Portable Memory Mapping C++ Class
posted by Stephan Brumme
Parsing Files The Easy Way
Recently I had to do a lot with file loading and parsing. It can be especially tricky to come up with a fast solution if you have to jump around within these files. Seeking and proper buffering were my biggest problems.Memory mapping is one of the nicest features of modern operating systems: after opening a file in memory-mapped mode you can treat the file as a large chunk of memory and use plain pointers. The operating system takes care of loading the data on demand (!) into memory - utilizing caches, of course. When using my C++ class
MemoryMapped
it's really easy:

// open file
MemoryMapped data("myfile.txt");
// read byte at file offset 2345
unsigned char a = data[2345];
// or if you prefer pointers
const short* raw = (const short*) data.getData();
short b = raw[300];
MemoryMapped
hides all the OS specific stuff in only two files:
MemoryMapped.h
and MemoryMapped.cpp
.They compile without any warning with GCC 4.7 and Visual C++ 2010. I haven't tried other compilers but they should be able to handle it, too, even when they are a bit older.
Download
Latest release: September 17, 2013, size: 2552 bytes, 100 linesCRC32:
5d202964
MD5:
6efd1a7cea536fbd88cf4f02b4c95bcf
SHA1:
f7d0c73a035262f9264724e1ba5d31b50c504c98
SHA256:
2fe563f3d9c24d563ce25c5cc2ffb9c7d2115782fe3a9ebf465d0a9a9a22c9f3
Latest release: November 4, 2015, size: 6.0 kBytes, 322 lines
CRC32:
6aab600a
MD5:
643a883c9aa720a3f39f068f9dcaf463
SHA1:
0ecfc2cb380a7c9b1e023d3889bd3d9a05375fe6
SHA256:
d9cad2e388bae4cc2a00105f4841b785a306c78817b2a85fa943a5059aa4eb73
If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com
License
This code is licensed under the zlib License:This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License
Changelog
- version 2
- latest and greatest
- November 4, 2015
- fixed bug in
close()
- Git tag
portable_memory_mapping_v2
- version 1
- September 17, 2013
- initial release
- Git tag
portable_memory_mapping_v1
Pro and Cons
The code can be used in a variety of environments:- it supports Linux and Windows
- it supports 32 and 64 bit CPUs
- it supports large files (>2GB)
- Read-only access to files
Interface At A Glance
You can open a file in theMemoryMapped
constructor or by calling the open
method.
The file is automagically closed in the destructor or by calling close
.
Note: it's a good habit to verify that isValid
returns true
after the desired file has been opened.Here is a shortened version of
MemoryMapped.h
:

/// Portable read-only memory mapping (Windows and Linux)
class MemoryMapped
{
public:
/// tweak performance
enum CacheHint
{
Normal, ///< good overall performance
SequentialScan, ///< read file only once with few seeks
RandomAccess ///< jump around
};
/// how much should be mappend
enum MapRange
{
WholeFile = 0 ///< everything ... be careful when file is larger than memory
};
/// do nothing, must use open()
MemoryMapped();
/// open file, mappedBytes = 0 maps the whole file
MemoryMapped(const std::string& filename, size_t mappedBytes = WholeFile, CacheHint hint = Normal);
/// close file (see close() )
~MemoryMapped();
/// open file, mappedBytes = 0 maps the whole file
bool open(const std::string& filename, size_t mappedBytes = WholeFile, CacheHint hint = Normal);
/// close file
void close();
/// access position, no range checking (faster)
unsigned char operator[](size_t offset) const;
/// access position, including range checking
unsigned char at (size_t offset) const;
/// raw access
const unsigned char* getData() const;
/// true, if file successfully opened
bool isValid() const;
/// get file size
uint64_t size() const;
/// get number of actually mapped bytes
size_t mappedSize() const;
/// replace mapping by a new one of the same file, offset MUST be a multiple of the page size
bool remap(uint64_t offset, size_t mappedBytes);
};
Large Files On Small Computers
Since memory mapping loads pages only on-demand you can usually map the whole file. However, this is not possible for large files (>2GB) on 32 bit systems. Then you have to implement your own algorithm and callremap
whenever the file position you are looking for
is not currently mapped into memory. For example:
const size_t OneGigabyte = 1 << 30;
uint64_t startAt = 0;
MemoryMapped data("largefile.txt", OneGigabyte, Normal);
while (startAt < data.size())
{
const unsigned char* mapped = data.getData();
// ... do whatever you want with "mapped"
// load next chunk
startAt += OneGigabyte;
size_t numBytes = data.size() - startAt;
// limit to 1 GB
if (numBytes > OneGigabyte)
numBytes = OneGigabyte;
data.remap(startAt, numBytes);
}
Demo Program mywcl
I need the Unix tool wc
daily at work. Well, to be precise, I use wc -l
.
The idea behind wc -l
is pretty simple: count all line endings.If your file is completely mapped to memory, the core routine becomes a simple
for
-loop:
uint64_t numLines = 0;
for (uint64_t i = 0; i < bufferSize; i++)
numLines += (buffer[i] == '\n');

// //////////////////////////////////////////////////////////
// mywcl.cpp
// Copyright (c) 2013 Stephan Brumme. All rights reserved.
//
// g++ MemoryMapped.cpp mywcl.cpp -o mywcl -O3 -fopenmp
#include "MemoryMapped.h"
#include <cstdio>
int main(int argc, char* argv[])
{
// syntax check
if (argc > 2)
{
printf("Syntax: ./mywcl filename\n");
return -1;
}
// map file to memory
MemoryMapped data(argv[1], MemoryMapped::WholeFile, MemoryMapped::SequentialScan);
if (!data.isValid())
{
printf("File not found\n");
return -2;
}
// raw pointer to mapped memory
char* buffer = (char*)data.getData();
// store result here
uint64_t numLines = 0;
// OpenMP spreads work across CPU cores
#pragma omp parallel for reduction(+:numLines)
for (uint64_t i = 0; i < data.size(); i++)
numLines += (buffer[i] == '\n');
// show result
#ifdef _MSC_VER
printf("%I64d\n", numLines);
#else
printf("%lld\n", numLines);
#endif
return 0;
}
#pragma
in front of the for
-loop.
This simple line (in addition to the -fopenmp
compiler option) enables multi-core line counting:

// OpenMP spreads work across CPU cores
#pragma omp parallel for reduction(+:numLines)
for (uint64_t i = 0; i < data.size(); i++)
numLines += (buffer[i] == '\n');
wc -l
:
(test data: first 1 GByte of Wikipedia)

> time wc -l enwik9
13147025 enwik9
real 0m0.258s
user 0m0.156s
sys 0m0.102s
> time ./mywcl enwik9
13147025
real 0m0.182s <== 0.076 seconds faster
user 0m1.241s
sys 0m0.116s
wc -l
...Whenever the file has to be read from disk or on single-core machines
wc -l
beats me easily.Moreover,
wc
accepts data from STDIN (standard input) which is handy for piping.
mywcl
on the other hand only works with files.Here the performance timings on my Raspberry Pi: (test data: first 100 MByte of Wikipedia)

> time wc -l enwik8
1128023 enwik8
real 0m1.581s
user 0m0.880s
sys 0m0.420s
> time ./mywcl enwik8
1128023
real 0m2.057s <== 0.466 seconds slower
user 0m1.720s
sys 0m0.080s
Download
Latest release: September 18, 2013, size: 980 bytes, 45 linesCRC32:
bcf9eca4
MD5:
9a20c6af8e41c92b601397632dbf795e
SHA1:
8dc3d5e35cfe54c4958f2bcd96973db8ddcab4e2
SHA256:
216f321fd2ec55752b19cd7ccdb7e3c6bf804c55a5b9fdb56ad1ed4fa15cea98
If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com