Portable Memory Mapping C++ Class

posted by Stephan Brumme

Parsing Files The Easy Way

Recently I had to do a lot with file loading and parsing. It can be especially tricky to come up with a fast solution if you have to jump around within these files. Seeking and proper buffering were my biggest problems.

Memory mapping is one of the nicest features of modern operating systems: after opening a file in memory-mapped mode you can treat the file as a large chunk of memory and use plain pointers. The operating system takes care of loading the data on demand (!) into memory - utilizing caches, of course. When using my C++ class MemoryMapped it's really easy:
hide How to use // open file MemoryMapped data("myfile.txt"); // read byte at file offset 2345 unsigned char a = data[2345]; // or if you prefer pointers const short* raw = (const short*) data.getData(); short b = raw[300];
Windows is completely different from Linux when it comes to opening a file in memory-mapped mode. The class MemoryMapped hides all the OS specific stuff in only two files: MemoryMapped.h and MemoryMapped.cpp.
They compile without any warning with GCC 4.7 and Visual C++ 2010. I haven't tried other compilers but they should be able to handle it, too, even when they are a bit older. Git users: scroll down to the repository link
Download  MemoryMapped.h
Latest release: September 17, 2013, size: 2552 bytes, 100 lines

CRC32: 5d202964
MD5: 6efd1a7cea536fbd88cf4f02b4c95bcf
SHA1: f7d0c73a035262f9264724e1ba5d31b50c504c98
SHA256:2fe563f3d9c24d563ce25c5cc2ffb9c7d2115782fe3a9ebf465d0a9a9a22c9f3

Download  MemoryMapped.cpp
Latest release: November 4, 2015, size: 6158 bytes, 322 lines

CRC32: 6aab600a
MD5: 643a883c9aa720a3f39f068f9dcaf463
SHA1: 0ecfc2cb380a7c9b1e023d3889bd3d9a05375fe6
SHA256:d9cad2e388bae4cc2a00105f4841b785a306c78817b2a85fa943a5059aa4eb73

Stay up-to-date:git clone http://create.stephan-brumme.com/portable-memory-mapping/.git

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

Changelog

Pro and Cons

The code can be used in a variety of environments:
  1. it supports Linux and Windows
  2. it supports 32 and 64 bit CPUs
  3. it supports large files (>2GB)
To keep things simple, I implemented only the most common use case for memory mapped files:
  1. Read-only access to files

Interface At A Glance

You can open a file in the MemoryMapped constructor or by calling the open method. The file is automagically closed in the destructor or by calling close. Note: it's a good habit to verify that isValid returns true after the desired file has been opened.

Here is a shortened version of MemoryMapped.h:
hide Public Interface /// Portable read-only memory mapping (Windows and Linux) class MemoryMapped { public: /// tweak performance enum CacheHint { Normal, ///< good overall performance SequentialScan, ///< read file only once with few seeks RandomAccess ///< jump around }; /// how much should be mappend enum MapRange { WholeFile = 0 ///< everything ... be careful when file is larger than memory }; /// do nothing, must use open() MemoryMapped(); /// open file, mappedBytes = 0 maps the whole file MemoryMapped(const std::string& filename, size_t mappedBytes = WholeFile, CacheHint hint = Normal); /// close file (see close() ) ~MemoryMapped(); /// open file, mappedBytes = 0 maps the whole file bool open(const std::string& filename, size_t mappedBytes = WholeFile, CacheHint hint = Normal); /// close file void close(); /// access position, no range checking (faster) unsigned char operator[](size_t offset) const; /// access position, including range checking unsigned char at (size_t offset) const; /// raw access const unsigned char* getData() const; /// true, if file successfully opened bool isValid() const; /// get file size uint64_t size() const; /// get number of actually mapped bytes size_t mappedSize() const; /// replace mapping by a new one of the same file, offset MUST be a multiple of the page size bool remap(uint64_t offset, size_t mappedBytes); };

Large Files On Small Computers

Since memory mapping loads pages only on-demand you can usually map the whole file. However, this is not possible for large files (>2GB) on 32 bit systems. Then you have to implement your own algorithm and call remap whenever the file position you are looking for is not currently mapped into memory. For example:
hide Parsing a large files on 32 bit systems const size_t OneGigabyte = 1 << 30; uint64_t startAt = 0; MemoryMapped data("largefile.txt", OneGigabyte, Normal); while (startAt < data.size()) { const unsigned char* mapped = data.getData(); // ... do whatever you want with "mapped" // load next chunk startAt += OneGigabyte; size_t numBytes = data.size() - startAt; // limit to 1 GB if (numBytes > OneGigabyte) numBytes = OneGigabyte; data.remap(startAt, numBytes); }
Of course, you don't have to worry about that on 64 bit Linux or Windows.

Demo Program mywcl

I need the Unix tool wc daily at work. Well, to be precise, I use wc -l. The idea behind wc -l is pretty simple: count all line endings.
If your file is completely mapped to memory, the core routine becomes a simple for-loop:
uint64_t numLines = 0; for (uint64_t i = 0; i < bufferSize; i++) numLines += (buffer[i] == '\n');
The full program is only 45 lines long:
hide mywcl.cpp // ////////////////////////////////////////////////////////// // mywcl.cpp // Copyright (c) 2013 Stephan Brumme. All rights reserved. // // g++ MemoryMapped.cpp mywcl.cpp -o mywcl -O3 -fopenmp #include "MemoryMapped.h" #include <cstdio> int main(int argc, char* argv[]) { // syntax check if (argc > 2) { printf("Syntax: ./mywcl filename\n"); return -1; } // map file to memory MemoryMapped data(argv[1], MemoryMapped::WholeFile, MemoryMapped::SequentialScan); if (!data.isValid()) { printf("File not found\n"); return -2; } // raw pointer to mapped memory char* buffer = (char*)data.getData(); // store result here uint64_t numLines = 0; // OpenMP spreads work across CPU cores #pragma omp parallel for reduction(+:numLines) for (uint64_t i = 0; i < data.size(); i++) numLines += (buffer[i] == '\n'); // show result #ifdef _MSC_VER printf("%I64d\n", numLines); #else printf("%lld\n", numLines); #endif return 0; }
Maybe you have noticed the #pragma in front of the for-loop. This simple line (in addition to the -fopenmp compiler option) enables multi-core line counting:
hide Enabling OpenMP // OpenMP spreads work across CPU cores #pragma omp parallel for reduction(+:numLines) for (uint64_t i = 0; i < data.size(); i++) numLines += (buffer[i] == '\n');
If the file is already cached in memory then my code outperforms the good old wc -l: (test data: first 1 GByte of Wikipedia)
hide Core i7 (64 bit) > time wc -l enwik9 13147025 enwik9 real 0m0.258s user 0m0.156s sys 0m0.102s > time ./mywcl enwik9 13147025 real 0m0.182s <== 0.076 seconds faster user 0m1.241s sys 0m0.116s
To be fair, this situation is the only one where my code is faster than wc -l ...
Whenever the file has to be read from disk or on single-core machines wc -l beats me easily.
Moreover, wc accepts data from STDIN (standard input) which is handy for piping. mywcl on the other hand only works with files.

Here the performance timings on my Raspberry Pi: (test data: first 100 MByte of Wikipedia)
hide Raspberry Pi (32 bit) > time wc -l enwik8 1128023 enwik8 real 0m1.581s user 0m0.880s sys 0m0.420s > time ./mywcl enwik8 1128023 real 0m2.057s <== 0.466 seconds slower user 0m1.720s sys 0m0.080s
Git users: scroll down to the repository link
Download  mywcl.cpp
Latest release: September 18, 2013, size: 980 bytes, 45 lines

CRC32: bcf9eca4
MD5: 9a20c6af8e41c92b601397632dbf795e
SHA1: 8dc3d5e35cfe54c4958f2bcd96973db8ddcab4e2
SHA256:216f321fd2ec55752b19cd7ccdb7e3c6bf804c55a5b9fdb56ad1ed4fa15cea98

Stay up-to-date:git clone http://create.stephan-brumme.com/portable-memory-mapping/.git

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com
homepage