Portable Memory Mapping C++ Class

posted September 18, 2013 by Stephan Brumme

Parsing Files The Easy Way

Recently I had to do a lot with file loading and parsing. It can be especially tricky to come up with a fast solution if you have to jump around within these files. Seeking and proper buffering were my biggest problems.

Memory mapping is one of the nicest features of modern operating systems: after opening a file in memory-mapped mode you can treat the file as a large chunk of memory and use plain pointers. The operating system takes care of loading the data on demand (!) into memory - utilizing caches, of course. When using my C++ class MemoryMapped it's really easy:

hide How to use


     // open file
     MemoryMapped data("myfile.txt");
     
     // read byte at file offset 2345
     unsigned char a = data[2345];
     
     // or if you prefer pointers
     const short* raw = (const short*) data.getData();
     short        b   = raw[300];

Windows is completely different from Linux when it comes to opening a file in memory-mapped mode. The class MemoryMapped hides all the OS specific stuff in only two files: MemoryMapped.h and MemoryMapped.cpp.
They compile without any warning with GCC 4.7 and Visual C++ 2010. I haven't tried other compilers but they should be able to handle it, too, even when they are a bit older.

Download

Download MemoryMapped.h

Latest release: September 17, 2013, size: 2552 bytes, 100 lines

CRC32: 5d202964
MD5: 6efd1a7cea536fbd88cf4f02b4c95bcf
SHA1: f7d0c73a035262f9264724e1ba5d31b50c504c98
SHA256:2fe563f3d9c24d563ce25c5cc2ffb9c7d2115782fe3a9ebf465d0a9a9a22c9f3

Download MemoryMapped.cpp

Latest release: November 4, 2015, size: 6.0 kBytes, 322 lines

CRC32: 6aab600a
MD5: 643a883c9aa720a3f39f068f9dcaf463
SHA1: 0ecfc2cb380a7c9b1e023d3889bd3d9a05375fe6
SHA256:d9cad2e388bae4cc2a00105f4841b785a306c78817b2a85fa943a5059aa4eb73

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

License

This code is licensed under the zlib License:

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License

Changelog

version 2
- latest and greatest
- November 4, 2015
- fixed bug in close()
- Git tag portable_memory_mapping_v2
version 1
- September 17, 2013
- initial release
- Git tag portable_memory_mapping_v1

Pro and Cons

The code can be used in a variety of environments:

it supports Linux and Windows
it supports 32 and 64 bit CPUs
it supports large files (>2GB)

To keep things simple, I implemented only the most common use case for memory mapped files:

Read-only access to files

Interface At A Glance

You can open a file in the MemoryMapped constructor or by calling the open method. The file is automagically closed in the destructor or by calling close. Note: it's a good habit to verify that isValid returns true after the desired file has been opened.

Here is a shortened version of MemoryMapped.h:

hide Public Interface


     /// Portable read-only memory mapping (Windows and Linux)
     class MemoryMapped
     {
     public:
       /// tweak performance
       enum CacheHint
       {
         Normal,         ///< good overall performance
         SequentialScan, ///< read file only once with few seeks
         RandomAccess    ///< jump around
       };
     
       /// how much should be mappend
       enum MapRange
       {
         WholeFile = 0   ///< everything ... be careful when file is larger than memory
       };
     
       /// do nothing, must use open()
       MemoryMapped();
       /// open file, mappedBytes = 0 maps the whole file
       MemoryMapped(const std::string& filename, size_t mappedBytes = WholeFile, CacheHint hint = Normal);
       /// close file (see close() )
       ~MemoryMapped();
     
       /// open file, mappedBytes = 0 maps the whole file
       bool open(const std::string& filename, size_t mappedBytes = WholeFile, CacheHint hint = Normal);
       /// close file
       void close();
     
       /// access position, no range checking (faster)
       unsigned char operator[](size_t offset) const;
       /// access position, including range checking
       unsigned char at        (size_t offset) const;
     
       /// raw access
       const unsigned char* getData() const;
     
       /// true, if file successfully opened
       bool isValid() const;
     
       /// get file size
       uint64_t size() const;
       /// get number of actually mapped bytes
       size_t   mappedSize() const;
     
       /// replace mapping by a new one of the same file, offset MUST be a multiple of the page size
       bool remap(uint64_t offset, size_t mappedBytes);
     };

Large Files On Small Computers

Since memory mapping loads pages only on-demand you can usually map the whole file. However, this is not possible for large files (>2GB) on 32 bit systems. Then you have to implement your own algorithm and call remap whenever the file position you are looking for is not currently mapped into memory. For example:

hide Parsing a large files on 32 bit systems


     const size_t OneGigabyte = 1 << 30;
     uint64_t startAt = 0;
     MemoryMapped data("largefile.txt", OneGigabyte, Normal);
     
     while (startAt < data.size())
     {
       const unsigned char* mapped = data.getData();
     
       // ... do whatever you want with "mapped"
     
       // load next chunk
       startAt += OneGigabyte;
       size_t numBytes = data.size() - startAt;
       // limit to 1 GB
       if (numBytes > OneGigabyte)
         numBytes = OneGigabyte;
       data.remap(startAt, numBytes);
     }

Of course, you don't have to worry about that on 64 bit Linux or Windows.

Demo Program `mywcl`

I need the Unix tool wc daily at work. Well, to be precise, I use wc -l. The idea behind wc -l is pretty simple: count all line endings.
If your file is completely mapped to memory, the core routine becomes a simple for-loop:

    
     uint64_t numLines = 0;
     for (uint64_t i = 0; i < bufferSize; i++)
       numLines += (buffer[i] == '\n');

The full program is only 45 lines long:

hide mywcl.cpp


     // //////////////////////////////////////////////////////////
     // mywcl.cpp
     // Copyright (c) 2013 Stephan Brumme. All rights reserved.
     //
     
     // g++ MemoryMapped.cpp mywcl.cpp -o mywcl -O3 -fopenmp
     
     #include "MemoryMapped.h"
     #include <cstdio>
     
     int main(int argc, char* argv[])
     {
       // syntax check
       if (argc > 2)
       {
         printf("Syntax: ./mywcl filename\n");
         return -1;
       }
     
       // map file to memory
       MemoryMapped data(argv[1], MemoryMapped::WholeFile, MemoryMapped::SequentialScan);
       if (!data.isValid())
       {
         printf("File not found\n");
         return -2;
       }
     
       // raw pointer to mapped memory
       char* buffer = (char*)data.getData();
       // store result here
       uint64_t numLines = 0;
     
       // OpenMP spreads work across CPU cores
     #pragma omp parallel for reduction(+:numLines)
       for (uint64_t i = 0; i < data.size(); i++)
         numLines += (buffer[i] == '\n');
     
       // show result
     #ifdef _MSC_VER
       printf("%I64d\n", numLines);
     #else
       printf("%lld\n",  numLines);
     #endif
       return 0;
     }

Maybe you have noticed the #pragma in front of the for-loop. This simple line (in addition to the -fopenmp compiler option) enables multi-core line counting:

hide Enabling OpenMP


     // OpenMP spreads work across CPU cores
     #pragma omp parallel for reduction(+:numLines)
       for (uint64_t i = 0; i < data.size(); i++)
         numLines += (buffer[i] == '\n');

If the file is already cached in memory then my code outperforms the good old wc -l: (test data: first 1 GByte of Wikipedia)

hide Core i7 (64 bit)


     > time wc -l enwik9
     13147025 enwik9
     
     real    0m0.258s
     user    0m0.156s
     sys     0m0.102s
     
     > time ./mywcl enwik9
     13147025
     
     real    0m0.182s      <== 0.076 seconds faster
     user    0m1.241s
     sys     0m0.116s

To be fair, this situation is the only one where my code is faster than wc -l ...
Whenever the file has to be read from disk or on single-core machines wc -l beats me easily.
Moreover, wc accepts data from STDIN (standard input) which is handy for piping. mywcl on the other hand only works with files.

Here the performance timings on my Raspberry Pi: (test data: first 100 MByte of Wikipedia)

hide Raspberry Pi (32 bit)


     > time wc -l enwik8
     1128023 enwik8
     
     real    0m1.581s
     user    0m0.880s
     sys     0m0.420s
     
     > time ./mywcl enwik8
     1128023
     
     real    0m2.057s      <== 0.466 seconds slower
     user    0m1.720s
     sys     0m0.080s

Download

Download mywcl.cpp

Latest release: September 18, 2013, size: 980 bytes, 45 lines

CRC32: bcf9eca4
MD5: 9a20c6af8e41c92b601397632dbf795e
SHA1: 8dc3d5e35cfe54c4958f2bcd96973db8ddcab4e2
SHA256:216f321fd2ec55752b19cd7ccdb7e3c6bf804c55a5b9fdb56ad1ed4fa15cea98

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com


June 30, 2021: Length-Limited Prefix Codes April 13, 2020: smalLZ4 - optimal LZ4 compression June 17, 2019: toojpeg - a JPEG encoder in a single C++ file ... and 30 more !	RSS	www.stephan-brumme.com bits.stephan-brumme.com photos.stephan-brumme.com minime.stephan-brumme.com
© 2011-2024 Stephan Brumme products and company names mentioned may be trademarks of their respective owners. see my disclaimer. dude, this is the end ... of this page.

Portable Memory Mapping C++ Class

Parsing Files The Easy Way

Download

License

Changelog

Pro and Cons

Interface At A Glance

Large Files On Small Computers

Demo Program mywcl

Download

Demo Program `mywcl`