Portable Memory Mapping C++ Class
posted by Stephan Brumme
Parsing Files The Easy Way
Recently I had to do a lot with file loading and parsing. It can be especially tricky to come up with a fast solution if you have to jump around within these files. Seeking and proper buffering were my biggest problems.Memory mapping is one of the nicest features of modern operating systems: after opening a file in memory-mapped mode you can treat the file as a large chunk of memory and use plain pointers. The operating system takes care of loading the data on demand (!) into memory - utilizing caches, of course. When using my C++ class
MemoryMapped it's really easy:
// open file
MemoryMapped data("myfile.txt");
// read byte at file offset 2345
unsigned char a = data[2345];
// or if you prefer pointers
const short* raw = (const short*) data.getData();
short b = raw[300];
MemoryMapped hides all the OS specific stuff in only two files:
MemoryMapped.h and MemoryMapped.cpp.They compile without any warning with GCC 4.7 and Visual C++ 2010. I haven't tried other compilers but they should be able to handle it, too, even when they are a bit older.
Download
Latest release: September 17, 2013, size: 2552 bytes, 100 linesCRC32:
5d202964MD5:
6efd1a7cea536fbd88cf4f02b4c95bcfSHA1:
f7d0c73a035262f9264724e1ba5d31b50c504c98SHA256:
2fe563f3d9c24d563ce25c5cc2ffb9c7d2115782fe3a9ebf465d0a9a9a22c9f3Latest release: November 4, 2015, size: 6.0 kBytes, 322 lines
CRC32:
6aab600aMD5:
643a883c9aa720a3f39f068f9dcaf463SHA1:
0ecfc2cb380a7c9b1e023d3889bd3d9a05375fe6SHA256:
d9cad2e388bae4cc2a00105f4841b785a306c78817b2a85fa943a5059aa4eb73If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com
License
This code is licensed under the zlib License:This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License
Changelog
- version 2
- latest and greatest
- November 4, 2015
- fixed bug in
close() - Git tag
portable_memory_mapping_v2
- version 1
- September 17, 2013
- initial release
- Git tag
portable_memory_mapping_v1
Pro and Cons
The code can be used in a variety of environments:- it supports Linux and Windows
- it supports 32 and 64 bit CPUs
- it supports large files (>2GB)
- Read-only access to files
Interface At A Glance
You can open a file in theMemoryMapped constructor or by calling the open method.
The file is automagically closed in the destructor or by calling close.
Note: it's a good habit to verify that isValid returns true after the desired file has been opened.Here is a shortened version of
MemoryMapped.h:
/// Portable read-only memory mapping (Windows and Linux)
class MemoryMapped
{
public:
/// tweak performance
enum CacheHint
{
Normal, ///< good overall performance
SequentialScan, ///< read file only once with few seeks
RandomAccess ///< jump around
};
/// how much should be mappend
enum MapRange
{
WholeFile = 0 ///< everything ... be careful when file is larger than memory
};
/// do nothing, must use open()
MemoryMapped();
/// open file, mappedBytes = 0 maps the whole file
MemoryMapped(const std::string& filename, size_t mappedBytes = WholeFile, CacheHint hint = Normal);
/// close file (see close() )
~MemoryMapped();
/// open file, mappedBytes = 0 maps the whole file
bool open(const std::string& filename, size_t mappedBytes = WholeFile, CacheHint hint = Normal);
/// close file
void close();
/// access position, no range checking (faster)
unsigned char operator[](size_t offset) const;
/// access position, including range checking
unsigned char at (size_t offset) const;
/// raw access
const unsigned char* getData() const;
/// true, if file successfully opened
bool isValid() const;
/// get file size
uint64_t size() const;
/// get number of actually mapped bytes
size_t mappedSize() const;
/// replace mapping by a new one of the same file, offset MUST be a multiple of the page size
bool remap(uint64_t offset, size_t mappedBytes);
};
Large Files On Small Computers
Since memory mapping loads pages only on-demand you can usually map the whole file. However, this is not possible for large files (>2GB) on 32 bit systems. Then you have to implement your own algorithm and callremap whenever the file position you are looking for
is not currently mapped into memory. For example:
const size_t OneGigabyte = 1 << 30;
uint64_t startAt = 0;
MemoryMapped data("largefile.txt", OneGigabyte, Normal);
while (startAt < data.size())
{
const unsigned char* mapped = data.getData();
// ... do whatever you want with "mapped"
// load next chunk
startAt += OneGigabyte;
size_t numBytes = data.size() - startAt;
// limit to 1 GB
if (numBytes > OneGigabyte)
numBytes = OneGigabyte;
data.remap(startAt, numBytes);
}
Demo Program mywcl
I need the Unix tool wc daily at work. Well, to be precise, I use wc -l.
The idea behind wc -l is pretty simple: count all line endings.If your file is completely mapped to memory, the core routine becomes a simple
for-loop:
uint64_t numLines = 0;
for (uint64_t i = 0; i < bufferSize; i++)
numLines += (buffer[i] == '\n');
// //////////////////////////////////////////////////////////
// mywcl.cpp
// Copyright (c) 2013 Stephan Brumme. All rights reserved.
//
// g++ MemoryMapped.cpp mywcl.cpp -o mywcl -O3 -fopenmp
#include "MemoryMapped.h"
#include <cstdio>
int main(int argc, char* argv[])
{
// syntax check
if (argc > 2)
{
printf("Syntax: ./mywcl filename\n");
return -1;
}
// map file to memory
MemoryMapped data(argv[1], MemoryMapped::WholeFile, MemoryMapped::SequentialScan);
if (!data.isValid())
{
printf("File not found\n");
return -2;
}
// raw pointer to mapped memory
char* buffer = (char*)data.getData();
// store result here
uint64_t numLines = 0;
// OpenMP spreads work across CPU cores
#pragma omp parallel for reduction(+:numLines)
for (uint64_t i = 0; i < data.size(); i++)
numLines += (buffer[i] == '\n');
// show result
#ifdef _MSC_VER
printf("%I64d\n", numLines);
#else
printf("%lld\n", numLines);
#endif
return 0;
}
#pragma in front of the for-loop.
This simple line (in addition to the -fopenmp compiler option) enables multi-core line counting:
// OpenMP spreads work across CPU cores
#pragma omp parallel for reduction(+:numLines)
for (uint64_t i = 0; i < data.size(); i++)
numLines += (buffer[i] == '\n');
wc -l:
(test data: first 1 GByte of Wikipedia)
> time wc -l enwik9
13147025 enwik9
real 0m0.258s
user 0m0.156s
sys 0m0.102s
> time ./mywcl enwik9
13147025
real 0m0.182s <== 0.076 seconds faster
user 0m1.241s
sys 0m0.116s
wc -l ...Whenever the file has to be read from disk or on single-core machines
wc -l beats me easily.Moreover,
wc accepts data from STDIN (standard input) which is handy for piping.
mywcl on the other hand only works with files.Here the performance timings on my Raspberry Pi: (test data: first 100 MByte of Wikipedia)
> time wc -l enwik8
1128023 enwik8
real 0m1.581s
user 0m0.880s
sys 0m0.420s
> time ./mywcl enwik8
1128023
real 0m2.057s <== 0.466 seconds slower
user 0m1.720s
sys 0m0.080s
Download
Latest release: September 18, 2013, size: 980 bytes, 45 linesCRC32:
bcf9eca4MD5:
9a20c6af8e41c92b601397632dbf795eSHA1:
8dc3d5e35cfe54c4958f2bcd96973db8ddcab4e2SHA256:
216f321fd2ec55752b19cd7ccdb7e3c6bf804c55a5b9fdb56ad1ed4fa15cea98If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com
