smalLZ4 - optimal LZ4 compression

posted by Stephan Brumme

You came here for the code ? Scroll down to the Download section and GIT repository.

Optimal Parsing

In the context of compression, optimal parsing is a multi-pass algorithm to find the smallest output of compressible data.
Whenever there are multiple choices of encoding a symbol (or a sequence of symbols), optimal parsing will choose the representation which leads to the overall smallest output.
That means, it may prefer a locally sub-optimal encoding in order to achieve a globally optimal encoding.

Most compression algorithms strives for locally optimal encoding because it's computationally cheaper.
The most basic algorithm, the greedy algorithm, always selects the encoding that fits best "right now".
A significant improvement is called "lazy evaluation" and takes the next group of symbols into consideration and decides which encoding yields the smallest output for the whole group.
In a way, lazy evaluation can be seen as a local version of optimal parsing. It does the same job for a group of symbols instead of the whole file.

Compressing the string abcde_bcdefgh_abcdefghxxxxxxx returns different file sizes:
$ echo "abcde_bcdefgh_abcdefghxxxxxxx" | lz4 -9 --no-frame-crc | wc -c 43 $ echo "abcde_bcdefgh_abcdefghxxxxxxx" | ./smallz4 | wc -c 41
Let's assume that red characters are literals, i.e. uncompressed data.
Green pairs of numbers indicate distance and length of a match.
And let's ignore xxxxxxx from now on (it's only purpose is to hide that the LZ4 format specification forbids matches at the very end of a file).

lz4's result in detail: abcde_(5,4)fgh_(14,5)fghxxxxxxx
It found two matches: The first replaced bcde by a reference to a sequence we saw 5 bytes ago.
That reference "costs" 3 bytes while the literals would have cost 4 bytes. We saved a byte here.
The second match replaces abcde by a reference to the beginning of the file.
Again, we need 3 bytes to represent the match while the replaced text was 5 bytes long. Two more bytes saved !

However, this compression is not optimal. smallz4 can squeeze out another two bytes because it postpones the second match:
smallz4 detects that matching abcde would save 2 bytes but matching bcdefgh saves 4 bytes.

Optimal Parsing - Pass 1

For each byte of input data, the longest match will be found and stored in an array called match (starts at about line 440).
Additional data structures such as previousHash and previousExact only speed up the code.
A simple brute-force routine will give you the same result - albeit slower.
position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 (padding)
character a b c d e _ b c d e f g h _ a b c d e f g h xxxxxxx
match distance 5 (5) (5) (5) (8) 14 9 9 9 9 9 (9) (9)
match length 4 (3) (2) (1) (1) 5 7 6 5 4 (3) (2) (1)

(read more ...)

xxHash - a hash algorithm as fast as memcpy

posted by Stephan Brumme

Numbers, please !

Yes, Yann Collet's xxHash algorithm can be faster than memcpy (Visual C++):
throughput bytes/CPU cycle
memcpy 5.45 GByte/s approx. 1.6
xxHash32 (Yann's implementation) 5.9 GByte/s approx. 1.7
xxHash32 (my code) 5.9 GByte/s approx. 1.7
CRC32 Slicing-by-16 3.2 GByte/s approx. 1
Yes, xxHash is extremely fast - but keep in mind that memcpy has to read and write lots of bytes whereas this hashing algorithm reads everything but writes only a few bytes.

For comparison: memset achieves 8.4 GByte/s on the same Intel Core i7-2600K CPU @ 3.40GHz system.
Even more interesting is that even pretty old versions of G++ have a faster version of memcpy (7.7 GByte/s) and much, much faster intrinsics for memset (18.2 GByte/s).

Update August 8, 2016:
I wrote very similar code for xxHash64. It's super-fast on x64 but much slower when compiled for x32 systems.
Most of this posting will remain unchanged to avoid confusion - just keep in mind that every time xxHash32 touches 4 bytes, xxHash64 does pretty much the same with 8 bytes (with a few minor exceptions).
throughput bytes/CPU cycle
xxHash64 (Yann's implementation) 10.5 GByte/s approx. 3.1
xxHash64 (my code) 10.5 GByte/s approx. 3.1


xxHash was designed from the ground up to be as fast as possible on modern CPUs.
It is not a strong cryptographic hash, such as the SHA family, but still passes the SMHasher test set with 10 points.

Most simple hashes, such as FNV (see my posting, too), step through the input data byte-by-byte.
Working on byte at position 1 requires that all work on the previous byte (at position 0) has finished.
Therefore, we have a dependency which causes the CPU to potentially wait (stalling).

Slightly better algorithms process a block of bytes at once, e.g. CRC Slicing-by-4/8/16 consumes 4/8/16 bytes per step - instead just one - giving a substantial speed-up.
However, we still have the same problem: working on block 1 requires that all work on block 0 has finished.

xxHash subdivides input data into four independent streams. Each stream processes a block of 4 bytes per step and stores a temporary state.
Only the final step combines these four states into a single one.

The major advantage is that the code generator has lots of opportunities to re-order opcodes to avoid latencies.
I drew a crude kindergarten-style image to visualize how memory is accessed when hashing 64 bytes:

0 stream 0 1 2 3 4 stream 1 5 6 7 8 stream 2 9 10 11 12 stream 3 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 state[0] 49 50 51 52 state[1] 53 54 55 56 state[2] 57 58 59 60 state[3] 61 62 63

(read more ...)

Tiny Regular Expression Matcher

posted by Stephan Brumme

Regular expressions (regex) can be the gate to heaven as well as the gate to hell. Usually they are both at the same time.
Writing correct regular expression is sometimes as touch as writing a regular expression parser.

Chapter 1 of the excellent book "Beautiful Code" (ISBN 9780596510046) discusses Rob Pike's very short and elegant regex parser code.
It doesn't come with all the bells and whistles of a full Perl regular expression matcher but is quite sufficient for many purposes.
I extended Rob's C code in such a way that it can handle the following regex meta-symbols (see limitations, too):
Regex Symbol Meaning Already in Rob Pike's code ?
. match any character yes
* the previous character is arbitrarily repeated yes
+ the previous character occurs at least once no
? the previous character occurs once or not at all no
^ match regular expression only to the beginning of the text yes
$ match regular expression only to the end of the text yes
\ treat next character as a literal, even if it is a regex symbol no
My C code consists of just one .h and one .c file without any external dependencies - not even #include <stdlib.h>.
The source file match.c contains about 100 lines of straight C code (plus tons of comments and blank lines).
Take a look at my explanation of the matching algorithm.

You can perform a live search of Bram Stoker's book "Dracula" (public domain, my digital version is 15123 lines long, 838 kBytes):

A very basic use case looks as follows:
hide example: find first match const char* text = "aabc"; const char* regex = "^a+b"; if (match(text, regex)) printf("yes, %s matches the pattern %s", text, regex);

(read more ...)

Fast CRC32

posted by Stephan Brumme, updated

Error Checking

Real life data tends to get corrupted because machines (and humans) are never as reliable as we wish for. One efficient way is make sure your data wasn't unintendedly modifiied is to generate some kind of hash. That hash shall be unique, compact and efficient: Many protocols, like Ethernet and GZIP archives, append a so-called CRC hash to their data streams which has a few weaknesses regarding uniqueness (sometimes data modifications remain undetected) but is absolutely great when it comes to compactness (CRC32: 32 bits) and efficiency. You find a CRC32 right below every Download button on my blog, too.

There is one very useful property of CRC32: it can be implemented as a rolling algorithm. That means, if you already have some chunk of data and its CRC, then you can append new data and compute the updated CRC but using your original CRC as a seed and just scanning through the appended data.

I don't want to go into mathematical details of Cyclic Redundancy Checking because there are tons of information on the internet, e.g. Wikipedia and the Painless Guide To CRC Error Detection Algorithms. This article only discusses how to write a fast CRC32 algorithm in C/C++.

If you aren't too keen on technical details and just want to have the fastest implementation for not-too-small datasets, I strongly recommend using the crc32_fast function.

Note: The concepts behind the various CRC32 algorithm are not my original work - I only gathered them in one place.

I assume a little-endian architecture (Intel's x86 and x64 CPU default byte ordering) and ints to be 32 bits wide (replace unsigned int by uint32_t if you use GCC64/Unix).

Update November 13, 2011: The code is more portable now, e.g. uint32_t is used. I got it running on an Arduino board, see here for more details.

Update May 4, 2013: Now the code is endian-aware. Read my posting for instructions on how run a big endian virtual machine on Linux or Windows.

Update February 6, 2014: If you are interested in MD5, SHA1 and/or SHA256 hashes then take a look at my portable C++ hashing library which of course includes CRC32, too.

Update August 12, 2014: added a Javascript version of the simple bitwise algorithm

Update February 4, 2015: added Slicing-by-16 algorithm based on suggestions by Bulat Ziganshin

Update August 14, 2015: compiler-dependent #includes and #ifdefs were improved for Cygwin, MinGW and Clang. CRC32 code remained unchanged

Update October 21, 2016: changed code structure into a library format, added tableless byte algorithms

CRC32 of 1 GByte published by bits per iteration table size time throughput CPU cycles/byte
Original (unknown) 1 - 29.2 seconds 35 MByte/s approx. 100
Branch-free (unknown) 1 - 16.7 seconds 61 MByte/s approx. 50
Improved Branch-free (unknown) 1 - 14.5 seconds 70 MByte/s approx. 40
Half-Byte (unknown) 4 64 bytes 4.8 seconds 210 MByte/s approx. 14
Tableless Full-Byte (sent to me by Hagai Gold) 8 - 6.2 seconds 160 MByte/s approx. 18
Tableless Full-Byte found in "Hacker's Delight"
by Henry S. Warren
8 - 6.3 seconds 155 MByte/s approx. 19
Standard Implementation Dilip V. Sarwate 8 1024 bytes 2.8 seconds 375 MByte/s approx. 8
Slicing-by-4 Intel Corp. 32 4096 bytes 0.95 or 1.2 seconds* 1050 or 840 MByte/s* approx. 3 or 4*
Slicing-by-8 Intel Corp. 64 8192 bytes 0.55 or 0.7 seconds* 1800 or 1400 MByte/s* approx. 1.75 or 2.25*
Slicing-by-16 based on Slicing-by-8,
improved by Bulat Ziganshin
128 16384 bytes 0.4 or 0.5 seconds* 3000 or 2000 MByte/s* approx. 1.1 or 1.5*
4x unrolled with prefetch
based on Slicing-by-8,
improved by Bulat Ziganshin
512 16384 bytes 0.35 or 0.5 seconds* 3200 or 2000 MByte/s* approx. 1 or 1.5*
Note: GCC 4.7 produces faster code than Visual C++ 2010 for Slicing-by-4, Slicing-by-8 and Slicing-by-16. For the other algorithms, throughput is almost identical.
(read more ...)

Practical String Searching

posted by Stephan Brumme


There is a huge number of scientific papers on fast, faster and super-fast string searching algorithms.
They usually prove theoretical performance in O-Notation and most papers cover memory consumption as well.

However, theoretical performance isn't always the same as practical performance. That's why I always want to measure real-world throughput: this article presents hopefully understandable C implementations of the most common generic string search algorithms.

In addition I also wrote a simple tool called mygrep that prints all lines of a file where a search phrase is found.
It doesn't come with all the bells and whistles of the Unix tool grep but achieves similar or sometimes even better speed.
A direct comparison can be done online, just scroll down to the Live Demo.

Playing around with various search phrases you will easily see the plain and simple truth: for most situtations the naive brute-force algorithm is a great choice.
A slightly modified version based on memchr and memcmp is often several times faster than the second-fastest contender - just because this simple algorithm can make efficient use of modern CPUs' optimized string opcodes and fits well into the caches.

These generic routines might slow down considerably in worst-case scenarios, like DNA analysis (small alphabet and many repeated similar parts).
And yes, fgrep is damn fast.

Live Demo

This demo scans through the first 100 MByte of Wikipedia (well, an older snapshot dating back to March 3, 2006) on the server.
Before search starts, the file is mapped into the Linux file cache in order to avoid disk latencies (cat enwik8 > /dev/zero).

Notes: execution times can vary significantly - depending on overall server load.
Special characters such as " or \ are disabled to prevent potential security breaches into my server.
The search algorithms actually can search for these special characters (as plain ASCII symbols).
Don't confuse it with meta characters for regular expressions; those are not supported.
Search is case-sensitive.

Search Phrase:
Algorithm Tool Matches Time
(please enter search phrase and click "Go !"

(read more ...)

The Mersenne Twister Pseudo Random Number Generator

posted by Stephan Brumme


The Mersenne Twister is often regarded as the fastest pseudo-random number generator which passes almost all statistical tests.

The original C code isn't exactly beautiful, therefore I decided to write my one C++ class.
And for the fun of it, I converted the code to Javascript and added a two live demos, too (scroll down).

Live Demo

This demo will give you the first 10 random numbers generated by the Mersenne Twister.
My C++ implementation returns exactly the same values as the original code written by Makoto Matsumoto and Takuji Nishimura.

These numbers will be computed on the webserver whereas the Javascript code - which generates exactly the same numbers, too - obviously runs in your browser.
Modifying the seed value will change the sequence of random numbers. seed must be a 32 bit integer. If you leave it empty, 5489 will be used instead.

(in hexadecimal)

(read more ...)

Parallel Sieve of Eratosthenes [updated]

posted by Stephan Brumme, updated

Processing 1 Billion Numbers Per Second

There are 50,847,534 prime numbers between 2 and 1,000,000,000. For reasons unknown, I wanted to know whether a C++ implementation can find all these numbers on my desktop computer (Intel Core i7 860, quad-core, hyperthreading, 2.8GHz) in less than 1 second with a simple algorithm such as the Sieve of Eratosthenes.

By only slightly modifying the basic algorithm and unleashing the raw power of OpenMP, I achieved a 12x 20x speed-up.
In the end, my improved sieve finishes after 1.1 seconds - mission almost accomplished.

UPDATE October 6, 2011: After adding a simple test for divisibility by 3 (and 5, 7, 9, 11, 13), the NEW block-wise sieve gets much faster and swiftly breaks through the 1-second-barrier.

UPDATE July 7, 2014: Many browsers support asm.js to boost Javascript performance. Scroll down to find my asm.js implementation or click here.
Algorithm OpenMP from 2 to 1 billion Speed-Up CPU ticks (approx.) Notes
Basic Algorithm no 50847534 primes 13.5 seconds (1.0x) 40 billion shortest code (22 lines)
Odd Numbers Only no 50847534 primes 6.7 seconds 2.0x 20 billion only 10% more code than basic algorithm
Odd Numbers Only yes, 8 threads 50847534 primes 5.4 seconds 2.5x 130 billion
Block-Wise NEW no 50847534 primes 2.4 seconds 5.6x 7 billion most efficient
Block-Wise NEW yes, 8 threads 50847534 primes 0.65 seconds 20.7x 16 billion fastest
Note: Much more complex algorithms easily outperform my code. My primary focus was parallelism in general, not prime numbers !

Live Demo

I ported the short Odd Numbers Only algorithm to Javascript and PHP; the code can be seen at the bottom of this page. The table below shows the expected number of primes for some reference inputs and the performance numbers measured on my computer.

Compute number of primes from 2 to:

Your browser: CCBot/2.0 (

In this benchmark, C++ is approx. 4x faster than the best Javascript engine (Mozilla Firefox) and approx. 100x faster than PHP 5.3. Only the PHP times marked with (*) are measured on my local computer while all other PHP timings were measured on the server (which is about 25% bit slower).

All browsers are still 32 bit programs - while my operating system is already 64 bit - and run out of memory when attempting to find all prime numbers from 1 to 1 billion. The more complex blockwise sieve does not suffer from this problem but I was too lazy to port it, too.
Multithreading is currently not supported by neither Javascript nor PHP.
from 2 to ... primes C++ Internet Explorer 10 Firefox 18 Opera 12 Chrome 24 PHP 5.3
10 4 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
100 25 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
1,000 168 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
10,000 1,229 <0.001s <0.001s <0.001s <0.001s <0.001s 0.003s
100,000 9,592 <0.001s 0.001s 0.001s 0.002s 0.001s 0.027s
1,000,000 78,498 0.002s 0.010s 0.007s 0.022s 0.015s 0.393s
10,000,000 664,579 0.024s 0.139s 0.103s 0.248s 0.190s 3.717s
100,000,000 5,761,455 0.555s about 1.7s about 1.2s about 2.7s about 2.0s about 29s (*)
1,000,000,000 50,847,534 6.7s n/a n/a n/a n/a n/a
For your interest: the browser ranking is contrary to what I have observed with my Javascript global illumination port of smallpt. I expected PHP to be faster, too.

UPDATE February 14, 2013: More than a year after my initial tests, Firefox became the fastest browser due to its vastly improved Javascript engine.
(read more ...)

Windows and Linux Code Timing

posted by Stephan Brumme

Whenever I have to do some performance testing, I try the most simple way: write a short program and run it within my Linux bash shell which comes with a built-in time command:
$ time ./testcode real 0m0.788s user 0m0.696s sys 0m0.080s
Obviously there can (and will) be some overhead by the operating system. So numbers are typically a bit off.
The timing resolution isn't that great, either, and often lies above 10ms.

That's why I wrote a simple C routine to measure performance inside the code. My goal was to have numbers that are accurate down to about 1 microsecond (10-6).
I didn't opt for extremely high timing precision (count CPU ticks) because I rarely need that degree of precision and really want to avoid the tricky details of CPU frequency scaling, cache stalls etc.

Another feature is portability; simply recompile the code and run it on Windows and Linux. The code works fine with GCC as well as Visual C++.

All my timing code contains is a single function seconds() returning a double. You can invoke it in your programs like this:
double start = seconds(); // ... your code that should be measured ... double duration = seconds() - start;

(read more ...)

Portable C++ Hashing Library

posted by Stephan Brumme, updated


Let's start with the core features of my C++ hashing library:


Since there are no external dependencies, not even dependencies within the library itself, all you need is to include the header of the hashing algorithm of your choice and add the identically named .cpp to your project. That means if you want to add SHA256 hashing to your C++ program, you only have to include sha256.h and sha256.cpp in your project.

If you prefer SHA256, then your code will look like this:
hide // SHA2 test program #include "sha256.h" #include <iostream> // for std::cout only, not needed for hashing library int main(int, char**) { // create a new hashing object SHA256 sha256; // hashing an std::string std::cout << sha256("Hello World") << std::endl; // => a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e // hashing a buffer of bytes const char* buffer = "How are you"; std::cout << sha256(buffer, 11) << std::endl; // => 9c7d5b046878838da72e40ceb3179580958df544b240869b80d0275cc07209cc // or in a streaming fashion (re-use "How are you") SHA256 sha256stream; const char* url = ""; // 25 bytes int step = 5; for (int i = 0; i < 25; i += step) sha256stream.add(url + i, step); // add five bytes at a time std::cout << sha256stream.getHash() << std::endl; // => 82aa771f1183c52f973c798c9243a1c73833ea40961c73e55e12430ec77b69f6 return 0; }

(read more ...)

browse through all postings:

older stuff:

2014/Feb/27   Sorting STL Containers
2013/Sep/18   Portable Memory Mapping C++ Class
2013/May/03   Testing Endian Aware Code
2013/Apr/24   Software License
2013/Feb/15   Read-Only Git Access Available
2011/Dec/05   Fowler-Noll-Vo Hash (FNV1a)
2011/Dec/01   Drawing Antialiased Circles and Ellipses
2011/Nov/27   Uncompressing DEFLATE Data
2011/Nov/23   Streaming MPEG-4 Movies From Your Own Server
2011/Nov/08   Binary Search Revisited
2011/Oct/20   Reverse MD5 Hash
2011/Oct/17   Windows 7 Screenshots
2011/Oct/16   Hide Strings In Your Executable
2011/Oct/11   Generate QR Codes With PHP and Google Charts API
2011/Oct/02   Feed.rss and Sitemap.xml
2011/Sep/28   My Experiences with AdSense
2011/Sep/27   Emulating MemCache in Pure PHP
2011/Sep/14   Simple Ajax
2011/Sep/13   Sparklines
2011/Sep/12   smallpt - Global Illumination in Javascript
2011/Sep/11   Multi-Core Sorting
2011/Sep/06   CLein - The tiny, portable OpenCL info tool