Interactive demos require Javascript. Anything else works.

Length-Limited Prefix Codes

posted by Stephan Brumme

Overview

The most common technique to generate prefix-free codes is the Huffman algorithm.
It is part of many popular compression algorithms such as JPEG, DEFLATE (ZIP / GZIP / PNG / etc.).
In most cases it serves as an "afterburner" to compress pre-processed symbols.

The main advantage - variable-length codes - can be a critical drawback, too: if the maximum code length exceeds a certain threshold, e.g. the CPU's integer size, (16 / 32 / 64 bit), then the compression/decompression code becomes complex and slows down.
All of the aforementioned file formats put a strict upper limit on the maximum length of Huffman codes.
Mainly due to their age and low number of distinct symbols, they all agreed on 16 bits per code.

So-called length-limited prefix codes are not Huffman codes anymore because they are not optimal (if the length restriction was enforced).
The loss in compression efficiency turns out to be often negligible and thus is widely accepted due to the significant gain in decompression performance.

There is a number of algorithms to create such length-limited prefix codes. They vary concerning speed, efficiency and complexity.
I wrote a C library implementing the most common techniques. Most of the code derives from open source programs and was invented by others.

My major contribution was to create a shared interface so that these algorithms are easily interchangeable:
hide interface unsigned char algorithmName(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[]);
There are three input parameters: and one output parameter: The return value is: If you want to encode the sequence 0 3 1 1 2 5 2 2 then numCodes = 6 and histogram[] = { 1,2,3,1,0,1 };

However, this shared interface comes with a little bit of overhead: sometimes doubling code size and/or execution time.
Therefore most files have a second public function which is more specialized for its algorithm but may have a few additional restrictions.
A common case is that the histogram has to be sorted and unused symbols removed.

Currently implemented are:
  Package-Merge JPEG MiniZ BZip2 Kraft
Author Larmore and Hirschberg ITU T.81 standard, annex K.3 Rich Geldreich Julian Seward inspired by Charles Bloom
and Jörn Engel
Pro optimal output shortest code similar to JPEG but faster trivial to understand often fastest
Contra slowest and most complex mediocre compression mediocre compression worst compression wildly fluctuating compression
Description scroll down scroll down scroll down scroll down strategy A and strategy B

(read more ...)

smalLZ4 - optimal LZ4 compression

posted by Stephan Brumme, updated

You came here to grab just the code or a binary ? Scroll down to the Download section and GIT repository.
A short overview how smallz4 compares to the original LZ4 and a program called LZ4X (which follows similar ideas as smallz4) can be found here.

And if you are interesting in the inner working, keep on reading and enjoy the ride ...

Updates

version 1.1 (August 2, 2018):
I improved the compression ratio by better handling of block boundaries (reduced test file enwik9.lz4 by 647 bytes).
Unfortunately, there was a bug in the header of uncompressed blocks - it's fixed.

version 1.2 (August 3, 2018):
Experimental support for dictionary compression

version 1.2.1 (August 21, 2018):
Greatly reduced size of the portable, statically compiled smallz4cat binaries - by using diet libc on Linux (only 7kb now !) and by replacing fprintf with fputs.
The smallz4cat Windows binary shrinks by about 20% due to the fputs trick, too.
The compression algorithm remained completely unchanged.

version 1.3 (November 14, 2018):
Support for LZ4 legacy format (slightly smaller output if input < 8 MByte)
Matches with length 19+255x had a wrong cost estimation so that they were often ignored during optimization.

version 1.3.1 (April 25, 2019):
Slightly faster (both encoding and decoding) and more code comments. Windows binaries now compiled with Visual C++ 2019 Community.

version 1.4 (December 18, 2019):
A user-defined pointer can be supplied which is forwarded to read and write functions. This allows to avoid nasty global variables.
Performance is completely unaffected, therefore I didn't upload binaries for version 1.4.

version 1.5 (April 13, 2020):
Fixed 64-bit bugs affecting files over 2GB.

Optimal Parsing

In the context of compression, optimal parsing is a multi-pass algorithm to find the smallest output of compressible data.
Whenever there are multiple choices of encoding a symbol (or a sequence of symbols), optimal parsing will choose the representation which leads to the overall smallest output.
That means, it may prefer a locally sub-optimal encoding in order to achieve a globally optimal encoding.

Most compression algorithms strives for locally optimal encoding because it's computationally cheaper.
The most basic algorithm, the greedy algorithm, always selects the encoding that fits best "right now".
A significant improvement is called "lazy evaluation" and takes the next group of symbols into consideration and decides which encoding yields the smallest output for the whole group.
In a way, lazy evaluation can be seen as a local version of optimal parsing. It does the same job for a group of symbols instead of the whole file.

Compressing the string abcde_bcdefgh_abcdefghxxxxxxx returns different file sizes:
$ echo "abcde_bcdefgh_abcdefghxxxxxxx" | lz4 -9 --no-frame-crc | wc -c 43 $ echo "abcde_bcdefgh_abcdefghxxxxxxx" | ./smallz4 | wc -c 41
Let's assume that red characters are literals, i.e. uncompressed data.
Green pairs of numbers indicate distance and length of a match.
And let's ignore xxxxxxx from now on (it's only purpose is to hide the fact that the LZ4 format specification forbids matches at the very end of a file).

lz4's result in detail: abcde_(5,4)fgh_(14,5)fghxxxxxxx
It found two matches: The first replaced bcde by a reference to a sequence we saw 5 bytes ago.
That reference "costs" 3 bytes while the four literals would have occupied 4 bytes. We saved a byte here.
The second match replaces abcde by a reference to the beginning of the file.
Again, we need 3 bytes to represent the match while the replaced text was 5 bytes long. Two more bytes saved !

However, this compression is not optimal. smallz4 can squeeze out another two bytes because it postpones the second match:
abcde_(5,4)fgh_a(9,7)xxxxxxx
smallz4 detects that matching abcde would save 2 bytes but matching bcdefgh saves 4 bytes.

Optimal Parsing - Pass 1

For each byte of input data, the longest match will be found and stored in an array called matches (starts at about line 560).
Additional data structures such as previousHash and previousExact only speed up the code.
A simple brute-force routine will give you the same result - albeit slower.
position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 (padding)
character a b c d e _ b c d e f g h _ a b c d e f g h xxxxxxx
match distance 5 (5) (5) (5) (8) 14 9 9 9 9 9 (9) (9)
match length 4 (3) (2) (1) (1) 5 7 6 5 4 (3) (2) (1)

(read more ...)

A JPEG encoder in a single C++ file ...

posted by Stephan Brumme, updated
Recently, one of my programs needed to write a few JPEG files. The obvious choice is to link against libjpeg (or even better: libjpeg-turbo).

These libraries are fast, loaded with a plethora of features and very well tested.
However, they are huge (static linking is pretty much a no-go) and their API is kind of too complicated if you "just want to save an image".

Building my own JPEG encoder

I found a great alternative on Jon Olick's website - his JPEG Writer. It's barely 336 lines of C++ code in a single file and part of the very popular stb_image_write library, too.

However, the large number of magic constants bothered me, so I started looking up their meaning and in the end wrote my own JPEG encoder from scratch.
I realized that a few things weren't optimal and/or missing in Jon's library:
It was a fun experience and I learned a lot about JPEGs and the JFIF file format.

(read more ...)

flexiGIF - Lossless GIF/LZW optimization

posted by Stephan Brumme, updated
flexiGIF shrinks GIF files by optimizing their compression scheme (LZW algorithm).
No visual information is changed and the output is 100% pixel-wise identical to the original file - that's why it's called "lossless optimization".
And the results are still GIF files you can open/view with any standard tool.
Animated GIFs are supported as well.

Most files can be reduced by about 2%. I'm not aware of any open-source software outperforming flexiGIF.

The only downside: it may takes several seconds or even minutes for a medium-sized GIF image.
That's several magnitudes slower than other GIF encoders.
Your GIF decoder isn't affected at all - probably it even becomes faster because it has less input to process.

flexiGIF does NOT optimize palette, strip extensions and/or reduces color information.
There are many sophisticated tools out there, such as ImageMagick, that excel at that job.
flexiGIF is designed to be a tool used after you ran those image-processing programs.
(read more ...)

xxHash - a hash algorithm as fast as memcpy

posted by Stephan Brumme

Numbers, please !

Yes, Yann Collet's xxHash algorithm can be faster than memcpy (Visual C++):
throughput bytes/CPU cycle
memcpy 5.45 GByte/s approx. 1.6
xxHash32 (Yann's implementation) 5.9 GByte/s approx. 1.7
xxHash32 (my code) 5.9 GByte/s approx. 1.7
CRC32 Slicing-by-16 3.2 GByte/s approx. 1
Yes, xxHash is extremely fast - but keep in mind that memcpy has to read and write lots of bytes whereas this hashing algorithm reads everything but writes only a few bytes.

For comparison: memset achieves 8.4 GByte/s on the same Intel Core i7-2600K CPU @ 3.40GHz system.
Even more interesting is that even pretty old versions of G++ have a faster version of memcpy (7.7 GByte/s) and much, much faster intrinsics for memset (18.2 GByte/s).

Update August 8, 2016:
I wrote very similar code for xxHash64. It's super-fast on x64 but much slower when compiled for x32 systems.
Most of this posting will remain unchanged to avoid confusion - just keep in mind that every time xxHash32 touches 4 bytes, xxHash64 does pretty much the same with 8 bytes (with a few minor exceptions).
throughput bytes/CPU cycle
xxHash64 (Yann's implementation) 10.5 GByte/s approx. 3.1
xxHash64 (my code) 10.5 GByte/s approx. 3.1

Algorithm

xxHash was designed from the ground up to be as fast as possible on modern CPUs.
It is not a strong cryptographic hash, such as the SHA family, but still passes the SMHasher test set with 10 points.

Most simple hashes, such as FNV (see my posting, too), step through the input data byte-by-byte.
Working on byte at position 1 requires that all work on the previous byte (at position 0) has finished.
Therefore, we have a dependency which causes the CPU to potentially wait (stalling).

Slightly better algorithms process a block of bytes at once, e.g. CRC Slicing-by-4/8/16 consumes 4/8/16 bytes per step - instead just one - giving a substantial speed-up.
However, we still have the same problem: working on block 1 requires that all work on block 0 has finished.

xxHash subdivides input data into four independent streams. Each stream processes a block of 4 bytes per step and stores a temporary state.
Only the final step combines these four states into a single one.

The major advantage is that the code generator has lots of opportunities to re-order opcodes to avoid latencies.
I drew a crude kindergarten-style image to visualize how memory is accessed when hashing 64 bytes:

0 stream 0 1 2 3 4 stream 1 5 6 7 8 stream 2 9 10 11 12 stream 3 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 state[0] 49 50 51 52 state[1] 53 54 55 56 state[2] 57 58 59 60 state[3] 61 62 63

(read more ...)

Tiny Regular Expression Matcher

posted by Stephan Brumme

Regular expressions (regex) can be the gate to heaven as well as the gate to hell. Usually they are both at the same time.
Writing correct regular expression is sometimes as tough as writing a regular expression parser.

Chapter 1 of the excellent book "Beautiful Code" (ISBN 9780596510046) discusses Rob Pike's very short and elegant regex parser code.
It doesn't come with all the bells and whistles of a full Perl regular expression matcher but is quite sufficient for many purposes.
I extended Rob's C code in such a way that it can handle the following regex meta-symbols (see limitations, too):
Regex Symbol Meaning Already in Rob Pike's code ?
. match any character yes
* the previous character is arbitrarily repeated yes
+ the previous character occurs at least once no
? the previous character occurs once or not at all no
^ match regular expression only to the beginning of the text yes
$ match regular expression only to the end of the text yes
\ treat next character as a literal, even if it is a regex symbol no
My C code consists of just one .h and one .c file without any external dependencies - not even #include <stdlib.h>.
The source file match.c contains about 100 lines of straight C code (plus tons of comments and blank lines).
Take a look at my explanation of the matching algorithm.

You can perform a live search of Bram Stoker's book "Dracula" (public domain, my digital version is 15123 lines long, 838 kBytes):


A very basic use case looks as follows:
hide example: find first match const char* text = "aabc"; const char* regex = "^a+b"; if (match(text, regex)) printf("yes, %s matches the pattern %s", text, regex);

(read more ...)

Fast CRC32

posted by Stephan Brumme, updated

Error Checking

Real life data tends to get corrupted because machines (and humans) are never as reliable as we wish for. One efficient way is make sure your data wasn't unintendedly modifiied is to generate some kind of hash. That hash shall be unique, compact and efficient: Many protocols, like Ethernet and GZIP archives, append a so-called CRC hash to their data streams which has a few weaknesses regarding uniqueness (sometimes data modifications remain undetected) but is absolutely great when it comes to compactness (CRC32: 32 bits) and efficiency. You find a CRC32 right below every Download button on my blog, too.

There is one very useful property of CRC32: it can be implemented as a rolling algorithm. That means, if you already have some chunk of data and its CRC, then you can append new data and compute the updated CRC but using your original CRC as a seed and just scanning through the appended data.

I don't want to go into mathematical details of Cyclic Redundancy Checking because there are tons of information on the internet, e.g. Wikipedia and the Painless Guide To CRC Error Detection Algorithms. This article only discusses how to write a fast CRC32 algorithm in C/C++.

If you aren't too keen on technical details and just want to have the fastest implementation for not-too-small datasets, I strongly recommend using the crc32_fast function.

Note: The concepts behind the various CRC32 algorithm are not my original work - I only gathered them in one place.

I assume a little-endian architecture (Intel's x86 and x64 CPU default byte ordering) and ints to be 32 bits wide (replace unsigned int by uint32_t if you use GCC64/Unix).

Update November 13, 2011: The code is more portable now, e.g. uint32_t is used. I got it running on an Arduino board, see here for more details.

Update May 4, 2013: Now the code is endian-aware. Read my posting for instructions on how run a big endian virtual machine on Linux or Windows.

Update February 6, 2014: If you are interested in MD5, SHA1 and/or SHA256 hashes then take a look at my portable C++ hashing library which of course includes CRC32, too.

Update August 12, 2014: added a Javascript version of the simple bitwise algorithm

Update February 4, 2015: added Slicing-by-16 algorithm based on suggestions by Bulat Ziganshin

Update August 14, 2015: compiler-dependent #includes and #ifdefs were improved for Cygwin, MinGW and Clang. CRC32 code remained unchanged

Update October 21, 2016: changed code structure into a library format, added tableless byte algorithms

CRC32 of 1 GByte published by bits per iteration table size time throughput CPU cycles/byte
Original (unknown) 1 - 29.2 seconds 35 MByte/s approx. 100
Branch-free (unknown) 1 - 16.7 seconds 61 MByte/s approx. 50
Improved Branch-free (unknown) 1 - 14.5 seconds 70 MByte/s approx. 40
Half-Byte (unknown) 4 64 bytes 4.8 seconds 210 MByte/s approx. 14
Tableless Full-Byte (sent to me by Hagai Gold) 8 - 6.2 seconds 160 MByte/s approx. 18
Tableless Full-Byte found in "Hacker's Delight"
by Henry S. Warren
8 - 6.3 seconds 155 MByte/s approx. 19
Standard Implementation Dilip V. Sarwate 8 1024 bytes 2.8 seconds 375 MByte/s approx. 8
Slicing-by-4 Intel Corp. 32 4096 bytes 0.95 or 1.2 seconds* 1050 or 840 MByte/s* approx. 3 or 4*
Slicing-by-8 Intel Corp. 64 8192 bytes 0.55 or 0.7 seconds* 1800 or 1400 MByte/s* approx. 1.75 or 2.25*
Slicing-by-16 based on Slicing-by-8,
improved by Bulat Ziganshin
128 16384 bytes 0.4 or 0.5 seconds* 3000 or 2000 MByte/s* approx. 1.1 or 1.5*
Slicing-by-16
4x unrolled with prefetch
based on Slicing-by-8,
improved by Bulat Ziganshin
512 16384 bytes 0.35 or 0.5 seconds* 3200 or 2000 MByte/s* approx. 1 or 1.5*
Note: GCC 4.7 produces faster code than Visual C++ 2010 for Slicing-by-4, Slicing-by-8 and Slicing-by-16. For the other algorithms, throughput is almost identical.
(read more ...)

Practical String Searching

posted by Stephan Brumme

Motivation

There is a huge number of scientific papers on fast, faster and super-fast string searching algorithms.
They usually prove theoretical performance in O-Notation and most papers cover memory consumption as well.

However, theoretical performance isn't always the same as practical performance. That's why I always want to measure real-world throughput: this article presents hopefully understandable C implementations of the most common generic string search algorithms.

In addition I also wrote a simple tool called mygrep that prints all lines of a file where a search phrase is found.
It doesn't come with all the bells and whistles of the Unix tool grep but achieves similar or sometimes even better speed.
A direct comparison can be done online, just scroll down to the Live Demo.

Playing around with various search phrases you will easily see the plain and simple truth: for most situtations the naive brute-force algorithm is a great choice.
A slightly modified version based on memchr and memcmp is often several times faster than the second-fastest contender - just because this simple algorithm can make efficient use of modern CPUs' optimized string opcodes and fits well into the caches.

These generic routines might slow down considerably in worst-case scenarios, like DNA analysis (small alphabet and many repeated similar parts).
And yes, fgrep is damn fast.

Live Demo

This demo scans through the first 100 MByte of Wikipedia (well, an older snapshot dating back to March 3, 2006) on the server.
Before search starts, the file is mapped into the Linux file cache in order to avoid disk latencies (cat enwik8 > /dev/zero).

Notes: execution times can vary significantly - depending on overall server load.
Special characters such as " or \ are disabled to prevent potential security breaches into my server.
The search algorithms actually can search for these special characters (as plain ASCII symbols).
Don't confuse it with meta characters for regular expressions; those are not supported.
Search is case-sensitive.

Search Phrase:
Algorithm Tool Matches Time
(please enter search phrase and click "Go !"

(read more ...)

The Mersenne Twister Pseudo Random Number Generator

posted by Stephan Brumme

Introduction

The Mersenne Twister is often regarded as the fastest pseudo-random number generator which passes almost all statistical tests.

The original C code isn't exactly beautiful, therefore I decided to write my own C++ class.
And for the fun of it, I converted the code to Javascript and added two live demos, too (scroll down).

Live Demo

This demo will give you the first 10 random numbers generated by the Mersenne Twister.
My C++ implementation returns exactly the same values as the original code written by Makoto Matsumoto and Takuji Nishimura.

These numbers will be computed on the webserver whereas the Javascript code - which generates exactly the same numbers, too - obviously runs in your browser.
Modifying the seed value will change the sequence of random numbers. seed must be a 32 bit integer. If you leave it empty, 5489 will be used instead.

Seed:
Seed:
C++:
Original:
Javascript:
(in hexadecimal)

(read more ...)

browse through all postings:

older stuff:

2014/Jul/11   Parallel Sieve of Eratosthenes
2014/Jun/23   Windows and Linux Code Timing
2014/Jun/14   Portable C++ Hashing Library
2014/Feb/27   Sorting STL Containers
2013/Sep/18   Portable Memory Mapping C++ Class
2013/May/03   Testing Endian Aware Code
2013/Apr/24   Software License
2013/Feb/15   Read-Only Git Access Available
2011/Dec/05   Fowler-Noll-Vo Hash (FNV1a)
2011/Dec/01   Drawing Antialiased Circles and Ellipses
2011/Nov/27   Uncompressing DEFLATE Data
2011/Nov/23   Streaming MPEG-4 Movies From Your Own Server
2011/Nov/08   Binary Search Revisited
2011/Oct/20   Reverse MD5 Hash
2011/Oct/17   Windows 7 Screenshots
2011/Oct/16   Hide Strings In Your Executable
2011/Oct/11   Generate QR Codes With PHP and Google Charts API
2011/Oct/02   Feed.rss and Sitemap.xml
2011/Sep/28   My Experiences with AdSense
2011/Sep/27   Emulating MemCache in Pure PHP
2011/Sep/14   Simple Ajax
2011/Sep/13   Sparklines
2011/Sep/12   smallpt - Global Illumination in Javascript
2011/Sep/11   Multi-Core Sorting
2011/Sep/06   CLein - The tiny, portable OpenCL info tool
homepage