Length-Limited Prefix Codes
posted by Stephan Brumme
Overview
The most common technique to generate prefix-free codes is the Huffman algorithm.It is part of many popular compression algorithms such as JPEG, DEFLATE (ZIP / GZIP / PNG / etc.).
In most cases it serves as an "afterburner" to compress pre-processed symbols.
The main advantage - variable-length codes - can be a critical drawback, too: if the maximum code length exceeds a certain threshold, e.g. the CPU's integer size, (16 / 32 / 64 bit), then the compression/decompression code becomes complex and slows down.
All of the aforementioned file formats put a strict upper limit on the maximum length of Huffman codes.
Mainly due to their age and low number of distinct symbols, they all agreed on 16 bits per code.
So-called length-limited prefix codes are not Huffman codes anymore because they are not optimal (if the length restriction was enforced).
The loss in compression efficiency turns out to be often negligible and thus is widely accepted due to the significant gain in decompression performance.
There is a number of algorithms to create such length-limited prefix codes. They vary concerning speed, efficiency and complexity.
I wrote a C library implementing the most common techniques. Most of the code derives from open source programs and was invented by others.
My major contribution was to create a shared interface so that these algorithms are easily interchangeable:
hide
interface
There are three input parameters:
unsigned char algorithmName(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[]);
maxLength
is the upper limit of the number of bits for each prefix codenumCodes
is the number of distinct symbols (including unused symbols) → size of your alphabethistogram
is an array ofnumCodes
counters how often each symbol is present
codeLengths
will contain the bit length of each symbol
- the longest bit length or zero if the algorithm failed
0 3 1 1 2 5 2 2
then numCodes = 6
and histogram[] = { 1,2,3,1,0,1 };
However, this shared interface comes with a little bit of overhead: sometimes doubling code size and/or execution time.
Therefore most files have a second public function which is more specialized for its algorithm but may have a few additional restrictions.
A common case is that the histogram has to be sorted and unused symbols removed.
Currently implemented are:
Package-Merge | JPEG | MiniZ | BZip2 | Kraft | |
---|---|---|---|---|---|
Author | Larmore and Hirschberg | ITU T.81 standard, annex K.3 | Rich Geldreich | Julian Seward |
inspired by Charles Bloom and Jörn Engel |
Pro | optimal output | shortest code | similar to JPEG but faster | trivial to understand | often fastest |
Contra | slowest and most complex | mediocre compression | mediocre compression | worst compression | wildly fluctuating compression |
Description | scroll down | scroll down | scroll down | scroll down | strategy A and strategy B |
(read more ...)
smalLZ4 - optimal LZ4 compression
posted by Stephan Brumme,
updated
You came here to grab just the code or a binary ? Scroll down to the Download section and GIT repository.
A short overview how
smallz4
compares to the original LZ4
and a program called LZ4X (which follows similar ideas as smallz4
)
can be found here.And if you are interesting in the inner working, keep on reading and enjoy the ride ...
Updates
version 1.1 (August 2, 2018):I improved the compression ratio by better handling of block boundaries (reduced test file
enwik9.lz4
by 647 bytes).Unfortunately, there was a bug in the header of uncompressed blocks - it's fixed.
version 1.2 (August 3, 2018):
Experimental support for dictionary compression
version 1.2.1 (August 21, 2018):
Greatly reduced size of the portable, statically compiled
smallz4cat
binaries -
by using diet libc on Linux (only 7kb now !)
and by replacing fprintf
with fputs
.The
smallz4cat
Windows binary shrinks by about 20% due to the fputs
trick, too.The compression algorithm remained completely unchanged.
version 1.3 (November 14, 2018):
Support for LZ4 legacy format (slightly smaller output if input < 8 MByte)
Matches with length 19+255x had a wrong cost estimation so that they were often ignored during optimization.
version 1.3.1 (April 25, 2019):
Slightly faster (both encoding and decoding) and more code comments. Windows binaries now compiled with Visual C++ 2019 Community.
version 1.4 (December 18, 2019):
A user-defined pointer can be supplied which is forwarded to read and write functions. This allows to avoid nasty global variables.
Performance is completely unaffected, therefore I didn't upload binaries for version 1.4.
version 1.5 (April 13, 2020):
Fixed 64-bit bugs affecting files over 2GB.
Optimal Parsing
In the context of compression, optimal parsing is a multi-pass algorithm to find the smallest output of compressible data.Whenever there are multiple choices of encoding a symbol (or a sequence of symbols), optimal parsing will choose the representation which leads to the overall smallest output.
That means, it may prefer a locally sub-optimal encoding in order to achieve a globally optimal encoding.
Most compression algorithms strives for locally optimal encoding because it's computationally cheaper.
The most basic algorithm, the greedy algorithm, always selects the encoding that fits best "right now".
A significant improvement is called "lazy evaluation" and takes the next group of symbols into consideration and decides which encoding yields the smallest output for the whole group.
In a way, lazy evaluation can be seen as a local version of optimal parsing. It does the same job for a group of symbols instead of the whole file.
Compressing the string
abcde_bcdefgh_abcdefghxxxxxxx
returns different file sizes:
$ echo "abcde_bcdefgh_abcdefghxxxxxxx" | lz4 -9 --no-frame-crc | wc -c
43
$ echo "abcde_bcdefgh_abcdefghxxxxxxx" | ./smallz4 | wc -c
41
red
characters are literals, i.e. uncompressed data.Green
pairs of numbers indicate distance and length of a match.And let's ignore
xxxxxxx
from now on (it's only purpose is to hide the fact
that the LZ4 format specification forbids matches at the very end of a file).lz4
's result in detail: abcde_(5,4)fgh_(14,5)fghxxxxxxx
It found two matches: The first replaced
bcde
by a reference to a sequence we saw 5 bytes ago.That reference "costs" 3 bytes while the four literals would have occupied 4 bytes. We saved a byte here.
The second match replaces
abcde
by a reference to the beginning of the file.Again, we need 3 bytes to represent the match while the replaced text was 5 bytes long. Two more bytes saved !
However, this compression is not optimal.
smallz4
can squeeze out another two bytes because it postpones the second match:abcde_(5,4)fgh_a(9,7)xxxxxxx
smallz4
detects that matching abcde
would save 2 bytes but matching bcdefgh
saves 4 bytes.
Optimal Parsing - Pass 1
For each byte of input data, the longest match will be found and stored in an array calledmatches
(starts at about line 560).Additional data structures such as
previousHash
and previousExact
only speed up the code.A simple brute-force routine will give you the same result - albeit slower.
position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | (padding) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
character | a |
b |
c |
d |
e |
_ |
b |
c |
d |
e |
f |
g |
h |
_ |
a |
b |
c |
d |
e |
f |
g |
h |
xxxxxxx |
match distance | 5 | (5) | (5) | (5) | (8) | 14 | 9 | 9 | 9 | 9 | 9 | (9) | (9) | ||||||||||
match length | 4 | (3) | (2) | (1) | (1) | 5 | 7 | 6 | 5 | 4 | (3) | (2) | (1) |
(read more ...)
A JPEG encoder in a single C++ file ...
posted by Stephan Brumme,
updated
Recently, one of my programs needed to write a few JPEG files.
The obvious choice is to link against libjpeg (or even better: libjpeg-turbo).These libraries are fast, loaded with a plethora of features and very well tested.
However, they are huge (static linking is pretty much a no-go) and their API is kind of too complicated if you "just want to save an image".
Building my own JPEG encoder
I found a great alternative on Jon Olick's website - his JPEG Writer. It's barely 336 lines of C++ code in a single file and part of the very popular stb_image_write library, too.However, the large number of magic constants bothered me, so I started looking up their meaning and in the end wrote my own JPEG encoder from scratch.
I realized that a few things weren't optimal and/or missing in Jon's library:
jo_jpeg
's grayscale JPEGs still include the Cb and Cr channels which are not needed at all- downsampling the Cb and Cr channels can reduce the JPEG filesize often by 15 to 20% (YCbCr420 format)
jo_jpeg
always writes to a file, you can't create the JPEG data in memory (for network transfer etc.)- there were barely any code comments and most short variable names were hard to decipher
- his precomputed Huffman tables bloat the binaries a lot - I compute them on the fly (which can be done extremely fast)
- my bit operations are faster (see below for benchmark results)
- my quantization matrix scaling code is slightly more accurate - if you want JPEGs to be bitwise identical to Jon's encoder then uncomment lines 520-522
- ... and many more !
(read more ...)
flexiGIF - Lossless GIF/LZW optimization
posted by Stephan Brumme,
updated
flexiGIF shrinks GIF files by optimizing their compression scheme (LZW algorithm).No visual information is changed and the output is 100% pixel-wise identical to the original file - that's why it's called "lossless optimization".
And the results are still GIF files you can open/view with any standard tool.
Animated GIFs are supported as well.
Most files can be reduced by about 2%. I'm not aware of any open-source software outperforming flexiGIF.
The only downside: it may takes several seconds or even minutes for a medium-sized GIF image.
That's several magnitudes slower than other GIF encoders.
Your GIF decoder isn't affected at all - probably it even becomes faster because it has less input to process.
flexiGIF does NOT optimize palette, strip extensions and/or reduces color information.
There are many sophisticated tools out there, such as ImageMagick, that excel at that job.
flexiGIF is designed to be a tool used after you ran those image-processing programs.
(read more ...)
xxHash - a hash algorithm as fast as memcpy
posted by Stephan Brumme
Numbers, please !
Yes, Yann Collet's xxHash algorithm can be faster thanmemcpy
(Visual C++):throughput | bytes/CPU cycle | |
---|---|---|
memcpy |
5.45 GByte/s | approx. 1.6 |
xxHash32 (Yann's implementation) | 5.9 GByte/s | approx. 1.7 |
xxHash32 (my code) | 5.9 GByte/s | approx. 1.7 |
CRC32 Slicing-by-16 | 3.2 GByte/s | approx. 1 |
memcpy
has to read and write lots of bytes
whereas this hashing algorithm reads everything but writes only a few bytes.For comparison:
memset
achieves 8.4 GByte/s on the same Intel Core i7-2600K CPU @ 3.40GHz system.Even more interesting is that even pretty old versions of G++ have a faster version of
memcpy
(7.7 GByte/s)
and much, much faster intrinsics for memset
(18.2 GByte/s).Update August 8, 2016:
I wrote very similar code for xxHash64. It's super-fast on x64 but much slower when compiled for x32 systems.
Most of this posting will remain unchanged to avoid confusion - just keep in mind that every time xxHash32 touches 4 bytes, xxHash64 does pretty much the same with 8 bytes (with a few minor exceptions).
throughput | bytes/CPU cycle | |
---|---|---|
xxHash64 (Yann's implementation) | 10.5 GByte/s | approx. 3.1 |
xxHash64 (my code) | 10.5 GByte/s | approx. 3.1 |
Algorithm
xxHash was designed from the ground up to be as fast as possible on modern CPUs.It is not a strong cryptographic hash, such as the SHA family, but still passes the SMHasher test set with 10 points.
Most simple hashes, such as FNV (see my posting, too), step through the input data byte-by-byte.
Working on byte at position 1 requires that all work on the previous byte (at position 0) has finished.
Therefore, we have a dependency which causes the CPU to potentially wait (stalling).
Slightly better algorithms process a block of bytes at once, e.g. CRC Slicing-by-4/8/16 consumes 4/8/16 bytes per step - instead just one - giving a substantial speed-up.
However, we still have the same problem: working on block 1 requires that all work on block 0 has finished.
xxHash subdivides input data into four independent streams. Each stream processes a block of 4 bytes per step and stores a temporary
state
.Only the final step combines these four
state
s into a single one.The major advantage is that the code generator has lots of opportunities to re-order opcodes to avoid latencies.
I drew a crude kindergarten-style image to visualize how memory is accessed when hashing 64 bytes:
(read more ...)
Tiny Regular Expression Matcher
posted by Stephan Brumme
Regular expressions (regex) can be the gate to heaven as well as the gate to hell. Usually they are both at the same time.
Writing correct regular expression is sometimes as tough as writing a regular expression parser.
Chapter 1 of the excellent book "Beautiful Code" (ISBN 9780596510046) discusses Rob Pike's very short and elegant regex parser code.
It doesn't come with all the bells and whistles of a full Perl regular expression matcher but is quite sufficient for many purposes.
I extended Rob's C code in such a way that it can handle the following regex meta-symbols (see limitations, too):
Regex Symbol | Meaning | Already in Rob Pike's code ? |
---|---|---|
. | match any character | yes |
* | the previous character is arbitrarily repeated | yes |
+ | the previous character occurs at least once | no |
? | the previous character occurs once or not at all | no |
^ | match regular expression only to the beginning of the text | yes |
$ | match regular expression only to the end of the text | yes |
\ | treat next character as a literal, even if it is a regex symbol | no |
.h
and one .c
file without any external dependencies - not even #include <stdlib.h>
.The source file
match.c
contains about 100 lines of straight C code (plus tons of comments and blank lines).Take a look at my explanation of the matching algorithm.
A very basic use case looks as follows:
hide
example: find first match
const char* text = "aabc";
const char* regex = "^a+b";
if (match(text, regex))
printf("yes, %s matches the pattern %s", text, regex);
(read more ...)
Fast CRC32
posted by Stephan Brumme,
updated
Error Checking
Real life data tends to get corrupted because machines (and humans) are never as reliable as we wish for. One efficient way is make sure your data wasn't unintendedly modifiied is to generate some kind of hash. That hash shall be unique, compact and efficient:- unique: any kind of modification to the data shall generate a different hash
- compact: as few bits or bytes as possible to keep the overhead low
- efficient: use little computing resources, i.e. fast and low memory usage
There is one very useful property of CRC32: it can be implemented as a rolling algorithm. That means, if you already have some chunk of data and its CRC, then you can append new data and compute the updated CRC but using your original CRC as a seed and just scanning through the appended data.
I don't want to go into mathematical details of Cyclic Redundancy Checking because there are tons of information on the internet, e.g. Wikipedia and the Painless Guide To CRC Error Detection Algorithms. This article only discusses how to write a fast CRC32 algorithm in C/C++.
If you aren't too keen on technical details and just want to have the fastest implementation for not-too-small datasets, I strongly recommend using the
crc32_fast
function.
Note: The concepts behind the various CRC32 algorithm are not my original work - I only gathered them in one place.
I assume
int
s
to be 32 bits wide unsigned int
by uint32_t
if you use GCC64/Unix)Update November 13, 2011: The code is more portable now, e.g.
uint32_t
is used.
I got it running on an Arduino board, see here
for more details.Update May 4, 2013: Now the code is endian-aware. Read my posting for instructions on how run a big endian virtual machine on Linux or Windows.
Update February 6, 2014: If you are interested in MD5, SHA1 and/or SHA256 hashes then take a look at my portable C++ hashing library which of course includes CRC32, too.
Update August 12, 2014: added a Javascript version of the simple bitwise algorithm
Update February 4, 2015: added Slicing-by-16 algorithm based on suggestions by Bulat Ziganshin
Update August 14, 2015: compiler-dependent
#include
s and #ifdef
s were improved for Cygwin, MinGW and Clang. CRC32 code remained unchangedUpdate October 21, 2016: changed code structure into a library format, added tableless byte algorithms
CRC32 of 1 GByte | published by | bits per iteration | table size | time | throughput | CPU cycles/byte |
---|---|---|---|---|---|---|
Original | (unknown) | 1 | - | 29.2 seconds | 35 MByte/s | approx. 100 |
Branch-free | (unknown) | 1 | - | 16.7 seconds | 61 MByte/s | approx. 50 |
Improved Branch-free | (unknown) | 1 | - | 14.5 seconds | 70 MByte/s | approx. 40 |
Half-Byte | (unknown) | 4 | 64 bytes | 4.8 seconds | 210 MByte/s | approx. 14 |
Tableless Full-Byte | (sent to me by Hagai Gold) | 8 | - | 6.2 seconds | 160 MByte/s | approx. 18 |
Tableless Full-Byte | found in "Hacker's Delight" by Henry S. Warren |
8 | - | 6.3 seconds | 155 MByte/s | approx. 19 |
Standard Implementation | Dilip V. Sarwate | 8 | 1024 bytes | 2.8 seconds | 375 MByte/s | approx. 8 |
Slicing-by-4 | Intel Corp. | 32 | 4096 bytes | 0.95 or 1.2 seconds* | 1050 or 840 MByte/s* | approx. 3 or 4* |
Slicing-by-8 | Intel Corp. | 64 | 8192 bytes | 0.55 or 0.7 seconds* | 1800 or 1400 MByte/s* | approx. 1.75 or 2.25* |
Slicing-by-16 | based on Slicing-by-8, improved by Bulat Ziganshin |
128 | 16384 bytes | 0.4 or 0.5 seconds* | 3000 or 2000 MByte/s* | approx. 1.1 or 1.5* |
Slicing-by-16 4x unrolled with prefetch |
based on Slicing-by-8, improved by Bulat Ziganshin |
512 | 16384 bytes | 0.35 or 0.5 seconds* | 3200 or 2000 MByte/s* | approx. 1 or 1.5* |
(read more ...)
Practical String Searching
posted by Stephan Brumme
Motivation
There is a huge number of scientific papers on fast, faster and super-fast string searching algorithms.They usually prove theoretical performance in O-Notation and most papers cover memory consumption as well.
However, theoretical performance isn't always the same as practical performance. That's why I always want to measure real-world throughput: this article presents hopefully understandable C implementations of the most common generic string search algorithms.
In addition I also wrote a simple tool called
mygrep
that prints all lines of a file where a search phrase is found.It doesn't come with all the bells and whistles of the Unix tool
grep
but achieves similar or sometimes even better speed.A direct comparison can be done online, just scroll down to the Live Demo.
Playing around with various search phrases you will easily see the plain and simple truth: for most situtations the naive brute-force algorithm is a great choice.
A slightly modified version based on
memchr
and memcmp
is often several times faster
than the second-fastest contender - just because this simple algorithm can make efficient use of modern CPUs' optimized string opcodes
and fits well into the caches.These generic routines might slow down considerably in worst-case scenarios, like DNA analysis (small alphabet and many repeated similar parts).
And yes,
fgrep
is damn fast.Live Demo
This demo scans through the first 100 MByte of Wikipedia (well, an older snapshot dating back to March 3, 2006) on the server.Before search starts, the file is mapped into the Linux file cache in order to avoid disk latencies (
cat enwik8 > /dev/zero
).Notes: execution times can vary significantly - depending on overall server load.
Special characters such as
"
or \
are disabled to prevent potential security breaches into my server.The search algorithms actually can search for these special characters (as plain ASCII symbols).
Don't confuse it with meta characters for regular expressions; those are not supported.
Search is case-sensitive.
Algorithm | Tool | Matches | Time |
---|---|---|---|
(please enter search phrase and click "Go !" |
(read more ...)
The Mersenne Twister Pseudo Random Number Generator
posted by Stephan Brumme
Introduction
The Mersenne Twister is often regarded as the fastest pseudo-random number generator which passes almost all statistical tests.The original C code isn't exactly beautiful, therefore I decided to write my own C++ class.
And for the fun of it, I converted the code to Javascript and added two live demos, too (scroll down).
Live Demo
This demo will give you the first 10 random numbers generated by the Mersenne Twister.My C++ implementation returns exactly the same values as the original code written by Makoto Matsumoto and Takuji Nishimura.
These numbers will be computed on the webserver whereas the Javascript code - which generates exactly the same numbers, too - obviously runs in your browser.
Modifying the
seed
value will change the sequence of random numbers.
seed
must be a 32 bit integer. If you leave it empty, 5489 will be used instead.Seed: | |
C++: | |
Original: | |
Javascript: | |
(in hexadecimal) |
(read more ...)
browse through all postings:
older stuff:
2014/Jul/11 Parallel Sieve of Eratosthenes
2014/Jun/23 Windows and Linux Code Timing
2014/Jun/14 Portable C++ Hashing Library
2014/Feb/27 Sorting STL Containers
2013/Sep/18 Portable Memory Mapping C++ Class
2013/May/03 Testing Endian Aware Code
2013/Apr/24 Software License
2013/Feb/15 Read-Only Git Access Available
2011/Dec/05 Fowler-Noll-Vo Hash (FNV1a)
2011/Dec/01 Drawing Antialiased Circles and Ellipses
2011/Nov/27 Uncompressing DEFLATE Data
2011/Nov/23 Streaming MPEG-4 Movies From Your Own Server
2011/Nov/08 Binary Search Revisited
2011/Oct/20 Reverse MD5 Hash
2011/Oct/17 Windows 7 Screenshots
2011/Oct/16 Hide Strings In Your Executable
2011/Oct/11 Generate QR Codes With PHP and Google Charts API
2011/Oct/02 Feed.rss and Sitemap.xml
2011/Sep/28 My Experiences with AdSense
2011/Sep/27 Emulating MemCache in Pure PHP
2011/Sep/14 Simple Ajax
2011/Sep/13 Sparklines
2011/Sep/12 smallpt - Global Illumination in Javascript
2011/Sep/11 Multi-Core Sorting
2011/Sep/06 CLein - The tiny, portable OpenCL info tool