Length-Limited Prefix Codes
posted by Stephan Brumme
Overview
The most common technique to generate prefix-free codes is the Huffman algorithm.It is part of many popular compression algorithms such as JPEG, DEFLATE (ZIP / GZIP / PNG / etc.).
In most cases it serves as an "afterburner" to compress pre-processed symbols.
The main advantage - variable-length codes - can be a critical drawback, too: if the maximum code length exceeds a certain threshold, e.g. the CPU's integer size, (16 / 32 / 64 bit), then the compression/decompression code becomes complex and slows down.
All of the aforementioned file formats put a strict upper limit on the maximum length of Huffman codes.
Mainly due to their age and low number of distinct symbols, they all agreed on 16 bits per code.
So-called length-limited prefix codes are not Huffman codes anymore because they are not optimal (if the length restriction was enforced).
The loss in compression efficiency turns out to be often negligible and thus is widely accepted due to the significant gain in decompression performance.
There is a number of algorithms to create such length-limited prefix codes. They vary concerning speed, efficiency and complexity.
I wrote a C library implementing the most common techniques. Most of the code derives from open source programs and was invented by others.
My major contribution was to create a shared interface so that these algorithms are easily interchangeable:
hide
interface
There are three input parameters:
unsigned char algorithmName(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[]);
maxLength
is the upper limit of the number of bits for each prefix codenumCodes
is the number of distinct symbols (including unused symbols) → size of your alphabethistogram
is an array ofnumCodes
counters how often each symbol is present
codeLengths
will contain the bit length of each symbol
- the longest bit length or zero if the algorithm failed
0 3 1 1 2 5 2 2
then numCodes = 6
and histogram[] = { 1,2,3,1,0,1 };
However, this shared interface comes with a little bit of overhead: sometimes doubling code size and/or execution time.
Therefore most files have a second public function which is more specialized for its algorithm but may have a few additional restrictions.
A common case is that the histogram has to be sorted and unused symbols removed.
Currently implemented are:
Package-Merge | JPEG | MiniZ | BZip2 | Kraft | |
---|---|---|---|---|---|
Author | Larmore and Hirschberg | ITU T.81 standard, annex K.3 | Rich Geldreich | Julian Seward |
inspired by Charles Bloom and Jörn Engel |
Pro | optimal output | shortest code | similar to JPEG but faster | trivial to understand | often fastest |
Contra | slowest and most complex | mediocre compression | mediocre compression | worst compression | wildly fluctuating compression |
Description | scroll down | scroll down | scroll down | scroll down | strategy A and strategy B |
Usage
Each algorithm runs independent of each other. There are no external dependencies, too.Some algorithms require pre-computed Huffman codes - for those cases I added a slightly modified version of Andrew Moffat's efficient in-place Huffman code.
If you want to try out an algorithm then all you have to do is:
hide
Usage
Compared to the original Huffman algorithm, a few code became shorter and a few became longer.
#include "packagemerge.h"
#include <stdio.h>
int main()
{
// let's assume you have 7 symbols
unsigned int numCodes = 7;
// first one occurs 270 times, second 20x, third 10x, fourth not at all, fifth once, sixth 6x and seventh once
unsigned int histogram[7] = { 270,20,10,0,1,6,1 };
// at most 4 bits
unsigned char maxLength = 4;
// output: bits per code
unsigned char codeLengths[7];
// compute length-limited codes
unsigned char longest = packageMerge(maxLength, numCodes, histogram, codeLengths);
// display result
for (unsigned int i = 0; i < numCodes; i++)
printf("code %d occurs %d times => %d bits\n", i, histogram[i], codeLengths[i]);
printf("max. code length is %d\n", longest);
return 0;
}
The number of bits /
codeLengths
are:
Symbol | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|---|
Count | 270 | 20 | 10 | 0 | 1 | 6 | 1 | sum = 308 |
Huffman | 1 | 2 | 3 | 0 | 5 | 4 | 5 | sum = 374 bits |
maxLength = 4 |
1 | 2 | 4 | 0 | 4 | 4 | 4 | sum = 382 bits |
maxLength = 3 |
2 | 2 | 3 | 0 | 3 | 3 | 3 | sum = 634 bits |
There are several methods to convert code lengths to actual prefix-free codes. I find canonical Huffman codes to be the most suitable: a simple and fast algorithm.
Using the example data we get these canonical codes:
Symbol | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|
Count | 270 | 20 | 10 | 0 | 1 | 6 | 1 |
Huffman | 0 |
10 |
110 |
- | 11110 |
1110 |
11111 |
maxLength = 4 |
0 |
10 |
1100 |
- | 1101 |
1110 |
1111 |
maxLength = 3 |
00 |
01 |
100 |
- | 101 |
110 |
111 |
show
canonical.c
#include "packagemerge.h"
#include <stdio.h>
// create canonical codes, for simplicity longest code must not exceed number of bits of an unsigned int (32 or 64)
void canonical(unsigned int numCodes, unsigned char codeLengths[], unsigned int codes[])
{
// count how often each code length is present
unsigned int numLengths[256] = { 0 };
unsigned int i;
for (i = 0; i < numCodes; i++)
numLengths[codeLengths[i]]++;
// precompute the first code of each code length
unsigned int next[256];
// there's no code with length 0 and the first 1-bit code with be a zero
next[1] = 0;
for (i = 2; i < 256; i++)
next[i] = (next[i - 1] + numLengths[i - 1]) << 1;
// now assign all codes
for (i = 0; i < numCodes; i++)
codes[i] = next[codeLengths[i]]++;
}
int main()
{
// let's assume you have 7 symbols
unsigned int numCodes = 7;
// first one occurs 270 times, second 20x, third 10x, fourth no at all, fifth once, sixth 6x and seventh once
unsigned int histogram[7] = { 270,20,10,0,1,6,1 };
// at most 4 bits
unsigned char maxLength = 4;
// output: bits per code
unsigned char codeLengths[7];
// compute length-limited codes
unsigned char longest = packageMerge(maxLength, numCodes, histogram, codeLengths);
// create canonical codes
unsigned int codes[7];
canonical(numCodes, codeLengths, codes);
// display result
unsigned int i;
for (i = 0; i < numCodes; i++)
{
printf("code %d occurs %d times\t=> %d bits", i, histogram[i], codeLengths[i]);
// binary represenation of canonical code
printf("\t=> ");
if (codeLengths[i] > 0)
{
int shift;
for (shift = codeLengths[i] - 1; shift >= 0; shift--)
printf(codes[i] & (1 << shift) ? "1" : "0");
}
printf("\n");
}
printf("max. code length is %d\n", longest);
return 0;
}
Benchmark
Here are a few results from the first 64k bytes of enwik, measured on a Core i7 / GCC x64:time ./benchmark 1 12 100000
- where
1
is the algorithm's ID and was between0
and6
- limiting the code length to at most
12
bits - note: the unadjusted Huffman codes have up to 16 bits
- each algorithm ran
100000
times
algorithm | ID | total size | compression ratio | execution time |
---|---|---|---|---|
Huffman | 0 |
326,892 bits | 62.35% | 0.54 s |
Package-Merge | 1 |
327,721 bits | 62.51% | 1.17 s |
MiniZ | 2 |
328,456 bits | 62.65% | 0.58 s |
JPEG | 3 |
328,456 bits | 62.65% | 0.61 s |
BZip2 | 4 |
328,887 bits | 62.73% | 0.88 s |
Kraft | 5 |
327,895 bits | 62.54% | 1.72 s |
modified Kraft | 6 |
327,942 bits | 62.55% | 0.41 s |
Changing the code length limit significantly influences both execution time and compression ratio.
I picked the two "best" algorithms, used the same data set and repeated each procedure 100,000 times:
length limit | Package-Merge | Kraft Strategy B |
---|---|---|
8 bits | 70.47%, 0.96 s | 70.76%, 0.24 s |
9 bits | 65.30%, 1.02 s | 65.31%, 0.24 s |
10 bits | 63.49%, 1.07 s | 63.79%, 0.31 s |
11 bits | 62.80%, 1.14 s | 62.84%, 0.37 s |
12 bits | 62.51%, 1.17 s | 62.55%, 0.40 s |
13 bits | 62.40%, 1.22 s | 62.43%, 0.34 s |
14 bits | 62.36%, 1.25 s | 62.42%, 0.40 s |
15 bits | 62.35%, 1.29 s | 62.42%, 0.66 s |
16 bits | 62.35%, 1.35 s | 62.42%, 0.70 s |
Package-Merge
Package-Merge is an algorithm that guarantees to produce optimal length-limited prefix codes.Its name is based on the conception of putting numbers in packages and merging these.
Unfortunately, the original paper by Lawrence L. Larmore and Daniel S. Hirschberg isn't very accessible.
Moreover, further information is pretty scarce on the internet: the Wikipedia entry doesn't help a lot.
The best introduction - in my opinion - is Sebastian Gesemann's Bachelor's Thesis but it's available in German only.
The diagrams on page 24 and 26 walk through an example and page 29 gives pseudo code.
There are two prerequisites:
- remove all unused symbols before running Package-Merge
- and symbols must must be sorted in ascending order of their frequency
Let's process the sample data shown above:
Symbol | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|
Count | 270 | 20 | 10 | 0 | 1 | 6 | 1 |
(the order of symbols with the same count doesn't matter)
Symbol | 4 | 6 | 5 | 2 | 1 | 0 |
---|---|---|---|---|---|---|
Count | 1 | 1 | 6 | 10 | 20 | 270 |
Naturally you can't have an active symbol with zero bits, the minimum code length is 1 bit.
Package-Merge adds one bit in each iteration so there is a minimum of two iterations (2+1=3) but for whatever reason I decided to have this example with
maxLength=4
.Now we add these counts pairwise, we build packages. If there is an odd number of symbols then the last (which is the largest) is discarded:
Symbol | 4 | 6 | 5 | 2 | 1 | 0 |
---|---|---|---|---|---|---|
Count | 1 | 1 | 6 | 10 | 20 | 270 |
Pairwise Sum | 2 | 16 | 290 |
If a package is identical to a count of the original sequence then that item from the original sequence comes first.
In the next table packages are shown in a bold font:
Count | 1 | 1 | 2 | 6 | 10 | 16 | 20 | 270 | 290 |
---|
Therefore we need to repeat that procedure two more times to reach our goal
maxLength=4
.However, there is one little pitfall: pairs are made using the sequence from the previous step but you always merge them with the original sequence.
That last sentence is important and was probably my most critical insight when I learned about the Package-Merge algorithm.
Okay, let's create package for the second bit:
Count | 1 | 1 | 2 | 6 | 10 | 16 | 20 | 270 | 290 |
---|---|---|---|---|---|---|---|---|---|
Pairwise Sum | 2 | 8 | 26 | 290 |
The four values in the lower row are merged with the original sequence:
Count | 1 | 1 | 2 | 6 | 8 | 10 | 20 | 26 | 270 | 290 |
---|
Count | 1 | 1 | 2 | 6 | 8 | 10 | 20 | 26 | 270 | 290 |
---|---|---|---|---|---|---|---|---|---|---|
Pairwise Sum | 2 | 8 | 18 | 46 | 560 |
Count | 1 | 1 | 2 | 6 | 8 | 10 | 18 | 20 | 46 | 270 | 560 |
---|
Original | 1 | 1 | 6 | 10 | 20 | 270 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Iteration 1 | 1 | 1 | 2 | 6 | 10 | 16 | 20 | 270 | 290 | ||
Iteration 2 | 1 | 1 | 2 | 6 | 8 | 10 | 20 | 26 | 270 | 290 | |
Iteration 3 | 1 | 1 | 2 | 6 | 8 | 10 | 18 | 20 | 46 | 270 | 560 |
The actual values in the table don't matter - it's only relevant whether a value is a package (bold) or not (plain).
Code lengths for all symbols are initially set to zero.
If a value isn't bold then the code length of a symbol has to be incremented.
The first iteration processes the first
2*numCodes-2=10
values of the bottom row, that means 1,1,...,270
but not 560
.Two counters
symbol
and numMerged
are set to zero.For each table entry a decision has to be made:
- if that value is a package (bold) then increment
numMerged
- otherwise increment
codeLengths[symbol]
and thensymbol
2*numMerged
table entries.Let's go step-by-step through the example - as mentioned before we start with the bottom row.
The code length of all symbols was set to zero.
(array index) | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
Symbol | 4 | 6 | 5 | 2 | 1 | 0 |
Count | 1 | 1 | 6 | 10 | 20 | 270 |
Code Length (initially) | 0 | 0 | 0 | 0 | 0 | 0 |
Code Length (after processing bottom row) |
1 | 1 | 1 | 1 | 1 | 1 |
- a 1 which wasn't bold →
codeLength[0] = 1
and thensymbol = 1
- another 1 which wasn't bold →
codeLength[1] = 1
and thensymbol = 2
- a 2 which is bold →
numMerged = 1
- a 6 which wasn't bold →
codeLength[2] = 1
and thensymbol = 3
- a 8 which is bold →
numMerged = 2
- a 10 which wasn't bold →
codeLength[3] = 1
and thensymbol = 4
- a 18 which is bold →
numMerged = 3
- a 20 which wasn't bold →
codeLength[4] = 1
and thensymbol = 5
- a 46 which is bold →
numMerged = 4
- a 270 which wasn't bold →
codeLength[5] = 1
and thensymbol = 6
Because the final value of
numMerged
is 4
, we look at the first 4*2=8
values of the second-to-last row.Its content was (just a copy, I stroke through irrelevant entries):
Iteration 2 | 1 | 1 | 2 | 6 | 8 | 10 | 20 | 26 |
---|
That means
numMerged = 3
and the code length of all-but-the-last symbol becomes 2.That last symbol, which had the highest count, suddenly gets assigned a lower code length than all the other symbols.
Again, a copy of phase 1's iteration 1 (where only
2*numMerged = 6
will be handled):
Iteration 1 | 1 | 1 | 2 | 6 | 10 | 16 |
---|
You probably already understand that the first four symbol's code length needs to be incremented and
we will look at the first
2*numMerged = 4
numbers of the final row (the top row).Original | 1 | 1 | 6 | 10 |
---|
The first four symbol's code length is incremented again. Let's take a look at the code lengths of each step:
(array index) | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
Symbol | 4 | 6 | 5 | 2 | 1 | 0 |
Count | 1 | 1 | 6 | 10 | 20 | 270 |
Code Length (initially) | 0 | 0 | 0 | 0 | 0 | 0 |
Code Length (after processing bottom row / iteration 3) |
1 | 1 | 1 | 1 | 1 | 1 |
Code Length (after processing iteration 2) |
2 | 2 | 2 | 2 | 2 | 1 |
Code Length (after processing iteration 1) |
3 | 3 | 3 | 3 | 2 | 1 |
Code Length (after processing top row) |
4 | 4 | 4 | 4 | 2 | 1 |
No symbol's number of bits is higher than
maxLength=4
and the Kraft-McMillan inequality holds true as well: 2^-4 + 2^-4 + 2^-4 + 2^-4 + 2^-2 + 2^-1 = 1
.The algorithm is guaranteed to create optimal code lengths for a prefix-free code. No other combination of code lengths will be better.
I left out any theoretical stuff, such as proofs etc., and skipped practical optimizations to keep things short and sweet.
Take a look at my code (click below) or read the scientific papers to learn more details.
show
packagemerge.c
// //////////////////////////////////////////////////////////
// packagemerge.c
// written by Stephan Brumme, 2021
// see https://create.stephan-brumme.com/length-limited-prefix-codes/
//
#include "packagemerge.h"
#include <stdlib.h> // malloc/free/qsort
// ----- package-merge algorithm -----
// to me the best explanation is Sebastian Gesemann's Bachelor Thesis (in German only / University of Paderborn, 2004)
// data types (switching to unsigned int is faster but fails if sum(histogram) > 2^31 or maxLength > 31)
typedef unsigned long long BitMask;
typedef unsigned long long HistItem;
/// compute limited prefix code lengths based on Larmore/Hirschberg's package-merge algorithm
/** - histogram must be in ascending order and no entry must be zero
* - the function rejects maxLength > 63 but I don't see any practical reasons you would need a larger limit ...
* @param maxLength maximum code length, e.g. 15 for DEFLATE or JPEG
* @param numCodes number of codes, equals the array size of histogram and codeLength
* @param A [in] how often each code/symbol was found [out] computed code lengths
* @result actual maximum code length, 0 if error
*/
unsigned char packageMergeSortedInPlace(unsigned char maxLength, unsigned int numCodes, unsigned int A[])
{
// skip zeros
while (numCodes > 0 && A[0] == 0)
{
numCodes--;
A++;
}
// at least one code needs to be in use
if (numCodes == 0 || maxLength == 0)
return 0;
// one or two codes are always encoded with a single bit
if (numCodes <= 2)
{
A[0] = 1;
if (numCodes == 2)
A[1] = 1;
return 1;
}
// A[] is an input parameter (stores the histogram) as well as
// an output parameter (stores the code lengths)
const unsigned int* histogram = A;
unsigned int* codeLengths = A;
// my allround variable for various loops
unsigned int i;
// check maximum bit length
if (maxLength > 8*sizeof(BitMask) - 1) // 8*8-1 = 63
return 0;
// at least log2(numCodes) bits required for every valid prefix code
unsigned long long encodingLimit = 1ULL << maxLength;
if (encodingLimit < numCodes)
return 0;
// need two buffers to process iterations and an array of bitmasks
unsigned int maxBuffer = 2 * numCodes;
// allocate memory
HistItem* current = (HistItem*) malloc(sizeof(HistItem) * maxBuffer);
HistItem* previous = (HistItem*) malloc(sizeof(HistItem) * maxBuffer);
BitMask* isMerged = (BitMask*) malloc(sizeof(BitMask) * maxBuffer);
// initial value of "previous" is a plain copy of the sorted histogram
for (i = 0; i < numCodes; i++)
previous[i] = histogram[i];
unsigned int numPrevious = numCodes;
// no need to initialize "current", it's completely rebuild every iteration
// keep track which packages are merged (compact bitmasks):
// if package p was merged in iteration i then (isMerged[p] & (1 << i)) != 0
for (i = 0; i < maxBuffer; i++)
isMerged[i] = 0; // there are no merges before the first iteration
// the last 2 packages are irrelevant
unsigned int numRelevant = 2 * numCodes - 2;
// ... and preparation is finished
// //////////////////////////////////////////////////////////////////////
// iterate through potential bit lengths while packaging and merging pairs
// (step 1 of the algorithm)
// - the histogram is sorted (prerequisite of the function)
// - the output must be sorted, too
// - thus we have to copy the histogram and every and then insert a new package
// - the code keeps track of the next package and compares it to
// the next item to be copied from the history
// - the smaller value is chosen (if equal, the histogram is chosen)
// - a bitmask named isMerged is used to keep track which items were packages
// - repeat until the whole histogram was copied and all packages inserted
// bitmask for isMerged
BitMask mask = 1;
unsigned char bits;
for (bits = maxLength - 1; bits > 0; bits--)
{
// ignore last element if numPrevious is odd (can't be paired)
numPrevious &= ~1; // bit-twiddling trick to clear the lowest bit, same as numPrevious -= numPrevious % 2
// first merged package
current[0] = histogram[0]; // a sum can't be smaller than its parts
current[1] = histogram[1]; // therefore it's impossible to find a package at index 0 or 1
HistItem sum = current[0] + current[1]; // same as previous[0] + previous[1]
// copy histogram and insert merged sums whenever possible
unsigned int numCurrent = 2; // current[0] and current[1] were already set
unsigned int numHist = numCurrent; // we took them from the histogram
unsigned int numMerged = 0; // but so far no package inserted (however, it's precomputed in "sum")
for (;;) // stop/break is inside the loop
{
// the next package isn't better than the next histogram item ?
if (numHist < numCodes && histogram[numHist] <= sum)
{
// copy histogram item
current[numCurrent++] = histogram[numHist++];
continue;
}
// okay, we have a package being smaller than next histogram item
// mark output value as being "merged", i.e. a package
isMerged[numCurrent] |= mask;
// store package
current [numCurrent] = sum;
numCurrent++;
// already finished last package ?
numMerged++;
if (numMerged * 2 >= numPrevious)
break;
// precompute next sum
sum = previous[numMerged * 2] + previous[numMerged * 2 + 1];
}
// make sure every code from the histogram is included
// (relevant if histogram is very skewed with a few outliers)
while (numHist < numCodes)
current[numCurrent++] = histogram[numHist++];
// prepare next mask
mask <<= 1;
// performance tweak: abort as soon as "previous" and "current" are identical
if (numPrevious >= numRelevant) // ... at least their relevant elements
{
// basically a bool: FALSE == 0, TRUE == 1
char keepGoing = 0;
// compare both arrays: if they are identical then stop
for (i = numRelevant - 1; i > 0; i--) // collisions are most likely at the end
if (previous[i] != current[i])
{
keepGoing++;
break;
}
// early exit ?
if (keepGoing == 0)
break;
}
// swap pointers "previous" and "current"
HistItem* tmp = previous;
previous = current;
current = tmp;
// no need to swap their sizes because only numCurrent needed in next iteration
numPrevious = numCurrent;
}
// shifted one bit too far
mask >>= 1;
// keep only isMerged
free(previous);
free(current);
// //////////////////////////////////////////////////////////////////////
// tracking all merges will produce the code lengths
// (step 2 of the algorithm)
// - analyze each bitlength's mask in isMerged:
// * a "pure" symbol => increase bitlength of that symbol
// * a merged code => just increase counter
// - stop if no more merged codes found
// - if m merged codes were found then only examine
// the first 2*m elements in the next iteration
// (because only they formed these merged codes)
// reset code lengths
for (i = 0; i < numCodes; i++)
codeLengths[i] = 0;
// start with analyzing the first 2n-2 values
unsigned int numAnalyze = numRelevant;
while (mask != 0) // stops if nothing but symbols are found in an iteration
{
// number of merged packages seen so far
unsigned int numMerged = 0;
// the first two elements must be symbols, they can't be packages
codeLengths[0]++;
codeLengths[1]++;
unsigned int symbol = 2;
// look at packages
for (i = symbol; i < numAnalyze; i++)
{
// check bitmask: not merged if bit is 0
if ((isMerged[i] & mask) == 0)
{
// we have a single non-merged symbol, which needs to be one bit longer
codeLengths[symbol]++;
symbol++;
}
else
{
// we have a merged package, so that its parts need to be checked next iteration
numMerged++;
}
}
// look only at those values responsible for merged packages
numAnalyze = 2 * numMerged;
// note that the mask was originally slowly shifted left by the merging loop
mask >>= 1;
}
// last iteration can't have any merges
for (i = 0; i < numAnalyze; i++)
codeLengths[i]++;
// it's a free world ...
free(isMerged);
// first symbol has the longest code because it's the least frequent in the sorted histogram
return codeLengths[0];
}
// the following code is almost identical to function moffat() in moffat.c
// helper struct for qsort()
struct KeyValue
{
unsigned int key;
unsigned int value;
};
// helper function for qsort()
static int compareKeyValue(const void* a, const void* b)
{
struct KeyValue* aa = (struct KeyValue*) a;
struct KeyValue* bb = (struct KeyValue*) b;
// negative if a < b, zero if a == b, positive if a > b
if (aa->key < bb->key)
return -1;
if (aa->key > bb->key)
return +1;
return 0;
}
/// same as before but histogram can be in any order and may contain zeros, the output is stored in a dedicated parameter
/** - the function rejects maxLength > 63 but I don't see any practical reasons you would need a larger limit ...
* @param maxLength maximum code length, e.g. 15 for DEFLATE or JPEG
* @param numCodes number of codes, equals the array size of histogram and codeLength
* @param histogram how often each code/symbol was found
* @param codeLength [out] computed code lengths
* @result actual maximum code length, 0 if error
*/
unsigned char packageMerge(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[])
{
// my allround variable for various loops
unsigned int i;
// reset code lengths
for (i = 0; i < numCodes; i++)
codeLengths[i] = 0;
// count non-zero histogram values
unsigned int numNonZero = 0;
for (i = 0; i < numCodes; i++)
if (histogram[i] != 0)
numNonZero++;
// reject an empty alphabet because malloc(0) is undefined
if (numNonZero == 0)
return 0;
// allocate a buffer for sorting the histogram
struct KeyValue* mapping = (struct KeyValue*) malloc(sizeof(struct KeyValue) * numNonZero);
// copy histogram to that buffer
unsigned int storeAt = 0;
for (i = 0; i < numCodes; i++)
{
// skip zeros
if (histogram[i] == 0)
continue;
mapping[storeAt].key = histogram[i];
mapping[storeAt].value = i;
storeAt++;
}
// now storeAt == numNonZero
// invoke C standard library's qsort
qsort(mapping, numNonZero, sizeof(struct KeyValue), compareKeyValue);
// extract ascendingly ordered histogram
unsigned int* sorted = (unsigned int*) malloc(sizeof(unsigned int) * numNonZero);
for (i = 0; i < numNonZero; i++)
sorted[i] = mapping[i].key;
// run package-merge algorithm
unsigned char result = packageMergeSortedInPlace(maxLength, numNonZero, sorted);
// "unsort" code lengths
for (i = 0; i < numNonZero; i++)
codeLengths[mapping[i].value] = sorted[i];
// let it go ...
free(sorted);
free(mapping);
return result;
}
Kraft-McMillan Inequality
In one of the last sentences I mentioned the Kraft-McMillan inequality (sometimes McMillan is forgotten).It's core message is: if
sum(2^-codeLength[0...numCodes]) <= 1
then a prefix-free code exists.Ideally you try to have
sum(2^-codeLength[0...numCodes]) = 1
("equal" instead of "less-than or "equal") because anything else is inefficient.The Huffman algorithm always gives you
sum = 1
. The Package-Merge algorithm has the same property.The following length-limiting techniques modify code lengths in such a way that the maximum code length is enforced ("limited") while the inequality still holds true.
Assume that you ran the Huffman algorithm and found that a few codes are too long. You know that
sum = 1
.An important insight is that making one of such long codes only one bit shorter increments the sum, thus violating the inequality:
newSum = sum - 2^-oldLength + 2^-newLength = sum - 2^-oldLength + 2^(oldLength-1) = sum + 2^-oldLength
However, if there is another symbol with a code length that is shorter than your desired
maxLength
we can make it one bit longer to offset our change.The formula is almost the same (but plus one instead of minus one in the exponent).
Pretty much all algorithm mentioned in the follwing chapters pair long with short codes such that the new length limit is respected and the inequality holds true.
JPEG's Length-Limiting Approach
This is the shortest algorithm. It can be found in Annex K.3 of the JPEG standard ITU T.81 along with a nice flow diagram.You can download that document from w3.org, see page 147 of that PDF.
First you create a standard Huffman code. If the maximum code length doesn't exceed your limit then you are obviously done.
If you look at all codes having the longest code length then there is always an even number of them
(if that wasn't the case then one symbol could be made one bit shorter because its last bit is unused → an odd number must be an error in your Huffman implementation).
You repeat this procedure until reaching
maxLength
:
- pick two symbols x and y having that longest bit length
- the canonical Huffman code of x and y will be almost identical: x ends with 0 and y ends with 1
- let's call their shared prefix P
- we remove x's trailing 0, so that x becomes P, being one bit shorter
- however, y is invalid now
- let's pick a third symbol z which is shorter than the new x (that means at least two bits shorter than the still invalid y)
- appending a 0 to z makes it one bit longer
- appending a 1 to z creates a completely new code which can be used for y
- we made 1 symbol shorter by exactly 1 bit (x)
- we made 1 symbol shorter by at least 1 bit (y)
- we made 1 symbol longer by exactly 1 bit (z)
- from a mathematical point of view the sum of all Kraft values must be <= 1.0
- we know that bits(x) = bits(y)
- if bits(z) = bits(x) - 2 then sum(Kraft) = sum(Kraft')
- because bits(x') = bits(x) - 1
- and bits(y') = bits(y) - 1
- and bits(z') = bits(z) + 1
- so that bits(z') = bits(x') = bits(y')
- let's forget about all symbols except x, y and z so that we have:
sum(K) = K(x) + K(y) + K(z)
= 2^-bits(x) + 2^-bits(y) + 2^-bits(z)
= 2^-bits(x) + 2^-bits(x) + 2^-(bits(x) - 2)
= 2 * 2^-bits(x) + 4 * -bits(x)
= 6 * 2^-bits(x)
sum(K') = K(x') + K(y') + K(z')
= 3 * K(x')
= 3 * 2^-bits(x')
= 3 * 2^(-bits(x) - 1)
= 6 * 2^-bits(x)
= sum(K) - therefore the sum of all Kraft values remains unchanged after the transformation
Unfortunately, processing a huge alphabet with very large bit lengths can be quite slow.
However, this situation is impossible with the small JPEG alphabet (typically 162 symbols with an upper code length limit of 16).
The code simplifes and becomes faster if work transform a histogram of code lengths instead of the code lengths themselves.
That means you run a single pass over all code lengths and count how many symbols have one bits, two bits, three bits, etc.
The inner loop of my function
limitedJpegInPlace
is surprisingly shprt:
hide
limitedJpegInPlace
unsigned char limitedJpegInPlace(unsigned char newMaxLength, unsigned char oldMaxLength, unsigned int histNumBits[])
{
if (newMaxLength <= 1)
return 0;
if (newMaxLength > oldMaxLength)
return 0;
if (newMaxLength == oldMaxLength)
return newMaxLength;
// iterate over all "excessive" bit lengths, beginning with the longest
unsigned char i = oldMaxLength;
while (i > newMaxLength)
{
// no codes at this bit length ?
if (histNumBits[i] == 0)
{
i--;
continue;
}
// look for codes that are at least two bits shorter
unsigned char j = i - 2;
while (j > 0 && histNumBits[j] == 0)
j--;
// change bit length of 2 of the longest codes
histNumBits[i] -= 2;
// one code becomes one bit shorter
// (using the joint prefix of the old two codes)
histNumBits[i - 1]++;
// the other code has length j+1 now
// but another, not-yet-involved, code with length j
// is moved to bit length j+1 as well
histNumBits[j + 1] += 2;
histNumBits[j]--;
}
// return longest code length that is still used
while (i > 0 && histNumBits[i] == 0)
i--;
// JPEG Annex K.3 specifies an extra line:
// histNumBits[i]--;
// => because JPEG needs a special symbol to avoid 0xFF in its output stream
return i;
}
MiniZ's Length-Limiting Algorithm
The DEFLATE file format imposes a 16 bit limit in its Huffman stage.Unlike JPEG there is no official length-limiting procedure so that programmers came up with different (but somehow very similar) code.
If found the most easily identifiable code was MiniZ's huffman_enforce_max_code_size() (written by Rich Geldreich).
gzip and its zlib are more popular but the way they limit code lengths is somewhat hidden and interweaved with other functionality.
As far as I understand both MiniZ's and zlib's length-limiting behaves identically.
While MiniZ's basic idea is comparable to JPEG's (for most input they produce identical output), it's much faster:
instead of slowly "eating" one bit away at a time from long codes, it immediately shortens all (!) long codes to the new
maxLength
limit.Now the Kraft-McMillan inequality might be violated so that a few codes need to be fixed:
- select a code x with maximum length (=
maxLength
) - select another code y that is shorter, if possible
maxLength-1
bits (even shorter codes work, too, but are inefficient) - make y one bit longer
- assign x the same length as the "new" y
- we just reduced the Kraft-McMillan sum by (at least)
2^-maxLength
- if it still exceeds 1 then continue with step 1
For example, if newMaxLength is 16 and you have codes with 15 bits, too.
Now if you select a 16 bit code and pick a second 15 bit code then that 15 bit code becomes a 16 bit code and the 16 bit code stays a 15+1=16 bit code.
The Kraft-McMillan computations can be performed with integers if we multiply everything by
2^maxLength
:2^-maxLength * 2^maxLength = 1
, thus each iteration "saves" one unit and the total sum must be below 1 * 2^maxLength = 2^maxLength
.If a histogram of code lengths is used instead of code length themselves (just like the JPEG algorithm) then the inner loop is just 50 lines, and most of it are comments:
hide
limitedMinizInPlace
unsigned char limitedMinizInPlace(unsigned char newMaxLength, unsigned char oldMaxLength, unsigned int histNumBits[])
{
if (newMaxLength <= 1)
return 0;
if (newMaxLength > oldMaxLength)
return 0;
if (newMaxLength == oldMaxLength)
return newMaxLength;
// my allround variable for various loops
unsigned int i;
// move all oversized code lengths to the longest allowed
for (i = newMaxLength + 1; i <= oldMaxLength; i++)
{
histNumBits[newMaxLength] += histNumBits[i];
histNumBits[i] = 0;
}
// compute Kraft sum
// (using integer math: everything is multiplied by 2^newMaxLength)
unsigned long long total = 0;
for (i = newMaxLength; i > 0; i--)
total += histNumBits[i] << (newMaxLength - i);
// iterate until Kraft sum doesn't exceed 1 anymore
unsigned long long one = 1ULL << newMaxLength;
while (total > one)
{
// select a code with maximum length, it will be moved
histNumBits[newMaxLength]--;
// find a second code with shorter length
for (i = newMaxLength - 1; i > 0; i--)
if (histNumBits[i] > 0)
{
histNumBits[i]--;
// extend that shorter code by one bit
// and assign the same code length to the selected one
histNumBits[i + 1] += 2;
break;
}
// moving these codes reduced the Kraft sum
total--;
}
return newMaxLength;
}
BZip2's Length-Limiting Algorithm
bzip2 is a weird compression concept. One of its several stages is a Huffman compression stage.Instead of playing around with the Kraft-McMillan inequality it completely relies on a fast Huffman encoder and trial'n'error:
- compute Huffman codes for your alphabet
- if maximum code length doesn't exceed
maxLength
then you are done - otherwise reduce each symbol's count (but avoid becoming zero)
- and go back to step 1
While being very fast, it may produce sub-optimal Huffman codes. Sometimes dividing by 3 gives better results but that totally depends on your input data.
The core loop is:
hide
bzip2
The compression efficiency is in almost all cases worse that Package-Merge/JPEG/MiniZ. Moreover, it's slower than JPEG/MiniZ so there is no real reason choosing bzip2's approach.
// run Moffat algorithm ...
unsigned char result = moffatSortedInPlace(numNonZero, sorted);
// ... until a proper maximum code length is found
while (result > maxLength)
{
// more or less divide each weight by two while avoiding zero
for (i = 0; i < numNonZero; i++)
{
unsigned int weight = mapping[i].key;
// bzip2 "clears" the lowest 8 bits of the histogram
// to reach the length limit in less iterations
// but sacrifices lots of efficiency
// if you set EXTRA_SHIFT to 0 then the code may need more iterations
// but finds much better code lengths
//#define EXTRA_SHIFT 8
#define EXTRA_SHIFT 0
// sometimes dividing the weight by a bigger integer (e.g. 3)
// may lead to more efficient prefix codes
#define DIVIDE_BY 2
// the shifts do more harm than good in my opinion
weight >>= EXTRA_SHIFT;
// adding 1 avoids zero
weight = 1 + (weight / DIVIDE_BY);
weight <<= EXTRA_SHIFT;
// sorted was overwritten with code lengths
mapping[i].key = sorted[i] = weight;
}
// again: run Moffat algorithm
result = moffatSortedInPlace(numNonZero, sorted);
}
Kraft-McMillan Approximations
The goal of each presented length-limiting algorithm is producing a code that is compatible to Huffman decoders.Aside from Package-Merge, they all need to generate a plain Huffman code and then adjust overlong symbols.
I already mentioned the Kraft-McMillan inequality which states that each code where
sum(2^-codeLength[0...numCodes]) <= 1
is a prefix-free code.That means each such code can be decoded by Huffman decoders, too.
I came across a few postings, mainly by Charles Bloom, discussing how solely focussing on the Kraft-McMillan inequality can be both fast and close to optimal.
However, I made a few important changes to speed up things:
- compute theoretical number of bits for each symbol:
entropy = -log2(count / total)
- using a super-fast fast approximation instead of C's
log
functions - round up or down to closest integer, this is our initial guess for the code length
- but never round to zero
- if above our length limit, then set it to that limit
- compute Kraft-McMillan sum of all symbols' code lengths
If that sum is above 1 then we need to modify (=increase) a few code lengths. The main idea is modifying those code lengths which "hurt the least".
When we rounded to the closest integer, many symbols rounded down: their actual code length is smaller than their theoretical bit length.
Assume you have a symbol which represents 9% of all symbols. Then its entropy is
-log2(0.09) ≈ 3.474
→ rounded to 3. There is a "gain" of 3.474 - 3 = 0.474."Gain" means that we used 0.474 less bits per symbol occurrence than the theoretical minimum, we saved some bits.
If we change that symbol's code length to 4 (assuming it's allowed by the length limit), then there will be a "loss" of 3.474 - 4 = -0.526.
"Loss" means that we used 0.526 less bits per symbol occurrence than the theoretical minimum, we wasted some bits.
The Kraft-McMillan sum will be lower though, hopefully now below 1, at the cost of slightly worse compression.
Let's say there is another symbol representing 10% of all symbols, thus being a little bit more frequent than the previous symbol.
It's entropy is
-log2(0.1) ≈ 3.322
, rounded to 3 as well. Therefore it's "gain" is 3.322 - 3 = 0.322 and potential "loss" after adding one bit is 3.322 - 4 = -0.678.Both symbols affect the Kraft-McMillan sum in the same way, but the latter symbol will cause even worse compression.
You have probably noticed that adding one bit to symbol which had a gain of
x
leads to a loss of 1-x
.So the main idea is: always look for symbols with maximum gain to cause minimal loss after adding a bit.
I implemented two different strategies to modify the code lengths, called A and B.
And yes, I got tired of the long name "Kraft-McMillan inequality" and just called it "Kraft" (which is the nice German word for "power"/"strength"/"force").
Kraft - Strategy A
This algorithm works with several passes. A threshold, initially a small value, decreases after each pass.Walking through all symbols, we look for symbols where the difference between theoretical bit length and actual bit length is above that threshold ("high gain").
One bit will be added and the Kraft sum adjusted. As soon as the Kraft sum isn't larger than 1 anymore we have a valid prefix-free code.
It's possible to "overshoot" by extending too many previously very short codes - which have a high influence on the Kraft sum.
If we are well below a Kraft sum of 1 then a single pass looks for codes which can be made one bit shorter without violating the Kraft sum.
In many cases this last pass isn't needed and can be skipped.
The algorithm works pretty well. However, the number of passes hugely depends on the input data.
If the initial threshold is too high and/or shrinks only a little after each pass then a large number of passes will be required.
If the initial threshold is too low and/or shrinks a lot after each pass then only few passes will be required.
While the latter sounds much better, it will miss many optimal code lengths and produce sub-optimal results.
I didn't find a perfect one-suits-all configuration. The default
INITIAL_THRESHOLD = 28 / 64.0
and STEP_THRESHOLD = 1 / 64.0
worked well in my experiments.
Kraft - Strategy B
Strategy A makes a few guesses regarding the threshold. Moreover, it repeatedly scans symbols which need not to be modified.My strategy B implements a max-queue. Its first element is always the symbol with the highest gain.
Therefore we don't need scan all symbols but always pick from the top of that max-queue.
After a symbol was adjusted, there is a need to re-insert it into the max-queue.
One of the most efficient data structures for such access pattern is a heap. Most standard libraries come with a heap implementation (such as C++).
Roughly half of the code of limitedkraftheap.c is a pretty generic and straightforward heap implementation.
Similar to strategy A, we might "overshoot" so that a fast fix-up exists as well.
There is no major difference between strategy A and B concerning the generated code lengths.
However, strategy B is much faster (typically by a factor of 4) and even outperformes an efficient Huffman implementation.
It's interesting to note that the length limit has a very significant impact on the execution time: lower limits are faster.
License
This code is licensed under the zlib License:This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License
Source Code
Git users: scroll down to the repository link Latest release: February 5, 2021, size: 1668 bytes, 30 linesCRC32:
9f50f131
MD5:
9c4358817e4f951186dc6c87866ab727
SHA1:
5533b6f2fa58c109c590ae4f33665ac7d3e4b896
SHA256:
beb1da8cdce260cb540ab871c5f98d46b596e3956d61324fd1567739492384f8
Latest release: June 29, 2021, size: 11.0 kBytes, 338 lines
CRC32:
b70bd1e5
MD5:
e826eda6c202bfa2fa22daf9e3a5a246
SHA1:
fc47f5ef18d5c82996032b3d3d9c83a85df8f940
SHA256:
2dec5ba9d228eeb44bab2f90312518f546475a65982566857bc8ffe402c42a34
Latest release: February 5, 2021, size: 2208 bytes, 41 lines
CRC32:
979b7a5c
MD5:
264867286a2a10419e77ad7e2da9ee69
SHA1:
2cebbac1ae3c7f53ee7562ed73b9e68ca4987dff
SHA256:
04e45e82ad87c9bb938d7e68b9a39346dbe6d1c8ff9e5859894be43ec62ba406
Latest release: February 10, 2021, size: 4436 bytes, 175 lines
CRC32:
a5935e36
MD5:
db5f8486daebbbe7179c2ed964bd7a80
SHA1:
b688a68b47c8ef654e3271a78152a5382c480820
SHA256:
1ec47ed8df361a5cb8ef847b94dc1be85fccfaaab2c08d7d56f4dd551d8b31eb
Latest release: February 5, 2021, size: 4835 bytes, 80 lines
CRC32:
161095e9
MD5:
715a76cfe7bb7d74e8c1e5f2fbcfb4d9
SHA1:
e0f8bd3a6b3afcd6a793fa8daab8b7b34db92cf4
SHA256:
e75a18a20fe66cee275a0d0ebe52cca3bf9ad14c92994c3a11d5a68651d98819
Latest release: February 10, 2021, size: 12.2 kBytes, 356 lines
CRC32:
c503747c
MD5:
e878141ab902742613f9863e97118ec0
SHA1:
e6a180b59caf39df790cc047876392c65c114232
SHA256:
0e85a8d136f07214b4203a4036a75a3f817e64f1edebf7028b3df55217924ed6
Latest release: February 5, 2021, size: 1476 bytes, 29 lines
CRC32:
c958c820
MD5:
5735c2438ffbd21540c640e48a8217fd
SHA1:
e157bb74cc746609ae083ceb497bf4327f31f016
SHA256:
39a411f02183cc34a8559f973724517a6d81c68c5259478f0a05fdb8fb66658a
Latest release: February 10, 2021, size: 4050 bytes, 131 lines
CRC32:
821a4f0f
MD5:
af0a84f35b32310ec33ad6da058ff3e4
SHA1:
a1eff36323c9917f62f0cfd1ee000388c44a4c1e
SHA256:
a8bfb557d645a1419df9fe071682d9be88af71c82905bddfb0f3d10e50964a78
Latest release: February 5, 2021, size: 745 bytes, 17 lines
CRC32:
cb6d4745
MD5:
9f5615706645e6d591769cd9385c90bd
SHA1:
e8e526e2e32b16197ec77c593f476230221217a6
SHA256:
8ee50def030ba1dd2383febc970496c1f7b9cfe824774d133b5f75cc0741b641
Latest release: February 5, 2021, size: 6.9 kBytes, 224 lines
CRC32:
fddbc6e9
MD5:
a210caef2d00f1be7c9da18b193df501
SHA1:
648b90061f069d74beb41decdcd7ed28c4eaffef
SHA256:
545e870024999e917bd52347419872458178f61645853e5360bf04f35e2aa63c
Latest release: February 5, 2021, size: 753 bytes, 17 lines
CRC32:
7a4c7aa3
MD5:
f920b31d7c745895287ae07e357b0737
SHA1:
7bf007989f4ee0ec9e6af857038fa52bf7a5b59f
SHA256:
c12082e6be08b9028d80ad2f148ecfb780ea4daf9ee01b9569e7f30553dba489
Latest release: February 5, 2021, size: 9.7 kBytes, 342 lines
CRC32:
0a9bb4f3
MD5:
1d5fcb52464eb0ae0922ec43dddf862f
SHA1:
99e13c8494a78e5c4d602abb124da76e2064d049
SHA256:
3a4559d076c85703d1936b4ee09a9d1b2fbc71ab73ddf20ee9805ab29420fc56
Stay up-to-date:git clone https://create.stephan-brumme.com/length-limited-prefix-codes/git
If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com
Changelog
- version 1
- June 30, 2021
- initial release