Interactive demos require Javascript. Anything else works.

Length-Limited Prefix Codes

posted by Stephan Brumme

Overview

The most common technique to generate prefix-free codes is the Huffman algorithm.
It is part of many popular compression algorithms such as JPEG, DEFLATE (ZIP / GZIP / PNG / etc.).
In most cases it serves as an "afterburner" to compress pre-processed symbols.

The main advantage - variable-length codes - can be a critical drawback, too: if the maximum code length exceeds a certain threshold, e.g. the CPU's integer size, (16 / 32 / 64 bit), then the compression/decompression code becomes complex and slows down.
All of the aforementioned file formats put a strict upper limit on the maximum length of Huffman codes.
Mainly due to their age and low number of distinct symbols, they all agreed on 16 bits per code.

So-called length-limited prefix codes are not Huffman codes anymore because they are not optimal (if the length restriction was enforced).
The loss in compression efficiency turns out to be often negligible and thus is widely accepted due to the significant gain in decompression performance.

There is a number of algorithms to create such length-limited prefix codes. They vary concerning speed, efficiency and complexity.
I wrote a C library implementing the most common techniques. Most of the code derives from open source programs and was invented by others.

My major contribution was to create a shared interface so that these algorithms are easily interchangeable:
hide interface unsigned char algorithmName(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[]);
There are three input parameters: and one output parameter: The return value is: If you want to encode the sequence 0 3 1 1 2 5 2 2 then numCodes = 6 and histogram[] = { 1,2,3,1,0,1 };

However, this shared interface comes with a little bit of overhead: sometimes doubling code size and/or execution time.
Therefore most files have a second public function which is more specialized for its algorithm but may have a few additional restrictions.
A common case is that the histogram has to be sorted and unused symbols removed.

Currently implemented are:
  Package-Merge JPEG MiniZ BZip2 Kraft
Author Larmore and Hirschberg ITU T.81 standard, annex K.3 Rich Geldreich Julian Seward inspired by Charles Bloom
and Jörn Engel
Pro optimal output shortest code similar to JPEG but faster trivial to understand often fastest
Contra slowest and most complex mediocre compression mediocre compression worst compression wildly fluctuating compression
Description scroll down scroll down scroll down scroll down strategy A and strategy B

Usage

Each algorithm runs independent of each other. There are no external dependencies, too.
Some algorithms require pre-computed Huffman codes - for those cases I added a slightly modified version of Andrew Moffat's efficient in-place Huffman code.
If you want to try out an algorithm then all you have to do is:
hide Usage #include "packagemerge.h" #include <stdio.h> int main() { // let's assume you have 7 symbols unsigned int numCodes = 7; // first one occurs 270 times, second 20x, third 10x, fourth not at all, fifth once, sixth 6x and seventh once unsigned int histogram[7] = { 270,20,10,0,1,6,1 }; // at most 4 bits unsigned char maxLength = 4; // output: bits per code unsigned char codeLengths[7]; // compute length-limited codes unsigned char longest = packageMerge(maxLength, numCodes, histogram, codeLengths); // display result for (unsigned int i = 0; i < numCodes; i++) printf("code %d occurs %d times => %d bits\n", i, histogram[i], codeLengths[i]); printf("max. code length is %d\n", longest); return 0; }
Compared to the original Huffman algorithm, a few code became shorter and a few became longer.
The number of bits / codeLengths are:
Symbol 0 1 2 3 4 5 6
Count 270 20 10 0 1 6 1 sum = 308
Huffman 1 2 3 0 5 4 5 sum = 374 bits
maxLength = 4 1 2 4 0 4 4 4 sum = 382 bits
maxLength = 3 2 2 3 0 3 3 3 sum = 634 bits
As you can see, limiting the maximum code lengths hurts the compression ratio.

There are several methods to convert code lengths to actual prefix-free codes. I find canonical Huffman codes to be the most suitable: a simple and fast algorithm.
Using the example data we get these canonical codes:
Symbol 0 1 2 3 4 5 6
Count 270 20 10 0 1 6 1
Huffman 0 10 110 - 11110 1110 11111
maxLength = 4 0 10 1100 - 1101 1110 1111
maxLength = 3 00 01 100 - 101 110 111
A basic program showing how to compute canonical codes (click to open):
show canonical.c #include "packagemerge.h" #include <stdio.h> // create canonical codes, for simplicity longest code must not exceed number of bits of an unsigned int (32 or 64) void canonical(unsigned int numCodes, unsigned char codeLengths[], unsigned int codes[]) { // count how often each code length is present unsigned int numLengths[256] = { 0 }; unsigned int i; for (i = 0; i < numCodes; i++) numLengths[codeLengths[i]]++; // precompute the first code of each code length unsigned int next[256]; // there's no code with length 0 and the first 1-bit code with be a zero next[1] = 0; for (i = 2; i < 256; i++) next[i] = (next[i - 1] + numLengths[i - 1]) << 1; // now assign all codes for (i = 0; i < numCodes; i++) codes[i] = next[codeLengths[i]]++; } int main() { // let's assume you have 7 symbols unsigned int numCodes = 7; // first one occurs 270 times, second 20x, third 10x, fourth no at all, fifth once, sixth 6x and seventh once unsigned int histogram[7] = { 270,20,10,0,1,6,1 }; // at most 4 bits unsigned char maxLength = 4; // output: bits per code unsigned char codeLengths[7]; // compute length-limited codes unsigned char longest = packageMerge(maxLength, numCodes, histogram, codeLengths); // create canonical codes unsigned int codes[7]; canonical(numCodes, codeLengths, codes); // display result unsigned int i; for (i = 0; i < numCodes; i++) { printf("code %d occurs %d times\t=> %d bits", i, histogram[i], codeLengths[i]); // binary represenation of canonical code printf("\t=> "); if (codeLengths[i] > 0) { int shift; for (shift = codeLengths[i] - 1; shift >= 0; shift--) printf(codes[i] & (1 << shift) ? "1" : "0"); } printf("\n"); } printf("max. code length is %d\n", longest); return 0; }

Benchmark

Here are a few results from the first 64k bytes of enwik, measured on a Core i7 / GCC x64:

time ./benchmark 1 12 100000 For comparison: the uncompressed data has 64k bytes = 524,288 bits:
algorithm ID total size compression ratio execution time
Huffman 0 326,892 bits 62.35% 0.54 s
Package-Merge 1 327,721 bits 62.51% 1.17 s
MiniZ 2 328,456 bits 62.65% 0.58 s
JPEG 3 328,456 bits 62.65% 0.61 s
BZip2 4 328,887 bits 62.73% 0.88 s
Kraft 5 327,895 bits 62.54% 1.72 s
modified Kraft 6 327,942 bits 62.55% 0.41 s
Remember: each algorithm was repeated 100,000 times, so a single iteration finished in about 4 to 17 microseconds.

Changing the code length limit significantly influences both execution time and compression ratio.
I picked the two "best" algorithms, used the same data set and repeated each procedure 100,000 times:
length limit Package-Merge Kraft Strategy B
8 bits 70.47%, 0.96 s 70.76%, 0.24 s
9 bits 65.30%, 1.02 s 65.31%, 0.24 s
10 bits 63.49%, 1.07 s 63.79%, 0.31 s
11 bits 62.80%, 1.14 s 62.84%, 0.37 s
12 bits 62.51%, 1.17 s 62.55%, 0.40 s
13 bits 62.40%, 1.22 s 62.43%, 0.34 s
14 bits 62.36%, 1.25 s 62.42%, 0.40 s
15 bits 62.35%, 1.29 s 62.42%, 0.66 s
16 bits 62.35%, 1.35 s 62.42%, 0.70 s
For comparison: Moffat's Huffman algorithm needs 0.55 seconds and its longest prefix code has 16 bits.

Package-Merge

Package-Merge is an algorithm that guarantees to produce optimal length-limited prefix codes.
Its name is based on the conception of putting numbers in packages and merging these.

Unfortunately, the original paper by Lawrence L. Larmore and Daniel S. Hirschberg isn't very accessible.
Moreover, further information is pretty scarce on the internet: the Wikipedia entry doesn't help a lot.
The best introduction - in my opinion - is Sebastian Gesemann's Bachelor's Thesis but it's available in German only.
The diagrams on page 24 and 26 walk through an example and page 29 gives pseudo code.

There are two prerequisites:
Let's process the sample data shown above:
Symbol 0 1 2 3 4 5 6
Count 270 20 10 0 1 6 1
Obviously, the prerequisites aren't fulfilled yet: unused symbol 3 needs to be removed and the remaining symbols must be in ascending order of their count:
(the order of symbols with the same count doesn't matter)
Symbol 4 6 5 2 1 0
Count 1 1 6 10 20 270
Since there were 6 symbols (plus an unused one), we need at least 3 bits to encode them.
Naturally you can't have an active symbol with zero bits, the minimum code length is 1 bit.
Package-Merge adds one bit in each iteration so there is a minimum of two iterations (2+1=3) but for whatever reason I decided to have this example with maxLength=4.

Now we add these counts pairwise, we build packages. If there is an odd number of symbols then the last (which is the largest) is discarded:
Symbol 4 6 5 2 1 0
Count 1 1 6 10 20 270
Pairwise Sum 2 16 290
These packages have to be merged with the original sequence in ascending order.
If a package is identical to a count of the original sequence then that item from the original sequence comes first.
In the next table packages are shown in a bold font:
Count 1 1 2 6 10 16 20 270 290
This was one iteration of phase 1. Each iteration represents one bit.
Therefore we need to repeat that procedure two more times to reach our goal maxLength=4.
However, there is one little pitfall: pairs are made using the sequence from the previous step but you always merge them with the original sequence.
That last sentence is important and was probably my most critical insight when I learned about the Package-Merge algorithm.

Okay, let's create package for the second bit:
Count 1 1 2 6 10 16 20 270 290
Pairwise Sum 2 8 26 290
As you can see, there is an odd number of items and thus the last item (290 in the upper row) must be discarded.
The four values in the lower row are merged with the original sequence:
Count 1 1 2 6 8 10 20 26 270 290
One more time "packaging":
Count 1 1 2 6 8 10 20 26 270 290
Pairwise Sum 2 8 18 46 560
And merging:
Count 1 1 2 6 8 10 18 20 46 270 560
Phase 1 has completed. Let's gather all intermediate results from each iteration in one table:
Original 1 1 6 10 20 270
Iteration 1 1 1 2 6 10 16 20 270 290
Iteration 2 1 1 2 6 8 10 20 26 270 290
Iteration 3 1 1 2 6 8 10 18 20 46 270 560
Phase 2 processes the output of each iteration, going from the bottom row to the top row.
The actual values in the table don't matter - it's only relevant whether a value is a package (bold) or not (plain).

Code lengths for all symbols are initially set to zero.
If a value isn't bold then the code length of a symbol has to be incremented.

The first iteration processes the first 2*numCodes-2=10 values of the bottom row, that means 1,1,...,270 but not 560.
Two counters symbol and numMerged are set to zero.

For each table entry a decision has to be made:
The next iteration has to look at the first 2*numMerged table entries.

Let's go step-by-step through the example - as mentioned before we start with the bottom row.
The code length of all symbols was set to zero.
(array index) 0 1 2 3 4 5
Symbol 4 6 5 2 1 0
Count 1 1 6 10 20 270
Code Length (initially) 0 0 0 0 0 0
Code Length
(after processing bottom row)
1 1 1 1 1 1
What happened ? We looked at ten values of iteration 3 of the first phase:
  1. a 1 which wasn't bold → codeLength[0] = 1 and then symbol = 1
  2. another 1 which wasn't bold → codeLength[1] = 1 and then symbol = 2
  3. a 2 which is boldnumMerged = 1
  4. a 6 which wasn't bold → codeLength[2] = 1 and then symbol = 3
  5. a 8 which is boldnumMerged = 2
  6. a 10 which wasn't bold → codeLength[3] = 1 and then symbol = 4
  7. a 18 which is boldnumMerged = 3
  8. a 20 which wasn't bold → codeLength[4] = 1 and then symbol = 5
  9. a 46 which is boldnumMerged = 4
  10. a 270 which wasn't bold → codeLength[5] = 1 and then symbol = 6
Somehow that wasn't very exciting ... all code lengths changed from zero to one.
Because the final value of numMerged is 4, we look at the first 4*2=8 values of the second-to-last row.
Its content was (just a copy, I stroke through irrelevant entries):
Iteration 2 1 1 2 6 8 10 20 26 270 290
This time we cope only with three bold entries and five non-bold.
That means numMerged = 3 and the code length of all-but-the-last symbol becomes 2.
That last symbol, which had the highest count, suddenly gets assigned a lower code length than all the other symbols.

Again, a copy of phase 1's iteration 1 (where only 2*numMerged = 6 will be handled):
Iteration 1 1 1 2 6 10 16 20 270 290
There are two bold entries (merged packages) and four non-bold.
You probably already understand that the first four symbol's code length needs to be incremented and
we will look at the first 2*numMerged = 4 numbers of the final row (the top row).
Original 1 1 6 10 20 270
Obviously there can't be any bold values.
The first four symbol's code length is incremented again. Let's take a look at the code lengths of each step:
(array index) 0 1 2 3 4 5
Symbol 4 6 5 2 1 0
Count 1 1 6 10 20 270
Code Length (initially) 0 0 0 0 0 0
Code Length
(after processing bottom row /
iteration 3)
1 1 1 1 1 1
Code Length
(after processing iteration 2)
2 2 2 2 2 1
Code Length
(after processing iteration 1)
3 3 3 3 2 1
Code Length
(after processing top row)
4 4 4 4 2 1
That's it ! The bottom code contains the final code lengths.
No symbol's number of bits is higher than maxLength=4 and the Kraft-McMillan inequality holds true as well: 2^-4 + 2^-4 + 2^-4 + 2^-4 + 2^-2 + 2^-1 = 1.
The algorithm is guaranteed to create optimal code lengths for a prefix-free code. No other combination of code lengths will be better.

I left out any theoretical stuff, such as proofs etc., and skipped practical optimizations to keep things short and sweet.
Take a look at my code (click below) or read the scientific papers to learn more details.
show packagemerge.c // ////////////////////////////////////////////////////////// // packagemerge.c // written by Stephan Brumme, 2021 // see https://create.stephan-brumme.com/length-limited-prefix-codes/ // #include "packagemerge.h" #include <stdlib.h> // malloc/free/qsort // ----- package-merge algorithm ----- // to me the best explanation is Sebastian Gesemann's Bachelor Thesis (in German only / University of Paderborn, 2004) // data types (switching to unsigned int is faster but fails if sum(histogram) > 2^31 or maxLength > 31) typedef unsigned long long BitMask; typedef unsigned long long HistItem; /// compute limited prefix code lengths based on Larmore/Hirschberg's package-merge algorithm /** - histogram must be in ascending order and no entry must be zero * - the function rejects maxLength > 63 but I don't see any practical reasons you would need a larger limit ... * @param maxLength maximum code length, e.g. 15 for DEFLATE or JPEG * @param numCodes number of codes, equals the array size of histogram and codeLength * @param A [in] how often each code/symbol was found [out] computed code lengths * @result actual maximum code length, 0 if error */ unsigned char packageMergeSortedInPlace(unsigned char maxLength, unsigned int numCodes, unsigned int A[]) { // skip zeros while (numCodes > 0 && A[0] == 0) { numCodes--; A++; } // at least one code needs to be in use if (numCodes == 0 || maxLength == 0) return 0; // one or two codes are always encoded with a single bit if (numCodes <= 2) { A[0] = 1; if (numCodes == 2) A[1] = 1; return 1; } // A[] is an input parameter (stores the histogram) as well as // an output parameter (stores the code lengths) const unsigned int* histogram = A; unsigned int* codeLengths = A; // my allround variable for various loops unsigned int i; // check maximum bit length if (maxLength > 8*sizeof(BitMask) - 1) // 8*8-1 = 63 return 0; // at least log2(numCodes) bits required for every valid prefix code unsigned long long encodingLimit = 1ULL << maxLength; if (encodingLimit < numCodes) return 0; // need two buffers to process iterations and an array of bitmasks unsigned int maxBuffer = 2 * numCodes; // allocate memory HistItem* current = (HistItem*) malloc(sizeof(HistItem) * maxBuffer); HistItem* previous = (HistItem*) malloc(sizeof(HistItem) * maxBuffer); BitMask* isMerged = (BitMask*) malloc(sizeof(BitMask) * maxBuffer); // initial value of "previous" is a plain copy of the sorted histogram for (i = 0; i < numCodes; i++) previous[i] = histogram[i]; unsigned int numPrevious = numCodes; // no need to initialize "current", it's completely rebuild every iteration // keep track which packages are merged (compact bitmasks): // if package p was merged in iteration i then (isMerged[p] & (1 << i)) != 0 for (i = 0; i < maxBuffer; i++) isMerged[i] = 0; // there are no merges before the first iteration // the last 2 packages are irrelevant unsigned int numRelevant = 2 * numCodes - 2; // ... and preparation is finished // ////////////////////////////////////////////////////////////////////// // iterate through potential bit lengths while packaging and merging pairs // (step 1 of the algorithm) // - the histogram is sorted (prerequisite of the function) // - the output must be sorted, too // - thus we have to copy the histogram and every and then insert a new package // - the code keeps track of the next package and compares it to // the next item to be copied from the history // - the smaller value is chosen (if equal, the histogram is chosen) // - a bitmask named isMerged is used to keep track which items were packages // - repeat until the whole histogram was copied and all packages inserted // bitmask for isMerged BitMask mask = 1; unsigned char bits; for (bits = maxLength - 1; bits > 0; bits--) { // ignore last element if numPrevious is odd (can't be paired) numPrevious &= ~1; // bit-twiddling trick to clear the lowest bit, same as numPrevious -= numPrevious % 2 // first merged package current[0] = histogram[0]; // a sum can't be smaller than its parts current[1] = histogram[1]; // therefore it's impossible to find a package at index 0 or 1 HistItem sum = current[0] + current[1]; // same as previous[0] + previous[1] // copy histogram and insert merged sums whenever possible unsigned int numCurrent = 2; // current[0] and current[1] were already set unsigned int numHist = numCurrent; // we took them from the histogram unsigned int numMerged = 0; // but so far no package inserted (however, it's precomputed in "sum") for (;;) // stop/break is inside the loop { // the next package isn't better than the next histogram item ? if (numHist < numCodes && histogram[numHist] <= sum) { // copy histogram item current[numCurrent++] = histogram[numHist++]; continue; } // okay, we have a package being smaller than next histogram item // mark output value as being "merged", i.e. a package isMerged[numCurrent] |= mask; // store package current [numCurrent] = sum; numCurrent++; // already finished last package ? numMerged++; if (numMerged * 2 >= numPrevious) break; // precompute next sum sum = previous[numMerged * 2] + previous[numMerged * 2 + 1]; } // make sure every code from the histogram is included // (relevant if histogram is very skewed with a few outliers) while (numHist < numCodes) current[numCurrent++] = histogram[numHist++]; // prepare next mask mask <<= 1; // performance tweak: abort as soon as "previous" and "current" are identical if (numPrevious >= numRelevant) // ... at least their relevant elements { // basically a bool: FALSE == 0, TRUE == 1 char keepGoing = 0; // compare both arrays: if they are identical then stop for (i = numRelevant - 1; i > 0; i--) // collisions are most likely at the end if (previous[i] != current[i]) { keepGoing++; break; } // early exit ? if (keepGoing == 0) break; } // swap pointers "previous" and "current" HistItem* tmp = previous; previous = current; current = tmp; // no need to swap their sizes because only numCurrent needed in next iteration numPrevious = numCurrent; } // shifted one bit too far mask >>= 1; // keep only isMerged free(previous); free(current); // ////////////////////////////////////////////////////////////////////// // tracking all merges will produce the code lengths // (step 2 of the algorithm) // - analyze each bitlength's mask in isMerged: // * a "pure" symbol => increase bitlength of that symbol // * a merged code => just increase counter // - stop if no more merged codes found // - if m merged codes were found then only examine // the first 2*m elements in the next iteration // (because only they formed these merged codes) // reset code lengths for (i = 0; i < numCodes; i++) codeLengths[i] = 0; // start with analyzing the first 2n-2 values unsigned int numAnalyze = numRelevant; while (mask != 0) // stops if nothing but symbols are found in an iteration { // number of merged packages seen so far unsigned int numMerged = 0; // the first two elements must be symbols, they can't be packages codeLengths[0]++; codeLengths[1]++; unsigned int symbol = 2; // look at packages for (i = symbol; i < numAnalyze; i++) { // check bitmask: not merged if bit is 0 if ((isMerged[i] & mask) == 0) { // we have a single non-merged symbol, which needs to be one bit longer codeLengths[symbol]++; symbol++; } else { // we have a merged package, so that its parts need to be checked next iteration numMerged++; } } // look only at those values responsible for merged packages numAnalyze = 2 * numMerged; // note that the mask was originally slowly shifted left by the merging loop mask >>= 1; } // last iteration can't have any merges for (i = 0; i < numAnalyze; i++) codeLengths[i]++; // it's a free world ... free(isMerged); // first symbol has the longest code because it's the least frequent in the sorted histogram return codeLengths[0]; } // the following code is almost identical to function moffat() in moffat.c // helper struct for qsort() struct KeyValue { unsigned int key; unsigned int value; }; // helper function for qsort() static int compareKeyValue(const void* a, const void* b) { struct KeyValue* aa = (struct KeyValue*) a; struct KeyValue* bb = (struct KeyValue*) b; // negative if a < b, zero if a == b, positive if a > b if (aa->key < bb->key) return -1; if (aa->key > bb->key) return +1; return 0; } /// same as before but histogram can be in any order and may contain zeros, the output is stored in a dedicated parameter /** - the function rejects maxLength > 63 but I don't see any practical reasons you would need a larger limit ... * @param maxLength maximum code length, e.g. 15 for DEFLATE or JPEG * @param numCodes number of codes, equals the array size of histogram and codeLength * @param histogram how often each code/symbol was found * @param codeLength [out] computed code lengths * @result actual maximum code length, 0 if error */ unsigned char packageMerge(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[]) { // my allround variable for various loops unsigned int i; // reset code lengths for (i = 0; i < numCodes; i++) codeLengths[i] = 0; // count non-zero histogram values unsigned int numNonZero = 0; for (i = 0; i < numCodes; i++) if (histogram[i] != 0) numNonZero++; // reject an empty alphabet because malloc(0) is undefined if (numNonZero == 0) return 0; // allocate a buffer for sorting the histogram struct KeyValue* mapping = (struct KeyValue*) malloc(sizeof(struct KeyValue) * numNonZero); // copy histogram to that buffer unsigned int storeAt = 0; for (i = 0; i < numCodes; i++) { // skip zeros if (histogram[i] == 0) continue; mapping[storeAt].key = histogram[i]; mapping[storeAt].value = i; storeAt++; } // now storeAt == numNonZero // invoke C standard library's qsort qsort(mapping, numNonZero, sizeof(struct KeyValue), compareKeyValue); // extract ascendingly ordered histogram unsigned int* sorted = (unsigned int*) malloc(sizeof(unsigned int) * numNonZero); for (i = 0; i < numNonZero; i++) sorted[i] = mapping[i].key; // run package-merge algorithm unsigned char result = packageMergeSortedInPlace(maxLength, numNonZero, sorted); // "unsort" code lengths for (i = 0; i < numNonZero; i++) codeLengths[mapping[i].value] = sorted[i]; // let it go ... free(sorted); free(mapping); return result; }

Kraft-McMillan Inequality

In one of the last sentences I mentioned the Kraft-McMillan inequality (sometimes McMillan is forgotten).
It's core message is: if sum(2^-codeLength[0...numCodes]) <= 1 then a prefix-free code exists.

Ideally you try to have sum(2^-codeLength[0...numCodes]) = 1 ("equal" instead of "less-than or "equal") because anything else is inefficient.
The Huffman algorithm always gives you sum = 1. The Package-Merge algorithm has the same property.

The following length-limiting techniques modify code lengths in such a way that the maximum code length is enforced ("limited") while the inequality still holds true.

Assume that you ran the Huffman algorithm and found that a few codes are too long. You know that sum = 1.
An important insight is that making one of such long codes only one bit shorter increments the sum, thus violating the inequality:
newSum = sum - 2^-oldLength + 2^-newLength = sum - 2^-oldLength + 2^(oldLength-1) = sum + 2^-oldLength

However, if there is another symbol with a code length that is shorter than your desired maxLength we can make it one bit longer to offset our change.
The formula is almost the same (but plus one instead of minus one in the exponent).
Pretty much all algorithm mentioned in the follwing chapters pair long with short codes such that the new length limit is respected and the inequality holds true.

JPEG's Length-Limiting Approach

This is the shortest algorithm. It can be found in Annex K.3 of the JPEG standard ITU T.81 along with a nice flow diagram.
You can download that document from w3.org, see page 147 of that PDF.

First you create a standard Huffman code. If the maximum code length doesn't exceed your limit then you are obviously done.

If you look at all codes having the longest code length then there is always an even number of them
(if that wasn't the case then one symbol could be made one bit shorter because its last bit is unused → an odd number must be an error in your Huffman implementation).

You repeat this procedure until reaching maxLength: What did we achieve ?
  1. we made 1 symbol shorter by exactly 1 bit (x)
  2. we made 1 symbol shorter by at least 1 bit (y)
  3. we made 1 symbol longer by exactly 1 bit (z)
Why does it work ?
Each step reduces the bit lengths of just two symbols by typically one bit.
Unfortunately, processing a huge alphabet with very large bit lengths can be quite slow.
However, this situation is impossible with the small JPEG alphabet (typically 162 symbols with an upper code length limit of 16).

The code simplifes and becomes faster if work transform a histogram of code lengths instead of the code lengths themselves.
That means you run a single pass over all code lengths and count how many symbols have one bits, two bits, three bits, etc.

The inner loop of my function limitedJpegInPlace is surprisingly shprt:
hide limitedJpegInPlace unsigned char limitedJpegInPlace(unsigned char newMaxLength, unsigned char oldMaxLength, unsigned int histNumBits[]) { if (newMaxLength <= 1) return 0; if (newMaxLength > oldMaxLength) return 0; if (newMaxLength == oldMaxLength) return newMaxLength; // iterate over all "excessive" bit lengths, beginning with the longest unsigned char i = oldMaxLength; while (i > newMaxLength) { // no codes at this bit length ? if (histNumBits[i] == 0) { i--; continue; } // look for codes that are at least two bits shorter unsigned char j = i - 2; while (j > 0 && histNumBits[j] == 0) j--; // change bit length of 2 of the longest codes histNumBits[i] -= 2; // one code becomes one bit shorter // (using the joint prefix of the old two codes) histNumBits[i - 1]++; // the other code has length j+1 now // but another, not-yet-involved, code with length j // is moved to bit length j+1 as well histNumBits[j + 1] += 2; histNumBits[j]--; } // return longest code length that is still used while (i > 0 && histNumBits[i] == 0) i--; // JPEG Annex K.3 specifies an extra line: // histNumBits[i]--; // => because JPEG needs a special symbol to avoid 0xFF in its output stream return i; }

MiniZ's Length-Limiting Algorithm

The DEFLATE file format imposes a 16 bit limit in its Huffman stage.
Unlike JPEG there is no official length-limiting procedure so that programmers came up with different (but somehow very similar) code.
If found the most easily identifiable code was MiniZ's huffman_enforce_max_code_size() (written by Rich Geldreich).
gzip and its zlib are more popular but the way they limit code lengths is somewhat hidden and interweaved with other functionality.
As far as I understand both MiniZ's and zlib's length-limiting behaves identically.

While MiniZ's basic idea is comparable to JPEG's (for most input they produce identical output), it's much faster:
instead of slowly "eating" one bit away at a time from long codes, it immediately shortens all (!) long codes to the new maxLength limit.
Now the Kraft-McMillan inequality might be violated so that a few codes need to be fixed:
  1. select a code x with maximum length (=maxLength)
  2. select another code y that is shorter, if possible maxLength-1 bits (even shorter codes work, too, but are inefficient)
  3. make y one bit longer
  4. assign x the same length as the "new" y
  5. we just reduced the Kraft-McMillan sum by (at least) 2^-maxLength
  6. if it still exceeds 1 then continue with step 1
Note: it's possible (and quite likely !) that the selected code x gets assigned the same code length it had before.
For example, if newMaxLength is 16 and you have codes with 15 bits, too.
Now if you select a 16 bit code and pick a second 15 bit code then that 15 bit code becomes a 16 bit code and the 16 bit code stays a 15+1=16 bit code.

The Kraft-McMillan computations can be performed with integers if we multiply everything by 2^maxLength:
2^-maxLength * 2^maxLength = 1, thus each iteration "saves" one unit and the total sum must be below 1 * 2^maxLength = 2^maxLength.

If a histogram of code lengths is used instead of code length themselves (just like the JPEG algorithm) then the inner loop is just 50 lines, and most of it are comments:
hide limitedMinizInPlace unsigned char limitedMinizInPlace(unsigned char newMaxLength, unsigned char oldMaxLength, unsigned int histNumBits[]) { if (newMaxLength <= 1) return 0; if (newMaxLength > oldMaxLength) return 0; if (newMaxLength == oldMaxLength) return newMaxLength; // my allround variable for various loops unsigned int i; // move all oversized code lengths to the longest allowed for (i = newMaxLength + 1; i <= oldMaxLength; i++) { histNumBits[newMaxLength] += histNumBits[i]; histNumBits[i] = 0; } // compute Kraft sum // (using integer math: everything is multiplied by 2^newMaxLength) unsigned long long total = 0; for (i = newMaxLength; i > 0; i--) total += histNumBits[i] << (newMaxLength - i); // iterate until Kraft sum doesn't exceed 1 anymore unsigned long long one = 1ULL << newMaxLength; while (total > one) { // select a code with maximum length, it will be moved histNumBits[newMaxLength]--; // find a second code with shorter length for (i = newMaxLength - 1; i > 0; i--) if (histNumBits[i] > 0) { histNumBits[i]--; // extend that shorter code by one bit // and assign the same code length to the selected one histNumBits[i + 1] += 2; break; } // moving these codes reduced the Kraft sum total--; } return newMaxLength; }

BZip2's Length-Limiting Algorithm

bzip2 is a weird compression concept. One of its several stages is a Huffman compression stage.
Instead of playing around with the Kraft-McMillan inequality it completely relies on a fast Huffman encoder and trial'n'error: There are many possibilities how to implement step 3. bzip2 divides by two and clears the lower 8 bits of each counter (and adds 1 to avoid zeros).
While being very fast, it may produce sub-optimal Huffman codes. Sometimes dividing by 3 gives better results but that totally depends on your input data.

The core loop is:
hide bzip2 // run Moffat algorithm ... unsigned char result = moffatSortedInPlace(numNonZero, sorted); // ... until a proper maximum code length is found while (result > maxLength) { // more or less divide each weight by two while avoiding zero for (i = 0; i < numNonZero; i++) { unsigned int weight = mapping[i].key; // bzip2 "clears" the lowest 8 bits of the histogram // to reach the length limit in less iterations // but sacrifices lots of efficiency // if you set EXTRA_SHIFT to 0 then the code may need more iterations // but finds much better code lengths //#define EXTRA_SHIFT 8 #define EXTRA_SHIFT 0 // sometimes dividing the weight by a bigger integer (e.g. 3) // may lead to more efficient prefix codes #define DIVIDE_BY 2 // the shifts do more harm than good in my opinion weight >>= EXTRA_SHIFT; // adding 1 avoids zero weight = 1 + (weight / DIVIDE_BY); weight <<= EXTRA_SHIFT; // sorted was overwritten with code lengths mapping[i].key = sorted[i] = weight; } // again: run Moffat algorithm result = moffatSortedInPlace(numNonZero, sorted); }
The compression efficiency is in almost all cases worse that Package-Merge/JPEG/MiniZ. Moreover, it's slower than JPEG/MiniZ so there is no real reason choosing bzip2's approach.

Kraft-McMillan Approximations

The goal of each presented length-limiting algorithm is producing a code that is compatible to Huffman decoders.
Aside from Package-Merge, they all need to generate a plain Huffman code and then adjust overlong symbols.

I already mentioned the Kraft-McMillan inequality which states that each code where sum(2^-codeLength[0...numCodes]) <= 1 is a prefix-free code.
That means each such code can be decoded by Huffman decoders, too.

I came across a few postings, mainly by Charles Bloom, discussing how solely focussing on the Kraft-McMillan inequality can be both fast and close to optimal.
However, I made a few important changes to speed up things:
  1. compute theoretical number of bits for each symbol: entropy = -log2(count / total)
  2. round up or down to closest integer, this is our initial guess for the code length
  3. compute Kraft-McMillan sum of all symbols' code lengths
Now we have a rough estimation. If the Kraft-McMillan sum doesn't exceed 1 then this is actually a proper prefix-free code (but most likely it's sub-optimal).
If that sum is above 1 then we need to modify (=increase) a few code lengths. The main idea is modifying those code lengths which "hurt the least".
When we rounded to the closest integer, many symbols rounded down: their actual code length is smaller than their theoretical bit length.

Assume you have a symbol which represents 9% of all symbols. Then its entropy is -log2(0.09) ≈ 3.474 → rounded to 3. There is a "gain" of 3.474 - 3 = 0.474.
"Gain" means that we used 0.474 less bits per symbol occurrence than the theoretical minimum, we saved some bits.
If we change that symbol's code length to 4 (assuming it's allowed by the length limit), then there will be a "loss" of 3.474 - 4 = -0.526.
"Loss" means that we used 0.526 less bits per symbol occurrence than the theoretical minimum, we wasted some bits.
The Kraft-McMillan sum will be lower though, hopefully now below 1, at the cost of slightly worse compression.

Let's say there is another symbol representing 10% of all symbols, thus being a little bit more frequent than the previous symbol.
It's entropy is -log2(0.1) ≈ 3.322, rounded to 3 as well. Therefore it's "gain" is 3.322 - 3 = 0.322 and potential "loss" after adding one bit is 3.322 - 4 = -0.678.
Both symbols affect the Kraft-McMillan sum in the same way, but the latter symbol will cause even worse compression.

You have probably noticed that adding one bit to symbol which had a gain of x leads to a loss of 1-x.
So the main idea is: always look for symbols with maximum gain to cause minimal loss after adding a bit.

I implemented two different strategies to modify the code lengths, called A and B.
And yes, I got tired of the long name "Kraft-McMillan inequality" and just called it "Kraft" (which is the nice German word for "power"/"strength"/"force").

Kraft - Strategy A

This algorithm works with several passes. A threshold, initially a small value, decreases after each pass.
Walking through all symbols, we look for symbols where the difference between theoretical bit length and actual bit length is above that threshold ("high gain").
One bit will be added and the Kraft sum adjusted. As soon as the Kraft sum isn't larger than 1 anymore we have a valid prefix-free code.

It's possible to "overshoot" by extending too many previously very short codes - which have a high influence on the Kraft sum.
If we are well below a Kraft sum of 1 then a single pass looks for codes which can be made one bit shorter without violating the Kraft sum.
In many cases this last pass isn't needed and can be skipped.

The algorithm works pretty well. However, the number of passes hugely depends on the input data.
If the initial threshold is too high and/or shrinks only a little after each pass then a large number of passes will be required.
If the initial threshold is too low and/or shrinks a lot after each pass then only few passes will be required.
While the latter sounds much better, it will miss many optimal code lengths and produce sub-optimal results.
I didn't find a perfect one-suits-all configuration. The default INITIAL_THRESHOLD = 28 / 64.0 and STEP_THRESHOLD = 1 / 64.0 worked well in my experiments.

Kraft - Strategy B

Strategy A makes a few guesses regarding the threshold. Moreover, it repeatedly scans symbols which need not to be modified.
My strategy B implements a max-queue. Its first element is always the symbol with the highest gain.
Therefore we don't need scan all symbols but always pick from the top of that max-queue.
After a symbol was adjusted, there is a need to re-insert it into the max-queue.

One of the most efficient data structures for such access pattern is a heap. Most standard libraries come with a heap implementation (such as C++).
Roughly half of the code of limitedkraftheap.c is a pretty generic and straightforward heap implementation.

Similar to strategy A, we might "overshoot" so that a fast fix-up exists as well.

There is no major difference between strategy A and B concerning the generated code lengths.
However, strategy B is much faster (typically by a factor of 4) and even outperformes an efficient Huffman implementation.
It's interesting to note that the length limit has a very significant impact on the execution time: lower limits are faster.

License

This code is licensed under the zlib License:
This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License
Git users: scroll down to the repository link
Download  packagemerge.h
Latest release: February 5, 2021, size: 1668 bytes, 30 lines

CRC32: 9f50f131
MD5: 9c4358817e4f951186dc6c87866ab727
SHA1: 5533b6f2fa58c109c590ae4f33665ac7d3e4b896
SHA256:beb1da8cdce260cb540ab871c5f98d46b596e3956d61324fd1567739492384f8

Download  packagemerge.c
Latest release: June 29, 2021, size: 11.0 kBytes, 338 lines

CRC32: b70bd1e5
MD5: e826eda6c202bfa2fa22daf9e3a5a246
SHA1: fc47f5ef18d5c82996032b3d3d9c83a85df8f940
SHA256:2dec5ba9d228eeb44bab2f90312518f546475a65982566857bc8ffe402c42a34

Download  moffat.h
Latest release: February 5, 2021, size: 2208 bytes, 41 lines

CRC32: 979b7a5c
MD5: 264867286a2a10419e77ad7e2da9ee69
SHA1: 2cebbac1ae3c7f53ee7562ed73b9e68ca4987dff
SHA256:04e45e82ad87c9bb938d7e68b9a39346dbe6d1c8ff9e5859894be43ec62ba406

Download  moffat.c
Latest release: February 10, 2021, size: 4436 bytes, 175 lines

CRC32: a5935e36
MD5: db5f8486daebbbe7179c2ed964bd7a80
SHA1: b688a68b47c8ef654e3271a78152a5382c480820
SHA256:1ec47ed8df361a5cb8ef847b94dc1be85fccfaaab2c08d7d56f4dd551d8b31eb

Download  limitedjpegdeflate.h
Latest release: February 5, 2021, size: 4835 bytes, 80 lines

CRC32: 161095e9
MD5: 715a76cfe7bb7d74e8c1e5f2fbcfb4d9
SHA1: e0f8bd3a6b3afcd6a793fa8daab8b7b34db92cf4
SHA256:e75a18a20fe66cee275a0d0ebe52cca3bf9ad14c92994c3a11d5a68651d98819

Download  limitedjpegdeflate.c
Latest release: February 10, 2021, size: 12.2 kBytes, 356 lines

CRC32: c503747c
MD5: e878141ab902742613f9863e97118ec0
SHA1: e6a180b59caf39df790cc047876392c65c114232
SHA256:0e85a8d136f07214b4203a4036a75a3f817e64f1edebf7028b3df55217924ed6

Download  limitedbzip2.h
Latest release: February 5, 2021, size: 1476 bytes, 29 lines

CRC32: c958c820
MD5: 5735c2438ffbd21540c640e48a8217fd
SHA1: e157bb74cc746609ae083ceb497bf4327f31f016
SHA256:39a411f02183cc34a8559f973724517a6d81c68c5259478f0a05fdb8fb66658a

Download  limitedbzip2.c
Latest release: February 10, 2021, size: 4050 bytes, 131 lines

CRC32: 821a4f0f
MD5: af0a84f35b32310ec33ad6da058ff3e4
SHA1: a1eff36323c9917f62f0cfd1ee000388c44a4c1e
SHA256:a8bfb557d645a1419df9fe071682d9be88af71c82905bddfb0f3d10e50964a78

Download  limitedkraft.h
Latest release: February 5, 2021, size: 745 bytes, 17 lines

CRC32: cb6d4745
MD5: 9f5615706645e6d591769cd9385c90bd
SHA1: e8e526e2e32b16197ec77c593f476230221217a6
SHA256:8ee50def030ba1dd2383febc970496c1f7b9cfe824774d133b5f75cc0741b641

Download  limitedkraft.c
Latest release: February 5, 2021, size: 6.9 kBytes, 224 lines

CRC32: fddbc6e9
MD5: a210caef2d00f1be7c9da18b193df501
SHA1: 648b90061f069d74beb41decdcd7ed28c4eaffef
SHA256:545e870024999e917bd52347419872458178f61645853e5360bf04f35e2aa63c

Download  limitedkraftheap.h
Latest release: February 5, 2021, size: 753 bytes, 17 lines

CRC32: 7a4c7aa3
MD5: f920b31d7c745895287ae07e357b0737
SHA1: 7bf007989f4ee0ec9e6af857038fa52bf7a5b59f
SHA256:c12082e6be08b9028d80ad2f148ecfb780ea4daf9ee01b9569e7f30553dba489

Download  limitedkraftheap.c
Latest release: February 5, 2021, size: 9.7 kBytes, 342 lines

CRC32: 0a9bb4f3
MD5: 1d5fcb52464eb0ae0922ec43dddf862f
SHA1: 99e13c8494a78e5c4d602abb124da76e2064d049
SHA256:3a4559d076c85703d1936b4ee09a9d1b2fbc71ab73ddf20ee9805ab29420fc56

Stay up-to-date:git clone https://create.stephan-brumme.com/length-limited-prefix-codes/git

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

Changelog

homepage