Length-Limited Prefix Codes

Overview

Usage

Benchmark

Package-Merge

Kraft-McMillan Inequality

JPEG's Length-Limiting Approach

MiniZ's Length-Limiting Algorithm

BZip2's Length-Limiting Algorithm

Kraft-McMillan Approximations

License

Source Code

Changelog

Kraft - Strategy A

Kraft - Strategy B

posted June 30, 2021 by Stephan Brumme

The most common technique to generate prefix-free codes is the Huffman algorithm.
It is part of many popular compression algorithms such as JPEG, DEFLATE (ZIP / GZIP / PNG / etc.).
In most cases it serves as an "afterburner" to compress pre-processed symbols.

The main advantage - variable-length codes - can be a critical drawback, too: if the maximum code length exceeds a certain threshold, e.g. the CPU's integer size, (16 / 32 / 64 bit), then the compression/decompression code becomes complex and slows down.
All of the aforementioned file formats put a strict upper limit on the maximum length of Huffman codes.
Mainly due to their age and low number of distinct symbols, they all agreed on 16 bits per code.

So-called length-limited prefix codes are not Huffman codes anymore because they are not optimal (if the length restriction was enforced).
The loss in compression efficiency turns out to be often negligible and thus is widely accepted due to the significant gain in decompression performance.

There is a number of algorithms to create such length-limited prefix codes. They vary concerning speed, efficiency and complexity.
I wrote a C library implementing the most common techniques. Most of the code derives from open source programs and was invented by others.

My major contribution was to create a shared interface so that these algorithms are easily interchangeable:

hide interface


     unsigned char algorithmName(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[]);

There are three input parameters:

maxLength is the upper limit of the number of bits for each prefix code
numCodes is the number of distinct symbols (including unused symbols) → size of your alphabet
histogram is an array of numCodes counters how often each symbol is present

and one output parameter:

codeLengths will contain the bit length of each symbol

The return value is:

the longest bit length or zero if the algorithm failed

If you want to encode the sequence 0 3 1 1 2 5 2 2 then numCodes = 6 and histogram[] = { 1,2,3,1,0,1 };

However, this shared interface comes with a little bit of overhead: sometimes doubling code size and/or execution time.
Therefore most files have a second public function which is more specialized for its algorithm but may have a few additional restrictions.
A common case is that the histogram has to be sorted and unused symbols removed.

Currently implemented are:

	Package-Merge	JPEG	MiniZ	BZip2	Kraft
Author	Larmore and Hirschberg	ITU T.81 standard, annex K.3	Rich Geldreich	Julian Seward	inspired by Charles Bloom and Jörn Engel
Pro	optimal output	shortest code	similar to JPEG but faster	trivial to understand	often fastest
Contra	slowest and most complex	mediocre compression	mediocre compression	worst compression	wildly fluctuating compression
Description	scroll down	scroll down	scroll down	scroll down	strategy A and strategy B

Each algorithm runs independent of each other. There are no external dependencies, too.
Some algorithms require pre-computed Huffman codes - for those cases I added a slightly modified version of Andrew Moffat's efficient in-place Huffman code.
If you want to try out an algorithm then all you have to do is:

hide Usage


     #include "packagemerge.h"
     #include <stdio.h>
     
     int main()
     {
       // let's assume you have 7 symbols
       unsigned int  numCodes  = 7;
       // first one occurs 270 times, second 20x, third 10x, fourth not at all, fifth once, sixth 6x and seventh once
       unsigned int  histogram[7] = { 270,20,10,0,1,6,1 };
     
       // at most 4 bits
       unsigned char maxLength = 4;
       // output: bits per code
       unsigned char codeLengths[7];
     
       // compute length-limited codes
       unsigned char longest = packageMerge(maxLength, numCodes, histogram, codeLengths);
     
       // display result
       for (unsigned int i = 0; i < numCodes; i++)
         printf("code %d occurs %d times => %d bits\n", i, histogram[i], codeLengths[i]);
       printf("max. code length is %d\n", longest);
     
       return 0;
     }

Compared to the original Huffman algorithm, a few code became shorter and a few became longer.
The number of bits / codeLengths are:

Symbol	0	1	2	3	4	5	6
Count	270	20	10	0	1	6	1	sum = 308
Huffman	1	2	3	0	5	4	5	sum = 374 bits
`maxLength = 4`	1	2	4	0	4	4	4	sum = 382 bits
`maxLength = 3`	2	2	3	0	3	3	3	sum = 634 bits

As you can see, limiting the maximum code lengths hurts the compression ratio.

There are several methods to convert code lengths to actual prefix-free codes. I find canonical Huffman codes to be the most suitable: a simple and fast algorithm.
Using the example data we get these canonical codes:

Symbol	0	1	2	3	4	5	6
Count	270	20	10	0	1	6	1
Huffman	`0`	`10`	`110`	-	`11110`	`1110`	`11111`
`maxLength = 4`	`0`	`10`	`1100`	-	`1101`	`1110`	`1111`
`maxLength = 3`	`00`	`01`	`100`	-	`101`	`110`	`111`

A basic program showing how to compute canonical codes (click to open):

show canonical.c


     #include "packagemerge.h"
     #include <stdio.h>
     
     
     // create canonical codes, for simplicity longest code must not exceed number of bits of an unsigned int (32 or 64)
     void canonical(unsigned int numCodes, unsigned char codeLengths[], unsigned int codes[])
     {
       // count how often each code length is present
       unsigned int numLengths[256] = { 0 };
       unsigned int i;
       for (i = 0; i < numCodes; i++)
         numLengths[codeLengths[i]]++;
     
       // precompute the first code of each code length
       unsigned int next[256];
       // there's no code with length 0 and the first 1-bit code with be a zero
       next[1] = 0;
       for (i = 2; i < 256; i++)
         next[i] = (next[i - 1] + numLengths[i - 1]) << 1;
     
       // now assign all codes
       for (i = 0; i < numCodes; i++)
         codes[i] = next[codeLengths[i]]++;
     }
     
     
     int main()
     {
       // let's assume you have 7 symbols
       unsigned int  numCodes  = 7;
       // first one occurs 270 times, second 20x, third 10x, fourth no at all, fifth once, sixth 6x and seventh once
       unsigned int  histogram[7] = { 270,20,10,0,1,6,1 };
     
       // at most 4 bits
       unsigned char maxLength = 4;
       // output: bits per code
       unsigned char codeLengths[7];
     
       // compute length-limited codes
       unsigned char longest = packageMerge(maxLength, numCodes, histogram, codeLengths);
     
       // create canonical codes
       unsigned int codes[7];
       canonical(numCodes, codeLengths, codes);
     
       // display result
       unsigned int i;
       for (i = 0; i < numCodes; i++)
       {
         printf("code %d occurs %d times\t=> %d bits", i, histogram[i], codeLengths[i]);
     
         // binary represenation of canonical code
         printf("\t=> ");
         if (codeLengths[i] > 0)
         {
           int shift;
           for (shift = codeLengths[i] - 1; shift >= 0; shift--)
             printf(codes[i] & (1 << shift) ? "1" : "0");
         }
         printf("\n");
       }
       printf("max. code length is %d\n", longest);
     
       return 0;
     }

Here are a few results from the first 64k bytes of enwik, measured on a Core i7 / GCC x64:

time ./benchmark 1 12 100000

where 1 is the algorithm's ID and was between 0 and 6
limiting the code length to at most 12 bits

note: the unadjusted Huffman codes have up to 16 bits

each algorithm ran 100000 times

For comparison: the uncompressed data has 64k bytes = 524,288 bits:

algorithm	ID	total size	compression ratio	execution time
Huffman	`0`	326,892 bits	62.35%	0.54 s
Package-Merge	`1`	327,721 bits	62.51%	1.17 s
MiniZ	`2`	328,456 bits	62.65%	0.58 s
JPEG	`3`	328,456 bits	62.65%	0.61 s
BZip2	`4`	328,887 bits	62.73%	0.88 s
Kraft	`5`	327,895 bits	62.54%	1.72 s
modified Kraft	`6`	327,942 bits	62.55%	0.41 s

Remember: each algorithm was repeated 100,000 times, so a single iteration finished in about 4 to 17 microseconds.

Changing the code length limit significantly influences both execution time and compression ratio.
I picked the two "best" algorithms, used the same data set and repeated each procedure 100,000 times:

length limit	Package-Merge	Kraft Strategy B
8 bits	70.47%, 0.96 s	70.76%, 0.24 s
9 bits	65.30%, 1.02 s	65.31%, 0.24 s
10 bits	63.49%, 1.07 s	63.79%, 0.31 s
11 bits	62.80%, 1.14 s	62.84%, 0.37 s
12 bits	62.51%, 1.17 s	62.55%, 0.40 s
13 bits	62.40%, 1.22 s	62.43%, 0.34 s
14 bits	62.36%, 1.25 s	62.42%, 0.40 s
15 bits	62.35%, 1.29 s	62.42%, 0.66 s
16 bits	62.35%, 1.35 s	62.42%, 0.70 s

For comparison: Moffat's Huffman algorithm needs 0.55 seconds and its longest prefix code has 16 bits.

Package-Merge is an algorithm that guarantees to produce optimal length-limited prefix codes.
Its name is based on the conception of putting numbers in packages and merging these.

Unfortunately, the original paper by Lawrence L. Larmore and Daniel S. Hirschberg isn't very accessible.
Moreover, further information is pretty scarce on the internet: the Wikipedia entry doesn't help a lot.
The best introduction - in my opinion - is Sebastian Gesemann's Bachelor's Thesis but it's available in German only.
The diagrams on page 24 and 26 walk through an example and page 29 gives pseudo code.

There are two prerequisites:

remove all unused symbols before running Package-Merge
and symbols must must be sorted in ascending order of their frequency

Let's process the sample data shown above:

Symbol	0	1	2	3	4	5	6
Count	270	20	10	0	1	6	1

Obviously, the prerequisites aren't fulfilled yet: unused symbol 3 needs to be removed and the remaining symbols must be in ascending order of their count:
(the order of symbols with the same count doesn't matter)

Symbol	4	6	5	2	1	0
Count	1	1	6	10	20	270

Since there were 6 symbols (plus an unused one), we need at least 3 bits to encode them.
Naturally you can't have an active symbol with zero bits, the minimum code length is 1 bit.
Package-Merge adds one bit in each iteration so there is a minimum of two iterations (2+1=3) but for whatever reason I decided to have this example with maxLength=4.

Now we add these counts pairwise, we build packages. If there is an odd number of symbols then the last (which is the largest) is discarded:

Symbol	4	6	5	2	1	0
Count	1	1	6	10	20	270
Pairwise Sum	2		16		290

These packages have to be merged with the original sequence in ascending order.
If a package is identical to a count of the original sequence then that item from the original sequence comes first.
In the next table packages are shown in a bold font:

Count	1	1	2	6	10	16	20	270	290

This was one iteration of phase 1. Each iteration represents one bit.
Therefore we need to repeat that procedure two more times to reach our goal maxLength=4.
However, there is one little pitfall: pairs are made using the sequence from the previous step but you always merge them with the original sequence.
That last sentence is important and was probably my most critical insight when I learned about the Package-Merge algorithm.

Okay, let's create package for the second bit:

Count	1	1	2	6	10	16	20	270	290
Pairwise Sum	2		8		26		290

As you can see, there is an odd number of items and thus the last item (290 in the upper row) must be discarded.
The four values in the lower row are merged with the original sequence:

Count	1	1	2	6	8	10	20	26	270	290

One more time "packaging":

Count	1	1	2	6	8	10	20	26	270	290
Pairwise Sum	2		8		18		46		560

And merging:

Count	1	1	2	6	8	10	18	20	46	270	560

Phase 1 has completed. Let's gather all intermediate results from each iteration in one table:

Original	1	1	6	10	20	270
Iteration 1	1	1	2	6	10	16	20	270	290
Iteration 2	1	1	2	6	8	10	20	26	270	290
Iteration 3	1	1	2	6	8	10	18	20	46	270	560

Phase 2 processes the output of each iteration, going from the bottom row to the top row.
The actual values in the table don't matter - it's only relevant whether a value is a package (bold) or not (plain).

Code lengths for all symbols are initially set to zero.
If a value isn't bold then the code length of a symbol has to be incremented.

The first iteration processes the first 2*numCodes-2=10 values of the bottom row, that means 1,1,...,270 but not 560.
Two counters symbol and numMerged are set to zero.

For each table entry a decision has to be made:

if that value is a package (bold) then increment numMerged
otherwise increment codeLengths[symbol] and then symbol

The next iteration has to look at the first 2*numMerged table entries.

Let's go step-by-step through the example - as mentioned before we start with the bottom row.
The code length of all symbols was set to zero.

What happened ? We looked at ten values of iteration 3 of the first phase:

a 1 which wasn't bold → codeLength[0] = 1 and then symbol = 1
another 1 which wasn't bold → codeLength[1] = 1 and then symbol = 2
a 2 which is bold → numMerged = 1
a 6 which wasn't bold → codeLength[2] = 1 and then symbol = 3
a 8 which is bold → numMerged = 2
a 10 which wasn't bold → codeLength[3] = 1 and then symbol = 4
a 18 which is bold → numMerged = 3
a 20 which wasn't bold → codeLength[4] = 1 and then symbol = 5
a 46 which is bold → numMerged = 4
a 270 which wasn't bold → codeLength[5] = 1 and then symbol = 6

Somehow that wasn't very exciting ... all code lengths changed from zero to one.
Because the final value of numMerged is 4, we look at the first 4*2=8 values of the second-to-last row.
Its content was (just a copy, I stroke through irrelevant entries):

Iteration 2	1	1	2	6	8	10	20	26	~~270~~	~~290~~

This time we cope only with three bold entries and five non-bold.
That means numMerged = 3 and the code length of all-but-the-last symbol becomes 2.
That last symbol, which had the highest count, suddenly gets assigned a lower code length than all the other symbols.

Again, a copy of phase 1's iteration 1 (where only 2*numMerged = 6 will be handled):

Iteration 1	1	1	2	6	10	16	20	~~270~~	~~290~~

There are two bold entries (merged packages) and four non-bold.
You probably already understand that the first four symbol's code length needs to be incremented and
we will look at the first 2*numMerged = 4 numbers of the final row (the top row).

Original	1	1	6	10	20	~~270~~

Obviously there can't be any bold values.
The first four symbol's code length is incremented again. Let's take a look at the code lengths of each step:

(array index)	0	1	2	3	4	5
Symbol	4	6	5	2	1	0
Count	1	1	6	10	20	270
Code Length (initially)	0	0	0	0	0	0
Code Length (after processing bottom row / iteration 3)	1	1	1	1	1	1
Code Length (after processing iteration 2)	2	2	2	2	2	1
Code Length (after processing iteration 1)	3	3	3	3	2	1
Code Length (after processing top row)	4	4	4	4	2	1

That's it ! The bottom code contains the final code lengths.
No symbol's number of bits is higher than maxLength=4 and the Kraft-McMillan inequality holds true as well: 2^-4 + 2^-4 + 2^-4 + 2^-4 + 2^-2 + 2^-1 = 1.
The algorithm is guaranteed to create optimal code lengths for a prefix-free code. No other combination of code lengths will be better.

I left out any theoretical stuff, such as proofs etc., and skipped practical optimizations to keep things short and sweet.
Take a look at my code (click below) or read the scientific papers to learn more details.

show packagemerge.c


     // //////////////////////////////////////////////////////////
     // packagemerge.c
     // written by Stephan Brumme, 2021
     // see https://create.stephan-brumme.com/length-limited-prefix-codes/
     //
     
     #include "packagemerge.h"
     #include <stdlib.h>       // malloc/free/qsort
     
     
     // ----- package-merge algorithm -----
     
     // to me the best explanation is Sebastian Gesemann's Bachelor Thesis (in German only / University of Paderborn, 2004)
     
     // data types (switching to unsigned int is faster but fails if sum(histogram) > 2^31 or maxLength > 31)
     typedef unsigned long long BitMask;
     typedef unsigned long long HistItem;
     
     /// compute limited prefix code lengths based on Larmore/Hirschberg's package-merge algorithm
     /** - histogram must be in ascending order and no entry must be zero
      *  - the function rejects maxLength > 63 but I don't see any practical reasons you would need a larger limit ...
      *  @param  maxLength  maximum code length, e.g. 15 for DEFLATE or JPEG
      *  @param  numCodes   number of codes, equals the array size of histogram and codeLength
      *  @param  A [in]     how often each code/symbol was found [out] computed code lengths
      *  @result actual maximum code length, 0 if error
      */
     unsigned char packageMergeSortedInPlace(unsigned char maxLength, unsigned int numCodes, unsigned int A[])
     {
       // skip zeros
       while (numCodes > 0 && A[0] == 0)
       {
         numCodes--;
         A++;
       }
     
       // at least one code needs to be in use
       if (numCodes == 0 || maxLength == 0)
         return 0;
     
       // one or two codes are always encoded with a single bit
       if (numCodes <= 2)
       {
         A[0] = 1;
         if (numCodes == 2)
           A[1] = 1;
         return 1;
       }
     
       // A[] is an input  parameter (stores the histogram) as well as
       //        an output parameter (stores the code lengths)
       const unsigned int* histogram   = A;
       unsigned int*       codeLengths = A;
     
       // my allround variable for various loops
       unsigned int i;
     
       // check maximum bit length
       if (maxLength > 8*sizeof(BitMask) - 1) // 8*8-1 = 63
         return 0;
     
       // at least log2(numCodes) bits required for every valid prefix code
       unsigned long long encodingLimit = 1ULL << maxLength;
       if (encodingLimit < numCodes)
         return 0;
     
       // need two buffers to process iterations and an array of bitmasks
       unsigned int maxBuffer = 2 * numCodes;
       // allocate memory
       HistItem* current  = (HistItem*) malloc(sizeof(HistItem) * maxBuffer);
       HistItem* previous = (HistItem*) malloc(sizeof(HistItem) * maxBuffer);
       BitMask*  isMerged = (BitMask*)  malloc(sizeof(BitMask)  * maxBuffer);
     
       // initial value of "previous" is a plain copy of the sorted histogram
       for (i = 0; i < numCodes; i++)
         previous[i] = histogram[i];
       unsigned int numPrevious = numCodes;
       // no need to initialize "current", it's completely rebuild every iteration
     
       // keep track which packages are merged (compact bitmasks):
       // if package p was merged in iteration i then (isMerged[p] & (1 << i)) != 0
       for (i = 0; i < maxBuffer; i++)
         isMerged[i] = 0; // there are no merges before the first iteration
     
       // the last 2 packages are irrelevant
       unsigned int numRelevant = 2 * numCodes - 2;
     
       // ... and preparation is finished
     
       // //////////////////////////////////////////////////////////////////////
       // iterate through potential bit lengths while packaging and merging pairs
       // (step 1 of the algorithm)
       // - the histogram is sorted (prerequisite of the function)
       // - the output must be sorted, too
       // - thus we have to copy the histogram and every and then insert a new package
       // - the code keeps track of the next package and compares it to
       //   the next item to be copied from the history
       // - the smaller value is chosen (if equal, the histogram is chosen)
       // - a bitmask named isMerged is used to keep track which items were packages
       // - repeat until the whole histogram was copied and all packages inserted
     
       // bitmask for isMerged
       BitMask mask = 1;
       unsigned char bits;
       for (bits = maxLength - 1; bits > 0; bits--)
       {
         // ignore last element if numPrevious is odd (can't be paired)
         numPrevious &= ~1; // bit-twiddling trick to clear the lowest bit, same as numPrevious -= numPrevious % 2
     
         // first merged package
         current[0] = histogram[0];              // a sum can't be smaller than its parts
         current[1] = histogram[1];              // therefore it's impossible to find a package at index 0 or 1
         HistItem sum = current[0] + current[1]; // same as previous[0] + previous[1]
     
         // copy histogram and insert merged sums whenever possible
         unsigned int numCurrent = 2;            // current[0] and current[1] were already set
         unsigned int numHist    = numCurrent;   // we took them from the histogram
         unsigned int numMerged  = 0;            // but so far no package inserted (however, it's precomputed in "sum")
         for (;;) // stop/break is inside the loop
         {
           // the next package isn't better than the next histogram item ?
           if (numHist < numCodes && histogram[numHist] <= sum)
           {
             // copy histogram item
             current[numCurrent++] = histogram[numHist++];
             continue;
           }
     
           // okay, we have a package being smaller than next histogram item
     
           // mark output value as being "merged", i.e. a package
           isMerged[numCurrent] |= mask;
     
           // store package
           current [numCurrent]  = sum;
           numCurrent++;
     
           // already finished last package ?
           numMerged++;
           if (numMerged * 2 >= numPrevious)
             break;
     
           // precompute next sum
           sum = previous[numMerged * 2] + previous[numMerged * 2 + 1];
         }
     
         // make sure every code from the histogram is included
         // (relevant if histogram is very skewed with a few outliers)
         while (numHist < numCodes)
           current[numCurrent++] = histogram[numHist++];
     
         // prepare next mask
         mask <<= 1;
     
         // performance tweak: abort as soon as "previous" and "current" are identical
         if (numPrevious >= numRelevant) // ... at least their relevant elements
         {
           // basically a bool: FALSE == 0, TRUE == 1
           char keepGoing = 0;
     
           // compare both arrays: if they are identical then stop
           for (i = numRelevant - 1; i > 0; i--) // collisions are most likely at the end
             if (previous[i] != current[i])
             {
               keepGoing++;
               break;
             }
     
           // early exit ?
           if (keepGoing == 0)
             break;
         }
     
         // swap pointers "previous" and "current"
         HistItem* tmp = previous;
         previous = current;
         current  = tmp;
     
         // no need to swap their sizes because only numCurrent needed in next iteration
         numPrevious = numCurrent;
       }
     
       // shifted one bit too far
       mask >>= 1;
     
       // keep only isMerged
       free(previous);
       free(current);
     
       // //////////////////////////////////////////////////////////////////////
       // tracking all merges will produce the code lengths
       // (step 2 of the algorithm)
       // - analyze each bitlength's mask in isMerged:
       //   * a "pure" symbol => increase bitlength of that symbol
       //   * a merged code   => just increase counter
       // - stop if no more merged codes found
       // - if m merged codes were found then only examine
       //   the first 2*m elements in the next iteration
       //   (because only they formed these merged codes)
     
       // reset code lengths
       for (i = 0; i < numCodes; i++)
         codeLengths[i] = 0;
     
       // start with analyzing the first 2n-2 values
       unsigned int numAnalyze = numRelevant;
       while (mask != 0) // stops if nothing but symbols are found in an iteration
       {
         // number of merged packages seen so far
         unsigned int numMerged = 0;
     
         // the first two elements must be symbols, they can't be packages
         codeLengths[0]++;
         codeLengths[1]++;
         unsigned int symbol = 2;
     
         // look at packages
         for (i = symbol; i < numAnalyze; i++)
         {
           // check bitmask: not merged if bit is 0
           if ((isMerged[i] & mask) == 0)
           {
             // we have a single non-merged symbol, which needs to be one bit longer
             codeLengths[symbol]++;
             symbol++;
           }
           else
           {
             // we have a merged package, so that its parts need to be checked next iteration
             numMerged++;
           }
         }
     
         // look only at those values responsible for merged packages
         numAnalyze = 2 * numMerged;
     
         // note that the mask was originally slowly shifted left by the merging loop
         mask >>= 1;
       }
     
       // last iteration can't have any merges
       for (i = 0; i < numAnalyze; i++)
         codeLengths[i]++;
     
       // it's a free world ...
       free(isMerged);
     
       // first symbol has the longest code because it's the least frequent in the sorted histogram
       return codeLengths[0];
     }
     
     
     // the following code is almost identical to function moffat() in moffat.c
     
     
     // helper struct for qsort()
     struct KeyValue
     {
       unsigned int key;
       unsigned int value;
     };
     // helper function for qsort()
     static int compareKeyValue(const void* a, const void* b)
     {
       struct KeyValue* aa = (struct KeyValue*) a;
       struct KeyValue* bb = (struct KeyValue*) b;
       // negative if a < b, zero if a == b, positive if a > b
       if (aa->key < bb->key)
         return -1;
       if (aa->key > bb->key)
         return +1;
       return 0;
     }
     
     
     /// same as before but histogram can be in any order and may contain zeros, the output is stored in a dedicated parameter
     /** - the function rejects maxLength > 63 but I don't see any practical reasons you would need a larger limit ...
      *  @param  maxLength  maximum code length, e.g. 15 for DEFLATE or JPEG
      *  @param  numCodes   number of codes, equals the array size of histogram and codeLength
      *  @param  histogram  how often each code/symbol was found
      *  @param  codeLength [out] computed code lengths
      *  @result actual maximum code length, 0 if error
      */
     unsigned char packageMerge(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[])
     {
       // my allround variable for various loops
       unsigned int i;
     
       // reset code lengths
       for (i = 0; i < numCodes; i++)
         codeLengths[i] = 0;
     
       // count non-zero histogram values
       unsigned int numNonZero = 0;
       for (i = 0; i < numCodes; i++)
         if (histogram[i] != 0)
           numNonZero++;
     
       // reject an empty alphabet because malloc(0) is undefined
       if (numNonZero == 0)
         return 0;
     
       // allocate a buffer for sorting the histogram
       struct KeyValue* mapping = (struct KeyValue*) malloc(sizeof(struct KeyValue) * numNonZero);
       // copy histogram to that buffer
       unsigned int storeAt = 0;
       for (i = 0; i < numCodes; i++)
       {
         // skip zeros
         if (histogram[i] == 0)
           continue;
     
         mapping[storeAt].key   = histogram[i];
         mapping[storeAt].value = i;
         storeAt++;
       }
       // now storeAt == numNonZero
     
       // invoke C standard library's qsort
       qsort(mapping, numNonZero, sizeof(struct KeyValue), compareKeyValue);
     
       // extract ascendingly ordered histogram
       unsigned int* sorted = (unsigned int*) malloc(sizeof(unsigned int) * numNonZero);
       for (i = 0; i < numNonZero; i++)
         sorted[i] = mapping[i].key;
     
       // run package-merge algorithm
       unsigned char result = packageMergeSortedInPlace(maxLength, numNonZero, sorted);
     
       // "unsort" code lengths
       for (i = 0; i < numNonZero; i++)
         codeLengths[mapping[i].value] = sorted[i];
     
       // let it go ...
       free(sorted);
       free(mapping);
     
       return result;
     }

In one of the last sentences I mentioned the Kraft-McMillan inequality (sometimes McMillan is forgotten).
It's core message is: if sum(2^-codeLength[0...numCodes]) <= 1 then a prefix-free code exists.

Ideally you try to have sum(2^-codeLength[0...numCodes]) = 1 ("equal" instead of "less-than or "equal") because anything else is inefficient.
The Huffman algorithm always gives you sum = 1. The Package-Merge algorithm has the same property.

The following length-limiting techniques modify code lengths in such a way that the maximum code length is enforced ("limited") while the inequality still holds true.

Assume that you ran the Huffman algorithm and found that a few codes are too long. You know that sum = 1.
An important insight is that making one of such long codes only one bit shorter increments the sum, thus violating the inequality:
newSum = sum - 2^-oldLength + 2^-newLength = sum - 2^-oldLength + 2^(oldLength-1) = sum + 2^-oldLength

However, if there is another symbol with a code length that is shorter than your desired maxLength we can make it one bit longer to offset our change.
The formula is almost the same (but plus one instead of minus one in the exponent).
Pretty much all algorithm mentioned in the follwing chapters pair long with short codes such that the new length limit is respected and the inequality holds true.

This is the shortest algorithm. It can be found in Annex K.3 of the JPEG standard ITU T.81 along with a nice flow diagram.
You can download that document from w3.org, see page 147 of that PDF.

First you create a standard Huffman code. If the maximum code length doesn't exceed your limit then you are obviously done.

If you look at all codes having the longest code length then there is always an even number of them
(if that wasn't the case then one symbol could be made one bit shorter because its last bit is unused → an odd number must be an error in your Huffman implementation).

You repeat this procedure until reaching maxLength:

pick two symbols x and y having that longest bit length
the canonical Huffman code of x and y will be almost identical: x ends with 0 and y ends with 1
let's call their shared prefix P
we remove x's trailing 0, so that x becomes P, being one bit shorter
however, y is invalid now
let's pick a third symbol z which is shorter than the new x (that means at least two bits shorter than the still invalid y)
appending a 0 to z makes it one bit longer
appending a 1 to z creates a completely new code which can be used for y

What did we achieve ?

we made 1 symbol shorter by exactly 1 bit (x)
we made 1 symbol shorter by at least 1 bit (y)
we made 1 symbol longer by exactly 1 bit (z)

Why does it work ?

from a mathematical point of view the sum of all Kraft values must be <= 1.0
we know that bits(x) = bits(y)
if bits(z) = bits(x) - 2 then sum(Kraft) = sum(Kraft')

because bits(x') = bits(x) - 1
and bits(y') = bits(y) - 1
and bits(z') = bits(z) + 1
so that bits(z') = bits(x') = bits(y')

let's forget about all symbols except x, y and z so that we have:
sum(K) = K(x) + K(y) + K(z)
= 2^-bits(x) + 2^-bits(y) + 2^-bits(z)
= 2^-bits(x) + 2^-bits(x) + 2^-(bits(x) - 2)
= 2 * 2^-bits(x) + 4 * -bits(x)
= 6 * 2^-bits(x)
sum(K') = K(x') + K(y') + K(z')
= 3 * K(x')
= 3 * 2^-bits(x')
= 3 * 2^(-bits(x) - 1)
= 6 * 2^-bits(x)
= sum(K)
therefore the sum of all Kraft values remains unchanged after the transformation

Each step reduces the bit lengths of just two symbols by typically one bit.
Unfortunately, processing a huge alphabet with very large bit lengths can be quite slow.
However, this situation is impossible with the small JPEG alphabet (typically 162 symbols with an upper code length limit of 16).

The code simplifes and becomes faster if work transform a histogram of code lengths instead of the code lengths themselves.
That means you run a single pass over all code lengths and count how many symbols have one bits, two bits, three bits, etc.

The inner loop of my function limitedJpegInPlace is surprisingly shprt:

hide limitedJpegInPlace


     unsigned char limitedJpegInPlace(unsigned char newMaxLength, unsigned char oldMaxLength, unsigned int histNumBits[])
     {
       if (newMaxLength <= 1)
         return 0;
       if (newMaxLength >  oldMaxLength)
         return 0;
       if (newMaxLength == oldMaxLength)
         return newMaxLength;
     
       // iterate over all "excessive" bit lengths, beginning with the longest
       unsigned char i = oldMaxLength;
       while (i > newMaxLength)
       {
         // no codes at this bit length ?
         if (histNumBits[i] == 0)
         {
           i--;
           continue;
         }
     
         // look for codes that are at least two bits shorter
         unsigned char j = i - 2;
         while (j > 0 && histNumBits[j] == 0)
           j--;
     
         // change bit length of 2 of the longest codes
         histNumBits[i] -= 2;
         // one code becomes one bit shorter
         // (using the joint prefix of the old two codes)
         histNumBits[i - 1]++;
     
         // the other code has length j+1 now
         // but another, not-yet-involved, code with length j
         // is moved to bit length j+1 as well
         histNumBits[j + 1] += 2;
         histNumBits[j]--;
       }
     
       // return longest code length that is still used
       while (i > 0 && histNumBits[i] == 0)
         i--;
     
       // JPEG Annex K.3 specifies an extra line:
       // histNumBits[i]--;
       // => because JPEG needs a special symbol to avoid 0xFF in its output stream
     
       return i;
     }

The DEFLATE file format imposes a 16 bit limit in its Huffman stage.
Unlike JPEG there is no official length-limiting procedure so that programmers came up with different (but somehow very similar) code.
If found the most easily identifiable code was MiniZ's huffman_enforce_max_code_size() (written by Rich Geldreich).
gzip and its zlib are more popular but the way they limit code lengths is somewhat hidden and interweaved with other functionality.
As far as I understand both MiniZ's and zlib's length-limiting behaves identically.

While MiniZ's basic idea is comparable to JPEG's (for most input they produce identical output), it's much faster:
instead of slowly "eating" one bit away at a time from long codes, it immediately shortens all (!) long codes to the new maxLength limit.
Now the Kraft-McMillan inequality might be violated so that a few codes need to be fixed:

select a code x with maximum length (=maxLength)
select another code y that is shorter, if possible maxLength-1 bits (even shorter codes work, too, but are inefficient)
make y one bit longer
assign x the same length as the "new" y
we just reduced the Kraft-McMillan sum by (at least) 2^-maxLength
if it still exceeds 1 then continue with step 1

Note: it's possible (and quite likely !) that the selected code x gets assigned the same code length it had before.
For example, if newMaxLength is 16 and you have codes with 15 bits, too.
Now if you select a 16 bit code and pick a second 15 bit code then that 15 bit code becomes a 16 bit code and the 16 bit code stays a 15+1=16 bit code.

The Kraft-McMillan computations can be performed with integers if we multiply everything by 2^maxLength:
2^-maxLength * 2^maxLength = 1, thus each iteration "saves" one unit and the total sum must be below 1 * 2^maxLength = 2^maxLength.

If a histogram of code lengths is used instead of code length themselves (just like the JPEG algorithm) then the inner loop is just 50 lines, and most of it are comments:

hide limitedMinizInPlace


     unsigned char limitedMinizInPlace(unsigned char newMaxLength, unsigned char oldMaxLength, unsigned int histNumBits[])
     {
       if (newMaxLength <= 1)
         return 0;
       if (newMaxLength >  oldMaxLength)
         return 0;
       if (newMaxLength == oldMaxLength)
         return newMaxLength;
     
       // my allround variable for various loops
       unsigned int i;
     
       // move all oversized code lengths to the longest allowed
       for (i = newMaxLength + 1; i <= oldMaxLength; i++)
       {
         histNumBits[newMaxLength] += histNumBits[i];
         histNumBits[i] = 0;
       }
     
       // compute Kraft sum
       // (using integer math: everything is multiplied by 2^newMaxLength)
       unsigned long long total = 0;
       for (i = newMaxLength; i > 0; i--)
         total += histNumBits[i] << (newMaxLength - i);
     
       // iterate until Kraft sum doesn't exceed 1 anymore
       unsigned long long one = 1ULL << newMaxLength;
       while (total > one)
       {
         // select a code with maximum length, it will be moved
         histNumBits[newMaxLength]--;
     
         // find a second code with shorter length
         for (i = newMaxLength - 1; i > 0; i--)
           if (histNumBits[i] > 0)
           {
             histNumBits[i]--;
             // extend that shorter code by one bit
             // and assign the same code length to the selected one
             histNumBits[i + 1] += 2;
     
             break;
           }
     
         // moving these codes reduced the Kraft sum
         total--;
       }
     
       return newMaxLength;
     }

bzip2 is a weird compression concept. One of its several stages is a Huffman compression stage.
Instead of playing around with the Kraft-McMillan inequality it completely relies on a fast Huffman encoder and trial'n'error:

compute Huffman codes for your alphabet
if maximum code length doesn't exceed maxLength then you are done
otherwise reduce each symbol's count (but avoid becoming zero)
and go back to step 1

There are many possibilities how to implement step 3. bzip2 divides by two and clears the lower 8 bits of each counter (and adds 1 to avoid zeros).
While being very fast, it may produce sub-optimal Huffman codes. Sometimes dividing by 3 gives better results but that totally depends on your input data.

The core loop is:

hide bzip2


     // run Moffat algorithm ...
     unsigned char result = moffatSortedInPlace(numNonZero, sorted);
     // ... until a proper maximum code length is found
     while (result > maxLength)
     {
       // more or less divide each weight by two while avoiding zero
       for (i = 0; i < numNonZero; i++)
       {
         unsigned int weight = mapping[i].key;
     
         // bzip2 "clears" the lowest 8 bits of the histogram
         // to reach the length limit in less iterations
         // but sacrifices lots of efficiency
         // if you set EXTRA_SHIFT to 0 then the code may need more iterations
         // but finds much better code lengths
         //#define EXTRA_SHIFT 8
         #define EXTRA_SHIFT 0
     
         // sometimes dividing the weight by a bigger integer (e.g. 3)
         // may lead to more efficient prefix codes
         #define DIVIDE_BY   2
     
         // the shifts do more harm than good in my opinion
         weight >>= EXTRA_SHIFT;
     
         // adding 1 avoids zero
         weight   = 1 + (weight / DIVIDE_BY);
         weight <<= EXTRA_SHIFT;
     
         // sorted was overwritten with code lengths
         mapping[i].key = sorted[i] = weight;
       }
     
       // again: run Moffat algorithm
       result = moffatSortedInPlace(numNonZero, sorted);
     }

The compression efficiency is in almost all cases worse that Package-Merge/JPEG/MiniZ. Moreover, it's slower than JPEG/MiniZ so there is no real reason choosing bzip2's approach.

The goal of each presented length-limiting algorithm is producing a code that is compatible to Huffman decoders.
Aside from Package-Merge, they all need to generate a plain Huffman code and then adjust overlong symbols.

I already mentioned the Kraft-McMillan inequality which states that each code where sum(2^-codeLength[0...numCodes]) <= 1 is a prefix-free code.
That means each such code can be decoded by Huffman decoders, too.

I came across a few postings, mainly by Charles Bloom, discussing how solely focussing on the Kraft-McMillan inequality can be both fast and close to optimal.
However, I made a few important changes to speed up things:

compute theoretical number of bits for each symbol: entropy = -log2(count / total)

using a super-fast fast approximation instead of C's log functions

round up or down to closest integer, this is our initial guess for the code length

but never round to zero
if above our length limit, then set it to that limit

compute Kraft-McMillan sum of all symbols' code lengths

Now we have a rough estimation. If the Kraft-McMillan sum doesn't exceed 1 then this is actually a proper prefix-free code (but most likely it's sub-optimal).
If that sum is above 1 then we need to modify (=increase) a few code lengths. The main idea is modifying those code lengths which "hurt the least".
When we rounded to the closest integer, many symbols rounded down: their actual code length is smaller than their theoretical bit length.

Assume you have a symbol which represents 9% of all symbols. Then its entropy is -log2(0.09) ≈ 3.474 → rounded to 3. There is a "gain" of 3.474 - 3 = 0.474.
"Gain" means that we used 0.474 less bits per symbol occurrence than the theoretical minimum, we saved some bits.
If we change that symbol's code length to 4 (assuming it's allowed by the length limit), then there will be a "loss" of 3.474 - 4 = -0.526.
"Loss" means that we used 0.526 less bits per symbol occurrence than the theoretical minimum, we wasted some bits.
The Kraft-McMillan sum will be lower though, hopefully now below 1, at the cost of slightly worse compression.

Let's say there is another symbol representing 10% of all symbols, thus being a little bit more frequent than the previous symbol.
It's entropy is -log2(0.1) ≈ 3.322, rounded to 3 as well. Therefore it's "gain" is 3.322 - 3 = 0.322 and potential "loss" after adding one bit is 3.322 - 4 = -0.678.
Both symbols affect the Kraft-McMillan sum in the same way, but the latter symbol will cause even worse compression.

You have probably noticed that adding one bit to symbol which had a gain of x leads to a loss of 1-x.
So the main idea is: always look for symbols with maximum gain to cause minimal loss after adding a bit.

I implemented two different strategies to modify the code lengths, called A and B.
And yes, I got tired of the long name "Kraft-McMillan inequality" and just called it "Kraft" (which is the nice German word for "power"/"strength"/"force").

This algorithm works with several passes. A threshold, initially a small value, decreases after each pass.
Walking through all symbols, we look for symbols where the difference between theoretical bit length and actual bit length is above that threshold ("high gain").
One bit will be added and the Kraft sum adjusted. As soon as the Kraft sum isn't larger than 1 anymore we have a valid prefix-free code.

It's possible to "overshoot" by extending too many previously very short codes - which have a high influence on the Kraft sum.
If we are well below a Kraft sum of 1 then a single pass looks for codes which can be made one bit shorter without violating the Kraft sum.
In many cases this last pass isn't needed and can be skipped.

The algorithm works pretty well. However, the number of passes hugely depends on the input data.
If the initial threshold is too high and/or shrinks only a little after each pass then a large number of passes will be required.
If the initial threshold is too low and/or shrinks a lot after each pass then only few passes will be required.
While the latter sounds much better, it will miss many optimal code lengths and produce sub-optimal results.
I didn't find a perfect one-suits-all configuration. The default INITIAL_THRESHOLD = 28 / 64.0 and STEP_THRESHOLD = 1 / 64.0 worked well in my experiments.

Strategy A makes a few guesses regarding the threshold. Moreover, it repeatedly scans symbols which need not to be modified.
My strategy B implements a max-queue. Its first element is always the symbol with the highest gain.
Therefore we don't need scan all symbols but always pick from the top of that max-queue.
After a symbol was adjusted, there is a need to re-insert it into the max-queue.

One of the most efficient data structures for such access pattern is a heap. Most standard libraries come with a heap implementation (such as C++).
Roughly half of the code of limitedkraftheap.c is a pretty generic and straightforward heap implementation.

Similar to strategy A, we might "overshoot" so that a fast fix-up exists as well.

There is no major difference between strategy A and B concerning the generated code lengths.
However, strategy B is much faster (typically by a factor of 4) and even outperformes an efficient Huffman implementation.
It's interesting to note that the length limit has a very significant impact on the execution time: lower limits are faster.

This code is licensed under the zlib License:

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License

Git users: scroll down to the repository link

Download packagemerge.h

Latest release: February 5, 2021, size: 1668 bytes, 30 lines

CRC32: 9f50f131
MD5: 9c4358817e4f951186dc6c87866ab727
SHA1: 5533b6f2fa58c109c590ae4f33665ac7d3e4b896
SHA256:beb1da8cdce260cb540ab871c5f98d46b596e3956d61324fd1567739492384f8

Download packagemerge.c

Latest release: June 29, 2021, size: 11.0 kBytes, 338 lines

CRC32: b70bd1e5
MD5: e826eda6c202bfa2fa22daf9e3a5a246
SHA1: fc47f5ef18d5c82996032b3d3d9c83a85df8f940
SHA256:2dec5ba9d228eeb44bab2f90312518f546475a65982566857bc8ffe402c42a34

Download moffat.h

Latest release: February 5, 2021, size: 2208 bytes, 41 lines

CRC32: 979b7a5c
MD5: 264867286a2a10419e77ad7e2da9ee69
SHA1: 2cebbac1ae3c7f53ee7562ed73b9e68ca4987dff
SHA256:04e45e82ad87c9bb938d7e68b9a39346dbe6d1c8ff9e5859894be43ec62ba406

Download moffat.c

Latest release: February 10, 2021, size: 4436 bytes, 175 lines

CRC32: a5935e36
MD5: db5f8486daebbbe7179c2ed964bd7a80
SHA1: b688a68b47c8ef654e3271a78152a5382c480820
SHA256:1ec47ed8df361a5cb8ef847b94dc1be85fccfaaab2c08d7d56f4dd551d8b31eb

Download limitedjpegdeflate.h

Latest release: February 5, 2021, size: 4835 bytes, 80 lines

CRC32: 161095e9
MD5: 715a76cfe7bb7d74e8c1e5f2fbcfb4d9
SHA1: e0f8bd3a6b3afcd6a793fa8daab8b7b34db92cf4
SHA256:e75a18a20fe66cee275a0d0ebe52cca3bf9ad14c92994c3a11d5a68651d98819

Download limitedjpegdeflate.c

Latest release: February 10, 2021, size: 12.2 kBytes, 356 lines

CRC32: c503747c
MD5: e878141ab902742613f9863e97118ec0
SHA1: e6a180b59caf39df790cc047876392c65c114232
SHA256:0e85a8d136f07214b4203a4036a75a3f817e64f1edebf7028b3df55217924ed6

Download limitedbzip2.h

Latest release: February 5, 2021, size: 1476 bytes, 29 lines

CRC32: c958c820
MD5: 5735c2438ffbd21540c640e48a8217fd
SHA1: e157bb74cc746609ae083ceb497bf4327f31f016
SHA256:39a411f02183cc34a8559f973724517a6d81c68c5259478f0a05fdb8fb66658a

Download limitedbzip2.c

Latest release: February 10, 2021, size: 4050 bytes, 131 lines

CRC32: 821a4f0f
MD5: af0a84f35b32310ec33ad6da058ff3e4
SHA1: a1eff36323c9917f62f0cfd1ee000388c44a4c1e
SHA256:a8bfb557d645a1419df9fe071682d9be88af71c82905bddfb0f3d10e50964a78

Download limitedkraft.h

Latest release: February 5, 2021, size: 745 bytes, 17 lines

CRC32: cb6d4745
MD5: 9f5615706645e6d591769cd9385c90bd
SHA1: e8e526e2e32b16197ec77c593f476230221217a6
SHA256:8ee50def030ba1dd2383febc970496c1f7b9cfe824774d133b5f75cc0741b641

Download limitedkraft.c

Latest release: February 5, 2021, size: 6.9 kBytes, 224 lines

CRC32: fddbc6e9
MD5: a210caef2d00f1be7c9da18b193df501
SHA1: 648b90061f069d74beb41decdcd7ed28c4eaffef
SHA256:545e870024999e917bd52347419872458178f61645853e5360bf04f35e2aa63c

Download limitedkraftheap.h

Latest release: February 5, 2021, size: 753 bytes, 17 lines

CRC32: 7a4c7aa3
MD5: f920b31d7c745895287ae07e357b0737
SHA1: 7bf007989f4ee0ec9e6af857038fa52bf7a5b59f
SHA256:c12082e6be08b9028d80ad2f148ecfb780ea4daf9ee01b9569e7f30553dba489

Download limitedkraftheap.c

Latest release: February 5, 2021, size: 9.7 kBytes, 342 lines

CRC32: 0a9bb4f3
MD5: 1d5fcb52464eb0ae0922ec43dddf862f
SHA1: 99e13c8494a78e5c4d602abb124da76e2064d049
SHA256:3a4559d076c85703d1936b4ee09a9d1b2fbc71ab73ddf20ee9805ab29420fc56

Stay up-to-date:git clone https://create.stephan-brumme.com/length-limited-prefix-codes/git

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

(array index)

Code Length (initially)

Code Length
(after processing bottom row)

version 1
- June 30, 2021
- initial release


June 30, 2021: Length-Limited Prefix Codes April 13, 2020: smalLZ4 - optimal LZ4 compression June 17, 2019: toojpeg - a JPEG encoder in a single C++ file ... and 30 more !	RSS	www.stephan-brumme.com bits.stephan-brumme.com photos.stephan-brumme.com minime.stephan-brumme.com
© 2011-2024 Stephan Brumme products and company names mentioned may be trademarks of their respective owners. see my disclaimer. dude, this is the end ... of this page.