# Length-Limited Prefix Codes

posted by Stephan Brumme

## Overview

The most common technique to generate prefix-free codes is the Huffman algorithm.It is part of many popular compression algorithms such as JPEG, DEFLATE (ZIP / GZIP / PNG / etc.).

In most cases it serves as an "afterburner" to compress pre-processed symbols.

The main advantage - variable-length codes - can be a critical drawback, too: if the maximum code length exceeds a certain threshold, e.g. the CPU's integer size, (16 / 32 / 64 bit), then the compression/decompression code becomes complex and slows down.

All of the aforementioned file formats put a strict upper limit on the maximum length of Huffman codes.

Mainly due to their age and low number of distinct symbols, they all agreed on 16 bits per code.

So-called length-limited prefix codes are not Huffman codes anymore because they are not optimal (if the length restriction was enforced).

The loss in compression efficiency turns out to be often negligible and thus is widely accepted due to the significant gain in decompression performance.

There is a number of algorithms to create such length-limited prefix codes. They vary concerning speed, efficiency and complexity.

I wrote a C library implementing the most common techniques. Most of the code derives from open source programs and was invented by others.

My major contribution was to create a shared interface so that these algorithms are easily interchangeable:

hide

There are three input parameters:
**interface**```
```**unsigned** **char** algorithmName(**unsigned** **char** maxLength, **unsigned** **int** numCodes, **const** **unsigned** **int** histogram[], **unsigned** **char** codeLengths[]);

`maxLength`

is the upper limit of the number of bits for each prefix code`numCodes`

is the number of distinct symbols (including unused symbols) → size of your alphabet`histogram`

is an array of`numCodes`

counters how often each symbol is present

`codeLengths`

will contain the bit length of each symbol

- the longest bit length or zero if the algorithm failed

`0 3 1 1 2 5 2 2`

then `numCodes = 6`

and `histogram[] = { 1,2,3,1,0,1 };`

However, this shared interface comes with a little bit of overhead: sometimes doubling code size and/or execution time.

Therefore most files have a second public function which is more specialized for its algorithm but may have a few additional restrictions.

A common case is that the histogram has to be sorted and unused symbols removed.

Currently implemented are:

Package-Merge | JPEG | MiniZ | BZip2 | Kraft | |
---|---|---|---|---|---|

Author | Larmore and Hirschberg | ITU T.81 standard, annex K.3 | Rich Geldreich | Julian Seward |
inspired by Charles Bloom and Jörn Engel |

Pro | optimal output | shortest code | similar to JPEG but faster | trivial to understand | often fastest |

Contra | slowest and most complex | mediocre compression | mediocre compression | worst compression | wildly fluctuating compression |

Description | scroll down | scroll down | scroll down | scroll down | strategy A and strategy B |

## Usage

Each algorithm runs independent of each other. There are no external dependencies, too.Some algorithms require pre-computed Huffman codes - for those cases I added a slightly modified version of Andrew Moffat's efficient in-place Huffman code.

If you want to try out an algorithm then all you have to do is:

hide

Compared to the original Huffman algorithm, a few code became shorter and a few became longer.**Usage**```
```**#include** "packagemerge.h"
**#include** <stdio.h>
**int** main()
{
*// let's assume you have 7 symbols*
**unsigned** **int** numCodes = 7;
*// first one occurs 270 times, second 20x, third 10x, fourth not at all, fifth once, sixth 6x and seventh once*
**unsigned** **int** histogram[7] = { 270,20,10,0,1,6,1 };
*// at most 4 bits*
**unsigned** **char** maxLength = 4;
*// output: bits per code*
**unsigned** **char** codeLengths[7];
*// compute length-limited codes*
**unsigned** **char** longest = packageMerge(maxLength, numCodes, histogram, codeLengths);
*// display result*
**for** (**unsigned** **int** i = 0; i < numCodes; i++)
**printf**("code %d occurs %d times => %d bits\n", i, histogram[i], codeLengths[i]);
**printf**("max. code length is %d\n", longest);
**return** 0;
}

The number of bits /

`codeLengths`

are:
Symbol | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|---|

Count | 270 | 20 | 10 | 0 | 1 | 6 | 1 | sum = 308 |

Huffman | 1 | 2 | 3 | 0 | 5 | 4 | 5 | sum = 374 bits |

`maxLength = 4` |
1 | 2 | 4 | 0 | 4 | 4 | 4 | sum = 382 bits |

`maxLength = 3` |
2 | 2 | 3 | 0 | 3 | 3 | 3 | sum = 634 bits |

There are several methods to convert code lengths to actual prefix-free codes. I find canonical Huffman codes to be the most suitable: a simple and fast algorithm.

Using the example data we get these canonical codes:

Symbol | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|

Count | 270 | 20 | 10 | 0 | 1 | 6 | 1 |

Huffman | `0` |
`10` |
`110` |
- | `11110` |
`1110` |
`11111` |

`maxLength = 4` |
`0` |
`10` |
`1100` |
- | `1101` |
`1110` |
`1111` |

`maxLength = 3` |
`00` |
`01` |
`100` |
- | `101` |
`110` |
`111` |

show

**canonical.c**```
```**#include** "packagemerge.h"
**#include** <stdio.h>
*// create canonical codes, for simplicity longest code must not exceed number of bits of an unsigned int (32 or 64)*
**void** canonical(**unsigned** **int** numCodes, **unsigned** **char** codeLengths[], **unsigned** **int** codes[])
{
*// count how often each code length is present*
**unsigned** **int** numLengths[256] = { 0 };
**unsigned** **int** i;
**for** (i = 0; i < numCodes; i++)
numLengths[codeLengths[i]]++;
*// precompute the first code of each code length*
**unsigned** **int** next[256];
*// there's no code with length 0 and the first 1-bit code with be a zero*
next[1] = 0;
**for** (i = 2; i < 256; i++)
next[i] = (next[i - 1] + numLengths[i - 1]) << 1;
*// now assign all codes*
**for** (i = 0; i < numCodes; i++)
codes[i] = next[codeLengths[i]]++;
}
**int** main()
{
*// let's assume you have 7 symbols*
**unsigned** **int** numCodes = 7;
*// first one occurs 270 times, second 20x, third 10x, fourth no at all, fifth once, sixth 6x and seventh once*
**unsigned** **int** histogram[7] = { 270,20,10,0,1,6,1 };
*// at most 4 bits*
**unsigned** **char** maxLength = 4;
*// output: bits per code*
**unsigned** **char** codeLengths[7];
*// compute length-limited codes*
**unsigned** **char** longest = packageMerge(maxLength, numCodes, histogram, codeLengths);
*// create canonical codes*
**unsigned** **int** codes[7];
canonical(numCodes, codeLengths, codes);
*// display result*
**unsigned** **int** i;
**for** (i = 0; i < numCodes; i++)
{
**printf**("code %d occurs %d times\t=> %d bits", i, histogram[i], codeLengths[i]);
*// binary represenation of canonical code*
**printf**("\t=> ");
**if** (codeLengths[i] > 0)
{
**int** shift;
**for** (shift = codeLengths[i] - 1; shift >= 0; shift--)
**printf**(codes[i] & (1 << shift) ? "1" : "0");
}
**printf**("\n");
}
**printf**("max. code length is %d\n", longest);
**return** 0;
}

## Benchmark

Here are a few results from the first 64k bytes of enwik, measured on a Core i7 / GCC x64:`time ./benchmark 1 12 100000`

- where
`1`

is the algorithm's ID and was between`0`

and`6`

- limiting the code length to at most
`12`

bits **note:**the unadjusted Huffman codes have up to 16 bits- each algorithm ran
`100000`

times

algorithm | ID | total size | compression ratio | execution time |
---|---|---|---|---|

Huffman | `0` |
326,892 bits | 62.35% | 0.54 s |

Package-Merge | `1` |
327,721 bits | 62.51% | 1.17 s |

MiniZ | `2` |
328,456 bits | 62.65% | 0.58 s |

JPEG | `3` |
328,456 bits | 62.65% | 0.61 s |

BZip2 | `4` |
328,887 bits | 62.73% | 0.88 s |

Kraft | `5` |
327,895 bits | 62.54% | 1.72 s |

modified Kraft | `6` |
327,942 bits | 62.55% | 0.41 s |

Changing the code length limit significantly influences both execution time and compression ratio.

I picked the two "best" algorithms, used the same data set and repeated each procedure 100,000 times:

length limit | Package-Merge | Kraft Strategy B |
---|---|---|

8 bits | 70.47%, 0.96 s | 70.76%, 0.24 s |

9 bits | 65.30%, 1.02 s | 65.31%, 0.24 s |

10 bits | 63.49%, 1.07 s | 63.79%, 0.31 s |

11 bits | 62.80%, 1.14 s | 62.84%, 0.37 s |

12 bits | 62.51%, 1.17 s | 62.55%, 0.40 s |

13 bits | 62.40%, 1.22 s | 62.43%, 0.34 s |

14 bits | 62.36%, 1.25 s | 62.42%, 0.40 s |

15 bits | 62.35%, 1.29 s | 62.42%, 0.66 s |

16 bits | 62.35%, 1.35 s |
62.42%, 0.70 s |

## Package-Merge

Package-Merge is an algorithm that guarantees to produce optimal length-limited prefix codes.Its name is based on the conception of putting numbers in packages and merging these.

Unfortunately, the original paper by Lawrence L. Larmore and Daniel S. Hirschberg isn't very accessible.

Moreover, further information is pretty scarce on the internet: the Wikipedia entry doesn't help a lot.

The best introduction - in my opinion - is Sebastian Gesemann's Bachelor's Thesis but it's available in German only.

The diagrams on page 24 and 26 walk through an example and page 29 gives pseudo code.

There are two prerequisites:

- remove all unused symbols before running Package-Merge
- and symbols must must be sorted in ascending order of their frequency

Let's process the sample data shown above:

Symbol | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|

Count | 270 | 20 | 10 | 0 | 1 | 6 | 1 |

(the order of symbols with the same count doesn't matter)

Symbol | 4 | 6 | 5 | 2 | 1 | 0 |
---|---|---|---|---|---|---|

Count | 1 | 1 | 6 | 10 | 20 | 270 |

Naturally you can't have an active symbol with zero bits, the minimum code length is 1 bit.

Package-Merge adds one bit in each iteration so there is a minimum of two iterations (2+1=3) but for whatever reason I decided to have this example with

`maxLength=4`

.Now we add these counts pairwise, we build

**packages**. If there is an odd number of symbols then the last (which is the largest) is discarded:

Symbol | 4 | 6 | 5 | 2 | 1 | 0 |
---|---|---|---|---|---|---|

Count | 1 | 1 | 6 | 10 | 20 | 270 |

Pairwise Sum | 2 | 16 | 290 |

**merged**with the original sequence in ascending order.

If a package is identical to a count of the original sequence then that item from the original sequence comes first.

In the next table packages are shown in a bold font:

Count | 1 | 1 | 2 | 6 | 10 | 16 | 20 | 270 | 290 |
---|

Therefore we need to repeat that procedure two more times to reach our goal

`maxLength=4`

.However, there is one little pitfall: pairs are made using the sequence from the

*previous step*but you always merge them with the

*original*sequence.

That last sentence is important and was probably my most critical insight when I learned about the Package-Merge algorithm.

Okay, let's create package for the second bit:

Count | 1 | 1 | 2 | 6 | 10 | 16 | 20 | 270 | 290 |
---|---|---|---|---|---|---|---|---|---|

Pairwise Sum | 2 | 8 | 26 | 290 |

The four values in the lower row are merged with the original sequence:

Count | 1 | 1 | 2 | 6 | 8 | 10 | 20 | 26 | 270 | 290 |
---|

Count | 1 | 1 | 2 | 6 | 8 | 10 | 20 | 26 | 270 | 290 |
---|---|---|---|---|---|---|---|---|---|---|

Pairwise Sum | 2 | 8 | 18 | 46 | 560 |

Count | 1 | 1 | 2 | 6 | 8 | 10 | 18 | 20 | 46 | 270 | 560 |
---|

Original | 1 | 1 | 6 | 10 | 20 | 270 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|

Iteration 1 | 1 | 1 | 2 | 6 | 10 | 16 | 20 | 270 | 290 | ||

Iteration 2 | 1 | 1 | 2 | 6 | 8 | 10 | 20 | 26 | 270 | 290 | |

Iteration 3 | 1 | 1 | 2 | 6 | 8 | 10 | 18 | 20 | 46 | 270 | 560 |

The actual values in the table don't matter - it's only relevant whether a value is a package (bold) or not (plain).

Code lengths for all symbols are initially set to zero.

If a value isn't bold then the code length of a symbol has to be incremented.

The first iteration processes the first

`2*numCodes-2=10`

values of the bottom row, that means `1,1,...,270`

but not `560`

.Two counters

`symbol`

and `numMerged`

are set to zero.For each table entry a decision has to be made:

- if that value is a package (bold) then increment
`numMerged`

- otherwise increment
`codeLengths[symbol]`

and then`symbol`

`2*numMerged`

table entries.Let's go step-by-step through the example - as mentioned before we start with the bottom row.

The code length of all symbols was set to zero.

(array index) |
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|

Symbol | 4 | 6 | 5 | 2 | 1 | 0 |

Count | 1 | 1 | 6 | 10 | 20 | 270 |

Code Length (initially) | 0 | 0 | 0 | 0 | 0 | 0 |

Code Length (after processing bottom row) |
1 | 1 | 1 | 1 | 1 | 1 |

- a 1 which wasn't bold →
`codeLength[0] = 1`

and then`symbol = 1`

- another 1 which wasn't bold →
`codeLength[1] = 1`

and then`symbol = 2`

- a 2 which is bold →
`numMerged = 1`

- a 6 which wasn't bold →
`codeLength[2] = 1`

and then`symbol = 3`

- a 8 which is bold →
`numMerged = 2`

- a 10 which wasn't bold →
`codeLength[3] = 1`

and then`symbol = 4`

- a 18 which is bold →
`numMerged = 3`

- a 20 which wasn't bold →
`codeLength[4] = 1`

and then`symbol = 5`

- a 46 which is bold →
`numMerged = 4`

- a 270 which wasn't bold →
`codeLength[5] = 1`

and then`symbol = 6`

Because the final value of

`numMerged`

is `4`

, we look at the first `4*2=8`

values of the second-to-last row.Its content was (just a copy, I stroke through irrelevant entries):

Iteration 2 | 1 | 1 | 2 | 6 | 8 | 10 | 20 | 26 |
---|

That means

`numMerged = 3`

and the code length of all-but-the-last symbol becomes 2.That last symbol, which had the highest count, suddenly gets assigned a lower code length than all the other symbols.

Again, a copy of phase 1's iteration 1 (where only

`2*numMerged = 6`

will be handled):
Iteration 1 | 1 | 1 | 2 | 6 | 10 | 16 |
---|

You probably already understand that the first four symbol's code length needs to be incremented and

we will look at the first

`2*numMerged = 4`

numbers of the final row (the top row).Original | 1 | 1 | 6 | 10 |
---|

The first four symbol's code length is incremented again. Let's take a look at the code lengths of each step:

(array index) |
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|

Symbol | 4 | 6 | 5 | 2 | 1 | 0 |

Count | 1 | 1 | 6 | 10 | 20 | 270 |

Code Length (initially) | 0 | 0 | 0 | 0 | 0 | 0 |

Code Length (after processing bottom row / iteration 3) |
1 | 1 | 1 | 1 | 1 | 1 |

Code Length (after processing iteration 2) |
2 | 2 | 2 | 2 | 2 | 1 |

Code Length (after processing iteration 1) |
3 | 3 | 3 | 3 | 2 | 1 |

Code Length (after processing top row) |
4 | 4 | 4 | 4 | 2 | 1 |

No symbol's number of bits is higher than

`maxLength=4`

and the Kraft-McMillan inequality holds true as well: `2^-4 + 2^-4 + 2^-4 + 2^-4 + 2^-2 + 2^-1 = 1`

.The algorithm is guaranteed to create optimal code lengths for a prefix-free code. No other combination of code lengths will be better.

I left out any theoretical stuff, such as proofs etc., and skipped practical optimizations to keep things short and sweet.

Take a look at my code (click below) or read the scientific papers to learn more details.

show

**packagemerge.c**```
```*// //////////////////////////////////////////////////////////*
*// packagemerge.c*
*// written by Stephan Brumme, 2021*
*// see https://create.stephan-brumme.com/length-limited-prefix-codes/*
*//*
**#include** "packagemerge.h"
**#include** <stdlib.h> *// malloc/free/qsort*
*// ----- package-merge algorithm -----*
*// to me the best explanation is Sebastian Gesemann's Bachelor Thesis (in German only / University of Paderborn, 2004)*
*// data types (switching to unsigned int is faster but fails if sum(histogram) > 2^31 or maxLength > 31)*
**typedef** **unsigned** **long** **long** BitMask;
**typedef** **unsigned** **long** **long** HistItem;
*/// compute limited prefix code lengths based on Larmore/Hirschberg's package-merge algorithm*
*/** - histogram must be in ascending order and no entry must be zero*
* * - the function rejects maxLength > 63 but I don't see any practical reasons you would need a larger limit ...*
* * @param maxLength maximum code length, e.g. 15 for DEFLATE or JPEG*
* * @param numCodes number of codes, equals the array size of histogram and codeLength*
* * @param A [in] how often each code/symbol was found [out] computed code lengths*
* * @result actual maximum code length, 0 if error*
* */*
**unsigned** **char** packageMergeSortedInPlace(**unsigned** **char** maxLength, **unsigned** **int** numCodes, **unsigned** **int** A[])
{
*// skip zeros*
**while** (numCodes > 0 && A[0] == 0)
{
numCodes--;
A++;
}
*// at least one code needs to be in use*
**if** (numCodes == 0 || maxLength == 0)
**return** 0;
*// one or two codes are always encoded with a single bit*
**if** (numCodes <= 2)
{
A[0] = 1;
**if** (numCodes == 2)
A[1] = 1;
**return** 1;
}
*// A[] is an input parameter (stores the histogram) as well as*
*// an output parameter (stores the code lengths)*
**const** **unsigned** **int*** histogram = A;
**unsigned** **int*** codeLengths = A;
*// my allround variable for various loops*
**unsigned** **int** i;
*// check maximum bit length*
**if** (maxLength > 8*sizeof(BitMask) - 1) *// 8*8-1 = 63*
**return** 0;
*// at least log2(numCodes) bits required for every valid prefix code*
**unsigned** **long** **long** encodingLimit = 1ULL << maxLength;
**if** (encodingLimit < numCodes)
**return** 0;
*// need two buffers to process iterations and an array of bitmasks*
**unsigned** **int** maxBuffer = 2 * numCodes;
*// allocate memory*
HistItem* current = (HistItem*) malloc(sizeof(HistItem) * maxBuffer);
HistItem* previous = (HistItem*) malloc(sizeof(HistItem) * maxBuffer);
BitMask* isMerged = (BitMask*) malloc(sizeof(BitMask) * maxBuffer);
*// initial value of "previous" is a plain copy of the sorted histogram*
**for** (i = 0; i < numCodes; i++)
previous[i] = histogram[i];
**unsigned** **int** numPrevious = numCodes;
*// no need to initialize "current", it's completely rebuild every iteration*
*// keep track which packages are merged (compact bitmasks):*
*// if package p was merged in iteration i then (isMerged[p] & (1 << i)) != 0*
**for** (i = 0; i < maxBuffer; i++)
isMerged[i] = 0; *// there are no merges before the first iteration*
*// the last 2 packages are irrelevant*
**unsigned** **int** numRelevant = 2 * numCodes - 2;
*// ... and preparation is finished*
*// //////////////////////////////////////////////////////////////////////*
*// iterate through potential bit lengths while packaging and merging pairs*
*// (step 1 of the algorithm)*
*// - the histogram is sorted (prerequisite of the function)*
*// - the output must be sorted, too*
*// - thus we have to copy the histogram and every and then insert a new package*
*// - the code keeps track of the next package and compares it to*
*// the next item to be copied from the history*
*// - the smaller value is chosen (if equal, the histogram is chosen)*
*// - a bitmask named isMerged is used to keep track which items were packages*
*// - repeat until the whole histogram was copied and all packages inserted*
*// bitmask for isMerged*
BitMask mask = 1;
**unsigned** **char** bits;
**for** (bits = maxLength - 1; bits > 0; bits--)
{
*// ignore last element if numPrevious is odd (can't be paired)*
numPrevious &= ~1; *// bit-twiddling trick to clear the lowest bit, same as numPrevious -= numPrevious % 2*
*// first merged package*
current[0] = histogram[0]; *// a sum can't be smaller than its parts*
current[1] = histogram[1]; *// therefore it's impossible to find a package at index 0 or 1*
HistItem sum = current[0] + current[1]; *// same as previous[0] + previous[1]*
*// copy histogram and insert merged sums whenever possible*
**unsigned** **int** numCurrent = 2; *// current[0] and current[1] were already set*
**unsigned** **int** numHist = numCurrent; *// we took them from the histogram*
**unsigned** **int** numMerged = 0; *// but so far no package inserted (however, it's precomputed in "sum")*
**for** (;;) *// stop/break is inside the loop*
{
*// the next package isn't better than the next histogram item ?*
**if** (numHist < numCodes && histogram[numHist] <= sum)
{
*// copy histogram item*
current[numCurrent++] = histogram[numHist++];
**continue**;
}
*// okay, we have a package being smaller than next histogram item*
*// mark output value as being "merged", i.e. a package*
isMerged[numCurrent] |= mask;
*// store package*
current [numCurrent] = sum;
numCurrent++;
*// already finished last package ?*
numMerged++;
**if** (numMerged * 2 >= numPrevious)
**break**;
*// precompute next sum*
sum = previous[numMerged * 2] + previous[numMerged * 2 + 1];
}
*// make sure every code from the histogram is included*
*// (relevant if histogram is very skewed with a few outliers)*
**while** (numHist < numCodes)
current[numCurrent++] = histogram[numHist++];
*// prepare next mask*
mask <<= 1;
*// performance tweak: abort as soon as "previous" and "current" are identical*
**if** (numPrevious >= numRelevant) *// ... at least their relevant elements*
{
*// basically a bool: FALSE == 0, TRUE == 1*
**char** keepGoing = 0;
*// compare both arrays: if they are identical then stop*
**for** (i = numRelevant - 1; i > 0; i--) *// collisions are most likely at the end*
**if** (previous[i] != current[i])
{
keepGoing++;
**break**;
}
*// early exit ?*
**if** (keepGoing == 0)
**break**;
}
*// swap pointers "previous" and "current"*
HistItem* tmp = previous;
previous = current;
current = tmp;
*// no need to swap their sizes because only numCurrent needed in next iteration*
numPrevious = numCurrent;
}
*// shifted one bit too far*
mask >>= 1;
*// keep only isMerged*
free(previous);
free(current);
*// //////////////////////////////////////////////////////////////////////*
*// tracking all merges will produce the code lengths*
*// (step 2 of the algorithm)*
*// - analyze each bitlength's mask in isMerged:*
*// * a "pure" symbol => increase bitlength of that symbol*
*// * a merged code => just increase counter*
*// - stop if no more merged codes found*
*// - if m merged codes were found then only examine*
*// the first 2*m elements in the next iteration*
*// (because only they formed these merged codes)*
*// reset code lengths*
**for** (i = 0; i < numCodes; i++)
codeLengths[i] = 0;
*// start with analyzing the first 2n-2 values*
**unsigned** **int** numAnalyze = numRelevant;
**while** (mask != 0) *// stops if nothing but symbols are found in an iteration*
{
*// number of merged packages seen so far*
**unsigned** **int** numMerged = 0;
*// the first two elements must be symbols, they can't be packages*
codeLengths[0]++;
codeLengths[1]++;
**unsigned** **int** symbol = 2;
*// look at packages*
**for** (i = symbol; i < numAnalyze; i++)
{
*// check bitmask: not merged if bit is 0*
**if** ((isMerged[i] & mask) == 0)
{
*// we have a single non-merged symbol, which needs to be one bit longer*
codeLengths[symbol]++;
symbol++;
}
**else**
{
*// we have a merged package, so that its parts need to be checked next iteration*
numMerged++;
}
}
*// look only at those values responsible for merged packages*
numAnalyze = 2 * numMerged;
*// note that the mask was originally slowly shifted left by the merging loop*
mask >>= 1;
}
*// last iteration can't have any merges*
**for** (i = 0; i < numAnalyze; i++)
codeLengths[i]++;
*// it's a free world ...*
free(isMerged);
*// first symbol has the longest code because it's the least frequent in the sorted histogram*
**return** codeLengths[0];
}
*// the following code is almost identical to function moffat() in moffat.c*
*// helper struct for qsort()*
**struct** KeyValue
{
**unsigned** **int** key;
**unsigned** **int** value;
};
*// helper function for qsort()*
**static** **int** compareKeyValue(**const** **void*** a, **const** **void*** b)
{
**struct** KeyValue* aa = (**struct** KeyValue*) a;
**struct** KeyValue* bb = (**struct** KeyValue*) b;
*// negative if a < b, zero if a == b, positive if a > b*
**if** (aa->key < bb->key)
**return** -1;
**if** (aa->key > bb->key)
**return** +1;
**return** 0;
}
*/// same as before but histogram can be in any order and may contain zeros, the output is stored in a dedicated parameter*
*/** - the function rejects maxLength > 63 but I don't see any practical reasons you would need a larger limit ...*
* * @param maxLength maximum code length, e.g. 15 for DEFLATE or JPEG*
* * @param numCodes number of codes, equals the array size of histogram and codeLength*
* * @param histogram how often each code/symbol was found*
* * @param codeLength [out] computed code lengths*
* * @result actual maximum code length, 0 if error*
* */*
**unsigned** **char** packageMerge(**unsigned** **char** maxLength, **unsigned** **int** numCodes, **const** **unsigned** **int** histogram[], **unsigned** **char** codeLengths[])
{
*// my allround variable for various loops*
**unsigned** **int** i;
*// reset code lengths*
**for** (i = 0; i < numCodes; i++)
codeLengths[i] = 0;
*// count non-zero histogram values*
**unsigned** **int** numNonZero = 0;
**for** (i = 0; i < numCodes; i++)
**if** (histogram[i] != 0)
numNonZero++;
*// reject an empty alphabet because malloc(0) is undefined*
**if** (numNonZero == 0)
**return** 0;
*// allocate a buffer for sorting the histogram*
**struct** KeyValue* mapping = (**struct** KeyValue*) malloc(sizeof(**struct** KeyValue) * numNonZero);
*// copy histogram to that buffer*
**unsigned** **int** storeAt = 0;
**for** (i = 0; i < numCodes; i++)
{
*// skip zeros*
**if** (histogram[i] == 0)
**continue**;
mapping[storeAt].key = histogram[i];
mapping[storeAt].value = i;
storeAt++;
}
*// now storeAt == numNonZero*
*// invoke C standard library's qsort*
qsort(mapping, numNonZero, sizeof(**struct** KeyValue), compareKeyValue);
*// extract ascendingly ordered histogram*
**unsigned** **int*** sorted = (**unsigned** **int***) malloc(sizeof(**unsigned** **int**) * numNonZero);
**for** (i = 0; i < numNonZero; i++)
sorted[i] = mapping[i].key;
*// run package-merge algorithm*
**unsigned** **char** result = packageMergeSortedInPlace(maxLength, numNonZero, sorted);
*// "unsort" code lengths*
**for** (i = 0; i < numNonZero; i++)
codeLengths[mapping[i].value] = sorted[i];
*// let it go ...*
free(sorted);
free(mapping);
**return** result;
}

## Kraft-McMillan Inequality

In one of the last sentences I mentioned the Kraft-McMillan inequality (sometimes McMillan is forgotten).It's core message is: if

`sum(2^-codeLength[0...numCodes]) <= 1`

then a prefix-free code exists.Ideally you try to have

`sum(2^-codeLength[0...numCodes]) = 1`

("equal" instead of "less-than or "equal") because anything else is inefficient.The Huffman algorithm always gives you

`sum = 1`

. The Package-Merge algorithm has the same property.The following length-limiting techniques modify code lengths in such a way that the maximum code length is enforced ("limited") while the inequality still holds true.

Assume that you ran the Huffman algorithm and found that a few codes are too long. You know that

`sum = 1`

.An important insight is that making one of such long codes only one bit shorter increments the sum, thus violating the inequality:

`newSum = sum - 2^-oldLength + 2^-newLength = sum - 2^-oldLength + 2^(oldLength-1) = sum + 2^-oldLength`

However, if there is another symbol with a code length that is

*shorter*than your desired

`maxLength`

we can make it one bit *longer*to offset our change.

The formula is almost the same (but plus one instead of minus one in the exponent).

Pretty much all algorithm mentioned in the follwing chapters pair long with short codes such that the new length limit is respected and the inequality holds true.

## JPEG's Length-Limiting Approach

This is the shortest algorithm. It can be found in Annex K.3 of the JPEG standard ITU T.81 along with a nice flow diagram.You can download that document from w3.org, see page 147 of that PDF.

First you create a standard Huffman code. If the maximum code length doesn't exceed your limit then you are obviously done.

If you look at all codes having the longest code length then there is always an even number of them

(if that wasn't the case then one symbol could be made one bit shorter because its last bit is unused → an odd number must be an error in your Huffman implementation).

You repeat this procedure until reaching

`maxLength`

:
- pick two symbols x and y having that longest bit length
- the canonical Huffman code of x and y will be almost identical: x ends with 0 and y ends with 1
- let's call their shared prefix P
- we remove x's trailing 0, so that x becomes P, being one bit shorter
- however, y is invalid now
- let's pick a third symbol z which is shorter than the new x (that means at least two bits shorter than the still invalid y)
- appending a 0 to z makes it one bit longer
- appending a 1 to z creates a completely new code which can be used for y

- we made 1 symbol shorter by exactly 1 bit (x)
- we made 1 symbol shorter by at least 1 bit (y)
- we made 1 symbol longer by exactly 1 bit (z)

- from a mathematical point of view the sum of all Kraft values must be <= 1.0
- we know that bits(x) = bits(y)
- if bits(z) = bits(x) - 2 then sum(Kraft) = sum(Kraft')
- because bits(x') = bits(x) - 1
- and bits(y') = bits(y) - 1
- and bits(z') = bits(z) + 1
- so that bits(z') = bits(x') = bits(y')
- let's forget about all symbols except x, y and z so that we have:

sum(K) = K(x) + K(y) + K(z)

= 2^-bits(x) + 2^-bits(y) + 2^-bits(z)

= 2^-bits(x) + 2^-bits(x) + 2^-(bits(x) - 2)

= 2 * 2^-bits(x) + 4 * -bits(x)

= 6 * 2^-bits(x)

sum(K') = K(x') + K(y') + K(z')

= 3 * K(x')

= 3 * 2^-bits(x')

= 3 * 2^(-bits(x) - 1)

= 6 * 2^-bits(x)

= sum(K) - therefore the sum of all Kraft values remains unchanged after the transformation

Unfortunately, processing a huge alphabet with very large bit lengths can be quite slow.

However, this situation is impossible with the small JPEG alphabet (typically 162 symbols with an upper code length limit of 16).

The code simplifes and becomes faster if work transform a histogram of code lengths instead of the code lengths themselves.

That means you run a single pass over all code lengths and count how many symbols have one bits, two bits, three bits, etc.

The inner loop of my function

`limitedJpegInPlace`

is surprisingly shprt:
hide

**limitedJpegInPlace**```
```**unsigned** **char** limitedJpegInPlace(**unsigned** **char** newMaxLength, **unsigned** **char** oldMaxLength, **unsigned** **int** histNumBits[])
{
**if** (newMaxLength <= 1)
**return** 0;
**if** (newMaxLength > oldMaxLength)
**return** 0;
**if** (newMaxLength == oldMaxLength)
**return** newMaxLength;
*// iterate over all "excessive" bit lengths, beginning with the longest*
**unsigned** **char** i = oldMaxLength;
**while** (i > newMaxLength)
{
*// no codes at this bit length ?*
**if** (histNumBits[i] == 0)
{
i--;
**continue**;
}
*// look for codes that are at least two bits shorter*
**unsigned** **char** j = i - 2;
**while** (j > 0 && histNumBits[j] == 0)
j--;
*// change bit length of 2 of the longest codes*
histNumBits[i] -= 2;
*// one code becomes one bit shorter*
*// (using the joint prefix of the old two codes)*
histNumBits[i - 1]++;
*// the other code has length j+1 now*
*// but another, not-yet-involved, code with length j*
*// is moved to bit length j+1 as well*
histNumBits[j + 1] += 2;
histNumBits[j]--;
}
*// return longest code length that is still used*
**while** (i > 0 && histNumBits[i] == 0)
i--;
*// JPEG Annex K.3 specifies an extra line:*
*// histNumBits[i]--;*
*// => because JPEG needs a special symbol to avoid 0xFF in its output stream*
**return** i;
}

## MiniZ's Length-Limiting Algorithm

The DEFLATE file format imposes a 16 bit limit in its Huffman stage.Unlike JPEG there is no official length-limiting procedure so that programmers came up with different (but somehow very similar) code.

If found the most easily identifiable code was MiniZ's huffman_enforce_max_code_size() (written by Rich Geldreich).

gzip and its zlib are more popular but the way they limit code lengths is somewhat hidden and interweaved with other functionality.

As far as I understand both MiniZ's and zlib's length-limiting behaves identically.

While MiniZ's basic idea is comparable to JPEG's (for most input they produce identical output), it's much faster:

instead of slowly "eating" one bit away at a time from long codes, it immediately shortens all (!) long codes to the new

`maxLength`

limit.Now the Kraft-McMillan inequality might be violated so that a few codes need to be fixed:

- select a code x with maximum length (=
`maxLength`

) - select another code y that is shorter, if possible
`maxLength-1`

bits (even shorter codes work, too, but are inefficient) - make y one bit longer
- assign x the same length as the "new" y
- we just reduced the Kraft-McMillan sum by (at least)
`2^-maxLength`

- if it still exceeds 1 then continue with step 1

For example, if newMaxLength is 16 and you have codes with 15 bits, too.

Now if you select a 16 bit code and pick a second 15 bit code then that 15 bit code becomes a 16 bit code and the 16 bit code stays a 15+1=16 bit code.

The Kraft-McMillan computations can be performed with integers if we multiply everything by

`2^maxLength`

:`2^-maxLength * 2^maxLength = 1`

, thus each iteration "saves" one unit and the total sum must be below `1 * 2^maxLength = 2^maxLength`

.If a histogram of code lengths is used instead of code length themselves (just like the JPEG algorithm) then the inner loop is just 50 lines, and most of it are comments:

hide

**limitedMinizInPlace**```
```**unsigned** **char** limitedMinizInPlace(**unsigned** **char** newMaxLength, **unsigned** **char** oldMaxLength, **unsigned** **int** histNumBits[])
{
**if** (newMaxLength <= 1)
**return** 0;
**if** (newMaxLength > oldMaxLength)
**return** 0;
**if** (newMaxLength == oldMaxLength)
**return** newMaxLength;
*// my allround variable for various loops*
**unsigned** **int** i;
*// move all oversized code lengths to the longest allowed*
**for** (i = newMaxLength + 1; i <= oldMaxLength; i++)
{
histNumBits[newMaxLength] += histNumBits[i];
histNumBits[i] = 0;
}
*// compute Kraft sum*
*// (using integer math: everything is multiplied by 2^newMaxLength)*
**unsigned** **long** **long** total = 0;
**for** (i = newMaxLength; i > 0; i--)
total += histNumBits[i] << (newMaxLength - i);
*// iterate until Kraft sum doesn't exceed 1 anymore*
**unsigned** **long** **long** one = 1ULL << newMaxLength;
**while** (total > one)
{
*// select a code with maximum length, it will be moved*
histNumBits[newMaxLength]--;
*// find a second code with shorter length*
**for** (i = newMaxLength - 1; i > 0; i--)
**if** (histNumBits[i] > 0)
{
histNumBits[i]--;
*// extend that shorter code by one bit*
*// and assign the same code length to the selected one*
histNumBits[i + 1] += 2;
**break**;
}
*// moving these codes reduced the Kraft sum*
total--;
}
**return** newMaxLength;
}

## BZip2's Length-Limiting Algorithm

bzip2 is a weird compression concept. One of its several stages is a Huffman compression stage.Instead of playing around with the Kraft-McMillan inequality it completely relies on a fast Huffman encoder and trial'n'error:

- compute Huffman codes for your alphabet
- if maximum code length doesn't exceed
`maxLength`

then you are done - otherwise reduce each symbol's count (but avoid becoming zero)
- and go back to step 1

While being very fast, it may produce sub-optimal Huffman codes. Sometimes dividing by 3 gives better results but that totally depends on your input data.

The core loop is:

hide

The compression efficiency is in almost all cases worse that Package-Merge/JPEG/MiniZ. Moreover, it's slower than JPEG/MiniZ so there is no real reason choosing bzip2's approach.
**bzip2**```
```*// run Moffat algorithm ...*
**unsigned** **char** result = moffatSortedInPlace(numNonZero, sorted);
*// ... until a proper maximum code length is found*
**while** (result > maxLength)
{
*// more or less divide each weight by two while avoiding zero*
**for** (i = 0; i < numNonZero; i++)
{
**unsigned** **int** weight = mapping[i].key;
*// bzip2 "clears" the lowest 8 bits of the histogram*
*// to reach the length limit in less iterations*
*// but sacrifices lots of efficiency*
*// if you set EXTRA_SHIFT to 0 then the code may need more iterations*
*// but finds much better code lengths*
*//#define EXTRA_SHIFT 8*
**#****define** EXTRA_SHIFT 0
*// sometimes dividing the weight by a bigger integer (e.g. 3)*
*// may lead to more efficient prefix codes*
**#****define** DIVIDE_BY 2
*// the shifts do more harm than good in my opinion*
weight >>= EXTRA_SHIFT;
*// adding 1 avoids zero*
weight = 1 + (weight / DIVIDE_BY);
weight <<= EXTRA_SHIFT;
*// sorted was overwritten with code lengths*
mapping[i].key = sorted[i] = weight;
}
*// again: run Moffat algorithm*
result = moffatSortedInPlace(numNonZero, sorted);
}

## Kraft-McMillan Approximations

The goal of each presented length-limiting algorithm is producing a code that is compatible to Huffman decoders.Aside from Package-Merge, they all need to generate a plain Huffman code and then adjust overlong symbols.

I already mentioned the Kraft-McMillan inequality which states that each code where

`sum(2^-codeLength[0...numCodes]) <= 1`

is a prefix-free code.That means each such code can be decoded by Huffman decoders, too.

I came across a few postings, mainly by Charles Bloom, discussing how solely focussing on the Kraft-McMillan inequality can be both fast and close to optimal.

However, I made a few important changes to speed up things:

- compute theoretical number of bits for each symbol:
`entropy = -log2(count / total)`

- using a super-fast fast approximation instead of C's
`log`

functions - round up or down to closest integer, this is our initial guess for the code length
- but never round to zero
- if above our length limit, then set it to that limit
- compute Kraft-McMillan sum of all symbols' code lengths

If that sum is above 1 then we need to modify (=increase) a few code lengths. The main idea is modifying those code lengths which "hurt the least".

When we rounded to the closest integer, many symbols rounded down: their actual code length is smaller than their theoretical bit length.

Assume you have a symbol which represents 9% of all symbols. Then its entropy is

`-log2(0.09) ≈ 3.474`

→ rounded to 3. There is a "gain" of 3.474 - 3 = 0.474."Gain" means that we used 0.474 less bits per symbol occurrence than the theoretical minimum, we saved some bits.

If we change that symbol's code length to 4 (assuming it's allowed by the length limit), then there will be a "loss" of 3.474 - 4 = -0.526.

"Loss" means that we used 0.526 less bits per symbol occurrence than the theoretical minimum, we wasted some bits.

The Kraft-McMillan sum will be lower though, hopefully now below 1, at the cost of slightly worse compression.

Let's say there is another symbol representing 10% of all symbols, thus being a little bit more frequent than the previous symbol.

It's entropy is

`-log2(0.1) ≈ 3.322`

, rounded to 3 as well. Therefore it's "gain" is 3.322 - 3 = 0.322 and potential "loss" after adding one bit is 3.322 - 4 = -0.678.Both symbols affect the Kraft-McMillan sum in the same way, but the latter symbol will cause even worse compression.

You have probably noticed that adding one bit to symbol which had a gain of

`x`

leads to a loss of `1-x`

.So the main idea is: always look for symbols with maximum gain to cause minimal loss after adding a bit.

I implemented two different strategies to modify the code lengths, called A and B.

And yes, I got tired of the long name "Kraft-McMillan inequality" and just called it "Kraft" (which is the nice German word for "power"/"strength"/"force").

### Kraft - Strategy A

This algorithm works with several passes. A threshold, initially a small value, decreases after each pass.Walking through all symbols, we look for symbols where the difference between theoretical bit length and actual bit length is above that threshold ("high gain").

One bit will be added and the Kraft sum adjusted. As soon as the Kraft sum isn't larger than 1 anymore we have a valid prefix-free code.

It's possible to "overshoot" by extending too many previously very short codes - which have a high influence on the Kraft sum.

If we are well below a Kraft sum of 1 then a single pass looks for codes which can be made one bit shorter without violating the Kraft sum.

In many cases this last pass isn't needed and can be skipped.

The algorithm works pretty well. However, the number of passes hugely depends on the input data.

If the initial threshold is too high and/or shrinks only a little after each pass then a large number of passes will be required.

If the initial threshold is too low and/or shrinks a lot after each pass then only few passes will be required.

While the latter sounds much better, it will miss many optimal code lengths and produce sub-optimal results.

I didn't find a perfect one-suits-all configuration. The default

`INITIAL_THRESHOLD = 28 / 64.0`

and `STEP_THRESHOLD = 1 / 64.0`

worked well in my experiments.
### Kraft - Strategy B

Strategy A makes a few guesses regarding the threshold. Moreover, it repeatedly scans symbols which need not to be modified.My strategy B implements a max-queue. Its first element is always the symbol with the highest gain.

Therefore we don't need scan all symbols but always pick from the top of that max-queue.

After a symbol was adjusted, there is a need to re-insert it into the max-queue.

One of the most efficient data structures for such access pattern is a heap. Most standard libraries come with a heap implementation (such as C++).

Roughly half of the code of limitedkraftheap.c is a pretty generic and straightforward heap implementation.

Similar to strategy A, we might "overshoot" so that a fast fix-up exists as well.

There is no major difference between strategy A and B concerning the generated code lengths.

However, strategy B is much faster (typically by a factor of 4) and even outperformes an efficient Huffman implementation.

It's interesting to note that the length limit has a very significant impact on the execution time: lower limits are faster.

## License

This code is licensed under the zlib License:This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License

## Source Code

Git users: scroll down to the repository link Latest release:**February 5, 2021**, size:

**1668**bytes,

**30**lines

CRC32:

`9f50f131`

MD5:

`9c4358817e4f951186dc6c87866ab727`

SHA1:

`5533b6f2fa58c109c590ae4f33665ac7d3e4b896`

SHA256:

`beb1da8cdce260cb540ab871c5f98d46b596e3956d61324fd1567739492384f8`

Latest release:

**June 29, 2021**, size:

**11.0**kBytes,

**338**lines

CRC32:

`b70bd1e5`

MD5:

`e826eda6c202bfa2fa22daf9e3a5a246`

SHA1:

`fc47f5ef18d5c82996032b3d3d9c83a85df8f940`

SHA256:

`2dec5ba9d228eeb44bab2f90312518f546475a65982566857bc8ffe402c42a34`

Latest release:

**February 5, 2021**, size:

**2208**bytes,

**41**lines

CRC32:

`979b7a5c`

MD5:

`264867286a2a10419e77ad7e2da9ee69`

SHA1:

`2cebbac1ae3c7f53ee7562ed73b9e68ca4987dff`

SHA256:

`04e45e82ad87c9bb938d7e68b9a39346dbe6d1c8ff9e5859894be43ec62ba406`

Latest release:

**February 10, 2021**, size:

**4436**bytes,

**175**lines

CRC32:

`a5935e36`

MD5:

`db5f8486daebbbe7179c2ed964bd7a80`

SHA1:

`b688a68b47c8ef654e3271a78152a5382c480820`

SHA256:

`1ec47ed8df361a5cb8ef847b94dc1be85fccfaaab2c08d7d56f4dd551d8b31eb`

Latest release:

**February 5, 2021**, size:

**4835**bytes,

**80**lines

CRC32:

`161095e9`

MD5:

`715a76cfe7bb7d74e8c1e5f2fbcfb4d9`

SHA1:

`e0f8bd3a6b3afcd6a793fa8daab8b7b34db92cf4`

SHA256:

`e75a18a20fe66cee275a0d0ebe52cca3bf9ad14c92994c3a11d5a68651d98819`

Latest release:

**February 10, 2021**, size:

**12.2**kBytes,

**356**lines

CRC32:

`c503747c`

MD5:

`e878141ab902742613f9863e97118ec0`

SHA1:

`e6a180b59caf39df790cc047876392c65c114232`

SHA256:

`0e85a8d136f07214b4203a4036a75a3f817e64f1edebf7028b3df55217924ed6`

Latest release:

**February 5, 2021**, size:

**1476**bytes,

**29**lines

CRC32:

`c958c820`

MD5:

`5735c2438ffbd21540c640e48a8217fd`

SHA1:

`e157bb74cc746609ae083ceb497bf4327f31f016`

SHA256:

`39a411f02183cc34a8559f973724517a6d81c68c5259478f0a05fdb8fb66658a`

Latest release:

**February 10, 2021**, size:

**4050**bytes,

**131**lines

CRC32:

`821a4f0f`

MD5:

`af0a84f35b32310ec33ad6da058ff3e4`

SHA1:

`a1eff36323c9917f62f0cfd1ee000388c44a4c1e`

SHA256:

`a8bfb557d645a1419df9fe071682d9be88af71c82905bddfb0f3d10e50964a78`

Latest release:

**February 5, 2021**, size:

**745**bytes,

**17**lines

CRC32:

`cb6d4745`

MD5:

`9f5615706645e6d591769cd9385c90bd`

SHA1:

`e8e526e2e32b16197ec77c593f476230221217a6`

SHA256:

`8ee50def030ba1dd2383febc970496c1f7b9cfe824774d133b5f75cc0741b641`

Latest release:

**February 5, 2021**, size:

**6.9**kBytes,

**224**lines

CRC32:

`fddbc6e9`

MD5:

`a210caef2d00f1be7c9da18b193df501`

SHA1:

`648b90061f069d74beb41decdcd7ed28c4eaffef`

SHA256:

`545e870024999e917bd52347419872458178f61645853e5360bf04f35e2aa63c`

Latest release:

**February 5, 2021**, size:

**753**bytes,

**17**lines

CRC32:

`7a4c7aa3`

MD5:

`f920b31d7c745895287ae07e357b0737`

SHA1:

`7bf007989f4ee0ec9e6af857038fa52bf7a5b59f`

SHA256:

`c12082e6be08b9028d80ad2f148ecfb780ea4daf9ee01b9569e7f30553dba489`

Latest release:

**February 5, 2021**, size:

**9.7**kBytes,

**342**lines

CRC32:

`0a9bb4f3`

MD5:

`1d5fcb52464eb0ae0922ec43dddf862f`

SHA1:

`99e13c8494a78e5c4d602abb124da76e2064d049`

SHA256:

`3a4559d076c85703d1936b4ee09a9d1b2fbc71ab73ddf20ee9805ab29420fc56`

Stay up-to-date:git clone https://create.stephan-brumme.com/length-limited-prefix-codes/git

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

## Changelog

**version 1**- June 30, 2021
- initial release