smalLZ4 - optimal LZ4 compression

posted by Stephan Brumme

You came here to grab just the code or a binary ? Scroll down to the Download section and GIT repository.
A short overview how smallz4 compares to the original LZ4 and a program called LZ4X (which follows similar ideas as smallz4) can be found here.

And if you are interesting in the inner working, keep on reading and enjoy the ride ...

Optimal Parsing

In the context of compression, optimal parsing is a multi-pass algorithm to find the smallest output of compressible data.
Whenever there are multiple choices of encoding a symbol (or a sequence of symbols), optimal parsing will choose the representation which leads to the overall smallest output.
That means, it may prefer a locally sub-optimal encoding in order to achieve a globally optimal encoding.

Most compression algorithms strives for locally optimal encoding because it's computationally cheaper.
The most basic algorithm, the greedy algorithm, always selects the encoding that fits best "right now".
A significant improvement is called "lazy evaluation" and takes the next group of symbols into consideration and decides which encoding yields the smallest output for the whole group.
In a way, lazy evaluation can be seen as a local version of optimal parsing. It does the same job for a group of symbols instead of the whole file.

Compressing the string abcde_bcdefgh_abcdefghxxxxxxx returns different file sizes:
$ echo "abcde_bcdefgh_abcdefghxxxxxxx" | lz4 -9 --no-frame-crc | wc -c 43 $ echo "abcde_bcdefgh_abcdefghxxxxxxx" | ./smallz4 | wc -c 41
Let's assume that red characters are literals, i.e. uncompressed data.
Green pairs of numbers indicate distance and length of a match.
And let's ignore xxxxxxx from now on (it's only purpose is to hide the fact that the LZ4 format specification forbids matches at the very end of a file).

lz4's result in detail: abcde_(5,4)fgh_(14,5)fghxxxxxxx
It found two matches: The first replaced bcde by a reference to a sequence we saw 5 bytes ago.
That reference "costs" 3 bytes while the four literals would have occupied 4 bytes. We saved a byte here.
The second match replaces abcde by a reference to the beginning of the file.
Again, we need 3 bytes to represent the match while the replaced text was 5 bytes long. Two more bytes saved !

However, this compression is not optimal. smallz4 can squeeze out another two bytes because it postpones the second match:
abcde_(5,4)fgh_a(9,7)xxxxxxx
smallz4 detects that matching abcde would save 2 bytes but matching bcdefgh saves 4 bytes.

Optimal Parsing - Pass 1

For each byte of input data, the longest match will be found and stored in an array called match (starts at about line 513).
Additional data structures such as previousHash and previousExact only speed up the code.
A simple brute-force routine will give you the same result - albeit slower.
position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 (padding)
character a b c d e _ b c d e f g h _ a b c d e f g h xxxxxxx
match distance 5 (5) (5) (5) (8) 14 9 9 9 9 9 (9) (9)
match length 4 (3) (2) (1) (1) 5 7 6 5 4 (3) (2) (1)
Numbers in brackets indicates matches which cannot be efficiently compressed by LZ4 - actually each match is required to be at least four characters long.
My program discards these short matches; they aren't stored in matches.

Match length 1 isn't shown in the table above, it's the default value for literals.
Each empty cell in the "match length" row is considered to be 1, with its corresponding match distance being undefined.

Optimal Parsing - Pass 2

In pass 2, all matches are analyzed in reverse. Beginning with the last match we compute how big the compressed output will be - that's what I call cost.
You may ask: why in reverse ? Well, the idea ("dynamic programming") is to reduce the problem into a smaller one.

The most simple problem is a single byte. Obviously, it can't be LZ4 compressed.
When taking a few more steps we will arrive at a position where a match was found in the previous pass.
Now we have to decide whether this match is more efficient ("costs less") than outputting its literals.
It boils down to a decision whether cost(match) + cost(everything after the match) is smaller than cost(from here on).

Usually the match is always "cheaper" than literals. The case becomes much more interesting if the longest possible match overlaps with a potential match found later in the text.
That's exactly what we saw in the example given above. Therefore we compute the cost for all potential match length from 4 (LZ4's minimum match length) up to the longest match found.
In conclusion we look for the minimum of cost(match with length x) + cost(everything after x bytes).

The relevant code can be found in estimateCosts. It's only parameter matches is modified such that each locally optimal match is replaced by its globally optimal sibling.
Often, those two are identical but, as explained, a shorter match (which is locally sub-optimal) can sometimes yield a better output.
Reducing match length below 4 converts a match into a literal.

Obviously, literals take 1 byte. Therefore, the cost of a literal will be 1 plus whatever follows: cost[i] = cost[i + 1] + 1

The basic cost of a match is 1+2=3 bytes: the intial token consumes 1 byte plus 2 bytes to encode the match distance.
However, we can skip over the next position: cost[i] = cost[i + matchLength] + 1 + 2

The optimal match turns out to be the one with the lowest cost[i] when we vary matchLength between 1 and its original value:
position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 (padding)
character a b c d e _ b c d e f g h _ a b c d e f g h xxxxxxx
match distance 5 14 9 9 9 9
match length 4 5 7 6 5 4
cost 19 18 17 16 15 14 13 13 12 11 10 9 8 7 6 3 3 3 3 3 2 1
minimum cost 17 16 15 14 13 12 11 11 10 9 8 7 6 5 4 3 3 3 3 3 2 1
optimal length 4 1 7 6 5 4
In the example above estimateCosts will figure out that a match at position 15 is more expensive than a literal:
Match: cost[15] = cost[15 + matchLength] + 1 + 2 = cost[19] + 3 = 6 (see row "cost")
Literal: cost[15] = cost[15 + 1] + 1 = cost[16] + 1 = 4 (see row "minimum cost")

Therefore, match[15].length will be changed from 4 to 1 (its optimal length).

Note: there are a few additional adjustments to these formulas in the full source code.
The 15th consecutive literal actually costs 1+1=2 bytes because we have to insert an extra length byte (and after that plus 1 for each 255th byte).
Very long matches take more byte to encode the length (plus 1 after 19 bytes and plus 1 for each 255th thereafter).

Optimal Parsing - Pass 3

Finally, we walk through the matches from the beginning to the end and emit the corresponding LZ4 codes.
selectBestMatches is implemented pretty straightforward and can be summarized as follows:
hide pseudo code i = 0 while (i < matches.size()) { if (matches[i] is a literal) { add literal to an internal buffer i++ } else { emit token (which contains match length and number of buffered literals) emit all buffered literals and clear that buffer emit match distance i += matches[i].length } } // if literals buffer isnt empty, emit it, too

LZ4 compressor (C++)

Features:
What's missing:
Comparison chart (compressed test file is enwik9 (first 1 GiByte of English Wikipedia), download it here):
compressor settings bytes duration
lz4 1.7.5 (default) 509,454,875 bytes 5 seconds
lz4 1.7.5 -9 374,843,921 bytes 35 seconds
lz4 1.7.5 -9 --no-frame-crc -BD 374,091,158 bytes 35 seconds
lz4 1.7.5 -12 --no-frame-crc -BD 371,681,832 bytes 144 seconds
LZ4X 1.12 -9 372,068,631 bytes 164 seconds
smallz4 1.0 -9 371,681,075 bytes 242 seconds
Note: LZ4X generates LZ4 legacy format frames.

hide Compressor // ////////////////////////////////////////////////////////// // smallz4.h // Copyright (c) 2016 Stephan Brumme. All rights reserved. // see http://create.stephan-brumme.com/smallz4/ // // "MIT License": // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), // to deal in the Software without restriction, including without limitation // the rights to use, copy, modify, merge, publish, distribute, sublicense, // and/or sell copies of the Software, and to permit persons to whom the Software // is furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included // in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, // INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A // PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION // OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE // SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. #pragma once #include <stdint.h> // uint16_t, uint32_t, ... #include <cstdlib> // size_t #include <vector> /// LZ4 compression with optimal parsing /** see smallz4.cpp for a basic I/O interface you can easily replace it by a in-memory version then all you have to do is: #include "smallz4.h" smallz4::lz4(GET_BYTES, SEND_BYTES); **/ class smallz4 { public: // read several bytes, see getBytesFromIn() in smallz4.cpp for a basic implementation typedef size_t (*GET_BYTES) ( void* data, size_t numBytes); // write several bytes, see sendBytesToOut() in smallz4.cpp for a basic implementation typedef void (*SEND_BYTES)(const void* data, size_t numBytes); /// compress everything in input stream (accessed via getByte) and write to output stream (via send) static void lz4(GET_BYTES getBytes, SEND_BYTES sendBytes, unsigned int maxChainLength = MaxChainLength) { smallz4 obj(maxChainLength); obj.compress(getBytes, sendBytes); } /// version string static const char* const getVersion() { return "1.0"; } // compression level thresholds, made public because I display them in the help screen ... enum { /// greedy mode for short chains (compression level <= 3) instead of optimal parsing / lazy evaluation ShortChainsGreedy = 3, /// lazy evaluation for medium-sized chains (compression level > 3 and <= 6) ShortChainsLazy = 6 }; // ----- END OF PUBLIC INTERFACE ----- private: // ----- constants and types ----- /// a block can be 4 MB typedef uint32_t Length; /// matches must start within the most recent 64k typedef uint16_t Distance; enum { /// each match's length must be >= 4 MinMatch = 4, /// no matching within the last few bytes BlockEndNoMatch = 12, /// last bytes must be literals BlockEndLiterals = 5, /// match finder's hash table size (2^HashBits entries, must be less than 32) HashBits = 20, /// input buffer size, can be any number but zero ;-) BufferSize = 64*1024, /// maximum match distance MaxDistance = 65534, /// marker for "no match" NoPrevious = MaxDistance + 1, /// stop match finding after MaxChainLength steps (default is unlimited => optimal parsing) MaxChainLength = NoPrevious, /// refer to location of the previous match (implicit hash chain) PreviousSize = 1 << 16 }; /// maximum block size as defined in LZ4 spec //const uint32_t MaxBlockSizeArray[] = { 0,0,0,0,64*1024,256*1024,1024*1024,4*1024*1024 }; /// I only work with the biggest maximum block size (7) static const uint32_t MaxBlockSizeId = 7; // header checksum is precalculated only for 7, too static const uint32_t MaxBlockSize = 4*1024*1024;//MaxBlockSizeArray[MaxBlockSizeId]; // ----- one and only variable ... ----- /// how many matches are checked in findLongestMatch unsigned int maxChainLength; // ----- code ----- /// match struct Match { /// true, if long enough inline bool isMatch() const { return length >= MinMatch; } /// length of match Length length; /// start of match Distance distance; }; /// create new compressor (only invoked by lz4) explicit smallz4(unsigned int newMaxChainLength = MaxChainLength) : maxChainLength(newMaxChainLength) // => no limit, but can be changed by setMaxChainLength { } /// return true, if four bytes at a and b match inline static bool match4(const void* a, const void* b) { return *(uint32_t*)a == *(uint32_t*)b; } /// find longest match of data[pos] between data[begin] and data[end], use match chain stored in previous Match findLongestMatch(const unsigned char* data, uint32_t pos, uint32_t begin, uint32_t end, const Distance* previous) const { Match result; result.length = 1; // compression level: look only at the first n entries of the match chain int32_t stepsLeft = maxChainLength; // pointer to position that is matched against everything in data const unsigned char* const current = data + pos - begin; // don't match beyond this point const unsigned char* const stop = current + end - pos; // get distance to previous match, abort if -1 => not existing Distance distance = previous[pos % PreviousSize]; uint32_t totalDistance = 0; while (distance != NoPrevious) { // too far back ? totalDistance += distance; if (totalDistance > MaxDistance) break; // prepare next position distance = previous[(pos - totalDistance) % PreviousSize]; // stop searching on lower compression levels if (stepsLeft-- <= 0) break; // let's introduce a new pointer atLeast that points to the first "new" byte of a potential longer match const unsigned char* atLeast = current + result.length + 1; // the idea is to split the comparison algorithm into 2 phases // (1) scan backward from atLeast to current, abort if mismatch // (2) scan forward until a mismatch is found and store length/distance of this new best match // current atLeast // | | // -<<<<<<<< phase 1 <<<<<<<< // >>> phase 2 >>> // impossible to find a longer match because not enough bytes left ? if (atLeast > stop) break; // all bytes between current and atLeast shall be identical, compare 4 bytes at once const unsigned char* compare = atLeast - 4; bool ok = true; while (compare > current) { // mismatch ? if (!match4(compare, compare - totalDistance)) { ok = false; break; } // keep going ... compare -= 4; // note: - the first four bytes always match // - in the last iteration, compare is either current + 1 or current + 2 or current + 3 // - therefore we compare a few bytes twice => but a check to skip these checks is more expensive } // mismatch ? if (!ok) continue; // we have a new best match, now scan forward from the end compare = atLeast; // fast loop: check four bytes at once while (compare + 4 <= stop && match4(compare, compare - totalDistance)) compare += 4; // slow loop: check the last 1/2/3 bytes while (compare < stop && *compare == *(compare - totalDistance)) compare++; // store new best match result.distance = totalDistance; result.length = compare - current; } return result; } /// create shortest output /** data points to block's begin; we need it to extract literals **/ static std::vector<unsigned char> selectBestMatches(const std::vector<Match>& matches, const unsigned char* data) { // store encoded data std::vector<unsigned char> result; result.reserve(MaxBlockSize / 2); // indices of current literal run size_t literalsFrom = 0; size_t literalsTo = 0; // point beyond last literal of the current run // walk through the whole block for (size_t offset = 0; offset < matches.size(); ) // increment inside of loop { // get best cost-weighted match Match match = matches[offset]; // if no match, then count literals instead if (!match.isMatch()) { // first literal if (literalsFrom == literalsTo) literalsFrom = literalsTo = offset; // one more literal literalsTo++; // ... and definitely no match match.length = 1; } offset += match.length; bool lastToken = (offset == matches.size()); // continue if simple literal if (!match.isMatch() && !lastToken) continue; // emit token // count literals size_t numLiterals = literalsTo - literalsFrom; // store literals' length unsigned char token = (numLiterals < 15) ? (unsigned char)numLiterals : 15; token <<= 4; // store match length (4 is implied because it's the minimum match length) size_t matchLength = match.length - 4; if (!lastToken) token |= (matchLength < 15) ? matchLength : 15; result.push_back(token); // >= 15 literals ? (extra bytes to store length) if (numLiterals >= 15) { // 15 is already encoded in token numLiterals -= 15; // emit 255 until remainder is below 255 while (numLiterals >= 255) { result.push_back(255); numLiterals -= 255; } // and the last byte (can be zero, too) result.push_back((unsigned char)numLiterals); } // copy literals if (literalsFrom != literalsTo) { result.insert(result.end(), data + literalsFrom, data + literalsTo); literalsFrom = literalsTo = 0; } // last token doesn't have a match if (lastToken) break; // distance stored in 16 bits / little endian result.push_back( match.distance & 0xFF); result.push_back((match.distance >> 8) & 0xFF); // >= 15+4 bytes matched (4 is implied because it's the minimum match length) if (matchLength >= 15) { // 15 is already encoded in token matchLength -= 15; // emit 255 until remainder is below 255 while (matchLength >= 255) { result.push_back(255); matchLength -= 255; } // and the last byte (can be zero, too) result.push_back((unsigned char)matchLength); } } return result; } /// walk backwards through all matches and compute number of compressed bytes from current position to the end of the block /** note: matches are modified (shortened length) if necessary **/ static void estimateCosts(std::vector<Match>& matches) { size_t blockEnd = matches.size(); typedef uint32_t Cost; // minimum cost from this position to the end of the current block std::vector<Cost> cost(matches.size(), 0); // "cost" represents the number of bytes needed // backwards optimal parsing size_t posLastMatch = matches.size(); for (int i = matches.size() - (1 + BlockEndLiterals); i >= 0; i--) // ignore the last 5 bytes, they are always literals { // watch out for long literal strings that need extra bytes Length numLiterals = posLastMatch - i; // assume no match Cost minCost = cost[i + 1] + 1; // an extra byte for every 255 literals required to store length (first 14 bytes are "for free") if (numLiterals >= 15 && (numLiterals - 15) % 255 == 0) minCost++; // if encoded as a literal Length bestLength = 1; // analyze longest match Match match = matches[i]; // match must not cross block borders if (match.isMatch() && i + match.length + BlockEndLiterals > blockEnd) match.length = blockEnd - (i + BlockEndLiterals); // try all match lengths for (Length length = MinMatch; length <= match.length; length++) { // token (1 byte) + offset (2 bytes) Cost currentCost = cost[i + length] + 1 + 2; // very long matches need extra bytes for encoding match length if (length > 18) currentCost += 1 + (length - 18) / 255; // better choice ? if (currentCost <= minCost) { // regarding the if-condition: // "<" prefers literals and shorter matches // "<=" prefers longer matches // they should produce the same number of bytes (because of the same cost) // ... but every now and then it doesn't ! // that's why: too many consecutive literals require an extra length byte // (which we took into consideration a few lines above) // but we only looked at literals beyond the current position // if there are many literal in front of the current position // then it may be better to emit a match with the same cost as the literals at the current position // => it "breaks" the long chain of literals and removes the extra length byte minCost = currentCost; bestLength = length; // performance-wise, a long match is usually faster during decoding than multiple short matches // on the other hand, literals are faster than short matches as well (assuming same cost) } // workaround: very long self-referencing matches can slow down the program A LOT if (match.distance == 1 && match.length > 18 + 255) { // assume that longest match is always the best match bestLength = match.length; minCost = cost[i + match.length] + 1 + 2 + 1 + (match.length - 18) / 255; break; } } // remember position of last match to detect number of consecutive literals if (bestLength >= MinMatch) posLastMatch = i; // store lowest cost so far cost[i] = minCost; // and adjust best match matches[i].length = bestLength; if (bestLength == 1) matches[i].distance = NoPrevious; // note: if bestLength is smaller than the previous matches[i].length then there might be a closer match // which could be more cache-friendly (=> faster decoding) } } /// compress everything in input stream (accessed via getByte) and write to output stream (via send) void compress(GET_BYTES getBytes, SEND_BYTES sendBytes) const { // ==================== write header ==================== // magic bytes const unsigned char magic[4] = { 0x04, 0x22, 0x4D, 0x18 }; sendBytes(magic, 4); // flags const unsigned char flags = 1 << 6; sendBytes(&flags, 1); // max blocksize const unsigned char maxBlockSizeId = MaxBlockSizeId << 4; sendBytes(&maxBlockSizeId, 1); // header checksum (precomputed) const unsigned char checksum = 0xDF; sendBytes(&checksum, 1); // ==================== declarations ==================== // read the file in chunks/blocks, data will contain only bytes which are relevant for the current block std::vector<unsigned char> data; // file position corresponding to data[0] size_t dataZero = 0; // last already read position size_t numRead = 0; // passthru data (but still wrap in LZ4 format) const bool uncompressed = (maxChainLength == 0); // last time we saw a hash const uint32_t HashSize = 1 << HashBits; std::vector<size_t> lastHash(HashSize, NoPrevious); const uint32_t HashMultiplier = 22695477; // taken from https://en.wikipedia.org/wiki/Linear_congruential_generator const uint8_t HashShift = 32 - HashBits; // previous position which starts with the same bytes std::vector<Distance> previousHash (PreviousSize, Distance(NoPrevious)); // long chains based on my simple hash std::vector<Distance> previousExact(PreviousSize, Distance(NoPrevious)); // shorter chains based on exact matching of the first four bytes // change buffer size as you like std::vector<unsigned char> buffer(BufferSize); // first and last offset of a block (next is end-of-block plus 1) size_t lastBlock = 0; size_t nextBlock = 0; while (true) { // ==================== start new block ==================== // read more bytes from input while (numRead - nextBlock < MaxBlockSize) { // buffer can be significantly smaller than MaxBLockSize, that's the only reason for this while-block size_t incoming = getBytes(&buffer[0], buffer.size()); if (incoming == 0) break; numRead += incoming; data.insert(data.end(), buffer.begin(), buffer.begin() + incoming); } // no more data ? => WE'RE DONE ! if (nextBlock == numRead) break; // determine block borders lastBlock = nextBlock; nextBlock += MaxBlockSize; // not beyond end-of-file if (nextBlock > numRead) nextBlock = numRead; size_t blockSize = nextBlock - lastBlock; // first byte of the currently processed block (std::vector data may contain the last 64k of the previous block, too) const unsigned char* dataBlock = &data[lastBlock - dataZero]; // ==================== full match finder ==================== // greedy mode is much faster but produces larger output bool isGreedy = (maxChainLength <= ShortChainsGreedy); // lazy evaluation: if there is a match, then try running match finder on next position, too, but not after that bool isLazy = !isGreedy && (maxChainLength <= ShortChainsLazy); // skip match finding on the next x bytes in greedy mode size_t skipMatches = 0; // allow match finding on the next byte but skip afterwards (in lazy mode) bool lazyEvaluation = false; std::vector<Match> matches(blockSize); // find longest matches for each position for (size_t i = 0; i < blockSize; i++) { // no matches at the end of the block (or matching disabled by command-line option -0 ) if (i + BlockEndNoMatch >= blockSize || uncompressed) continue; // detect self-matching if (i > 0 && dataBlock[i] == dataBlock[i - 1]) { Match prevMatch = matches[i - 1]; // predecessor had the same match ? if (prevMatch.distance == 1 && prevMatch.length > 1024) // TODO: handle very long self-referencing matches { // just copy predecessor without further (expensive) optimizations prevMatch.length--; matches[i] = prevMatch; continue; } } // read next four bytes uint32_t four = *(uint32_t*)(dataBlock + i); // convert to a shorter hash uint32_t hash = (four * HashMultiplier) >> HashShift; // get last occurrence of these bits uint32_t last = lastHash[hash]; // and store current position lastHash[hash] = i + lastBlock; // no predecessor or too far away ? uint32_t distance = i + lastBlock - last; if (last == NoPrevious || distance > MaxDistance) { previousHash [i % PreviousSize] = NoPrevious; previousExact[i % PreviousSize] = NoPrevious; continue; } // build hash chain, i.e. store distance to last match previousHash[i % PreviousSize] = distance; // skip pseudo-matches (hash collisions) and build a second chain where the first four bytes must match exactly while (distance != NoPrevious) { uint32_t curFour = *(uint32_t*)(&data[last - dataZero]); // may be in the previous block, too // actual match found, first 4 bytes are identical if (curFour == four) break; // prevent from accidently hopping on an old, wrong hash chain uint32_t curHash = (curFour * HashMultiplier) >> HashShift; if (curHash != hash) { distance = NoPrevious; break; } // try next pseudo-match Distance next = previousHash[last % PreviousSize]; // pointing to outdated hash chain entry ? distance += next; if (distance > MaxDistance) { previousHash[last % PreviousSize] = NoPrevious; distance = NoPrevious; break; } // closest match is out of range ? last -= next; if (next == NoPrevious || last < dataZero) { distance = NoPrevious; break; } } // no match at all ? if (distance == NoPrevious) { previousExact[i % PreviousSize] = NoPrevious; continue; } // store distance to previous match previousExact[i % PreviousSize] = distance; // TODO: long consecutive literals /*if (distance == 1 && i > 0) { if (matches[i - 1].length > MinMatch) { matches[i] = matches[i - 1]; //continue; } }*/ // skip match finding if in greedy mode if (skipMatches > 0) { skipMatches--; if (!lazyEvaluation) continue; lazyEvaluation = false; } // and look for longest match Match longest = findLongestMatch(&data[0], i + lastBlock, dataZero, nextBlock - BlockEndLiterals, &previousExact[0]); matches[i] = longest; // no match finding needed for the next few bytes in greedy/lazy mode if (longest.isMatch() && (isLazy || isGreedy)) { lazyEvaluation = (skipMatches == 0); skipMatches = longest.length; } } // ==================== estimate costs (number of compressed bytes) ==================== // not needed in greedy mode and/or very short blocks if (matches.size() > BlockEndNoMatch && maxChainLength > ShortChainsGreedy) estimateCosts(matches); // ==================== select best matches ==================== std::vector<unsigned char> block; if (!uncompressed) block = selectBestMatches(matches, &data[lastBlock - dataZero]); // ==================== output ==================== // automatically decide whether compressed or uncompressed size_t uncompressedSize = nextBlock - lastBlock; // did compression do harm ? bool useCompression = block.size() < uncompressedSize && !uncompressed; // block size uint32_t numBytes = useCompression ? block.size() : uncompressedSize; uint32_t numBytesTagged = numBytes | (useCompression ? 0 : 0x80000000); unsigned char num1 = numBytesTagged & 0xFF; sendBytes(&num1, 1); unsigned char num2 = (numBytesTagged >> 8) & 0xFF; sendBytes(&num2, 1); unsigned char num3 = (numBytesTagged >> 16) & 0xFF; sendBytes(&num3, 1); unsigned char num4 = (numBytesTagged >> 24) & 0x7F; sendBytes(&num4, 1); if (useCompression) sendBytes(&block[0], numBytes); else // uncompressed ? => copy input data sendBytes(&data[lastBlock - dataZero], numBytes); // remove already processed data except for the last 64kb which could be used for intra-block matches if (data.size() > MaxDistance) { size_t remove = data.size() - MaxDistance; dataZero += remove; data.erase(data.begin(), data.begin() + remove); } } // add an empty block uint32_t zero = 0; sendBytes(&zero, 4); } };
Click on "show" to display the code of a small test program:
show Test program // ////////////////////////////////////////////////////////// // smallz4.cpp // Copyright (c) 2016 Stephan Brumme. All rights reserved. // see http://create.stephan-brumme.com/smallz4/ // // "MIT License": // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), // to deal in the Software without restriction, including without limitation // the rights to use, copy, modify, merge, publish, distribute, sublicense, // and/or sell copies of the Software, and to permit persons to whom the Software // is furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included // in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, // INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A // PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION // OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE // SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. #define _CRT_SECURE_NO_WARNINGS #include "smallz4.h" #include <stdio.h> // stdin/stdout/stderr, fopen, ... #ifdef _WIN32 #include <io.h> // isatty() #else #include <unistd.h> // isatty() #define _fileno fileno #define _isatty isatty #endif /// error handler static void error(const char* msg) { fprintf(stderr, "ERROR: %s\n", msg); exit(1); } // ==================== I/O INTERFACE ==================== /// input stream, usually stdin FILE* in = 0; /// read several bytes and store at "data", return number of actually read bytes (return only zero if end of data reached) size_t getBytesFromIn(void* data, size_t numBytes) { if (data && numBytes > 0) return fread(data, 1, numBytes, in); return 0; } /// output stream, usually stdout FILE* out = 0; /// write a block of bytes void sendBytesToOut(const void* data, size_t numBytes) { if (data && numBytes > 0) fwrite(data, 1, numBytes, out); } // ==================== COMMAND-LINE HANDLING ==================== // show simple help static void showHelp(const char* program) { printf("smalLZ4 %s: LZ4 compressor with optimal parsing, fully compatible with LZ4\n" "\n" "Basic usage:\n" " %s [flags] [input] [output]\n" "\n" "This program writes to STDOUT if output isn't specified\n" "and reads from STDIN if input isn't specified, either.\n" "\n" "Examples:\n" " %s < abc.txt > abc.txt.lz4 # use STDIN and STDOUT\n" " %s abc.txt > abc.txt.lz4 # read from file and write to STDOUT\n" " %s abc.txt abc.txt.lz4 # read from and write to file\n" " cat abc.txt | %s - abc.txt.lz4 # read from STDIN and write to file\n" " %s -6 abc.txt abc.txt.lz4 # compression level 6 (instead of default 9)\n" " %s -f abc.txt abc.txt.lz4 # overwrite an existing file\n" " %s -f7 abc.txt abc.txt.lz4 # compression level 7 and overwrite an existing file\n" "\n" "Flags:\n" " -0, -1 ... -9 Set compression level, default: 9 (see below)\n" " -h Display this help message\n" " -f Overwrite an existing file\n" "\n" "Compression levels:\n" " -0 No compression\n" " -1 ... -%d Greedy search, check 1 to %d matches\n" " -%d ... -8 Lazy matching with optimal parsing, check %d to 8 matches\n" " -9 Optimal parsing, check all possible matches\n" "\n" "(C) 2016 Stephan Brumme, http://create.stephan-brumme.com/smallz4/\n" , smallz4::getVersion() , program, program, program, program, program, program, program, program, smallz4::ShortChainsGreedy, smallz4::ShortChainsGreedy, smallz4::ShortChainsGreedy + 1, smallz4::ShortChainsGreedy + 1); } /// parse command-line int main(int argc, const char* argv[]) { // show help if no parameters and stdin isn't a pipe if (argc == 1 && _isatty(_fileno(stdin)) != 0) { showHelp(argv[0]); return 0; } unsigned int maxChainLength = 65536; // "unlimited" because search window contains only 2^16 bytes // overwrite output ? bool overwrite = false; // parse flags int nextArgument = 1; while (argc > nextArgument && argv[nextArgument][0] == '-') { int argPos = 1; while (argv[nextArgument][argPos] != '\0') { switch (argv[nextArgument][argPos++]) { // show help case 'h': showHelp(argv[0]); return 0; // force overwrite case 'f': overwrite = true; break; // set compression level case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': maxChainLength = argv[nextArgument][1] - '0'; // "0" => 0, "1" => 1, ..., "8" => 8 break; // unlimited hash chain length case '9': // default maxChainLength is already "unlimited" break; default: error("unknown flag"); } } nextArgument++; } // default input/output streams in = stdin; out = stdout; // input file is given as first parameter or stdin if no parameter is given (or "-") if (argc > nextArgument && argv[nextArgument][0] != '-') { in = fopen(argv[nextArgument], "rb"); if (!in) error("file not found"); nextArgument++; } // output file is given as second parameter or stdout if no parameter is given (or "-") if (argc == nextArgument + 1 && argv[nextArgument][0] != '-') { // check if file already exists if (!overwrite && fopen(argv[nextArgument], "rb")) error("output file already exists"); out = fopen(argv[nextArgument], "wb"); if (!out) error("cannot create file"); } // and go ! smallz4::lz4(getBytesFromIn, sendBytesToOut, maxChainLength); return 0; }

LZ4 decompressor (C99)

Features:
What's missing:
hide Decompressor // ////////////////////////////////////////////////////////// // smallz4cat.c // Copyright (c) 2016 Stephan Brumme. All rights reserved. // see http://create.stephan-brumme.com/smallz4/ // // "MIT License": // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), // to deal in the Software without restriction, including without limitation // the rights to use, copy, modify, merge, publish, distribute, sublicense, // and/or sell copies of the Software, and to permit persons to whom the Software // is furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included // in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, // INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A // PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION // OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE // SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. // This program is a shorter, more readable, albeit slower re-implementation of lz4cat ( https://github.com/Cyan4973/xxHash ) // Limitations: // - skippable frames and legacy frames are not implemented (and most likely never will) // - checksums are not verified (see http://create.stephan-brumme.com/xxhash/ for a simple implementation) // Replace getByteFromIn() and sendToOut() by your own code if you need in-memory LZ4 decompression. // Corrupted data causes a call to error(). #define _CRT_SECURE_NO_WARNINGS #include <stdint.h> // uint32_t #include <stdio.h> // stdin/stdout/stderr, fopen, ... #include <stdlib.h> // exit() #include <string.h> // memcpy /// error handler void error(const char* msg) { fprintf(stderr, "ERROR: %s\n", msg); exit(1); } // ==================== I/O INTERFACE ==================== // read one byte from input, see getByteFromIn() for a basic implementation typedef unsigned char (*GET_BYTE) (); // write several bytes, see sendBytesToOut() for a basic implementation typedef void (*SEND_BYTES)(const unsigned char*, unsigned int); /// input stream, usually stdin static FILE* in = NULL; /// read a single byte (with simple buffering) static unsigned char getByteFromIn() { // modify buffer size as you like ... for most use cases, bigger buffer aren't faster anymore - and even reducing to 1 byte works ! #define READ_BUFFER_SIZE 4*1024 static unsigned char readBuffer[READ_BUFFER_SIZE]; static unsigned int pos = 0; static unsigned int available = 0; // refill buffer if (pos == available) { pos = 0; available = fread(readBuffer, 1, READ_BUFFER_SIZE, in); if (available == 0) error("out of data"); } // return a byte return readBuffer[pos++]; } /// output stream, usually stdout static FILE* out = NULL; /// write a block of bytes static void sendBytesToOut(const unsigned char* data, unsigned int numBytes) { if (data != NULL && numBytes > 0) fwrite(data, 1, numBytes, out); } // ==================== LZ4 DECOMPRESSOR ==================== /// decompress everything in input stream (accessed via getByte) and write to output stream (via sendBytes) void unlz4(GET_BYTE getByte, SEND_BYTES sendBytes) { // signature unsigned char signature1 = getByte(); unsigned char signature2 = getByte(); unsigned char signature3 = getByte(); unsigned char signature4 = getByte(); uint32_t signature = (signature4 << 24) | (signature3 << 16) | (signature2 << 8) | signature1; int isModern = (signature == 0x184D2204); int isLegacy = (signature == 0x184C2102); if (!isModern && !isLegacy) error("invalid signature"); unsigned char hasBlockChecksum = 0; unsigned char hasContentSize = 0; unsigned char hasContentChecksum = 0; if (isModern) { // flags (version is ignored) unsigned char flags = getByte(); hasBlockChecksum = flags & 16; hasContentSize = flags & 8; hasContentChecksum = flags & 4; // ignore blocksize getByte(); if (hasContentSize) { // ignore, skip 8 bytes getByte(); getByte(); getByte(); getByte(); getByte(); getByte(); getByte(); getByte(); } // ignore header checksum (xxhash32 of everything up this point & 0xFF) getByte(); } // don't lower this value, backreferences can be 64kb far away #define HISTORY_SIZE 64*1024 // contains the latest decoded data unsigned char history[HISTORY_SIZE]; // next free position in history[] unsigned int pos = 0; // parse all blocks until blockSize == 0 while (1) { // block size uint32_t blockSize = getByte(); blockSize |= (uint32_t)getByte() << 8; blockSize |= (uint32_t)getByte() << 16; blockSize |= (uint32_t)getByte() << 24; // highest bit set ? unsigned char isCompressed = isLegacy || (blockSize & 0x80000000) == 0; if (isModern) blockSize &= 0x7FFFFFFF; // stop after last block if (blockSize == 0) break; if (isCompressed) { // decompress block uint32_t blockOffset = 0; while (blockOffset < blockSize) { // get a token unsigned char token = getByte(); blockOffset++; // determine number of literals uint32_t numLiterals = (token >> 4) & 0x0F; if (numLiterals == 15) { // number of literals length encoded in more than 1 byte unsigned char current; do { current = getByte(); numLiterals += current; blockOffset++; } while (current == 255); } blockOffset += numLiterals; // copy all those literals while (numLiterals-- > 0) { history[pos++] = getByte(); // flush output buffer if (pos == HISTORY_SIZE) { sendBytes(history, HISTORY_SIZE); pos = 0; } } // last token has only literals if (blockOffset == blockSize) break; // match distance is encoded by two bytes (little endian) blockOffset += 2; uint32_t delta = getByte(); delta |= (uint32_t)getByte() << 8; // zero isn't allowed if (delta == 0) error("invalid offset"); // match length (must be >= 4, therefore length is stored minus 4) uint32_t matchLength = 4 + (token & 0x0F); if (matchLength == 4 + 0x0F) { unsigned char current; do // match length encoded in more than 1 byte { current = getByte(); matchLength += current; blockOffset++; } while (current == 255); } // copy match uint32_t reference = (pos >= delta) ? pos - delta : HISTORY_SIZE + pos - delta; if (pos + matchLength < HISTORY_SIZE && reference + matchLength < HISTORY_SIZE) { // fast copy if (pos >= reference + matchLength || reference >= pos + matchLength) { // non-overlapping memcpy(history + pos, history + reference, matchLength); pos += matchLength; } else { // overlapping while (matchLength-- > 0) history[pos++] = history[reference++]; } } else { // slower copy, have to take care of buffer limits while (matchLength-- > 0) { // copy single byte history[pos++] = history[reference++]; // cannot write anymore ? => wrap around if (pos == HISTORY_SIZE) { // flush output buffer sendBytes(history, HISTORY_SIZE); pos = 0; } // cannot read anymore ? => wrap around if (reference == HISTORY_SIZE) reference = 0; } } } // all legacy blocks must be completely filled - except for the last one if (isLegacy && blockSize < 8*1024*1024) break; } else { // copy uncompressd data and add to history, too (if next block is compressed and some matches refer to this block) while (blockSize-- > 0) { // copy a byte ... history[pos++] = getByte(); // ... until buffer is full => send to output if (pos == HISTORY_SIZE) { sendBytes(history, HISTORY_SIZE); pos = 0; } } } if (hasBlockChecksum) { // ignore checksum, skip 4 bytes getByte(); getByte(); getByte(); getByte(); } } if (hasContentChecksum) { // ignore checksum, skip 4 bytes getByte(); getByte(); getByte(); getByte(); } // flush output buffer sendBytes(history, pos); } // ==================== COMMAND-LINE HANDLING ==================== /// parse command-line int main(int argc, const char* argv[]) { // default input/output streams in = stdin; out = stdout; // first command-line parameter is our input filename / but ignore "-" which stands for STDIN if (argc == 2 && argv[1][0] != '-' && argv[1][1] != '\0') { in = fopen(argv[1], "rb"); if (!in) error("file not found"); } // and go ! unlz4(getByteFromIn, sendBytesToOut); return 0; }
Git users: scroll down to the repository link
Download  smallz4cat.c
Latest release: October 25, 2016, size: 9587 bytes, 319 lines

CRC32: b476f4c9
MD5: 1cda6e7c6683c16c76946969a74bd42f
SHA1: 38e139445f850d778fa46b622a6d131b293a9b86
SHA256:c006c8dccc97f2af461eb26e1caeba18c77ecc0606e6a139289d6b1e197978cd

Download  smallz4.h
Latest release: October 25, 2016, size: 23.5 kBytes, 680 lines

CRC32: 34382a7b
MD5: e9dd579f5db1745cd0425fd9b7350c3d
SHA1: f614c7ce31dd7aa223b23d4eed986f6f680305f3
SHA256:d81d710f95336be1474e45df511cace5b934c0811c8fea5c713a7dcd77c8d62f

Download  smallz4.cpp
Latest release: October 25, 2016, size: 6052 bytes, 190 lines

CRC32: 8cfc2657
MD5: ff5a16be02c351ae0032a0811d427bf0
SHA1: 502beea9c5fa6672ef0ad1e80d6ac34eeba2ece3
SHA256:91a1b9e09b789dbcbd0d9e5862c6b1161e0bc9b4242bc26ca3ce58381f49906f

Stay up-to-date:git clone http://create.stephan-brumme.com/smallz4/.git

Windows executables (x64 only)

(compiled with CLang/LLVM 3.7.1: clang++ -O3 -s -Wall -pedantic)
Download  smallz4cat-v1.0.exe
Latest release: October 25, 2016, size: 18.5 kBytes

CRC32: 90160f8f
MD5: bdd845aa4e6c758753dfb505e352d23b
SHA1: 8fbdffd11d6a36a11fd8212c05b65719a8698cbf
SHA256:98dfbdede24184426a6555e447b2b0fad89ccc5706079b15f2a884238d853c68

Download  smallz4-v1.0.exe
Latest release: November 2, 2016, size: 131.0 kBytes

CRC32: c8a2c84e
MD5: 5a6fc60a0a57dc545be66b8413b8e524
SHA1: 710f09c08c6acdd8b6b03392daf48ca69f7b88d7
SHA256:a3532b1034cb91253ee0996a4eea95192982b09873b0d6c6e08afb11a90bcd0f

Linux executables

(static linking: g++ -O3 -s -Wall -pedantic -static)
Download  smallz4-v1.0
Latest release: October 25, 2016, size: 1.4 MBytes

CRC32: d9723dc1
MD5: 1b5bd31bd75239b9343a40f7cf3af541
SHA1: e72fb94e42120df366e002bf83ea0c8fc83aaf69
SHA256:0ded16f71c6dd9f7709da0baeab25caae4c1186c68417e0bb2b2e4391a7dfb27

Download  smallz4-x32-v1.0
Latest release: October 25, 2016, size: 571.6 kBytes

CRC32: b8f547b4
MD5: 6ec710f27bdf3b60d0dd402ebde7d351
SHA1: 41e5c168fcc73f24e94b6c607dce0078f1822658
SHA256:271c56daa1091dc820a947509f89023792c42a590090a736f98a907c6adb0d7c

Changelog

homepage