xxHash - a hash algorithm as fast as memcpy

posted by Stephan Brumme

Numbers, please !

Yes, Yann Collet's xxHash algorithm can be faster than memcpy (Visual C++):
throughput bytes/CPU cycle
memcpy 5.45 GByte/s approx. 1.6
xxHash32 (Yann's implementation) 5.9 GByte/s approx. 1.7
xxHash32 (my code) 5.9 GByte/s approx. 1.7
CRC32 Slicing-by-16 3.2 GByte/s approx. 1
Yes, xxHash is extremely fast - but keep in mind that memcpy has to read and write lots of bytes whereas this hashing algorithm reads everything but writes only a few bytes.

For comparison: memset achieves 8.4 GByte/s on the same Intel Core i7-2600K CPU @ 3.40GHz system.
Even more interesting is that even pretty old versions of G++ have a faster version of memcpy (7.7 GByte/s) and much, much faster intrinsics for memset (18.2 GByte/s).

Update August 8, 2016:
I wrote very similar code for xxHash64. It's super-fast on x64 but much slower when compiled for x32 systems.
Most of this posting will remain unchanged to avoid confusion - just keep in mind that every time xxHash32 touches 4 bytes, xxHash64 does pretty much the same with 8 bytes (with a few minor exceptions).
throughput bytes/CPU cycle
xxHash64 (Yann's implementation) 10.5 GByte/s approx. 3.1
xxHash64 (my code) 10.5 GByte/s approx. 3.1

Algorithm

xxHash was designed from the ground up to be as fast as possible on modern CPUs.
It is not a strong cryptographic hash, such as the SHA family, but still passes the SMHasher test set with 10 points.

Most simple hashes, such as FNV (see my posting, too), step through the input data byte-by-byte.
Working on byte at position 1 requires that all work on the previous byte (at position 0) has finished.
Therefore, we have a dependency which causes the CPU to potentially wait (stalling).

Slightly better algorithms process a block of bytes at once, e.g. CRC Slicing-by-4/8/16 consumes 4/8/16 bytes per step - instead just one - giving a substantial speed-up.
However, we still have the same problem: working on block 1 requires that all work on block 0 has finished.

xxHash subdivides input data into four independent streams. Each stream processes a block of 4 bytes per step and stores a temporary state.
Only the final step combines these four states into a single one.

The major advantage is that the code generator has lots of opportunities to re-order opcodes to avoid latencies.
I drew a crude kindergarten-style image to visualize how memory is accessed when hashing 64 bytes:

0 stream 0 1 2 3 4 stream 1 5 6 7 8 stream 2 9 10 11 12 stream 3 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 state[0] 49 50 51 52 state[1] 53 54 55 56 state[2] 57 58 59 60 state[3] 61 62 63

My C++ implementation

Yann's C code is heavily optimized for various environments: little and big endian, aligned and unaligned data, etc.
However, this makes his code hard to read. That's why I wrote my C++ implementation.

It's just 180 lines of code - all in a single header file (no .cpp) and heavily commented. No #ifdefs, too - scroll down to see the full souce code !

Typical usage:
hide xxHash32 in one line uint32_t result2 = XXHash32::hash(mypointer, numBytes, myseed); // well, you need to #include "xxhash32.h", too ;-)
A streaming interface is provided as well:
hide xxHash32 in one line uint32_t myseed = 0; XXHash32 myhash(myseed); myhash.add(pointerToSomeBytes, numberOfBytes); myhash.add(pointerToSomeMoreBytes, numberOfMoreBytes); // call add() as often as you like to ... // and compute hash: uint32_t result = myhash.hash();

Under the hood, most of the time is spent in process():
hide core hashing routine static const uint32_t Prime1 = 2654435761U; static const uint32_t Prime2 = 2246822519U; /// rotate bits, should compile to a single CPU instruction (ROL) static inline uint32_t rotateLeft(uint32_t x, unsigned char bits) { return (x << bits) | (x >> (32 - bits)); } /// process a block of 4x4 bytes, this is the main part of the XXHash32 algorithm static inline void process(const void* data, uint32_t& state0, uint32_t& state1, uint32_t& state2, uint32_t& state3) { const uint32_t* block = (const uint32_t*) data; state0 = rotateLeft(state0 + block[0] * Prime2, 13) * Prime1; state1 = rotateLeft(state1 + block[1] * Prime2, 13) * Prime1; state2 = rotateLeft(state2 + block[2] * Prime2, 13) * Prime1; state3 = rotateLeft(state3 + block[3] * Prime2, 13) * Prime1; }
Git users: scroll down to the repository link
Download  xxhash32.h
Latest release: August 4, 2016, size: 5849 bytes, 180 lines

CRC32: 6a0b0b35
MD5: 3e979d57c13b37a5167d3a31216fe107
SHA1: 7b51e86ea10d3db60d4a278c9692ee74a244c1d8
SHA256:f92ad5012ff3d60cedc3a9f192a31a392775aa8c0d793468378defd310c83a67

Stay up-to-date:git clone http://create.stephan-brumme.com/xxhash/.git

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com Git users: scroll down to the repository link
Download  xxhash64.h
Latest release: August 8, 2016, size: 6398 bytes, 202 lines

CRC32: 8d395986
MD5: 9e4a45bc798b13b75430209fead9106d
SHA1: c1e60786270d8924081f4282c030ad55a7747fcc
SHA256:d8471addff709b135c68532bd7fe4eaadc2bfe1d9cb602cf5a4af5e901cfa63e

Stay up-to-date:git clone http://create.stephan-brumme.com/xxhash/.git

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

Changelog

Limitations

Keep in mind that my code only given proper results on little-endian architectures (such as Intel or modern ARM chips).
When fed with unaligned data, it could be MUCH slower than the reference implementation.

Influence of C++ optimizers

Visual C++'s assembler code turns out to be a bit faster than old G++ compilers for xxHash32, but G++ 4.9 (or newer) takes the lead back.
compiler settings OS throughput (my code) throughput (original)
Visual C++ 2013 /Ox /Ot /Oi /Oy /Ob2 /arch:AVX2 Windows (x64) 5.9 GByte/s 5.9 GByte/s
G++ 4.4.7 -O3 -march=native Linux (x64) 5.7 GByte/s 5.1 GByte/s
G++ 4.7.2 -O3 -march=native Linux (x64) 5.7 GByte/s 5.7 GByte/s
G++ 4.9.2 -Ofast -march=native Linux (x64) 6.1 GByte/s 6.1 GByte/s
G++ 5.3.1 -Ofast -march=native Linux (x64) 6.1 GByte/s 6.1 GByte/s
clang++ 3.4.2 -O3 -march=native Linux (x64) 5.9 GByte/s 5.9 GByte/s
And the same measurements for xxHash64 (no x64 for Visual C++):
compiler settings OS throughput (my code) throughput (original)
G++ 4.4.7 -O3 -march=native Linux (x64) 10.5 GByte/s 10.5 GByte/s
G++ 4.7.2 -O3 -march=native Linux (x64) 10.4 GByte/s 10.4 GByte/s
G++ 4.9.2 -Ofast -march=native Linux (x64) 8.5 GByte/s 9.1 GByte/s
G++ 5.3.1 -Ofast -march=native Linux (x64) 9.1 GByte/s 8.8 GByte/s
clang++ 3.4.2 -O3 -march=native Linux (x64) 8.7 GByte/s 8.2 GByte/s
When compiling xxHash64 for x32 targets you will suffer severe performance degradation: expect a throughput of about 2 GByte/s.

Source code

And here comes the full source code (see above for direct download or Git repository):
hide xxHash32 // ////////////////////////////////////////////////////////// // xxhash32.h // Copyright (c) 2016 Stephan Brumme. All rights reserved. // see http://create.stephan-brumme.com/disclaimer.html // #pragma once #include <stdint.h> // for uint32_t and uint64_t /// XXHash (32 bit), based on Yann Collet's descriptions, see http://cyan4973.github.io/xxHash/ /** How to use: uint32_t myseed = 0; XXHash32 myhash(myseed); myhash.add(pointerToSomeBytes, numberOfBytes); myhash.add(pointerToSomeMoreBytes, numberOfMoreBytes); // call add() as often as you like to ... // and compute hash: uint32_t result = myhash.hash(); // or all of the above in one single line: uint32_t result2 = XXHash32::hash(mypointer, numBytes, myseed); Note: my code is NOT endian-aware ! **/ class XXHash32 { public: /// create new XXHash (32 bit) /** @param seed your seed value, even zero is a valid seed and e.g. used by LZ4 **/ explicit XXHash32(uint32_t seed) { state[0] = seed + Prime1 + Prime2; state[1] = seed + Prime2; state[2] = seed; state[3] = seed - Prime1; bufferSize = 0; totalLength = 0; } /// add a chunk of bytes /** @param input pointer to a continuous block of data @param length number of bytes @return false if parameters are invalid / zero **/ bool add(const void* input, uint64_t length) { // no data ? if (!input || length == 0) return false; totalLength += length; // byte-wise access const unsigned char* data = (const unsigned char*)input; // unprocessed old data plus new data still fit in temporary buffer ? if (bufferSize + length < MaxBufferSize) { // just add new data while (length-- > 0) buffer[bufferSize++] = *data++; return true; } // point beyond last byte const unsigned char* stop = data + length; const unsigned char* stopBlock = stop - MaxBufferSize; // some data left from previous update ? if (bufferSize > 0) { // make sure temporary buffer is full (16 bytes) while (bufferSize < MaxBufferSize) buffer[bufferSize++] = *data++; // process these 16 bytes (4x4) process(buffer, state[0], state[1], state[2], state[3]); } // copying state to local variables helps optimizer A LOT uint32_t s0 = state[0], s1 = state[1], s2 = state[2], s3 = state[3]; // 16 bytes at once while (data <= stopBlock) { // local variables s0..s3 instead of state[0]..state[3] are much faster process(data, s0, s1, s2, s3); data += 16; } // copy back state[0] = s0; state[1] = s1; state[2] = s2; state[3] = s3; // copy remainder to temporary buffer bufferSize = stop - data; for (unsigned int i = 0; i < bufferSize; i++) buffer[i] = data[i]; // done return true; } /// get current hash /** @return 32 bit XXHash **/ uint32_t hash() const { uint32_t result = (uint32_t)totalLength; // fold 128 bit state into one single 32 bit value if (totalLength >= MaxBufferSize) result += rotateLeft(state[0], 1) + rotateLeft(state[1], 7) + rotateLeft(state[2], 12) + rotateLeft(state[3], 18); else // internal state wasn't set in add(), therefore original seed is still stored in state2 result += state[2] + Prime5; // process remaining bytes in temporary buffer const unsigned char* data = buffer; // point beyond last byte const unsigned char* stop = data + bufferSize; // at least 4 bytes left ? => eat 4 bytes per step for (; data + 4 <= stop; data += 4) result = rotateLeft(result + *(uint32_t*)data * Prime3, 17) * Prime4; // take care of remaining 0..3 bytes, eat 1 byte per step while (data != stop) result = rotateLeft(result + (*data++) * Prime5, 11) * Prime1; // mix bits result ^= result >> 15; result *= Prime2; result ^= result >> 13; result *= Prime3; result ^= result >> 16; return result; } /// combine constructor, add() and hash() in one static function (C style) /** @param input pointer to a continuous block of data @param length number of bytes @param seed your seed value, e.g. zero is a valid seed and used by LZ4 @return 32 bit XXHash **/ static uint32_t hash(const void* input, uint64_t length, uint32_t seed) { XXHash32 hasher(seed); hasher.add(input, length); return hasher.hash(); } private: /// magic constants :-) static const uint32_t Prime1 = 2654435761U; static const uint32_t Prime2 = 2246822519U; static const uint32_t Prime3 = 3266489917U; static const uint32_t Prime4 = 668265263U; static const uint32_t Prime5 = 374761393U; /// temporarily store up to 15 bytes between multiple add() calls static const uint32_t MaxBufferSize = 15+1; // internal state and temporary buffer uint32_t state[4]; // state[2] == seed if totalLength < MaxBufferSize unsigned char buffer[MaxBufferSize]; unsigned int bufferSize; uint64_t totalLength; /// rotate bits, should compile to a single CPU instruction (ROL) static inline uint32_t rotateLeft(uint32_t x, unsigned char bits) { return (x << bits) | (x >> (32 - bits)); } /// process a block of 4x4 bytes, this is the main part of the XXHash32 algorithm static inline void process(const void* data, uint32_t& state0, uint32_t& state1, uint32_t& state2, uint32_t& state3) { const uint32_t* block = (const uint32_t*) data; state0 = rotateLeft(state0 + block[0] * Prime2, 13) * Prime1; state1 = rotateLeft(state1 + block[1] * Prime2, 13) * Prime1; state2 = rotateLeft(state2 + block[2] * Prime2, 13) * Prime1; state3 = rotateLeft(state3 + block[3] * Prime2, 13) * Prime1; } };
Click on "show" to see xxHash64's source code:
hide xxHash64 // ////////////////////////////////////////////////////////// // xxhash64.h // Copyright (c) 2016 Stephan Brumme. All rights reserved. // see http://create.stephan-brumme.com/disclaimer.html // #pragma once #include <stdint.h> // for uint32_t and uint64_t /// XXHash (64 bit), based on Yann Collet's descriptions, see http://cyan4973.github.io/xxHash/ /** How to use: uint64_t myseed = 0; XXHash64 myhash(myseed); myhash.add(pointerToSomeBytes, numberOfBytes); myhash.add(pointerToSomeMoreBytes, numberOfMoreBytes); // call add() as often as you like to ... // and compute hash: uint64_t result = myhash.hash(); // or all of the above in one single line: uint64_t result2 = XXHash64::hash(mypointer, numBytes, myseed); Note: my code is NOT endian-aware ! **/ class XXHash64 { public: /// create new XXHash (64 bit) /** @param seed your seed value, even zero is a valid seed **/ explicit XXHash64(uint64_t seed) { state[0] = seed + Prime1 + Prime2; state[1] = seed + Prime2; state[2] = seed; state[3] = seed - Prime1; bufferSize = 0; totalLength = 0; } /// add a chunk of bytes /** @param input pointer to a continuous block of data @param length number of bytes @return false if parameters are invalid / zero **/ bool add(const void* input, uint64_t length) { // no data ? if (!input || length == 0) return false; totalLength += length; // byte-wise access const unsigned char* data = (const unsigned char*)input; // unprocessed old data plus new data still fit in temporary buffer ? if (bufferSize + length < MaxBufferSize) { // just add new data while (length-- > 0) buffer[bufferSize++] = *data++; return true; } // point beyond last byte const unsigned char* stop = data + length; const unsigned char* stopBlock = stop - MaxBufferSize; // some data left from previous update ? if (bufferSize > 0) { // make sure temporary buffer is full (16 bytes) while (bufferSize < MaxBufferSize) buffer[bufferSize++] = *data++; // process these 32 bytes (4x8) process(buffer, state[0], state[1], state[2], state[3]); } // copying state to local variables helps optimizer A LOT uint64_t s0 = state[0], s1 = state[1], s2 = state[2], s3 = state[3]; // 32 bytes at once while (data <= stopBlock) { // local variables s0..s3 instead of state[0]..state[3] are much faster process(data, s0, s1, s2, s3); data += 32; } // copy back state[0] = s0; state[1] = s1; state[2] = s2; state[3] = s3; // copy remainder to temporary buffer bufferSize = stop - data; for (unsigned int i = 0; i < bufferSize; i++) buffer[i] = data[i]; // done return true; } /// get current hash /** @return 64 bit XXHash **/ uint64_t hash() const { // fold 256 bit state into one single 64 bit value uint64_t result; if (totalLength >= MaxBufferSize) { result = rotateLeft(state[0], 1) + rotateLeft(state[1], 7) + rotateLeft(state[2], 12) + rotateLeft(state[3], 18); result = (result ^ processSingle(0, state[0])) * Prime1 + Prime4; result = (result ^ processSingle(0, state[1])) * Prime1 + Prime4; result = (result ^ processSingle(0, state[2])) * Prime1 + Prime4; result = (result ^ processSingle(0, state[3])) * Prime1 + Prime4; } else { // internal state wasn't set in add(), therefore original seed is still stored in state2 result = state[2] + Prime5; } result += totalLength; // process remaining bytes in temporary buffer const unsigned char* data = buffer; // point beyond last byte const unsigned char* stop = data + bufferSize; // at least 8 bytes left ? => eat 8 bytes per step for (; data + 8 <= stop; data += 8) result = rotateLeft(result ^ processSingle(0, *(uint64_t*)data), 27) * Prime1 + Prime4; // 4 bytes left ? => eat those if (data + 4 <= stop) { result = rotateLeft(result ^ (*(uint32_t*)data) * Prime1, 23) * Prime2 + Prime3; data += 4; } // take care of remaining 0..3 bytes, eat 1 byte per step while (data != stop) result = rotateLeft(result ^ (*data++) * Prime5, 11) * Prime1; // mix bits result ^= result >> 33; result *= Prime2; result ^= result >> 29; result *= Prime3; result ^= result >> 32; return result; } /// combine constructor, add() and hash() in one static function (C style) /** @param input pointer to a continuous block of data @param length number of bytes @param seed your seed value, e.g. zero is a valid seed @return 64 bit XXHash **/ static uint64_t hash(const void* input, uint64_t length, uint64_t seed) { XXHash64 hasher(seed); hasher.add(input, length); return hasher.hash(); } private: /// magic constants :-) static const uint64_t Prime1 = 11400714785074694791ULL; static const uint64_t Prime2 = 14029467366897019727ULL; static const uint64_t Prime3 = 1609587929392839161ULL; static const uint64_t Prime4 = 9650029242287828579ULL; static const uint64_t Prime5 = 2870177450012600261ULL; /// temporarily store up to 31 bytes between multiple add() calls static const uint64_t MaxBufferSize = 31+1; uint64_t state[4]; unsigned char buffer[MaxBufferSize]; unsigned int bufferSize; uint64_t totalLength; /// rotate bits, should compile to a single CPU instruction (ROL) static inline uint64_t rotateLeft(uint64_t x, unsigned char bits) { return (x << bits) | (x >> (64 - bits)); } /// process a single 64 bit value static inline uint64_t processSingle(uint64_t previous, uint64_t input) { return rotateLeft(previous + input * Prime2, 31) * Prime1; } /// process a block of 4x4 bytes, this is the main part of the XXHash32 algorithm static inline void process(const void* data, uint64_t& state0, uint64_t& state1, uint64_t& state2, uint64_t& state3) { const uint64_t* block = (const uint64_t*) data; state0 = processSingle(state0, block[0]); state1 = processSingle(state1, block[1]); state2 = processSingle(state2, block[2]); state3 = processSingle(state3, block[3]); } };
homepage