Interactive demos require Javascript. Anything else works.

# Parallel Sieve of Eratosthenes [updated]

posted by Stephan Brumme, updated

## Processing 1 Billion Numbers Per Second

There are 50,847,534 prime numbers between 2 and 1,000,000,000. For reasons unknown, I wanted to know whether a C++ implementation can find all these numbers on my desktop computer (Intel Core i7 860, quad-core, hyperthreading, 2.8GHz) in less than 1 second with a simple algorithm such as the Sieve of Eratosthenes.

By only slightly modifying the basic algorithm and unleashing the raw power of OpenMP, I achieved a 12x 20x speed-up.
In the end, my improved sieve finishes after 1.1 seconds - mission almost accomplished.

UPDATE October 6, 2011: After adding a simple test for divisibility by 3 (and 5, 7, 9, 11, 13), the NEW block-wise sieve gets much faster and swiftly breaks through the 1-second-barrier.

UPDATE July 7, 2014: Many browsers support asm.js to boost Javascript performance. Scroll down to find my asm.js implementation or click here.
Algorithm OpenMP from 2 to 1 billion Speed-Up CPU ticks (approx.) Notes
Basic Algorithm no 50847534 primes 13.5 seconds (1.0x) 40 billion shortest code (22 lines)
Odd Numbers Only no 50847534 primes 6.7 seconds 2.0x 20 billion only 10% more code than basic algorithm
Odd Numbers Only yes, 8 threads 50847534 primes 5.4 seconds 2.5x 130 billion
Block-Wise NEW no 50847534 primes 2.4 seconds 5.6x 7 billion most efficient
Block-Wise NEW yes, 8 threads 50847534 primes 0.65 seconds 20.7x 16 billion fastest
Note: Much more complex algorithms easily outperform my code. My primary focus was parallelism in general, not prime numbers !

## Live Demo

I ported the short Odd Numbers Only algorithm to Javascript and PHP; the code can be seen at the bottom of this page. The table below shows the expected number of primes for some reference inputs and the performance numbers measured on my computer.

Compute number of primes from 2 to:

In this benchmark, C++ is approx. 4x faster than the best Javascript engine (Mozilla Firefox) and approx. 100x faster than PHP 5.3. Only the PHP times marked with (*) are measured on my local computer while all other PHP timings were measured on the create.stephan-brumme.com server (which is about 25% bit slower).

All browsers are still 32 bit programs - while my operating system is already 64 bit - and run out of memory when attempting to find all prime numbers from 1 to 1 billion. The more complex blockwise sieve does not suffer from this problem but I was too lazy to port it, too.
Multithreading is currently not supported by neither Javascript nor PHP.
from 2 to ... primes C++ Internet Explorer 10 Firefox 18 Opera 12 Chrome 24 PHP 5.3
10 4 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
100 25 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
1,000 168 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
10,000 1,229 <0.001s <0.001s <0.001s <0.001s <0.001s 0.003s
100,000 9,592 <0.001s 0.001s 0.001s 0.002s 0.001s 0.027s
1,000,000 78,498 0.002s 0.010s 0.007s 0.022s 0.015s 0.393s
10,000,000 664,579 0.024s 0.139s 0.103s 0.248s 0.190s 3.717s
1,000,000,000 50,847,534 6.7s n/a n/a n/a n/a n/a
For your interest: the browser ranking is contrary to what I have observed with my Javascript global illumination port of smallpt. I expected PHP to be faster, too.

UPDATE February 14, 2013: More than a year after my initial tests, Firefox became the fastest browser due to its vastly improved Javascript engine.

# Windows and Linux Code Timing

posted by Stephan Brumme

Whenever I have to do some performance testing, I try the most simple way: write a short program and run it within my Linux bash shell which comes with a built-in `time` command:
``` \$ time ./testcode real 0m0.788s user 0m0.696s sys 0m0.080s ```
Obviously there can (and will) be some overhead by the operating system. So numbers are typically a bit off.
The timing resolution isn't that great, either, and often lies above 10ms.

That's why I wrote a simple C routine to measure performance inside the code. My goal was to have numbers that are accurate down to about 1 microsecond (10-6).
I didn't opt for extremely high timing precision (count CPU ticks) because I rarely need that degree of precision and really want to avoid the tricky details of CPU frequency scaling, cache stalls etc.

Another feature is portability; simply recompile the code and run it on Windows and Linux. The code works fine with GCC as well as Visual C++.

All my timing code contains is a single function `seconds()` returning a `double`. You can invoke it in your programs like this:
``` double start = seconds(); // ... your code that should be measured ... double duration = seconds() - start; ```

# Portable C++ Hashing Library

posted by Stephan Brumme, updated

## Introduction

• computes CRC32, MD5, SHA1 and SHA256 (most common member of the SHA2 functions), Keccak and its SHA3 sibling
• optional HMAC (keyed-hash message authentication code)
• no external dependencies, small code size
• can work chunk-wise (for example when reading streams block-by-block)
• portable: supports Windows and Linux, tested on Little Endian and Big Endian CPUs
• roughly as fast as Linux core hashing functions

## Usage

Since there are no external dependencies, not even dependencies within the library itself, all you need is to include the header of the hashing algorithm of your choice and add the identically named `.cpp` to your project. That means if you want to add SHA256 hashing to your C++ program, you only have to include `sha256.h` and `sha256.cpp` in your project.

If you prefer SHA256, then your code will look like this:
``` // SHA2 test program #include "sha256.h" #include <iostream> // for std::cout only, not needed for hashing library int main(int, char**) { // create a new hashing object SHA256 sha256; // hashing an std::string std::cout << sha256("Hello World") << std::endl; // => a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e // hashing a buffer of bytes const char* buffer = "How are you"; std::cout << sha256(buffer, 11) << std::endl; // => 9c7d5b046878838da72e40ceb3179580958df544b240869b80d0275cc07209cc // or in a streaming fashion (re-use "How are you") SHA256 sha256stream; const char* url = "create.stephan-brumme.com"; // 25 bytes int step = 5; for (int i = 0; i < 25; i += step) sha256stream.add(url + i, step); // add five bytes at a time std::cout << sha256stream.getHash() << std::endl; // => 82aa771f1183c52f973c798c9243a1c73833ea40961c73e55e12430ec77b69f6 return 0; } ```

# Sorting STL Containers

posted by Stephan Brumme, updated

## std::sort

Yes, there is `std::sort`. Yes, it's pretty much the most efficient way to sort generic STL containers. No, it's not perfect.

## Improving std::sort

What if the data is already almost sorted ? `std::sort` doesn't benefit a lot from processing already sorted data. However, it's quite common to have (partially) sorted data.

`std::sort` requires `RandomAccessIterator` which are sometimes not available for custom containers.

What about `std::list` ? The syntax is completely different:
``` std::list<int> mylist; // ... fill mylist with data ... // compiler error: std::sort(mylist.begin(), mylist.end()); // must use this instead: mylist.sort(); ```
That's why I wrote some sorting routines on my own to overcome all those limitations:
• can sort containers that have only basic `ForwardIterator` (Bubble Sort and Selection Sort)
• most routines work with `BidirectionalIterator`
• therefore seamless integration of `std::list`
• (and all other STL containers)
• data types need a less-than comparison (`operator<`); which can be user-defined, too
The sorting algorithms are ordered more or less by overall performance (slowest first). As you can see faster algorithms have stricter requirements:
Algorithm Stable ? Iterator Notes
Bubble Sort yes forward iterator code shown here requires bidirectional iterator, just remove both instances of `--last;` (will be slower then)
Selection Sort no forward iterator
Insertion Sort yes bidirectional iterator
Shell Sort no bidirectional iterator `std::distance` and `std::advance` benefit heavily from random access iterators
Heap Sort no random-access iterator my n-ary Heap Sort implementation works with bidirectional iterators, too
Merge Sort yes random-access iterator requires additional memory
Merge Sort (in-place) yes forward iterator no additional memory but slower than normal merge sort
Quick Sort no bidirectional iterator `std::distance` and `std::advance` benefit heavily from random access iterators
Intro Sort no bidirectional iterator `std::distance` and `std::advance` benefit heavily from random access iterators
std::sort no random-access iterator
std::stable_sort yes random-access iterator
The algorithms are nothing new, they can be found in every decent book on algorithms and, of course, online (e.g. Wikipedia).

# Portable Memory Mapping C++ Class

posted by Stephan Brumme

## Parsing Files The Easy Way

Recently I had to do a lot with file loading and parsing. It can be especially tricky to come up with a fast solution if you have to jump around within these files. Seeking and proper buffering were my biggest problems.

Memory mapping is one of the nicest features of modern operating systems: after opening a file in memory-mapped mode you can treat the file as a large chunk of memory and use plain pointers. The operating system takes care of loading the data on demand (!) into memory - utilizing caches, of course. When using my C++ class `MemoryMapped` it's really easy:
hide How to use ``` // open file MemoryMapped data("myfile.txt"); // read byte at file offset 2345 unsigned char a = data[2345]; // or if you prefer pointers const short* raw = (const short*) data.getData(); short b = raw[300]; ```
Windows is completely different from Linux when it comes to opening a file in memory-mapped mode. The class `MemoryMapped` hides all the OS specific stuff in only two files: `MemoryMapped.h` and `MemoryMapped.cpp`.
They compile without any warning with GCC 4.7 and Visual C++ 2010. I haven't tried other compilers but they should be able to handle it, too, even when they are a bit older.

# Testing Endian Aware Code

posted by Stephan Brumme

## Turning Around

CPUs store numbers in two different ways: little or big endian. The majority of CPUs are designed to access memory in Little Endian way which seems to have the bytes mixed the wrong way. The memory layout of the hexadecimal number 0x12345678 looks like this:
byte 0 byte 1 byte 2 byte 3 Typical CPUs
Little Endian 0x78 0x56 0x34 0x12 Intel/AMD x86, x64
Big Endian 0x12 0x34 0x56 0x78 PowerPC, ARM*
Note: ARM chips often support little endian, too.

## Virtual Machines / QEMU

I don't have a big endian computer at home (my ARM chips are little endian by default) but some of my source code required testing that everything works as expected on big endian machines, too. That's when I learnt about QEMU, a great open-source processor emulator. It's available on Unix and Windows.

This posting explains how to setup QEMU and run Debian Wheezy on an emulated big endian PowerPC.

posted by Stephan Brumme

I received many kind requests from companies willing to use code from create.stephan-brumme.com in commercial products.
I publish my code because I believe that sharing knowledge makes this world a better place. Therefore all code will be under a zlib-style license now.
This applies to all past and future postings.

The full license can be found on http://create.stephan-brumme.com/disclaimer.html. In short, you can do with it whatever you want, even build commercial or military software, but please keep my name in the source code (however, no need to display it in the shipped product to the end user).
``` // ////////////////////////////////////////////////////////// // Copyright (c) 2013 Stephan Brumme. All rights reserved. // see http://create.stephan-brumme.com/disclaimer.html // ```
It would be great if you publish your code, too, but that's not a requirement.

I appreciate if you send me an email with a short description of your project / product.

posted by Stephan Brumme

## What Is Git ?!

If you don't know `git` grab one of tutorials on the web.

## Using My Git Repositories

git clone http://create.stephan-brumme.com/crc32/.git

You have read-only access via http and any push will produce an error message.
Whenever I update my projects (bugfixes, enhancements, ...), you can get the latest version:
``` # update local repository git pull ```
Unfortunately, not all repositories start with "version 1" because I lost some of the early source revisions.

The workflow is actually pretty simple.
``` # go to the project directory cd crc32 # create repository, will be stored in .git sub-directory git init # prepare files for the repository git add Crc32.cpp git add Crc32.ino git add Crc32Best.ino # of course you can use wild-cards, too # commit these added files git commit -m "version 1" # add web server's URL git remote add origin http://create.stephan-brumme.com/crc32/.git # compress repository git gc --aggressive ```
Now you can use your good old FTP program to copy the full `.git` sub-directory to your web server - just like you would copy HTML files. Done !

Note: A more sophisticated approach is to clone the repository from your local machine to your web server. However, if you are on a shared hoster then chances are good that `git` is disabled. Therefore I suggest you stick with FTP.

# Fowler-Noll-Vo Hash (FNV1a)

posted by Stephan Brumme

## At The Speed Of Light

Four weeks ago I posted an article about the popular CRC32 hashing algorithm. CRC32 works best on large data blocks because short sequences might lead to an increased number of collisions.

The FNV hash created by Fowler, Noll and Vo (see their website) comes with zero memory overhead (no precomputed look-up tables), is an incremental (rolling) hash and has a good avalanche behavior. It needs at about 4.5 CPU cycles per hashed byte on my computer, that's a little bit more than 700 MByte/second. GCC and Visual C++ produce identical code.

The heart of the hash function can be expressed in a single line (FNV1a):
``` newHash = (oneByte ^ oldHash) * Prime; ```
Reversing the order of `XOR` and multiplication leads to the original FNV1 hash which is considered slightly inferior:
``` newHash = (oneByte * Prime) ^ oldHash; ```
The 32-bit FNV1a hash of an `unsigned char` looks as follows in C/C++:
hide fnv1a, one byte ``` // default values recommended by http://isthe.com/chongo/tech/comp/fnv/ const uint32_t Prime = 0x01000193; // 16777619 const uint32_t Seed = 0x811C9DC5; // 2166136261 /// hash a single byte inline uint32_t fnv1a(unsigned char oneByte, uint32_t hash = Seed) { return (oneByte ^ hash) * Prime; } ```
And an array of bytes just calls `fnv1a` for each element:
hide fnv1a, multiple bytes ``` /// hash a block of memory uint32_t fnv1a(const void* data, size_t numBytes, uint32_t hash = Seed) { assert(data); const unsigned char* ptr = (const unsigned char*)data; while (numBytes--) hash = fnv1a(*ptr++, hash); return hash; } ```

# Drawing Antialiased Circles and Ellipses

posted by Stephan Brumme

## Fast But Ugly

PHP's GD image library can draw all kinds of circles and ellipses. But they look rather ugly because they lack proper antialiasing:

Aside from GD image handling stuff, it's a single line of code:
``` imageellipse(\$img, \$width/2,\$height/2, \$width,\$height, \$color); ```
The image above was created by the following PHP file (click show):
show PHP's built-in imageellipse ``` <?php \$start = microtime(true); // default size \$width = 150; \$height = 100; // ... or user defined if (is_numeric(\$_REQUEST["width" ])) \$width = \$_REQUEST["width" ]; if (is_numeric(\$_REQUEST["height"])) \$height = \$_REQUEST["height"]; // create new image \$img = imagecreatetruecolor(\$width, \$height); // transparent background \$background = imagecolorallocate(\$img, 0xFF, 0xFF, 0xFF); imagefilledrectangle (\$img, 0,0, \$width,\$height, \$background); imagecolortransparent(\$img, \$background); // draw red ellipse, 2*10px border \$red = imagecolorallocate(\$img, 0xFF, 0x00, 0x00); imageellipse(\$img, \$width/2,\$height/2, \$width-20,\$height-20, \$red); // measure speed \$end = microtime(true); \$duration = number_format((\$end - \$start)*1000, 3)."ms"; imagestring(\$img, 1, \$width-50,\$height-10, \$duration, 0); // send PNG to browser header("Content-type: image/png"); imagepng(\$img, NULL, 9, PNG_ALL_FILTERS); ?> ```
Algorithm Quality Speed Code Size
`imageellipse` bad very fast one line
Zooming Trick okay very slow about 10 lines
Polygon Approximation depends fast about 10 lines
Implicit Equation good very slow about 20 lines
Wu's Algorithm good slow about 30 lines
True-Type Font depends fast about 10 lines
All images on this page are drawn on-the-fly. Rendering time (in milliseconds) is displayed in the lower right corner and will be different if you reload.

browse through all postings:

2021/Jun/30   Length-Limited Prefix Codes
2020/Apr/13   smalLZ4 - optimal LZ4 compression
2019/Jun/17   toojpeg - a JPEG encoder in a single C++ file
2018/Oct/15   flexiGIF - lossless GIF/LZW optimization
2016/Aug/04   xxHash - a hash algorithm as fast as memcpy
2015/Aug/11   Tiny Regular Expression Matcher
2015/Feb/04   Fast CRC32
2014/Aug/25   Practical String Searching
2014/Aug/19   The Mersenne Twister Pseudo Random Number Generator

older stuff:

2011/Nov/27   Uncompressing DEFLATE Data
2011/Nov/23   Streaming MPEG-4 Movies From Your Own Server
2011/Nov/08   Binary Search Revisited
2011/Oct/20   Reverse MD5 Hash
2011/Oct/17   Windows 7 Screenshots
2011/Oct/16   Hide Strings In Your Executable
2011/Oct/11   Generate QR Codes With PHP and Google Charts API