Interactive demos require Javascript. Anything else works.

Parallel Sieve of Eratosthenes [updated]

posted by Stephan Brumme, updated

Processing 1 Billion Numbers Per Second

There are 50,847,534 prime numbers between 2 and 1,000,000,000. For reasons unknown, I wanted to know whether a C++ implementation can find all these numbers on my desktop computer (Intel Core i7 860, quad-core, hyperthreading, 2.8GHz) in less than 1 second with a simple algorithm such as the Sieve of Eratosthenes.

By only slightly modifying the basic algorithm and unleashing the raw power of OpenMP, I achieved a 12x 20x speed-up.
In the end, my improved sieve finishes after 1.1 seconds - mission almost accomplished.

UPDATE October 6, 2011: After adding a simple test for divisibility by 3 (and 5, 7, 9, 11, 13), the NEW block-wise sieve gets much faster and swiftly breaks through the 1-second-barrier.

UPDATE July 7, 2014: Many browsers support asm.js to boost Javascript performance. Scroll down to find my asm.js implementation or click here.
Algorithm OpenMP from 2 to 1 billion Speed-Up CPU ticks (approx.) Notes
Basic Algorithm no 50847534 primes 13.5 seconds (1.0x) 40 billion shortest code (22 lines)
Odd Numbers Only no 50847534 primes 6.7 seconds 2.0x 20 billion only 10% more code than basic algorithm
Odd Numbers Only yes, 8 threads 50847534 primes 5.4 seconds 2.5x 130 billion
Block-Wise NEW no 50847534 primes 2.4 seconds 5.6x 7 billion most efficient
Block-Wise NEW yes, 8 threads 50847534 primes 0.65 seconds 20.7x 16 billion fastest
Note: Much more complex algorithms easily outperform my code. My primary focus was parallelism in general, not prime numbers !

Live Demo

I ported the short Odd Numbers Only algorithm to Javascript and PHP; the code can be seen at the bottom of this page. The table below shows the expected number of primes for some reference inputs and the performance numbers measured on my computer.

Compute number of primes from 2 to:




Your browser: CCBot/2.0 (https://commoncrawl.org/faq/)

In this benchmark, C++ is approx. 4x faster than the best Javascript engine (Mozilla Firefox) and approx. 100x faster than PHP 5.3. Only the PHP times marked with (*) are measured on my local computer while all other PHP timings were measured on the create.stephan-brumme.com server (which is about 25% bit slower).

All browsers are still 32 bit programs - while my operating system is already 64 bit - and run out of memory when attempting to find all prime numbers from 1 to 1 billion. The more complex blockwise sieve does not suffer from this problem but I was too lazy to port it, too.
Multithreading is currently not supported by neither Javascript nor PHP.
from 2 to ... primes C++ Internet Explorer 10 Firefox 18 Opera 12 Chrome 24 PHP 5.3
10 4 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
100 25 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
1,000 168 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
10,000 1,229 <0.001s <0.001s <0.001s <0.001s <0.001s 0.003s
100,000 9,592 <0.001s 0.001s 0.001s 0.002s 0.001s 0.027s
1,000,000 78,498 0.002s 0.010s 0.007s 0.022s 0.015s 0.393s
10,000,000 664,579 0.024s 0.139s 0.103s 0.248s 0.190s 3.717s
100,000,000 5,761,455 0.555s about 1.7s about 1.2s about 2.7s about 2.0s about 29s (*)
1,000,000,000 50,847,534 6.7s n/a n/a n/a n/a n/a
For your interest: the browser ranking is contrary to what I have observed with my Javascript global illumination port of smallpt. I expected PHP to be faster, too.

UPDATE February 14, 2013: More than a year after my initial tests, Firefox became the fastest browser due to its vastly improved Javascript engine.
(read more ...)

Windows and Linux Code Timing

posted by Stephan Brumme

Whenever I have to do some performance testing, I try the most simple way: write a short program and run it within my Linux bash shell which comes with a built-in time command:
$ time ./testcode real 0m0.788s user 0m0.696s sys 0m0.080s
Obviously there can (and will) be some overhead by the operating system. So numbers are typically a bit off.
The timing resolution isn't that great, either, and often lies above 10ms.

That's why I wrote a simple C routine to measure performance inside the code. My goal was to have numbers that are accurate down to about 1 microsecond (10-6).
I didn't opt for extremely high timing precision (count CPU ticks) because I rarely need that degree of precision and really want to avoid the tricky details of CPU frequency scaling, cache stalls etc.

Another feature is portability; simply recompile the code and run it on Windows and Linux. The code works fine with GCC as well as Visual C++.

All my timing code contains is a single function seconds() returning a double. You can invoke it in your programs like this:
double start = seconds(); // ... your code that should be measured ... double duration = seconds() - start;

(read more ...)

Portable C++ Hashing Library

posted by Stephan Brumme, updated

Introduction

Let's start with the core features of my C++ hashing library:

Usage

Since there are no external dependencies, not even dependencies within the library itself, all you need is to include the header of the hashing algorithm of your choice and add the identically named .cpp to your project. That means if you want to add SHA256 hashing to your C++ program, you only have to include sha256.h and sha256.cpp in your project.

If you prefer SHA256, then your code will look like this:
hide // SHA2 test program #include "sha256.h" #include <iostream> // for std::cout only, not needed for hashing library int main(int, char**) { // create a new hashing object SHA256 sha256; // hashing an std::string std::cout << sha256("Hello World") << std::endl; // => a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e // hashing a buffer of bytes const char* buffer = "How are you"; std::cout << sha256(buffer, 11) << std::endl; // => 9c7d5b046878838da72e40ceb3179580958df544b240869b80d0275cc07209cc // or in a streaming fashion (re-use "How are you") SHA256 sha256stream; const char* url = "create.stephan-brumme.com"; // 25 bytes int step = 5; for (int i = 0; i < 25; i += step) sha256stream.add(url + i, step); // add five bytes at a time std::cout << sha256stream.getHash() << std::endl; // => 82aa771f1183c52f973c798c9243a1c73833ea40961c73e55e12430ec77b69f6 return 0; }

(read more ...)

Sorting STL Containers

posted by Stephan Brumme, updated

std::sort

Yes, there is std::sort. Yes, it's pretty much the most efficient way to sort generic STL containers. No, it's not perfect.

Improving std::sort

What if the data is already almost sorted ? std::sort doesn't benefit a lot from processing already sorted data. However, it's quite common to have (partially) sorted data.

std::sort requires RandomAccessIterator which are sometimes not available for custom containers.

What about std::list ? The syntax is completely different:
hide std::list<int> mylist; // ... fill mylist with data ... // compiler error: std::sort(mylist.begin(), mylist.end()); // must use this instead: mylist.sort();
That's why I wrote some sorting routines on my own to overcome all those limitations: The sorting algorithms are ordered more or less by overall performance (slowest first). As you can see faster algorithms have stricter requirements:
Algorithm Stable ? Iterator Notes
Bubble Sort yes forward iterator code shown here requires bidirectional iterator, just remove both instances of --last; (will be slower then)
Selection Sort no forward iterator
Insertion Sort yes bidirectional iterator
Shell Sort no bidirectional iterator std::distance and std::advance benefit heavily from random access iterators
Heap Sort no random-access iterator my n-ary Heap Sort implementation works with bidirectional iterators, too
Merge Sort yes random-access iterator requires additional memory
Merge Sort (in-place) yes forward iterator no additional memory but slower than normal merge sort
Quick Sort no bidirectional iterator std::distance and std::advance benefit heavily from random access iterators
Intro Sort no bidirectional iterator std::distance and std::advance benefit heavily from random access iterators
std::sort no random-access iterator
std::stable_sort yes random-access iterator
The algorithms are nothing new, they can be found in every decent book on algorithms and, of course, online (e.g. Wikipedia).


(read more ...)

Portable Memory Mapping C++ Class

posted by Stephan Brumme

Parsing Files The Easy Way

Recently I had to do a lot with file loading and parsing. It can be especially tricky to come up with a fast solution if you have to jump around within these files. Seeking and proper buffering were my biggest problems.

Memory mapping is one of the nicest features of modern operating systems: after opening a file in memory-mapped mode you can treat the file as a large chunk of memory and use plain pointers. The operating system takes care of loading the data on demand (!) into memory - utilizing caches, of course. When using my C++ class MemoryMapped it's really easy:
hide How to use // open file MemoryMapped data("myfile.txt"); // read byte at file offset 2345 unsigned char a = data[2345]; // or if you prefer pointers const short* raw = (const short*) data.getData(); short b = raw[300];
Windows is completely different from Linux when it comes to opening a file in memory-mapped mode. The class MemoryMapped hides all the OS specific stuff in only two files: MemoryMapped.h and MemoryMapped.cpp.
They compile without any warning with GCC 4.7 and Visual C++ 2010. I haven't tried other compilers but they should be able to handle it, too, even when they are a bit older.
(read more ...)

Testing Endian Aware Code

posted by Stephan Brumme

Turning Around

CPUs store numbers in two different ways: little or big endian. The majority of CPUs are designed to access memory in Little Endian way which seems to have the bytes mixed the wrong way. The memory layout of the hexadecimal number 0x12345678 looks like this:
byte 0 byte 1 byte 2 byte 3 Typical CPUs
Little Endian 0x78 0x56 0x34 0x12 Intel/AMD x86, x64
Big Endian 0x12 0x34 0x56 0x78 PowerPC, ARM*
Note: ARM chips often support little endian, too.

Virtual Machines / QEMU

I don't have a big endian computer at home (my ARM chips are little endian by default) but some of my source code required testing that everything works as expected on big endian machines, too. That's when I learnt about QEMU, a great open-source processor emulator. It's available on Unix and Windows.

This posting explains how to setup QEMU and run Debian Wheezy on an emulated big endian PowerPC.
(read more ...)

Software License

posted by Stephan Brumme

I received many kind requests from companies willing to use code from create.stephan-brumme.com in commercial products.
I publish my code because I believe that sharing knowledge makes this world a better place. Therefore all code will be under a zlib-style license now.
This applies to all past and future postings.

The full license can be found on http://create.stephan-brumme.com/disclaimer.html. In short, you can do with it whatever you want, even build commercial or military software, but please keep my name in the source code (however, no need to display it in the shipped product to the end user).
Each file's header contains a reference to the license, too:
// ////////////////////////////////////////////////////////// // Copyright (c) 2013 Stephan Brumme. All rights reserved. // see http://create.stephan-brumme.com/disclaimer.html //
It would be great if you publish your code, too, but that's not a requirement.

I appreciate if you send me an email with a short description of your project / product.

Read-Only Git Access Available

posted by Stephan Brumme

What Is Git ?!

If you don't know git grab one of tutorials on the web.

Using My Git Repositories

I added a git repository to these projects on create.stephan-brumme.com that already came with a big yellow download button, e.g.:

git clone http://create.stephan-brumme.com/crc32/.git

You have read-only access via http and any push will produce an error message.
Whenever I update my projects (bugfixes, enhancements, ...), you can get the latest version:
# update local repository git pull
Unfortunately, not all repositories start with "version 1" because I lost some of the early source revisions.

How To Add Git Repositories To Your Website

The workflow is actually pretty simple.
# go to the project directory cd crc32 # create repository, will be stored in .git sub-directory git init # prepare files for the repository git add Crc32.cpp git add Crc32.ino git add Crc32Best.ino # of course you can use wild-cards, too # commit these added files git commit -m "version 1" # add web server's URL git remote add origin http://create.stephan-brumme.com/crc32/.git # compress repository git gc --aggressive
Now you can use your good old FTP program to copy the full .git sub-directory to your web server - just like you would copy HTML files. Done !

Note: A more sophisticated approach is to clone the repository from your local machine to your web server. However, if you are on a shared hoster then chances are good that git is disabled. Therefore I suggest you stick with FTP.

Fowler-Noll-Vo Hash (FNV1a)

posted by Stephan Brumme

At The Speed Of Light

Four weeks ago I posted an article about the popular CRC32 hashing algorithm. CRC32 works best on large data blocks because short sequences might lead to an increased number of collisions.

The FNV hash created by Fowler, Noll and Vo (see their website) comes with zero memory overhead (no precomputed look-up tables), is an incremental (rolling) hash and has a good avalanche behavior. It needs at about 4.5 CPU cycles per hashed byte on my computer, that's a little bit more than 700 MByte/second. GCC and Visual C++ produce identical code.

The heart of the hash function can be expressed in a single line (FNV1a):
newHash = (oneByte ^ oldHash) * Prime;
Reversing the order of XOR and multiplication leads to the original FNV1 hash which is considered slightly inferior:
newHash = (oneByte * Prime) ^ oldHash;
The 32-bit FNV1a hash of an unsigned char looks as follows in C/C++:
hide fnv1a, one byte // default values recommended by http://isthe.com/chongo/tech/comp/fnv/ const uint32_t Prime = 0x01000193; // 16777619 const uint32_t Seed = 0x811C9DC5; // 2166136261 /// hash a single byte inline uint32_t fnv1a(unsigned char oneByte, uint32_t hash = Seed) { return (oneByte ^ hash) * Prime; }
And an array of bytes just calls fnv1a for each element:
hide fnv1a, multiple bytes /// hash a block of memory uint32_t fnv1a(const void* data, size_t numBytes, uint32_t hash = Seed) { assert(data); const unsigned char* ptr = (const unsigned char*)data; while (numBytes--) hash = fnv1a(*ptr++, hash); return hash; }

(read more ...)

Drawing Antialiased Circles and Ellipses

posted by Stephan Brumme

Fast But Ugly

PHP's GD image library can draw all kinds of circles and ellipses. But they look rather ugly because they lack proper antialiasing:

Standard

Aside from GD image handling stuff, it's a single line of code:
imageellipse($img, $width/2,$height/2, $width,$height, $color);
The image above was created by the following PHP file (click show):
show PHP's built-in imageellipse <?php $start = microtime(true); // default size $width = 150; $height = 100; // ... or user defined if (is_numeric($_REQUEST["width" ])) $width = $_REQUEST["width" ]; if (is_numeric($_REQUEST["height"])) $height = $_REQUEST["height"]; // create new image $img = imagecreatetruecolor($width, $height); // transparent background $background = imagecolorallocate($img, 0xFF, 0xFF, 0xFF); imagefilledrectangle ($img, 0,0, $width,$height, $background); imagecolortransparent($img, $background); // draw red ellipse, 2*10px border $red = imagecolorallocate($img, 0xFF, 0x00, 0x00); imageellipse($img, $width/2,$height/2, $width-20,$height-20, $red); // measure speed $end = microtime(true); $duration = number_format(($end - $start)*1000, 3)."ms"; imagestring($img, 1, $width-50,$height-10, $duration, 0); // send PNG to browser header("Content-type: image/png"); imagepng($img, NULL, 9, PNG_ALL_FILTERS); ?>
This article will show five ways how to add antialiasing to circles/ellipses:
Algorithm Quality Speed Code Size
imageellipse bad very fast one line
Zooming Trick okay very slow about 10 lines
Polygon Approximation depends fast about 10 lines
Implicit Equation good very slow about 20 lines
Wu's Algorithm good slow about 30 lines
True-Type Font depends fast about 10 lines
All images on this page are drawn on-the-fly. Rendering time (in milliseconds) is displayed in the lower right corner and will be different if you reload.
(read more ...)

browse through all postings:
newer stuff:

2021/Jun/30   Length-Limited Prefix Codes
2020/Apr/13   smalLZ4 - optimal LZ4 compression
2019/Jun/17   toojpeg - a JPEG encoder in a single C++ file
2018/Oct/15   flexiGIF - lossless GIF/LZW optimization
2016/Aug/04   xxHash - a hash algorithm as fast as memcpy
2015/Aug/11   Tiny Regular Expression Matcher
2015/Feb/04   Fast CRC32
2014/Aug/25   Practical String Searching
2014/Aug/19   The Mersenne Twister Pseudo Random Number Generator

older stuff:

2011/Nov/27   Uncompressing DEFLATE Data
2011/Nov/23   Streaming MPEG-4 Movies From Your Own Server
2011/Nov/08   Binary Search Revisited
2011/Oct/20   Reverse MD5 Hash
2011/Oct/17   Windows 7 Screenshots
2011/Oct/16   Hide Strings In Your Executable
2011/Oct/11   Generate QR Codes With PHP and Google Charts API
2011/Oct/02   Feed.rss and Sitemap.xml
2011/Sep/28   My Experiences with AdSense
2011/Sep/27   Emulating MemCache in Pure PHP
2011/Sep/14   Simple Ajax
2011/Sep/13   Sparklines
2011/Sep/12   smallpt - Global Illumination in Javascript
2011/Sep/11   Multi-Core Sorting
2011/Sep/06   CLein - The tiny, portable OpenCL info tool
homepage