Parallel Sieve of Eratosthenes [updated]

posted September 17, 2011 by Stephan Brumme, updated July 7, 2014

Processing 1 Billion Numbers Per Second

There are 50,847,534 prime numbers between 2 and 1,000,000,000. For reasons unknown, I wanted to know whether a C++ implementation can find all these numbers on my desktop computer (Intel Core i7 860, quad-core, hyperthreading, 2.8GHz) in less than 1 second with a simple algorithm such as the Sieve of Eratosthenes.

By only slightly modifying the basic algorithm and unleashing the raw power of OpenMP, I achieved a ~~12x~~ 20x speed-up.
~~In the end, my improved sieve finishes after 1.1 seconds - mission almost accomplished.~~

UPDATE October 6, 2011: After adding a simple test for divisibility by 3 (and 5, 7, 9, 11, 13), the NEW block-wise sieve gets much faster and swiftly breaks through the 1-second-barrier.

UPDATE July 7, 2014: Many browsers support asm.js to boost Javascript performance. Scroll down to find my asm.js implementation or click here.

Algorithm	OpenMP	from 2 to 1 billion		Speed-Up	CPU ticks (approx.)	Notes
Basic Algorithm	no	50847534 primes	13.5 seconds	(1.0x)	40 billion	shortest code (22 lines)
Odd Numbers Only	no	50847534 primes	6.7 seconds	2.0x	20 billion	only 10% more code than basic algorithm
Odd Numbers Only	yes, 8 threads	50847534 primes	5.4 seconds	2.5x	130 billion
Block-Wise NEW	no	50847534 primes	2.4 seconds	5.6x	7 billion	most efficient
Block-Wise NEW	yes, 8 threads	50847534 primes	0.65 seconds	20.7x	16 billion	fastest

Note: Much more complex algorithms easily outperform my code. My primary focus was parallelism in general, not prime numbers !

Live Demo

I ported the short Odd Numbers Only algorithm to Javascript and PHP; the code can be seen at the bottom of this page. The table below shows the expected number of primes for some reference inputs and the performance numbers measured on my computer.

In this benchmark, C++ is approx. 4x faster than the best Javascript engine (Mozilla Firefox) and approx. 100x faster than PHP 5.3. Only the PHP times marked with (*) are measured on my local computer while all other PHP timings were measured on the create.stephan-brumme.com server (which is about 25% bit slower).

All browsers are still 32 bit programs - while my operating system is already 64 bit - and run out of memory when attempting to find all prime numbers from 1 to 1 billion. The more complex blockwise sieve does not suffer from this problem but I was too lazy to port it, too.
Multithreading is currently not supported by neither Javascript nor PHP.

from 2 to ...	primes	C++	Internet Explorer 10	Firefox 18	Opera 12	Chrome 24	PHP 5.3
10	4	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s
100	25	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s
1,000	168	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s
10,000	1,229	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s	0.003s
100,000	9,592	<0.001s	0.001s	0.001s	0.002s	0.001s	0.027s
1,000,000	78,498	0.002s	0.010s	0.007s	0.022s	0.015s	0.393s
10,000,000	664,579	0.024s	0.139s	0.103s	0.248s	0.190s	3.717s
100,000,000	5,761,455	0.555s	about 1.7s	about 1.2s	about 2.7s	about 2.0s	about 29s (*)
1,000,000,000	50,847,534	6.7s	n/a	n/a	n/a	n/a	n/a

For your interest: the browser ranking is contrary to what I have observed with my Javascript global illumination port of smallpt. I expected PHP to be faster, too.

UPDATE February 14, 2013: More than a year after my initial tests, Firefox became the fastest browser due to its vastly improved Javascript engine.
(read more ...)

Windows and Linux Code Timing

posted June 23, 2014 by Stephan Brumme

Whenever I have to do some performance testing, I try the most simple way: write a short program and run it within my Linux bash shell which comes with a built-in time command:

    
     $ time ./testcode
     real    0m0.788s
     user    0m0.696s
     sys     0m0.080s

Obviously there can (and will) be some overhead by the operating system. So numbers are typically a bit off.
The timing resolution isn't that great, either, and often lies above 10ms.

That's why I wrote a simple C routine to measure performance inside the code. My goal was to have numbers that are accurate down to about 1 microsecond (10^-6).
I didn't opt for extremely high timing precision (count CPU ticks) because I rarely need that degree of precision and really want to avoid the tricky details of CPU frequency scaling, cache stalls etc.

Another feature is portability; simply recompile the code and run it on Windows and Linux. The code works fine with GCC as well as Visual C++.

All my timing code contains is a single function seconds() returning a double. You can invoke it in your programs like this:

    
     double start = seconds();
     // ... your code that should be measured ...
     double duration = seconds() - start;

(read more ...)

Portable C++ Hashing Library

posted February 5, 2014 by Stephan Brumme, updated June 14, 2014

Introduction

Let's start with the core features of my C++ hashing library:

computes CRC32, MD5, SHA1 and SHA256 (most common member of the SHA2 functions), Keccak and its SHA3 sibling
optional HMAC (keyed-hash message authentication code)
no external dependencies, small code size
can work chunk-wise (for example when reading streams block-by-block)
portable: supports Windows and Linux, tested on Little Endian and Big Endian CPUs
roughly as fast as Linux core hashing functions
open source, zlib license

Usage

Since there are no external dependencies, not even dependencies within the library itself, all you need is to include the header of the hashing algorithm of your choice and add the identically named .cpp to your project. That means if you want to add SHA256 hashing to your C++ program, you only have to include sha256.h and sha256.cpp in your project.

If you prefer SHA256, then your code will look like this:

hide


     // SHA2 test program
     #include "sha256.h"
     #include <iostream> // for std::cout only, not needed for hashing library
     
     int main(int, char**)
     {
       // create a new hashing object
       SHA256 sha256;
     
       // hashing an std::string
       std::cout << sha256("Hello World") << std::endl;
       // => a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e
     
       // hashing a buffer of bytes
       const char* buffer = "How are you";
       std::cout << sha256(buffer, 11) << std::endl;
       // => 9c7d5b046878838da72e40ceb3179580958df544b240869b80d0275cc07209cc
     
       // or in a streaming fashion (re-use "How are you")
       SHA256 sha256stream;
       const char* url = "create.stephan-brumme.com"; // 25 bytes
       int step = 5;
       for (int i = 0; i < 25; i += step)
         sha256stream.add(url + i, step); // add five bytes at a time
       std::cout << sha256stream.getHash() << std::endl;
       // => 82aa771f1183c52f973c798c9243a1c73833ea40961c73e55e12430ec77b69f6
     
       return 0;
     }

(read more ...)

Sorting STL Containers

posted September 7, 2011 by Stephan Brumme, updated November 20, 2020

std::sort

Yes, there is std::sort. Yes, it's pretty much the most efficient way to sort generic STL containers. No, it's not perfect.

Improving std::sort

What if the data is already almost sorted ? std::sort doesn't benefit a lot from processing already sorted data. However, it's quite common to have (partially) sorted data.

std::sort requires RandomAccessIterator which are sometimes not available for custom containers.

What about std::list ? The syntax is completely different:

hide


     std::list<int> mylist;
     // ... fill mylist with data ...
     
     // compiler error:
     std::sort(mylist.begin(), mylist.end());
     // must use this instead:
     mylist.sort();

That's why I wrote some sorting routines on my own to overcome all those limitations:

can sort containers that have only basic ForwardIterator (Bubble Sort and Selection Sort)
most routines work with BidirectionalIterator
therefore seamless integration of std::list
data types need a less-than comparison (operator<); which can be user-defined, too

The sorting algorithms are ordered more or less by overall performance (slowest first). As you can see faster algorithms have stricter requirements:

Algorithm	Stable ?	Iterator	Notes
Bubble Sort	yes	forward iterator	code shown here requires bidirectional iterator, just remove both instances of `--last;` (will be slower then)
Selection Sort	no	forward iterator
Insertion Sort	yes	bidirectional iterator
Shell Sort	no	bidirectional iterator	`std::distance` and `std::advance` benefit heavily from random access iterators
Heap Sort	no	random-access iterator	my n-ary Heap Sort implementation works with bidirectional iterators, too
Merge Sort	yes	random-access iterator	requires additional memory
Merge Sort (in-place)	yes	forward iterator	no additional memory but slower than normal merge sort
Quick Sort	no	bidirectional iterator	`std::distance` and `std::advance` benefit heavily from random access iterators
Intro Sort	no	bidirectional iterator	`std::distance` and `std::advance` benefit heavily from random access iterators
std::sort	no	random-access iterator
std::stable_sort	yes	random-access iterator

The algorithms are nothing new, they can be found in every decent book on algorithms and, of course, online (e.g. Wikipedia).

(read more ...)

Portable Memory Mapping C++ Class

posted September 18, 2013 by Stephan Brumme

Parsing Files The Easy Way

Recently I had to do a lot with file loading and parsing. It can be especially tricky to come up with a fast solution if you have to jump around within these files. Seeking and proper buffering were my biggest problems.

Memory mapping is one of the nicest features of modern operating systems: after opening a file in memory-mapped mode you can treat the file as a large chunk of memory and use plain pointers. The operating system takes care of loading the data on demand (!) into memory - utilizing caches, of course. When using my C++ class MemoryMapped it's really easy:

hide How to use


     // open file
     MemoryMapped data("myfile.txt");
     
     // read byte at file offset 2345
     unsigned char a = data[2345];
     
     // or if you prefer pointers
     const short* raw = (const short*) data.getData();
     short        b   = raw[300];

Windows is completely different from Linux when it comes to opening a file in memory-mapped mode. The class MemoryMapped hides all the OS specific stuff in only two files: MemoryMapped.h and MemoryMapped.cpp.
They compile without any warning with GCC 4.7 and Visual C++ 2010. I haven't tried other compilers but they should be able to handle it, too, even when they are a bit older.
(read more ...)

Testing Endian Aware Code

posted May 3, 2013 by Stephan Brumme

Turning Around

CPUs store numbers in two different ways: little or big endian. The majority of CPUs are designed to access memory in Little Endian way which seems to have the bytes mixed the wrong way. The memory layout of the hexadecimal number 0x12345678 looks like this:

	byte 0	byte 1	byte 2	byte 3	Typical CPUs
Little Endian	0x78	0x56	0x34	0x12	Intel/AMD x86, x64
Big Endian	0x12	0x34	0x56	0x78	PowerPC, ARM*

Note: ARM chips often support little endian, too.

Virtual Machines / QEMU

I don't have a big endian computer at home (my ARM chips are little endian by default) but some of my source code required testing that everything works as expected on big endian machines, too. That's when I learnt about QEMU, a great open-source processor emulator. It's available on Unix and Windows.

This posting explains how to setup QEMU and run Debian Wheezy on an emulated big endian PowerPC.
(read more ...)

Software License

posted April 24, 2013 by Stephan Brumme

I received many kind requests from companies willing to use code from create.stephan-brumme.com in commercial products.
I publish my code because I believe that sharing knowledge makes this world a better place. Therefore all code will be under a zlib-style license now.
This applies to all past and future postings.

The full license can be found on http://create.stephan-brumme.com/disclaimer.html. In short, you can do with it whatever you want, even build commercial or military software, but please keep my name in the source code (however, no need to display it in the shipped product to the end user).
Each file's header contains a reference to the license, too:


     // //////////////////////////////////////////////////////////
     // Copyright (c) 2013 Stephan Brumme. All rights reserved.
     // see http://create.stephan-brumme.com/disclaimer.html
     //

It would be great if you publish your code, too, but that's not a requirement.

I appreciate if you send me an email with a short description of your project / product.

Read-Only Git Access Available

posted February 15, 2013 by Stephan Brumme

What Is Git ?!

If you don't know git grab one of tutorials on the web.

Using My Git Repositories

I added a git repository to these projects on create.stephan-brumme.com that already came with a big yellow download button, e.g.:

git clone http://create.stephan-brumme.com/crc32/.git

You have read-only access via http and any push will produce an error message.
Whenever I update my projects (bugfixes, enhancements, ...), you can get the latest version:

    
    # update local repository
    git pull

Unfortunately, not all repositories start with "version 1" because I lost some of the early source revisions.

How To Add Git Repositories To Your Website

The workflow is actually pretty simple.

    
    # go to the project directory
    cd crc32
    
    # create repository, will be stored in .git sub-directory
    git init
    
    # prepare files for the repository
    git add Crc32.cpp
    git add Crc32.ino
    git add Crc32Best.ino
    # of course you can use wild-cards, too
    
    # commit these added files
    git commit -m "version 1"
    
    # add web server's URL
    git remote add origin http://create.stephan-brumme.com/crc32/.git
    
    # compress repository
    git gc --aggressive
    
   

Now you can use your good old FTP program to copy the full .git sub-directory to your web server - just like you would copy HTML files. Done !

Note: A more sophisticated approach is to clone the repository from your local machine to your web server. However, if you are on a shared hoster then chances are good that git is disabled. Therefore I suggest you stick with FTP.

Fowler-Noll-Vo Hash (FNV1a)

posted December 5, 2011 by Stephan Brumme

At The Speed Of Light

Four weeks ago I posted an article about the popular CRC32 hashing algorithm. CRC32 works best on large data blocks because short sequences might lead to an increased number of collisions.

The FNV hash created by Fowler, Noll and Vo (see their website) comes with zero memory overhead (no precomputed look-up tables), is an incremental (rolling) hash and has a good avalanche behavior. It needs at about 4.5 CPU cycles per hashed byte on my computer, that's a little bit more than 700 MByte/second. GCC and Visual C++ produce identical code.

The heart of the hash function can be expressed in a single line (FNV1a):


     newHash = (oneByte ^ oldHash) * Prime;

Reversing the order of XOR and multiplication leads to the original FNV1 hash which is considered slightly inferior:


     newHash = (oneByte * Prime) ^ oldHash;

The 32-bit FNV1a hash of an unsigned char looks as follows in C/C++:

hide fnv1a, one byte


     // default values recommended by http://isthe.com/chongo/tech/comp/fnv/
     const uint32_t Prime = 0x01000193; //   16777619
     const uint32_t Seed  = 0x811C9DC5; // 2166136261
     /// hash a single byte
     inline uint32_t fnv1a(unsigned char oneByte, uint32_t hash = Seed)
     {
       return (oneByte ^ hash) * Prime;
     }

And an array of bytes just calls fnv1a for each element:

hide fnv1a, multiple bytes


     /// hash a block of memory
     uint32_t fnv1a(const void* data, size_t numBytes, uint32_t hash = Seed)
     {
       assert(data);
       const unsigned char* ptr = (const unsigned char*)data;
       while (numBytes--)
         hash = fnv1a(*ptr++, hash);
       return hash;
     }

(read more ...)

Drawing Antialiased Circles and Ellipses

posted December 1, 2011 by Stephan Brumme

Fast But Ugly

PHP's GD image library can draw all kinds of circles and ellipses. But they look rather ugly because they lack proper antialiasing:

Standard

Aside from GD image handling stuff, it's a single line of code:


     imageellipse($img, $width/2,$height/2, $width,$height, $color);

The image above was created by the following PHP file (click show):

show PHP's built-in imageellipse


     <?php
       $start = microtime(true);
     
       // default size
       $width  = 150;
       $height = 100;
       // ... or user defined
       if (is_numeric($_REQUEST["width" ])) $width  = $_REQUEST["width" ];
       if (is_numeric($_REQUEST["height"])) $height = $_REQUEST["height"];
     
       // create new image
       $img = imagecreatetruecolor($width, $height);
     
       // transparent background
       $background = imagecolorallocate($img, 0xFF, 0xFF, 0xFF);
       imagefilledrectangle ($img, 0,0, $width,$height, $background);
       imagecolortransparent($img, $background);
     
       // draw red ellipse, 2*10px border
       $red = imagecolorallocate($img, 0xFF, 0x00, 0x00);
       imageellipse($img, $width/2,$height/2, $width-20,$height-20, $red);
     
       // measure speed
       $end = microtime(true);
       $duration = number_format(($end - $start)*1000, 3)."ms";
       imagestring($img, 1, $width-50,$height-10, $duration, 0);
     
       // send PNG to browser
       header("Content-type: image/png");
       imagepng($img, NULL, 9, PNG_ALL_FILTERS);
     ?>

This article will show five ways how to add antialiasing to circles/ellipses:

Algorithm	Quality	Speed	Code Size
`imageellipse`	bad	very fast	one line
Zooming Trick	okay	very slow	about 10 lines
Polygon Approximation	depends	fast	about 10 lines
Implicit Equation	good	very slow	about 20 lines
Wu's Algorithm	good	slow	about 30 lines
True-Type Font	depends	fast	about 10 lines

All images on this page are drawn on-the-fly. Rendering time (in milliseconds) is displayed in the lower right corner and will be different if you reload.
(read more ...)

browse through all postings:

newer stuff:

2021/Jun/30 Length-Limited Prefix Codes
2020/Apr/13 smalLZ4 - optimal LZ4 compression
2019/Jun/17 toojpeg - a JPEG encoder in a single C++ file
2018/Oct/15 flexiGIF - lossless GIF/LZW optimization
2016/Aug/04 xxHash - a hash algorithm as fast as memcpy
2015/Aug/11 Tiny Regular Expression Matcher
2015/Feb/04 Fast CRC32
2014/Aug/25 Practical String Searching
2014/Aug/19 The Mersenne Twister Pseudo Random Number Generator

older stuff:

2011/Nov/27 Uncompressing DEFLATE Data
2011/Nov/23 Streaming MPEG-4 Movies From Your Own Server
2011/Nov/08 Binary Search Revisited
2011/Oct/20 Reverse MD5 Hash
2011/Oct/17 Windows 7 Screenshots
2011/Oct/16 Hide Strings In Your Executable
2011/Oct/11 Generate QR Codes With PHP and Google Charts API
2011/Oct/02 Feed.rss and Sitemap.xml
2011/Sep/28 My Experiences with AdSense
2011/Sep/27 Emulating MemCache in Pure PHP
2011/Sep/14 Simple Ajax
2011/Sep/13 Sparklines
2011/Sep/12 smallpt - Global Illumination in Javascript
2011/Sep/11 Multi-Core Sorting
2011/Sep/06 CLein - The tiny, portable OpenCL info tool


June 30, 2021: Length-Limited Prefix Codes April 13, 2020: smalLZ4 - optimal LZ4 compression June 17, 2019: toojpeg - a JPEG encoder in a single C++ file ... and 30 more !	RSS	www.stephan-brumme.com bits.stephan-brumme.com photos.stephan-brumme.com minime.stephan-brumme.com
© 2011-2024 Stephan Brumme products and company names mentioned may be trademarks of their respective owners. see my disclaimer. dude, this is the end ... of this page.