Parallel Sieve of Eratosthenes [updated]

posted by Stephan Brumme, updated

Processing 1 Billion Numbers Per Second

There are 50,847,534 prime numbers between 2 and 1,000,000,000. For reasons unknown, I wanted to know whether a C++ implementation can find all these numbers on my desktop computer (Intel Core i7 860, quad-core, hyperthreading, 2.8GHz) in less than 1 second with a simple algorithm such as the Sieve of Eratosthenes.

By only slightly modifying the basic algorithm and unleashing the raw power of OpenMP, I achieved a 12x 20x speed-up.
In the end, my improved sieve finishes after 1.1 seconds - mission almost accomplished.

UPDATE October 6, 2011: After adding a simple test for divisibility by 3 (and 5, 7, 9, 11, 13), the NEW block-wise sieve gets much faster and swiftly breaks through the 1-second-barrier.

UPDATE July 7, 2014: Many browsers support asm.js to boost Javascript performance. Scroll down to find my asm.js implementation or click here.
Algorithm OpenMP from 2 to 1 billion Speed-Up CPU ticks (approx.) Notes
Basic Algorithm no 50847534 primes 13.5 seconds (1.0x) 40 billion shortest code (22 lines)
Odd Numbers Only no 50847534 primes 6.7 seconds 2.0x 20 billion only 10% more code than basic algorithm
Odd Numbers Only yes, 8 threads 50847534 primes 5.4 seconds 2.5x 130 billion
Block-Wise NEW no 50847534 primes 2.4 seconds 5.6x 7 billion most efficient
Block-Wise NEW yes, 8 threads 50847534 primes 0.65 seconds 20.7x 16 billion fastest
Note: Much more complex algorithms easily outperform my code. My primary focus was parallelism in general, not prime numbers !

Live Demo

I ported the short Odd Numbers Only algorithm to Javascript and PHP; the code can be seen at the bottom of this page. The table below shows the expected number of primes for some reference inputs and the performance numbers measured on my computer.

Compute number of primes from 2 to:




Your browser: CCBot/2.0 (http://commoncrawl.org/faq/)

In this benchmark, C++ is approx. 4x faster than the best Javascript engine (Mozilla Firefox) and approx. 100x faster than PHP 5.3. Only the PHP times marked with (*) are measured on my local computer while all other PHP timings were measured on the create.stephan-brumme.com server (which is about 25% bit slower).

All browsers are still 32 bit programs - while my operating system is already 64 bit - and run out of memory when attempting to find all prime numbers from 1 to 1 billion. The more complex blockwise sieve does not suffer from this problem but I was too lazy to port it, too.
Multithreading is currently not supported by neither Javascript nor PHP.
from 2 to ... primes C++ Internet Explorer 10 Firefox 18 Opera 12 Chrome 24 PHP 5.3
10 4 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
100 25 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
1,000 168 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
10,000 1,229 <0.001s <0.001s <0.001s <0.001s <0.001s 0.003s
100,000 9,592 <0.001s 0.001s 0.001s 0.002s 0.001s 0.027s
1,000,000 78,498 0.002s 0.010s 0.007s 0.022s 0.015s 0.393s
10,000,000 664,579 0.024s 0.139s 0.103s 0.248s 0.190s 3.717s
100,000,000 5,761,455 0.555s about 1.7s about 1.2s about 2.7s about 2.0s about 29s (*)
1,000,000,000 50,847,534 6.7s n/a n/a n/a n/a n/a
For your interest: the browser ranking is contrary to what I have observed with my Javascript global illumination port of smallpt. I expected PHP to be faster, too.

UPDATE February 14, 2013: More than a year after my initial tests, Firefox became the fastest browser due to its vastly improved Javascript engine. Git users: scroll down to the repository link
Download  prime.cpp
Latest release: April 24, 2013, size: 5963 bytes, 234 lines

CRC32:fc281a08
MD5:b483e2911cb1ed64939c0cf227648743
SHA1:ab1333e4b873faddab852f113894e06bc7d5b11a
SHA256:ca28959cdb40d3a0f58d08a5f3a7a67c4d9dcd99fc52254e239b0236751eb95d

Stay up-to-date:git clone http://create.stephan-brumme.com/eratosthenes/.git

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

Changelog

Basic Algorithm

The Sieve of Eratosthenes can be implemented in just a few lines of C/C++ code:
hide // simple serial sieve of Eratosthenes int eratosthenes(int lastNumber) { // initialize char* isPrime = new char[lastNumber+1]; for (int i = 0; i <= lastNumber; i++) isPrime[i] = 1; // find all non-primes for (int i = 2; i*i <= lastNumber; i++) if (isPrime[i]) for (int j = i*i; j <= lastNumber; j += i) isPrime[j] = 0; // sieve is complete, count primes int found = 0; for (int i = 2; i <= lastNumber; i++) found += isPrime[i]; delete[] isPrime; return found; }
Initially, each number is assumed to be prime, i.e. all elements of isPrime are set to 1. Then all multiples of i need to be set to 0. A final scan over the array counts the number of primes.

My first attempt was based on a bool array instead of char, true instead of 1 and false instead of 0. Even though bools were more intuitive, they were a bit slower (≈7%) because the final scan contained a branch which often flushed the CPU pipeline after a mis-prediction.
for (int i = 2; i <= lastNumber; i++) if (isPrime[i]) found++;
Both data types use 1 byte per element. Moreover, true is by default internally represented as 1 and false is internally represented as 0. That means, the binary code generated by the C++ compiler for isPrime is identical for bool and char - except for the more efficient final scan of char.

Time for lastNumber = 1 billion: 13.5 seconds

Against All Odds

The only even prime number is 2, all other primes are odd. This observation saves 50% memory and is nearly twice as fast as the basic algorithm while requiring only minor code modifications.
hide // odd-only sieve int eratosthenesOdd(int lastNumber) { int memorySize = (lastNumber-1)/2; // initialize char* isPrime = new char[memorySize+1]; for (int i = 0; i <= memorySize; i++) isPrime[i] = 1; // find all odd non-primes for (int i = 3; i*i <= lastNumber; i += 2) if (isPrime[i/2]) for (int j = i*i; j <= lastNumber; j += 2*i) isPrime[j/2] = 0; // sieve is complete, count primes int found = lastNumber >= 2 ? 1 : 0; for (int i = 1; i <= memorySize; i++) found += isPrime[i]; delete[] isPrime; return found; }
Time for lastNumber = 1 billion: 6.7 seconds, speed-up 2.0x

Parallelizing The Sieve

Both code pieces shown above run on a single core. When I added OpenMP #pragmas, Visual C++ rejected my for-loop because the termination condition i*i <= lastNumber is too complex (at least for the 2008 compiler's OpenMP implementation). After changing it to i <= lastNumberSqrt the compiler was happy.

Unfortunately, the performance gain was negligible: I went from 7.4 to 7.2 seconds. The task manager told me that OpenMP distributed the work in its thread pool not evenly: some cores ran at 100% but most were idle. What happened ? The execution time of the outer for-loop can be almost zero (if !isPrime[i/2]), quite high (if i is small) or quite low (if i is close to lastNumberSqrt). The default OpenMP scheduling strategy static assumes that each pass takes about the same time. When I explicitly request dynamic scheduling, the sieve finishes after 5.4 seconds. The final scan needs to be declared as reduction(+:found) to properly count the number of primes in the shared variable found.
hide // odd-only sieve int eratosthenesOdd(int lastNumber, bool useOpenMP) { // enable/disable OpenMP omp_set_num_threads(useOpenMP ? omp_get_num_procs() : 1); // instead of i*i <= lastNumber we write i <= lastNumberSquareRoot to help OpenMP const int lastNumberSqrt = (int)sqrt((double)lastNumber); int memorySize = (lastNumber-1)/2; // initialize char* isPrime = new char[memorySize+1]; #pragma omp parallel for for (int i = 0; i <= memorySize; i++) isPrime[i] = 1; // find all odd non-primes #pragma omp parallel for schedule(dynamic) for (int i = 3; i <= lastNumberSqrt; i += 2) if (isPrime[i/2]) for (int j = i*i; j <= lastNumber; j += 2*i) isPrime[j/2] = 0; // sieve is complete, count primes int found = lastNumber >= 2 ? 1 : 0; #pragma omp parallel for reduction(+:found) for (int i = 1; i <= memorySize; i++) found += isPrime[i]; delete[] isPrime; return found; }
Time for lastNumber = 1 billion: 5.4 seconds, speed-up 2.5x

Block-Wise Sieve

Thanks to OpenMP all 4 CPU cores (plus 4 hyperthreading cores) run under full load. But the performance gain isn't impressive at all: the program became only 25% faster despite burning 700% more CPU cycles. The memory system - including the CPU caches - turns out to be the bottleneck.

Instead of processing all numbers at once, I rewrote the odd-only sieve's code to find all prime numbers inside a range from ... to. Of major interest is minJ which is the starting value of the inner for-loop.

UPDATE October 6, 2011: I added divisibility tests which cut computing time almost in half. More than 13 doesn't give any improvement (test outweights its gain).
hide // process only odd numbers of a specified block int eratosthenesOddSingleBlock(const int from, const int to) { const int memorySize = (to - from + 1) / 2; // initialize char* isPrime = new char[memorySize]; for (int i = 0; i < memorySize; i++) isPrime[i] = 1; for (int i = 3; i*i <= to; i+=2) { // >>> UPDATE October 6, 2011 // skip multiples of three: 9, 15, 21, 27, ... if (i >= 3*3 && i % 3 == 0) continue; // skip multiples of five if (i >= 5*5 && i % 5 == 0) continue; // skip multiples of seven if (i >= 7*7 && i % 7 == 0) continue; // skip multiples of eleven if (i >= 11*11 && i % 11 == 0) continue; // skip multiples of thirteen if (i >= 13*13 && i % 13 == 0) continue; // <<< UPDATE October 6, 2011 // skip numbers before current slice int minJ = ((from+i-1)/i)*i; if (minJ < i*i) minJ = i*i; // start value must be odd if ((minJ & 1) == 0) minJ += i; // find all odd non-primes for (int j = minJ; j <= to; j += 2*i) { int index = j - from; isPrime[index/2] = 0; } } // count primes in this block int found = 0; for (int i = 0; i < memorySize; i++) found += isPrime[i]; // 2 is not odd => include on demand if (from <= 2) found++; delete[] isPrime; return found; } // process slice-by-slice, odd numbers only int eratosthenesBlockwise(int lastNumber, int sliceSize, bool useOpenMP) { int found = 0; // each slices covers ["from" ... "to"], incl. "from" and "to" for (int from = 2; from <= lastNumber; from += sliceSize) { int to = from + sliceSize; if (to > lastNumber) to = lastNumber; found += eratosthenesOddSingleBlock(from, to); } return found; }
A simple, serial program now needs less memory and is faster than the previous OpenMP-version. In plain English: a single core outperform 4x2 cores !

I got the best timings for sliceSize=128*1024. Then, processing 1 billion numbers with the block-wise approach requires only 128KB RAM instead of 1GByte.

Time for lastNumber = 1 billion: 2.4 seconds, speed-up 5.6x

Parallelizing Block-Wise Sieve

The simplest task was to add OpenMP to the block-wise sieve: only add a pragma to eratosthenesBlockwise's for-loop. To reduce to lines of code in my benchmark, I included an OpenMP on/off-switch.
hide // process slice-by-slice, odd numbers only int eratosthenesBlockwise(int lastNumber, int sliceSize, bool useOpenMP) { // enable/disable OpenMP omp_set_num_threads(useOpenMP ? omp_get_num_procs() : 1); int found = 0; // each slices covers ["from" ... "to"], incl. "from" and "to" #pragma omp parallel for reduction(+:found) for (int from = 2; from <= lastNumber; from += sliceSize) { int to = from + sliceSize; if (to > lastNumber) to = lastNumber; found += eratosthenesOddSingleBlock(from, to); } return found; }
Time for lastNumber = 1 billion: 0.65 seconds, speed-up 20.7x

Live Demo's Javascript Code

No significant changes from the C++ Odd-Only code:
hide var isPrime = new Array(); function eratosthenesOdd(lastNumber) { var memorySize = (lastNumber-1) >> 1; // initialize for (var i = 0; i <= memorySize; i++) isPrime[i] = 1; // find all odd non-primes for (var i = 3; i*i <= lastNumber; i += 2) if (isPrime[i >> 1] == 1) for (var j = i*i; j <= lastNumber; j += 2*i) isPrime[j >> 1] = 0; // sieve is complete, count primes var found = (lastNumber >= 2) ? 1 : 0; for (var i = 1; i <= memorySize; i++) found += isPrime[i]; return found; }

Live Demo's Asm.JS Code

Pretty much the same code as above but with all the weird looking Asm.JS syntax:
hide function AsmJs(stdlib, foreign, heap) { "use asm"; function eratosthenesOdd(lastNumber) { lastNumber = lastNumber|0; var memorySize = ((lastNumber - 1) >> 1)|0; // initialize var isPrime = new stdlib.Int8Array(heap); for (var i = 0|0; i <= memorySize|0; i++) isPrime[i|0] = 1|0; // find all odd non-primes for (var i = 3|0; i*i <= lastNumber|0; i += 2|0) if (isPrime[i >> 1|0] == 1|0) for (var j = i*i; j <= lastNumber|0; j += 2*i) isPrime[j >> 1|0] = 0|0; // include 2 (which is even, not odd) var found = (lastNumber >= 2) ? (1|0) : (0|0); // sieve is complete, count primes for (var i = 1|0; i <= memorySize|0; i++) found += isPrime[i|0]|0; return found; } // export return { eratosthenesOdd : eratosthenesOdd }; } // create AsmJs wrapper and call worker method function asmjsEratosthenesOdd(lastNumber) { var module = AsmJs(window, null, new Int8Array(lastNumber / 2 + 1)); return module.eratosthenesOdd(lastNumber); }

Live Demo's PHP Code

My initial port to PHP was based on array. Everything worked fine but I soon ran out of memory because PHP's hash table (the technology behind array) requires too many bytes per item.

The current solution "abuses" a very long string, consisting of '0' and '1' characters. The build-in functions str_repeat and substr_count simplify and speed-up string creation / evaluation. I chose to limit lastNumber to 25,000,000 when the script is executed on my public web server. At home (localhost), no limits are imposed.
hide <? // from 2 to number passed via GET, e.g. eratosthenes.ajax?12345 $lastNumber = $_SERVER["QUERY_STRING"]; if ($lastNumber < 1) die("<b>(invalid)</b>"); // limit $lastNumber depending on server location (localhost or create.stephan-brumme.com) if ($_SERVER["HTTP_HOST"] == "localhost") set_time_limit(0); // no limits ! else if ($lastNumber > 25*1000*1000) // 25 million die("<i>(limited to <b>25000000</b> to reduce server load)</i>"); $startTime = microtime(true); // initialize $memorySize = $lastNumber >> 1; $isPrime = str_repeat(1, $memorySize); // find all odd non-primes for ($i = 3; $i*$i <= $lastNumber; $i += 2) if ($isPrime[$i >> 1] == 1) for ($j = $i*$i; $j <= $lastNumber; $j += 2*$i) $isPrime[$j >> 1] = 0; // sieve is complete, count primes $found = substr_count($isPrime, 1); $duration = number_format(microtime(true) - $startTime, 3); echo "<b>$found primes</b> found ($duration seconds)"; ?>
homepage