Parallel Sieve of Eratosthenes [updated]
posted by Stephan Brumme,
updated
Processing 1 Billion Numbers Per Second
There are 50,847,534 prime numbers between 2 and 1,000,000,000. For reasons unknown, I wanted to know whether a C++ implementation can find all these numbers on my desktop computer (Intel Core i7 860, quad-core, hyperthreading, 2.8GHz) in less than 1 second with a simple algorithm such as the Sieve of Eratosthenes.By only slightly modifying the basic algorithm and unleashing the raw power of OpenMP, I achieved a
UPDATE October 6, 2011: After adding a simple test for divisibility by 3 (and 5, 7, 9, 11, 13), the NEW block-wise sieve gets much faster and swiftly breaks through the 1-second-barrier.
UPDATE July 7, 2014: Many browsers support asm.js to boost Javascript performance. Scroll down to find my asm.js implementation or click here.
Algorithm | OpenMP | from 2 to 1 billion | Speed-Up | CPU ticks (approx.) | Notes | |
---|---|---|---|---|---|---|
Basic Algorithm | no | 50847534 primes | 13.5 seconds | (1.0x) | 40 billion | shortest code (22 lines) |
Odd Numbers Only | no | 50847534 primes | 6.7 seconds | 2.0x | 20 billion | only 10% more code than basic algorithm |
Odd Numbers Only | yes, 8 threads | 50847534 primes | 5.4 seconds | 2.5x | 130 billion | |
Block-Wise NEW | no | 50847534 primes | 2.4 seconds | 5.6x | 7 billion | most efficient |
Block-Wise NEW | yes, 8 threads | 50847534 primes | 0.65 seconds | 20.7x | 16 billion | fastest |
Live Demo
I ported the short Odd Numbers Only algorithm to Javascript and PHP; the code can be seen at the bottom of this page. The table below shows the expected number of primes for some reference inputs and the performance numbers measured on my computer.In this benchmark, C++ is approx. 4x faster than the best Javascript engine (Mozilla Firefox) and approx. 100x faster than PHP 5.3. Only the PHP times marked with (*) are measured on my local computer while all other PHP timings were measured on the create.stephan-brumme.com server (which is about 25% bit slower).
All browsers are still 32 bit programs - while my operating system is already 64 bit - and run out of memory when attempting to find all prime numbers from 1 to 1 billion. The more complex blockwise sieve does not suffer from this problem but I was too lazy to port it, too.
Multithreading is currently not supported by neither Javascript nor PHP.
from 2 to ... | primes | C++ | Internet Explorer 10 | Firefox 18 | Opera 12 | Chrome 24 | PHP 5.3 |
---|---|---|---|---|---|---|---|
10 | 4 | <0.001s | <0.001s | <0.001s | <0.001s | <0.001s | <0.001s |
100 | 25 | <0.001s | <0.001s | <0.001s | <0.001s | <0.001s | <0.001s |
1,000 | 168 | <0.001s | <0.001s | <0.001s | <0.001s | <0.001s | <0.001s |
10,000 | 1,229 | <0.001s | <0.001s | <0.001s | <0.001s | <0.001s | 0.003s |
100,000 | 9,592 | <0.001s | 0.001s | 0.001s | 0.002s | 0.001s | 0.027s |
1,000,000 | 78,498 | 0.002s | 0.010s | 0.007s | 0.022s | 0.015s | 0.393s |
10,000,000 | 664,579 | 0.024s | 0.139s | 0.103s | 0.248s | 0.190s | 3.717s |
100,000,000 | 5,761,455 | 0.555s | about 1.7s | about 1.2s | about 2.7s | about 2.0s | about 29s (*) |
1,000,000,000 | 50,847,534 | 6.7s | n/a | n/a | n/a | n/a | n/a |
UPDATE February 14, 2013: More than a year after my initial tests, Firefox became the fastest browser due to its vastly improved Javascript engine.
Download
Git users: scroll down to the repository link Latest release: April 24, 2013, size: 5.8 kBytes, 234 linesCRC32:
fc281a08
MD5:
b483e2911cb1ed64939c0cf227648743
SHA1:
ab1333e4b873faddab852f113894e06bc7d5b11a
SHA256:
ca28959cdb40d3a0f58d08a5f3a7a67c4d9dcd99fc52254e239b0236751eb95d
Stay up-to-date:git clone https://create.stephan-brumme.com/eratosthenes/git
GitHub mirror:https://github.com/stbrumme/eratosthenes
If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com
Changelog
- License
- April 24, 2013
- zlib-style license, applies to all past and future versions
- version 3
- November 9, 2011
- fixed spelling
- Git tag
eratosthenes_v3
- version 2
- October 6, 2011
- added divisibility tests
- improved 64 bit portability
- version 1
- September 17, 2011
- initial release
Basic Algorithm
The Sieve of Eratosthenes can be implemented in just a few lines of C/C++ code:
hide
Initially, each number is assumed to be prime, i.e. all elements of
// simple serial sieve of Eratosthenes
int eratosthenes(int lastNumber)
{
// initialize
char* isPrime = new char[lastNumber+1];
for (int i = 0; i <= lastNumber; i++)
isPrime[i] = 1;
// find all non-primes
for (int i = 2; i*i <= lastNumber; i++)
if (isPrime[i])
for (int j = i*i; j <= lastNumber; j += i)
isPrime[j] = 0;
// sieve is complete, count primes
int found = 0;
for (int i = 2; i <= lastNumber; i++)
found += isPrime[i];
delete[] isPrime;
return found;
}
isPrime
are set to 1
.
Then all multiples of i
need to be set to 0
.
A final scan over the array counts the number of primes.My first attempt was based on a
bool
array instead of char
,
true
instead of 1
and
false
instead of 0
.
Even though bool
s were more intuitive, they were a bit slower (≈7%)
because the final scan contained a branch which often flushed the CPU pipeline after a mis-prediction.
for (int i = 2; i <= lastNumber; i++)
if (isPrime[i])
found++;
true
is by default internally represented as 1
and
false
is internally represented as 0
.
That means, the binary code generated by the C++ compiler for isPrime
is identical
for bool
and char
-
except for the more efficient final scan of char
.Time for
lastNumber
= 1 billion: 13.5 seconds
Against All Odds
The only even prime number is 2, all other primes are odd. This observation saves 50% memory and is nearly twice as fast as the basic algorithm while requiring only minor code modifications.
hide
Time for
// odd-only sieve
int eratosthenesOdd(int lastNumber)
{
int memorySize = (lastNumber-1)/2;
// initialize
char* isPrime = new char[memorySize+1];
for (int i = 0; i <= memorySize; i++)
isPrime[i] = 1;
// find all odd non-primes
for (int i = 3; i*i <= lastNumber; i += 2)
if (isPrime[i/2])
for (int j = i*i; j <= lastNumber; j += 2*i)
isPrime[j/2] = 0;
// sieve is complete, count primes
int found = lastNumber >= 2 ? 1 : 0;
for (int i = 1; i <= memorySize; i++)
found += isPrime[i];
delete[] isPrime;
return found;
}
lastNumber
= 1 billion: 6.7 seconds, speed-up 2.0x
Parallelizing The Sieve
Both code pieces shown above run on a single core. When I added OpenMP#pragma
s, Visual C++ rejected my for
-loop
because the termination condition i*i <= lastNumber
is too complex
(at least for the 2008 compiler's OpenMP implementation).
After changing it to i <= lastNumberSqrt
the compiler was happy.Unfortunately, the performance gain was negligible: I went from 7.4 to 7.2 seconds. The task manager told me that OpenMP distributed the work in its thread pool not evenly: some cores ran at 100% but most were idle. What happened ? The execution time of the outer
for
-loop can be almost zero (if !isPrime[i/2]
),
quite high (if i
is small) or
quite low (if i
is close to lastNumberSqrt
).
The default OpenMP scheduling strategy static
assumes that each pass
takes about the same time.
When I explicitly request dynamic
scheduling, the sieve finishes after 5.4 seconds.
The final scan needs to be declared as reduction(+:found)
to properly
count the number of primes in the shared variable found
.
hide
Time for
// odd-only sieve
int eratosthenesOdd(int lastNumber, bool useOpenMP)
{
// enable/disable OpenMP
omp_set_num_threads(useOpenMP ? omp_get_num_procs() : 1);
// instead of i*i <= lastNumber we write i <= lastNumberSquareRoot to help OpenMP
const int lastNumberSqrt = (int)sqrt((double)lastNumber);
int memorySize = (lastNumber-1)/2;
// initialize
char* isPrime = new char[memorySize+1];
#pragma omp parallel for
for (int i = 0; i <= memorySize; i++)
isPrime[i] = 1;
// find all odd non-primes
#pragma omp parallel for schedule(dynamic)
for (int i = 3; i <= lastNumberSqrt; i += 2)
if (isPrime[i/2])
for (int j = i*i; j <= lastNumber; j += 2*i)
isPrime[j/2] = 0;
// sieve is complete, count primes
int found = lastNumber >= 2 ? 1 : 0;
#pragma omp parallel for reduction(+:found)
for (int i = 1; i <= memorySize; i++)
found += isPrime[i];
delete[] isPrime;
return found;
}
lastNumber
= 1 billion: 5.4 seconds, speed-up 2.5x
Block-Wise Sieve
Thanks to OpenMP all 4 CPU cores (plus 4 hyperthreading cores) run under full load. But the performance gain isn't impressive at all: the program became only 25% faster despite burning 700% more CPU cycles. The memory system - including the CPU caches - turns out to be the bottleneck.Instead of processing all numbers at once, I rewrote the odd-only sieve's code to find all prime numbers inside a range
from ... to
.
Of major interest is minJ
which is the starting value of
the inner for
-loop.UPDATE October 6, 2011: I added divisibility tests which cut computing time almost in half. More than 13 doesn't give any improvement (test outweights its gain).
hide
A simple, serial program now needs less memory and is faster than the previous OpenMP-version.
In plain English: a single core outperform 4x2 cores !
// process only odd numbers of a specified block
int eratosthenesOddSingleBlock(const int from, const int to)
{
const int memorySize = (to - from + 1) / 2;
// initialize
char* isPrime = new char[memorySize];
for (int i = 0; i < memorySize; i++)
isPrime[i] = 1;
for (int i = 3; i*i <= to; i+=2)
{
// >>> UPDATE October 6, 2011
// skip multiples of three: 9, 15, 21, 27, ...
if (i >= 3*3 && i % 3 == 0)
continue;
// skip multiples of five
if (i >= 5*5 && i % 5 == 0)
continue;
// skip multiples of seven
if (i >= 7*7 && i % 7 == 0)
continue;
// skip multiples of eleven
if (i >= 11*11 && i % 11 == 0)
continue;
// skip multiples of thirteen
if (i >= 13*13 && i % 13 == 0)
continue;
// <<< UPDATE October 6, 2011
// skip numbers before current slice
int minJ = ((from+i-1)/i)*i;
if (minJ < i*i)
minJ = i*i;
// start value must be odd
if ((minJ & 1) == 0)
minJ += i;
// find all odd non-primes
for (int j = minJ; j <= to; j += 2*i)
{
int index = j - from;
isPrime[index/2] = 0;
}
}
// count primes in this block
int found = 0;
for (int i = 0; i < memorySize; i++)
found += isPrime[i];
// 2 is not odd => include on demand
if (from <= 2)
found++;
delete[] isPrime;
return found;
}
// process slice-by-slice, odd numbers only
int eratosthenesBlockwise(int lastNumber, int sliceSize, bool useOpenMP)
{
int found = 0;
// each slices covers ["from" ... "to"], incl. "from" and "to"
for (int from = 2; from <= lastNumber; from += sliceSize)
{
int to = from + sliceSize;
if (to > lastNumber)
to = lastNumber;
found += eratosthenesOddSingleBlock(from, to);
}
return found;
}
I got the best timings for
sliceSize=128*1024
.
Then, processing 1 billion numbers with the block-wise approach requires only 128KB RAM instead of 1GByte.Time for
lastNumber
= 1 billion: 2.4 seconds, speed-up 5.6x
Parallelizing Block-Wise Sieve
The simplest task was to add OpenMP to the block-wise sieve: only add a pragma toeratosthenesBlockwise
's for
-loop.
To reduce to lines of code in my benchmark, I included an OpenMP on/off-switch.
hide
Time for
// process slice-by-slice, odd numbers only
int eratosthenesBlockwise(int lastNumber, int sliceSize, bool useOpenMP)
{
// enable/disable OpenMP
omp_set_num_threads(useOpenMP ? omp_get_num_procs() : 1);
int found = 0;
// each slices covers ["from" ... "to"], incl. "from" and "to"
#pragma omp parallel for reduction(+:found)
for (int from = 2; from <= lastNumber; from += sliceSize)
{
int to = from + sliceSize;
if (to > lastNumber)
to = lastNumber;
found += eratosthenesOddSingleBlock(from, to);
}
return found;
}
lastNumber
= 1 billion: 0.65 seconds, speed-up 20.7x
Live Demo's Javascript Code
No significant changes from the C++ Odd-Only code:
hide
var isPrime = new Array();
function eratosthenesOdd(lastNumber)
{
var memorySize = (lastNumber-1) >> 1;
// initialize
for (var i = 0; i <= memorySize; i++)
isPrime[i] = 1;
// find all odd non-primes
for (var i = 3; i*i <= lastNumber; i += 2)
if (isPrime[i >> 1] == 1)
for (var j = i*i; j <= lastNumber; j += 2*i)
isPrime[j >> 1] = 0;
// sieve is complete, count primes
var found = (lastNumber >= 2) ? 1 : 0;
for (var i = 1; i <= memorySize; i++)
found += isPrime[i];
return found;
}
Live Demo's Asm.JS Code
Pretty much the same code as above but with all the weird looking Asm.JS syntax:
hide
function AsmJs(stdlib, foreign, heap)
{
"use asm";
function eratosthenesOdd(lastNumber)
{
lastNumber = lastNumber|0;
var memorySize = ((lastNumber - 1) >> 1)|0;
// initialize
var isPrime = new stdlib.Int8Array(heap);
for (var i = 0|0; i <= memorySize|0; i++)
isPrime[i|0] = 1|0;
// find all odd non-primes
for (var i = 3|0; i*i <= lastNumber|0; i += 2|0)
if (isPrime[i >> 1|0] == 1|0)
for (var j = i*i; j <= lastNumber|0; j += 2*i)
isPrime[j >> 1|0] = 0|0;
// include 2 (which is even, not odd)
var found = (lastNumber >= 2) ? (1|0) : (0|0);
// sieve is complete, count primes
for (var i = 1|0; i <= memorySize|0; i++)
found += isPrime[i|0]|0;
return found;
}
// export
return { eratosthenesOdd : eratosthenesOdd };
}
// create AsmJs wrapper and call worker method
function asmjsEratosthenesOdd(lastNumber)
{
var module = AsmJs(window, null, new Int8Array(lastNumber / 2 + 1));
return module.eratosthenesOdd(lastNumber);
}
Live Demo's PHP Code
My initial port to PHP was based onarray
.
Everything worked fine but I soon ran out of memory because PHP's hash table (the technology behind
array
) requires too many bytes per item.The current solution "abuses" a very long string, consisting of
'0'
and
'1'
characters.
The build-in functions str_repeat
and substr_count
simplify and speed-up string creation / evaluation.
I chose to limit lastNumber
to 25,000,000 when the script is executed
on my public web server.
At home (localhost
), no limits are imposed.
hide
<?
// from 2 to number passed via GET, e.g. eratosthenes.ajax?12345
$lastNumber = $_SERVER["QUERY_STRING"];
if ($lastNumber < 1)
die("<b>(invalid)</b>");
// limit $lastNumber depending on server location (localhost or create.stephan-brumme.com)
if ($_SERVER["HTTP_HOST"] == "localhost")
set_time_limit(0); // no limits !
else
if ($lastNumber > 25*1000*1000) // 25 million
die("<i>(limited to <b>25000000</b> to reduce server load)</i>");
$startTime = microtime(true);
// initialize
$memorySize = $lastNumber >> 1;
$isPrime = str_repeat(1, $memorySize);
// find all odd non-primes
for ($i = 3; $i*$i <= $lastNumber; $i += 2)
if ($isPrime[$i >> 1] == 1)
for ($j = $i*$i; $j <= $lastNumber; $j += 2*$i)
$isPrime[$j >> 1] = 0;
// sieve is complete, count primes
$found = substr_count($isPrime, 1);
$duration = number_format(microtime(true) - $startTime, 3);
echo "<b>$found primes</b> found ($duration seconds)";
?>