Parallel Sieve of Eratosthenes [updated]
posted by Stephan Brumme,
updated
Processing 1 Billion Numbers Per Second
There are 50,847,534 prime numbers between 2 and 1,000,000,000. For reasons unknown, I wanted to know whether a C++ implementation can find all these numbers on my desktop computer (Intel Core i7 860, quadcore, hyperthreading, 2.8GHz) in less than 1 second with a simple algorithm such as the Sieve of Eratosthenes.By only slightly modifying the basic algorithm and unleashing the raw power of OpenMP, I achieved a
UPDATE October 6, 2011: After adding a simple test for divisibility by 3 (and 5, 7, 9, 11, 13), the NEW blockwise sieve gets much faster and swiftly breaks through the 1secondbarrier.
Algorithm  OpenMP  from 2 to 1 billion  SpeedUp  CPU ticks (approx.)  Notes  

Basic Algorithm  no  50847534 primes  13.5 seconds  (1.0x)  40 billion  shortest code (22 lines) 
Odd Numbers Only  no  50847534 primes  6.7 seconds  2.0x  20 billion  only 10% more code than basic algorithm 
Odd Numbers Only  yes, 8 threads  50847534 primes  5.4 seconds  2.5x  130 billion  
BlockWise NEW  no  50847534 primes  2.4 seconds  5.6x  7 billion  most efficient 
BlockWise NEW  yes, 8 threads  50847534 primes  0.65 seconds  20.7x  16 billion  fastest 
Live Demo
I ported the short Odd Numbers Only algorithm to Javascript and PHP; the code can be seen at the bottom of this page. The table below shows the expected number of primes for some reference inputs and the performance numbers measured on my computer.In this benchmark, C++ is approx. 25x faster than the best Javascript engine (Internet Explorer 9) and approx. 100x faster than PHP 5.3. Only the PHP times marked with (*) are measured on my local computer while all other PHP timings were measured on the create.stephanbrumme.com server (which is about 25% bit slower).
All browsers are still 32 bit programs  while my operating system is already 64 bit  and run out of memory when attempting to find all prime numbers from 1 to 1 billion. The more complex blockwise sieve does not suffer from this problem but I was too lazy to port it, too.
Multithreading is currently not supported by neither Javascript nor PHP.
from 2 to ...  primes  C++  Internet Explorer 10  Firefox 18  Opera 12  Chrome 24  PHP 5.3 

10  4  <0.001s  <0.001s  <0.001s  <0.001s  <0.001s  <0.001s 
100  25  <0.001s  <0.001s  <0.001s  <0.001s  <0.001s  <0.001s 
1,000  168  <0.001s  <0.001s  <0.001s  <0.001s  <0.001s  <0.001s 
10,000  1,229  <0.001s  <0.001s  <0.001s  <0.001s  <0.001s  0.003s 
100,000  9,592  <0.001s  0.001s  0.001s  0.002s  0.001s  0.027s 
1,000,000  78,498  0.002s  0.010s  0.007s  0.022s  0.015s  0.393s 
10,000,000  664,579  0.024s  0.139s  0.103s  0.248s  0.190s  3.717s 
100,000,000  5,761,455  0.555s  about 1.7s  about 1.2s  about 2.7s  about 2.0s  about 29s (*) 
1,000,000,000  50,847,534  6.7s  n/a  n/a  n/a  n/a  n/a 
UPDATE February 14, 2013: More than a year after my initial tests, Firefox became the fastest browser due to its vastly improved Javascript engine.
Download
Git users: scroll down for the repository link Latest release: April 24, 2013, size: 5963 bytes, 234 linesCRC32:
fc281a08
MD5:
b483e2911cb1ed64939c0cf227648743
SHA1:
ab1333e4b873faddab852f113894e06bc7d5b11a
SHA256:
ca28959cdb40d3a0f58d08a5f3a7a67c4d9dcd99fc52254e239b0236751eb95d
Stay uptodate:git clone http://create.stephanbrumme.com/eratosthenes/.git
If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephanbrumme.com
Changelog
 License
 April 24, 2013
 zlibstyle license, applies to all past and future versions
 version 3
 November 9, 2011
 fixed spelling
 version 2
 October 6, 2011
 added divisibility tests
 improved 64 bit portability
 version 1
 September 17, 2011
 initial release
Basic Algorithm
The Sieve of Eratosthenes can be implemented in just a few lines of C/C++ code:
hide
Initially, each number is assumed to be prime, i.e. all elements of
// simple serial sieve of Eratosthenes
int eratosthenes(int lastNumber)
{
// initialize
char* isPrime = new char[lastNumber+1];
for (int i = 0; i <= lastNumber; i++)
isPrime[i] = 1;
// find all nonprimes
for (int i = 2; i*i <= lastNumber; i++)
if (isPrime[i])
for (int j = i*i; j <= lastNumber; j += i)
isPrime[j] = 0;
// sieve is complete, count primes
int found = 0;
for (int i = 2; i <= lastNumber; i++)
found += isPrime[i];
delete[] isPrime;
return found;
}
isPrime
are set to 1
.
Then all multiples of i
need to be set to 0
.
A final scan over the array counts the number of primes.My first attempt was based on a
bool
array instead of char
,
true
instead of 1
and
false
instead of 0
.
Even though bool
s were more intuitive, they were a bit slower (≈7%)
because the final scan contained a branch which often flushed the CPU pipeline after a misprediction.
for (int i = 2; i <= lastNumber; i++)
if (isPrime[i])
found++;
true
is by default internally represented as 1
and
false
is internally represented as 0
.
That means, the binary code generated by the C++ compiler for isPrime
is identical
for bool
and char

except for the more efficient final scan of char
.Time for
lastNumber
= 1 billion: 13.5 seconds
Against All Odds
The only even prime number is 2, all other primes are odd. This observation saves 50% memory and is nearly twice as fast as the basic algorithm while requiring only minor code modifications.
hide
Time for
// oddonly sieve
int eratosthenesOdd(int lastNumber)
{
int memorySize = (lastNumber1)/2;
// initialize
char* isPrime = new char[memorySize+1];
for (int i = 0; i <= memorySize; i++)
isPrime[i] = 1;
// find all odd nonprimes
for (int i = 3; i*i <= lastNumber; i += 2)
if (isPrime[i/2])
for (int j = i*i; j <= lastNumber; j += 2*i)
isPrime[j/2] = 0;
// sieve is complete, count primes
int found = lastNumber >= 2 ? 1 : 0;
for (int i = 1; i <= memorySize; i++)
found += isPrime[i];
delete[] isPrime;
return found;
}
lastNumber
= 1 billion: 6.7 seconds, speedup 2.0x
Parallelizing The Sieve
Both code pieces shown above run on a single core. When I added OpenMP#pragma
s, Visual C++ rejected my for
loop
because the termination condition i*i <= lastNumber
is too complex
(at least for the 2008 compiler's OpenMP implementation).
After changing it to i <= lastNumberSqrt
the compiler was happy.Unfortunately, the performance gain was negligible: I went from 7.4 to 7.2 seconds. The task manager told me that OpenMP distributed the work in its thread pool not evenly: some cores ran at 100% but most were idle. What happened ? The execution time of the outer
for
loop can be almost zero (if !isPrime[i/2]
),
quite high (if i
is small) or
quite low (if i
is close to lastNumberSqrt
).
The default OpenMP scheduling strategy static
assumes that each pass
takes about the same time.
When I explicitly request dynamic
scheduling, the sieve finishes after 5.4 seconds.
The final scan needs to be declared as reduction(+:found)
to properly
count the number of primes in the shared variable found
.
hide
Time for
// oddonly sieve
int eratosthenesOdd(int lastNumber, bool useOpenMP)
{
// enable/disable OpenMP
omp_set_num_threads(useOpenMP ? omp_get_num_procs() : 1);
// instead of i*i <= lastNumber we write i <= lastNumberSquareRoot to help OpenMP
const int lastNumberSqrt = (int)sqrt((double)lastNumber);
int memorySize = (lastNumber1)/2;
// initialize
char* isPrime = new char[memorySize+1];
#pragma omp parallel for
for (int i = 0; i <= memorySize; i++)
isPrime[i] = 1;
// find all odd nonprimes
#pragma omp parallel for schedule(dynamic)
for (int i = 3; i <= lastNumberSqrt; i += 2)
if (isPrime[i/2])
for (int j = i*i; j <= lastNumber; j += 2*i)
isPrime[j/2] = 0;
// sieve is complete, count primes
int found = lastNumber >= 2 ? 1 : 0;
#pragma omp parallel for reduction(+:found)
for (int i = 1; i <= memorySize; i++)
found += isPrime[i];
delete[] isPrime;
return found;
}
lastNumber
= 1 billion: 5.4 seconds, speedup 2.5x
BlockWise Sieve
Thanks to OpenMP all 4 CPU cores (plus 4 hyperthreading cores) run under full load. But the performance gain isn't impressive at all: the program became only 25% faster despite burning 700% more CPU cycles. The memory system  including the CPU caches  turns out to be the bottleneck.Instead of processing all numbers at once, I rewrote the oddonly sieve's code to find all prime numbers inside a range
from ... to
.
Of major interest is minJ
which is the starting value of
the inner for
loop.UPDATE October 6, 2011: I added divisibility tests which cut computing time almost in half. More than 13 doesn't give any improvement (test outweights its gain).
hide
A simple, serial program now needs less memory and is faster than the previous OpenMPversion.
In plain English: a single core outperform 4x2 cores !
// process only odd numbers of a specified block
int eratosthenesOddSingleBlock(const int from, const int to)
{
const int memorySize = (to  from + 1) / 2;
// initialize
char* isPrime = new char[memorySize];
for (int i = 0; i < memorySize; i++)
isPrime[i] = 1;
for (int i = 3; i*i <= to; i+=2)
{
// >>> UPDATE October 6, 2011
// skip multiples of three: 9, 15, 21, 27, ...
if (i >= 3*3 && i % 3 == 0)
continue;
// skip multiples of five
if (i >= 5*5 && i % 5 == 0)
continue;
// skip multiples of seven
if (i >= 7*7 && i % 7 == 0)
continue;
// skip multiples of eleven
if (i >= 11*11 && i % 11 == 0)
continue;
// skip multiples of thirteen
if (i >= 13*13 && i % 13 == 0)
continue;
// <<< UPDATE October 6, 2011
// skip numbers before current slice
int minJ = ((from+i1)/i)*i;
if (minJ < i*i)
minJ = i*i;
// start value must be odd
if ((minJ & 1) == 0)
minJ += i;
// find all odd nonprimes
for (int j = minJ; j <= to; j += 2*i)
{
int index = j  from;
isPrime[index/2] = 0;
}
}
// count primes in this block
int found = 0;
for (int i = 0; i < memorySize; i++)
found += isPrime[i];
// 2 is not odd => include on demand
if (from <= 2)
found++;
delete[] isPrime;
return found;
}
// process slicebyslice, odd numbers only
int eratosthenesBlockwise(int lastNumber, int sliceSize, bool useOpenMP)
{
int found = 0;
// each slices covers ["from" ... "to"], incl. "from" and "to"
for (int from = 2; from <= lastNumber; from += sliceSize)
{
int to = from + sliceSize;
if (to > lastNumber)
to = lastNumber;
found += eratosthenesOddSingleBlock(from, to);
}
return found;
}
I got the best timings for
sliceSize=128*1024
.
Then, processing 1 billion numbers with the blockwise approach requires only 128KB RAM instead of 1GByte.Time for
lastNumber
= 1 billion: 2.4 seconds, speedup 5.6x
Parallelizing BlockWise Sieve
The simplest task was to add OpenMP to the blockwise sieve: only add a pragma toeratosthenesBlockwise
's for
loop.
To reduce to lines of code in my benchmark, I included an OpenMP on/offswitch.
hide
Time for
// process slicebyslice, odd numbers only
int eratosthenesBlockwise(int lastNumber, int sliceSize, bool useOpenMP)
{
// enable/disable OpenMP
omp_set_num_threads(useOpenMP ? omp_get_num_procs() : 1);
int found = 0;
// each slices covers ["from" ... "to"], incl. "from" and "to"
#pragma omp parallel for reduction(+:found)
for (int from = 2; from <= lastNumber; from += sliceSize)
{
int to = from + sliceSize;
if (to > lastNumber)
to = lastNumber;
found += eratosthenesOddSingleBlock(from, to);
}
return found;
}
lastNumber
= 1 billion: 0.65 seconds, speedup 20.7x
Live Demo's Javascript Code
No significant changes from the C++ OddOnly code:
hide
var isPrime = new Array();
function eratosthenesOdd(lastNumber)
{
var memorySize = (lastNumber1) >> 1;
// initialize
for (var i = 0; i <= memorySize; i++)
isPrime[i] = 1;
// find all odd nonprimes
for (var i = 3; i*i <= lastNumber; i += 2)
if (isPrime[i >> 1] == 1)
for (var j = i*i; j <= lastNumber; j += 2*i)
isPrime[j >> 1] = 0;
// sieve is complete, count primes
var found = (lastNumber >= 2) ? 1 : 0;
for (var i = 1; i <= memorySize; i++)
found += isPrime[i];
return found;
}
Live Demo's PHP Code
My initial port to PHP was based onarray
.
Everything worked fine but I soon ran out of memory because PHP's hash table (the technology behind
array
) requires too many bytes per item.The current solution "abuses" a very long string, consisting of
'0'
and
'1'
characters.
The buildin functions str_repeat
and substr_count
simplify and speedup string creation / evaluation.
I chose to limit lastNumber
to 25,000,000 when the script is executed
on my public web server.
At home (localhost
), no limits are imposed.
hide
<?
// from 2 to number passed via GET, e.g. eratosthenes.ajax?12345
$lastNumber = $_SERVER["QUERY_STRING"];
if ($lastNumber < 1)
die("<b>(invalid)</b>");
// limit $lastNumber depending on server location (localhost or create.stephanbrumme.com)
if ($_SERVER["HTTP_HOST"] == "localhost")
set_time_limit(0); // no limits !
else
if ($lastNumber > 25*1000*1000) // 25 million
die("<i>(limited to <b>25000000</b> to reduce server load)</i>");
$startTime = microtime(true);
// initialize
$memorySize = $lastNumber >> 1;
$isPrime = str_repeat(1, $memorySize);
// find all odd nonprimes
for ($i = 3; $i*$i <= $lastNumber; $i += 2)
if ($isPrime[$i >> 1] == 1)
for ($j = $i*$i; $j <= $lastNumber; $j += 2*$i)
$isPrime[$j >> 1] = 0;
// sieve is complete, count primes
$found = substr_count($isPrime, 1);
$duration = number_format(microtime(true)  $startTime, 3);
echo "<b>$found primes</b> found ($duration seconds)";
?>