Parallel Sieve of Eratosthenes [updated]

posted September 17, 2011 by Stephan Brumme, updated July 7, 2014

Processing 1 Billion Numbers Per Second

There are 50,847,534 prime numbers between 2 and 1,000,000,000. For reasons unknown, I wanted to know whether a C++ implementation can find all these numbers on my desktop computer (Intel Core i7 860, quad-core, hyperthreading, 2.8GHz) in less than 1 second with a simple algorithm such as the Sieve of Eratosthenes.

By only slightly modifying the basic algorithm and unleashing the raw power of OpenMP, I achieved a ~~12x~~ 20x speed-up.
~~In the end, my improved sieve finishes after 1.1 seconds - mission almost accomplished.~~

UPDATE October 6, 2011: After adding a simple test for divisibility by 3 (and 5, 7, 9, 11, 13), the NEW block-wise sieve gets much faster and swiftly breaks through the 1-second-barrier.

UPDATE July 7, 2014: Many browsers support asm.js to boost Javascript performance. Scroll down to find my asm.js implementation or click here.

Algorithm	OpenMP	from 2 to 1 billion		Speed-Up	CPU ticks (approx.)	Notes
Basic Algorithm	no	50847534 primes	13.5 seconds	(1.0x)	40 billion	shortest code (22 lines)
Odd Numbers Only	no	50847534 primes	6.7 seconds	2.0x	20 billion	only 10% more code than basic algorithm
Odd Numbers Only	yes, 8 threads	50847534 primes	5.4 seconds	2.5x	130 billion
Block-Wise NEW	no	50847534 primes	2.4 seconds	5.6x	7 billion	most efficient
Block-Wise NEW	yes, 8 threads	50847534 primes	0.65 seconds	20.7x	16 billion	fastest

Note: Much more complex algorithms easily outperform my code. My primary focus was parallelism in general, not prime numbers !

Live Demo

I ported the short Odd Numbers Only algorithm to Javascript and PHP; the code can be seen at the bottom of this page. The table below shows the expected number of primes for some reference inputs and the performance numbers measured on my computer.

In this benchmark, C++ is approx. 4x faster than the best Javascript engine (Mozilla Firefox) and approx. 100x faster than PHP 5.3. Only the PHP times marked with (*) are measured on my local computer while all other PHP timings were measured on the create.stephan-brumme.com server (which is about 25% bit slower).

All browsers are still 32 bit programs - while my operating system is already 64 bit - and run out of memory when attempting to find all prime numbers from 1 to 1 billion. The more complex blockwise sieve does not suffer from this problem but I was too lazy to port it, too.
Multithreading is currently not supported by neither Javascript nor PHP.

from 2 to ...	primes	C++	Internet Explorer 10	Firefox 18	Opera 12	Chrome 24	PHP 5.3
10	4	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s
100	25	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s
1,000	168	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s
10,000	1,229	<0.001s	<0.001s	<0.001s	<0.001s	<0.001s	0.003s
100,000	9,592	<0.001s	0.001s	0.001s	0.002s	0.001s	0.027s
1,000,000	78,498	0.002s	0.010s	0.007s	0.022s	0.015s	0.393s
10,000,000	664,579	0.024s	0.139s	0.103s	0.248s	0.190s	3.717s
100,000,000	5,761,455	0.555s	about 1.7s	about 1.2s	about 2.7s	about 2.0s	about 29s (*)
1,000,000,000	50,847,534	6.7s	n/a	n/a	n/a	n/a	n/a

For your interest: the browser ranking is contrary to what I have observed with my Javascript global illumination port of smallpt. I expected PHP to be faster, too.

UPDATE February 14, 2013: More than a year after my initial tests, Firefox became the fastest browser due to its vastly improved Javascript engine.

Download

Git users: scroll down to the repository link

Download prime.cpp

Latest release: April 24, 2013, size: 5.8 kBytes, 234 lines

CRC32: fc281a08
MD5: b483e2911cb1ed64939c0cf227648743
SHA1: ab1333e4b873faddab852f113894e06bc7d5b11a
SHA256:ca28959cdb40d3a0f58d08a5f3a7a67c4d9dcd99fc52254e239b0236751eb95d

Stay up-to-date:git clone https://create.stephan-brumme.com/eratosthenes/git

GitHub mirror:https://github.com/stbrumme/eratosthenes

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

Changelog

License
- April 24, 2013
- zlib-style license, applies to all past and future versions
version 3
- November 9, 2011
- fixed spelling
- Git tag eratosthenes_v3
version 2
- October 6, 2011
- added divisibility tests
- improved 64 bit portability
version 1
- September 17, 2011
- initial release

Basic Algorithm

The Sieve of Eratosthenes can be implemented in just a few lines of C/C++ code:

hide


     // simple serial sieve of Eratosthenes
     int eratosthenes(int lastNumber)
     {
       // initialize
       char* isPrime = new char[lastNumber+1];
       for (int i = 0; i <= lastNumber; i++)
         isPrime[i] = 1;
     
       // find all non-primes
       for (int i = 2; i*i <= lastNumber; i++)
         if (isPrime[i])
           for (int j = i*i; j <= lastNumber; j += i)
             isPrime[j] = 0;
     
       // sieve is complete, count primes
       int found = 0;
       for (int i = 2; i <= lastNumber; i++)
         found += isPrime[i];
     
       delete[] isPrime;
       return found;
     }

Initially, each number is assumed to be prime, i.e. all elements of isPrime are set to 1. Then all multiples of i need to be set to 0. A final scan over the array counts the number of primes.

My first attempt was based on a bool array instead of char, true instead of 1 and false instead of 0. Even though bools were more intuitive, they were a bit slower (≈7%) because the final scan contained a branch which often flushed the CPU pipeline after a mis-prediction.

    
     for (int i = 2; i <= lastNumber; i++)
       if (isPrime[i])
         found++;

Both data types use 1 byte per element. Moreover, true is by default internally represented as 1 and false is internally represented as 0. That means, the binary code generated by the C++ compiler for isPrime is identical for bool and char - except for the more efficient final scan of char.

Time for lastNumber = 1 billion: 13.5 seconds

Against All Odds

The only even prime number is 2, all other primes are odd. This observation saves 50% memory and is nearly twice as fast as the basic algorithm while requiring only minor code modifications.

hide


     // odd-only sieve
     int eratosthenesOdd(int lastNumber)
     {
       int memorySize = (lastNumber-1)/2;
     
       // initialize
       char* isPrime = new char[memorySize+1];
       for (int i = 0; i <= memorySize; i++)
         isPrime[i] = 1;
     
       // find all odd non-primes
       for (int i = 3; i*i <= lastNumber; i += 2)
         if (isPrime[i/2])
           for (int j = i*i; j <= lastNumber; j += 2*i)
             isPrime[j/2] = 0;
     
       // sieve is complete, count primes
       int found = lastNumber >= 2 ? 1 : 0;
       for (int i = 1; i <= memorySize; i++)
         found += isPrime[i];
     
       delete[] isPrime;
       return found;
     }

Time for lastNumber = 1 billion: 6.7 seconds, speed-up 2.0x

Parallelizing The Sieve

Both code pieces shown above run on a single core. When I added OpenMP #pragmas, Visual C++ rejected my for-loop because the termination condition i*i <= lastNumber is too complex (at least for the 2008 compiler's OpenMP implementation). After changing it to i <= lastNumberSqrt the compiler was happy.

Unfortunately, the performance gain was negligible: I went from 7.4 to 7.2 seconds. The task manager told me that OpenMP distributed the work in its thread pool not evenly: some cores ran at 100% but most were idle. What happened ? The execution time of the outer for-loop can be almost zero (if !isPrime[i/2]), quite high (if i is small) or quite low (if i is close to lastNumberSqrt). The default OpenMP scheduling strategy static assumes that each pass takes about the same time. When I explicitly request dynamic scheduling, the sieve finishes after 5.4 seconds. The final scan needs to be declared as reduction(+:found) to properly count the number of primes in the shared variable found.

hide


     // odd-only sieve
     int eratosthenesOdd(int lastNumber, bool useOpenMP)
     {
       // enable/disable OpenMP
       omp_set_num_threads(useOpenMP ? omp_get_num_procs() : 1);
     
       // instead of i*i <= lastNumber we write i <= lastNumberSquareRoot to help OpenMP
       const int lastNumberSqrt = (int)sqrt((double)lastNumber);
     
       int memorySize = (lastNumber-1)/2;
     
       // initialize
       char* isPrime = new char[memorySize+1];
     #pragma omp parallel for
       for (int i = 0; i <= memorySize; i++)
         isPrime[i] = 1;
     
       // find all odd non-primes
     #pragma omp parallel for schedule(dynamic)
       for (int i = 3; i <= lastNumberSqrt; i += 2)
         if (isPrime[i/2])
           for (int j = i*i; j <= lastNumber; j += 2*i)
             isPrime[j/2] = 0;
     
       // sieve is complete, count primes
       int found = lastNumber >= 2 ? 1 : 0;
     #pragma omp parallel for reduction(+:found)
       for (int i = 1; i <= memorySize; i++)
         found += isPrime[i];
     
       delete[] isPrime;
       return found;
     }

Time for lastNumber = 1 billion: 5.4 seconds, speed-up 2.5x

Block-Wise Sieve

Thanks to OpenMP all 4 CPU cores (plus 4 hyperthreading cores) run under full load. But the performance gain isn't impressive at all: the program became only 25% faster despite burning 700% more CPU cycles. The memory system - including the CPU caches - turns out to be the bottleneck.

Instead of processing all numbers at once, I rewrote the odd-only sieve's code to find all prime numbers inside a range from ... to. Of major interest is minJ which is the starting value of the inner for-loop.

UPDATE October 6, 2011: I added divisibility tests which cut computing time almost in half. More than 13 doesn't give any improvement (test outweights its gain).

hide


     // process only odd numbers of a specified block
     int eratosthenesOddSingleBlock(const int from, const int to)
     {
       const int memorySize = (to - from + 1) / 2;
     
       // initialize
       char* isPrime = new char[memorySize];
       for (int i = 0; i < memorySize; i++)
         isPrime[i] = 1;
     
       for (int i = 3; i*i <= to; i+=2)
       {
         // >>> UPDATE October 6, 2011
         // skip multiples of three: 9, 15, 21, 27, ...
         if (i >= 3*3 && i % 3 == 0)
           continue;
         // skip multiples of five
         if (i >= 5*5 && i % 5 == 0)
           continue;
         // skip multiples of seven
         if (i >= 7*7 && i % 7 == 0)
           continue;
         // skip multiples of eleven
         if (i >= 11*11 && i % 11 == 0)
           continue;
         // skip multiples of thirteen
         if (i >= 13*13 && i % 13 == 0)
           continue;
         // <<< UPDATE October 6, 2011
     
         // skip numbers before current slice
         int minJ = ((from+i-1)/i)*i;
         if (minJ < i*i)
           minJ = i*i;
     
         // start value must be odd
         if ((minJ & 1) == 0)
           minJ += i;
     
         // find all odd non-primes
         for (int j = minJ; j <= to; j += 2*i)
         {
           int index = j - from;
           isPrime[index/2] = 0;
         }
       }
     
       // count primes in this block
       int found = 0;
       for (int i = 0; i < memorySize; i++)
         found += isPrime[i];
       // 2 is not odd => include on demand
       if (from <= 2)
         found++;
     
       delete[] isPrime;
       return found;
     }
     
     // process slice-by-slice, odd numbers only
     int eratosthenesBlockwise(int lastNumber, int sliceSize, bool useOpenMP)
     {
       int found = 0;
       // each slices covers ["from" ... "to"], incl. "from" and "to"
       for (int from = 2; from <= lastNumber; from += sliceSize)
       {
         int to = from + sliceSize;
         if (to > lastNumber)
           to = lastNumber;
     
         found += eratosthenesOddSingleBlock(from, to);
       }
     
       return found;
     }

A simple, serial program now needs less memory and is faster than the previous OpenMP-version. In plain English: a single core outperform 4x2 cores !

I got the best timings for sliceSize=128*1024. Then, processing 1 billion numbers with the block-wise approach requires only 128KB RAM instead of 1GByte.

Time for lastNumber = 1 billion: 2.4 seconds, speed-up 5.6x

Parallelizing Block-Wise Sieve

The simplest task was to add OpenMP to the block-wise sieve: only add a pragma to eratosthenesBlockwise's for-loop. To reduce to lines of code in my benchmark, I included an OpenMP on/off-switch.

hide


     // process slice-by-slice, odd numbers only
     int eratosthenesBlockwise(int lastNumber, int sliceSize, bool useOpenMP)
     {
       // enable/disable OpenMP
       omp_set_num_threads(useOpenMP ? omp_get_num_procs() : 1);
     
       int found = 0;
       // each slices covers ["from" ... "to"], incl. "from" and "to"
     #pragma omp parallel for reduction(+:found)
       for (int from = 2; from <= lastNumber; from += sliceSize)
       {
         int to = from + sliceSize;
         if (to > lastNumber)
           to = lastNumber;
     
         found += eratosthenesOddSingleBlock(from, to);
       }
     
       return found;
     }

Time for lastNumber = 1 billion: 0.65 seconds, speed-up 20.7x

Live Demo's Javascript Code

No significant changes from the C++ Odd-Only code:

hide


     var isPrime = new Array();
     function eratosthenesOdd(lastNumber)
     {
       var memorySize = (lastNumber-1) >> 1;
     
       // initialize
       for (var i = 0; i <= memorySize; i++)
         isPrime[i] = 1;
     
       // find all odd non-primes
       for (var i = 3; i*i <= lastNumber; i += 2)
         if (isPrime[i >> 1] == 1)
           for (var j = i*i; j <= lastNumber; j += 2*i)
             isPrime[j >> 1] = 0;
     
       // sieve is complete, count primes
       var found = (lastNumber >= 2) ? 1 : 0;
       for (var i = 1; i <= memorySize; i++)
         found += isPrime[i];
     
       return found;
     }

Live Demo's Asm.JS Code

Pretty much the same code as above but with all the weird looking Asm.JS syntax:

hide


     function AsmJs(stdlib, foreign, heap)
     {
       "use asm";
     
       function eratosthenesOdd(lastNumber)
       {
         lastNumber = lastNumber|0;
         var memorySize = ((lastNumber - 1) >> 1)|0;
     
         // initialize
         var isPrime = new stdlib.Int8Array(heap);
         for (var i = 0|0; i <= memorySize|0; i++)
           isPrime[i|0] = 1|0;
     
         // find all odd non-primes
         for (var i = 3|0; i*i <= lastNumber|0; i += 2|0)
           if (isPrime[i >> 1|0] == 1|0)
             for (var j = i*i; j <= lastNumber|0; j += 2*i)
               isPrime[j >> 1|0] = 0|0;
     
         // include 2 (which is even, not odd)
         var found = (lastNumber >= 2) ? (1|0) : (0|0);
         // sieve is complete, count primes
         for (var i = 1|0; i <= memorySize|0; i++)
           found += isPrime[i|0]|0;
     
         return found;
       }
     
       // export
       return { eratosthenesOdd : eratosthenesOdd };
     }
     
     // create AsmJs wrapper and call worker method
     function asmjsEratosthenesOdd(lastNumber)
     {
       var module = AsmJs(window, null, new Int8Array(lastNumber / 2 + 1));
       return module.eratosthenesOdd(lastNumber);
     }

Live Demo's PHP Code

My initial port to PHP was based on array. Everything worked fine but I soon ran out of memory because PHP's hash table (the technology behind array) requires too many bytes per item.

The current solution "abuses" a very long string, consisting of '0' and '1' characters. The build-in functions str_repeat and substr_count simplify and speed-up string creation / evaluation. I chose to limit lastNumber to 25,000,000 when the script is executed on my public web server. At home (localhost), no limits are imposed.

hide


     <?
       // from 2 to number passed via GET, e.g. eratosthenes.ajax?12345
       $lastNumber = $_SERVER["QUERY_STRING"];
       if ($lastNumber < 1)
         die("<b>(invalid)</b>");
     
       // limit $lastNumber depending on server location (localhost or create.stephan-brumme.com)
       if ($_SERVER["HTTP_HOST"] == "localhost")
         set_time_limit(0); // no limits !
       else
         if ($lastNumber > 25*1000*1000) // 25 million
           die("<i>(limited to <b>25000000</b> to reduce server load)</i>");
     
       $startTime = microtime(true);
     
       // initialize
       $memorySize = $lastNumber >> 1;
       $isPrime = str_repeat(1, $memorySize);
     
       // find all odd non-primes
       for ($i = 3; $i*$i <= $lastNumber; $i += 2)
         if ($isPrime[$i >> 1] == 1)
           for ($j = $i*$i; $j <= $lastNumber; $j += 2*$i)
             $isPrime[$j >> 1] = 0;
     
       // sieve is complete, count primes
       $found = substr_count($isPrime, 1);
     
       $duration = number_format(microtime(true) - $startTime, 3);
       echo "<b>$found primes</b> found ($duration seconds)";
     ?>


June 30, 2021: Length-Limited Prefix Codes April 13, 2020: smalLZ4 - optimal LZ4 compression June 17, 2019: toojpeg - a JPEG encoder in a single C++ file ... and 30 more !	RSS	www.stephan-brumme.com bits.stephan-brumme.com photos.stephan-brumme.com minime.stephan-brumme.com
© 2011-2024 Stephan Brumme products and company names mentioned may be trademarks of their respective owners. see my disclaimer. dude, this is the end ... of this page.