Multi-Core Sorting
posted by Stephan Brumme
The old-fashioned way
Most traditional sorting algorithms (I already blogged about their STL-compatible versions) work in a serial manner and are perfectly suited for single-core machines.My current computer has a Intel Core i7 860 CPU which comes with 4 cores and hyperthreading, giving me a total of 8 logical cores. That means, CPU utilization with traditional sorting algorithms (i.e. only able to run on one 1 out of 8 cores) cannot exceed 12.5%. That sucks.
OpenMP
Luckily, the GCC team made parallelized versions of pretty all important STL algorithms available. They still call their implementation experimental but I haven't had any problems so far.Everything is based on OpenMP which has been supported since GCC 4.2. Visual C++ and Intel C++ support OpenMP as well but I haven't tested GCC's library with these commercial compilers yet.
How to use it
There are two ways to switch to GCC's parallel mode:- link against
-D_GLIBCXX_PARALLEL
or - use the
__gnu_parallel
namespace
-fopenmp
and include parallel/algorithm
.
hide
// g++ -O3 -fopenmp -march=native parallelSort.cpp -o parallelSort
#include <algorithm>
std::sort(data.begin(), data.end());
#include <parallel/algorithm>
__gnu_parallel::sort(data.begin(), data.end());
How NOT to use it
If you are sorting only a small set of elements, the overhead for running the parallel code might outweight the time savings.In my experience, if
std::sort
takes less than 5ms,
then don't expect any significant gains from __gnu_parallel::sort
.And if you have lots of threads/processes already pushing your CPU to its limits, then you might prefer to stick to your old, trusted code.
Remember, everything is still labeled as experimental by GCC (early September 2011).
Download
Git users: scroll down to the repository link Latest release: April 24, 2013, size: 3286 bytes, 120 linesCRC32:
ec87210d
MD5:
80f1156aaf4912f7d88b9427192579f2
SHA1:
96042b375d74043f7bd0488f999caf590db7c1c5
SHA256:
425580b95b23d94caa694d2e9b6afc2a28cab6637445ee4bd60cc443d6379c2a
Stay up-to-date:git clone https://create.stephan-brumme.com/parallelSort/git
If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com
Changelog
- License
- April 24, 2013
- zlib-style license, applies to all past and future versions
- version 3
- November 9, 2011
- fixed spelling
- version 2
- September 22, 2011
- code formatting
- version 1
- September 11, 2011
- initial release
Benchmark
For reasons unknown, OpenMP was removed from my web server's GCC. It's shared hosting so I can't do anything about it.Therefore, instead of a live demo I just show you the numbers of my desktop computer (Intel Core i7 860, GCC 4.6.0):
100 million integers | sorted | inverted | random |
---|---|---|---|
std::sort | 2.002s | 13.621s | 7.088s |
__gnu_parallel::sort | 0.721s | 3.286s | 1.567s |
speed-up | 2.78x | 4.15x | 4.52x |
#define USE_WINDOWS_TIMER
to get more precise timings.std::sort
uses core 1 and 7
with all other cores idling.
Once __gnu_parallel::sort
starts, all cores are under full load.There is a noteworthy side-effect: my CPU's default speed is 2.8GHz but it spends 99% of the time in power-saving mode at 1.2GHz. Turbo-Boost overclocks the CPU to 3.3GHz with
std::sort
but falls back to only 2.9GHz with
__gnu_parallel::sort
.
I'm still impressed with Turbo Boost because it's about 28 Celsius (82 Fahrenheit) outside and, unfortunately, inside my appartment.