Multi-Core Sorting

posted by Stephan Brumme

The old-fashioned way

Most traditional sorting algorithms (I already blogged about their STL-compatible versions) work in a serial manner and are perfectly suited for single-core machines.
My current computer has a Intel Core i7 860 CPU which comes with 4 cores and hyperthreading, giving me a total of 8 logical cores. That means, CPU utilization with traditional sorting algorithms (i.e. only able to run on one 1 out of 8 cores) cannot exceed 12.5%. That sucks.

OpenMP

Luckily, the GCC team made parallelized versions of pretty all important STL algorithms available. They still call their implementation experimental but I haven't had any problems so far.
Everything is based on OpenMP which has been supported since GCC 4.2. Visual C++ and Intel C++ support OpenMP as well but I haven't tested GCC's library with these commercial compilers yet.

How to use it

There are two ways to switch to GCC's parallel mode:
  1. link against -D_GLIBCXX_PARALLEL or
  2. use the __gnu_parallel namespace
In both cases you have to enable compiler switch -fopenmp and include parallel/algorithm.
hide // g++ -O3 -fopenmp -march=native parallelSort.cpp -o parallelSort #include <algorithm> std::sort(data.begin(), data.end()); #include <parallel/algorithm> __gnu_parallel::sort(data.begin(), data.end());

How NOT to use it

If you are sorting only a small set of elements, the overhead for running the parallel code might outweight the time savings.
In my experience, if std::sort takes less than 5ms, then don't expect any significant gains from __gnu_parallel::sort.

And if you have lots of threads/processes already pushing your CPU to its limits, then you might prefer to stick to your old, trusted code.
Remember, everything is still labeled as experimental by GCC (early September 2011). Git users: scroll down to the repository link
Download  parallelSort.cpp
Latest release: April 24, 2013, size: 3286 bytes, 120 lines

CRC32: ec87210d
MD5: 80f1156aaf4912f7d88b9427192579f2
SHA1: 96042b375d74043f7bd0488f999caf590db7c1c5
SHA256:425580b95b23d94caa694d2e9b6afc2a28cab6637445ee4bd60cc443d6379c2a

Stay up-to-date:git clone http://create.stephan-brumme.com/parallelSort/.git

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

Changelog

Benchmark

For reasons unknown, OpenMP was removed from my web server's GCC. It's shared hosting so I can't do anything about it.
Therefore, instead of a live demo I just show you the numbers of my desktop computer (Intel Core i7 860, GCC 4.6.0):
100 million integers sorted inverted random
std::sort 2.002s 13.621s 7.088s
__gnu_parallel::sort 0.721s 3.286s 1.567s
speed-up 2.78x 4.15x 4.52x
I strongly recommend to download the C++ source code and run it on your own computer. If you use GCC on Windows (I prefer MinGW), uncomment #define USE_WINDOWS_TIMER to get more precise timings.
A short look at the Windows 7 Task Manager shows that std::sort uses core 1 and 7 with all other cores idling. Once __gnu_parallel::sort starts, all cores are under full load.

There is a noteworthy side-effect: my CPU's default speed is 2.8GHz but it spends 99% of the time in power-saving mode at 1.2GHz. Turbo-Boost overclocks the CPU to 3.3GHz with std::sort but falls back to only 2.9GHz with __gnu_parallel::sort. I'm still impressed with Turbo Boost because it's about 28 Celsius (82 Fahrenheit) outside and, unfortunately, inside my appartment.
homepage