posted by Stephan Brumme

What is CLein ?

CLein is a tiny, portable freeware program which displays all relevant information about the OpenCL drivers installed on your system.
It contains a small benchmark to compare the computing power of your host CPU and your OpenCL device(s), too.

If you have no idea what OpenCL is then read this article on Wikipedia.
See screenshots below to get a first impression of CLein.

The name CLein is based on the German word klein = small, little (now you know what that Calvin underwear really means ...)


Git users: scroll down to the repository link
Download  clein.exe
Latest release: September 7, 2011, size: 108.5 kBytes

CRC32: e34bf29b
MD5: 8d5500a852d203d5d1326ff0afa72f9e
SHA1: 1185354ebc249d8eed1597cc1300fce526bc3cf8

Stay up-to-date:git clone

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email:



Technical Background

CLein is written in C++ - and obviously the Benchmark kernel is written in OpenCL ;-)

At work I almost exclusively use MFC, Qt and (rarely) .NET but the resulting executables are either excessively large and/or have tons of DLL dependencies. Therefore I decided to program CLein the old-fashioned way: just pure Win32 API. I statically link to the C/C++ runtime including STL's std::string and std::vector.

Even though CLein is freeware, its source code is currently not publicly available except for the simple OpenCL kernel of the Benchmark tab (see below or click Show Kernel). At the moment, OpenCL 1.1 is the most recent specification released by The Khronos Group. The Capabilities tab includes all features available in OpenCL 1.0 and 1.1. The kernel is compliant to OpenCL 1.0.

Several N-Queens solvers can be found online, including at least one based on OpenCL. They are fine-tuned for high performance at the cost of a high kernel complexity and/or extensive precomputations. CLein's kernel is short'n'simple and pretty much identical in C++ and OpenCL. The algorithm is based on backtracking and does not exploit board symmetries. Each thread's only input are the board size and its the thread ID (via get_global_id / get_global_size). Each thread stores the number of solutions in data[thread] which is summed by the host program in C++.

To make the kernel even simpler you can revert to a (slower) single-threaded solution: ParallelLevel should be set to 0 and lines 30-59 can be removed.
hide OpenCL Kernel // maximum board size #define MaxQueens 20 // maximum degree of parallelism: ParallelLevel^queens, e.g. 4^8 #define ParallelLevel 4 kernel void myKernel(global int* write_only data, int queens) { const int thread = get_global_id(0); data[thread] = 0; // simple solutions if below ParallelLevel if (queens == 1) data[0] = 1; if (queens == 4) data[0] = 4; if (queens <= 4 || queens > MaxQueens) return; // all bits set const int fullMask = (1 << queens) - 1; local int leftStack [MaxQueens+1]; local int rightStack[MaxQueens+1]; local int downStack [MaxQueens+1]; local int maskStack [MaxQueens+1]; leftStack [0] = 0; rightStack[0] = 0; downStack [0] = 0; maskStack [0] = 0; int foundSolutions = 0; // spread work amongst cores int maxI = 1; for (int i = 0; i < ParallelLevel; i++) maxI *= queens; for (int i = thread; i < maxI; i += get_global_size(0)) { int id = i; int level; bool invalid = false; for (level = 0; level < ParallelLevel; level++) { const int bitPos = 1 << (id % queens); id /= queens; // collides with another queens ? if ((bitPos & maskStack[level]) != 0) { invalid = true; break; } // place queen leftStack [level+1] = (leftStack [level] | bitPos) << 1; rightStack[level+1] = (rightStack[level] | bitPos) >> 1; downStack [level+1] = downStack [level] | bitPos; maskStack [level+1] = leftStack[level+1] | rightStack[level+1] | downStack[level+1]; } if (invalid) continue; // try to fill board with backtracking while (level >= ParallelLevel) { // lowest bit not set (see const int freeSpot = (maskStack[level] + 1) & ~maskStack[level]; // invalid ? if ((freeSpot & fullMask) == 0) { level--; continue; } // board completed ? if (level == queens-1) { foundSolutions++; level--; continue; } maskStack[level] |= freeSpot; // place queen and dig one level deeper level++; leftStack [level] = (leftStack [level-1] | freeSpot) << 1; downStack [level] = downStack [level-1] | freeSpot; rightStack[level] = (rightStack[level-1] | freeSpot) >> 1; maskStack [level] = leftStack[level] | downStack[level] | rightStack[level]; } } data[thread] = foundSolutions; }


These are the timings of my computers (Execution only):
Code Device Compiler MHz Cores Threads 8x8 9x9 10x10 11x11 12x12 13x13 14x14 15x15
C++ CPU Intel Core i7 860 Visual C++ 2800MHz 4x2 1 1ms 1ms 2ms 4ms 18ms 74ms 362ms 2192ms
OpenCL CPU Intel Core i7 860 Intel OpenCL SDK 1.1 2800MHz 4x2 512 1ms 1ms 1ms 2ms 11ms 56ms 327ms 2036ms
OpenCL CPU Intel Core i7 860 AMD APP SDK 2 2800MHz 4x2 1024 1ms 1ms 1ms 3ms 14ms 71ms 415ms 2584ms
OpenCL GPGPU NVidia GT240 Driver 280.26 1462MHz 12x8 512 1ms 2ms 3ms 7ms 33ms 64ms 292ms 1562ms
Solutions: 92 352 724 2680 14200 73712 365596 2279184
C++ CPU Intel Pentium D 820 Visual C++ 2666MHz 2 1 1ms 1ms 2ms 5ms 20ms 106ms 613ms 3962ms
OpenCL GPGPU NVidia 8600GT Driver 280.26 1188MHz 4x8 512 37ms 177ms 1021ms 5925ms
Note: The timings for my Nvidia GT240 change significantly from run to run. Displayed are the best timings measured after 10 runs.

I increased ParallelLevel from 3 to 4 in version 0.2 which give a performance increase of about 10%.

As already stated in the paragraph above, there are many solvers that run much faster - but they are much more complex, too.