CLein
posted by Stephan Brumme
What is CLein ?
CLein is a tiny, portable freeware program which displays all relevant information about the OpenCL drivers installed on your system.It contains a small benchmark to compare the computing power of your host CPU and your OpenCL device(s), too.
If you have no idea what OpenCL is then read this article on Wikipedia.
See screenshots below to get a first impression of CLein.
The name CLein is based on the German word klein = small, little (now you know what that Calvin underwear really means ...)
Requirements
- Windows XP, Vista or 7 (32 and 64 bits)
- OpenCL driver (e.g. NVidia ForceWare 195.39+ or AMD/ATI Catalyst 9.12+, the Intel OpenCL SDK, ...)
- No installation needed: just download clein.exe and start it
- Nothing else: no need for .NET, runtime DLLs or anything else
Download
Git users: scroll down to the repository link Latest release: September 7, 2011, size: 108.5 kBytesCRC32:
e34bf29b
MD5:
8d5500a852d203d5d1326ff0afa72f9e
SHA1:
1185354ebc249d8eed1597cc1300fce526bc3cf8
SHA256:
ecb9e55bd907c36fcbe193d18224684f4b8464aaa4605ccfe2960bea97236bd1
Stay up-to-date:git clone https://create.stephan-brumme.com/clein/git
If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com
Changelog
- version 0.2
- latest and greatest
- September 7, 2011
- fixed some GUI bugs
- increased
ParallelLevel
to4
(≈10% faster)
- version 0.1
- September 6, 2011
- initial release
Screenshots
Technical Background
CLein is written in C++ - and obviously the Benchmark kernel is written in OpenCL ;-)At work I almost exclusively use MFC, Qt and (rarely) .NET but the resulting executables are either excessively large and/or have tons of DLL dependencies. Therefore I decided to program CLein the old-fashioned way: just pure Win32 API. I statically link to the C/C++ runtime including STL's
std::string
and
std::vector
.Even though CLein is freeware, its source code is currently not publicly available except for the simple OpenCL kernel of the Benchmark tab (see below or click Show Kernel). At the moment, OpenCL 1.1 is the most recent specification released by The Khronos Group. The Capabilities tab includes all features available in OpenCL 1.0 and 1.1. The kernel is compliant to OpenCL 1.0.
Several N-Queens solvers can be found online, including at least one based on OpenCL. They are fine-tuned for high performance at the cost of a high kernel complexity and/or extensive precomputations. CLein's kernel is short'n'simple and pretty much identical in C++ and OpenCL. The algorithm is based on backtracking and does not exploit board symmetries. Each thread's only input are the board size and its the thread ID (via
get_global_id / get_global_size
).
Each thread stores the number of solutions in data[thread]
which is summed by the host program in C++.To make the kernel even simpler you can revert to a (slower) single-threaded solution:
ParallelLevel
should be set to 0
and lines 30-59 can be removed.
hide
OpenCL Kernel
// maximum board size
#define MaxQueens 20
// maximum degree of parallelism: ParallelLevel^queens, e.g. 4^8
#define ParallelLevel 4
kernel void myKernel(global int* write_only data, int queens)
{
const int thread = get_global_id(0);
data[thread] = 0;
// simple solutions if below ParallelLevel
if (queens == 1)
data[0] = 1;
if (queens == 4)
data[0] = 4;
if (queens <= 4 || queens > MaxQueens)
return;
// all bits set
const int fullMask = (1 << queens) - 1;
local int leftStack [MaxQueens+1];
local int rightStack[MaxQueens+1];
local int downStack [MaxQueens+1];
local int maskStack [MaxQueens+1];
leftStack [0] = 0;
rightStack[0] = 0;
downStack [0] = 0;
maskStack [0] = 0;
int foundSolutions = 0;
// spread work amongst cores
int maxI = 1;
for (int i = 0; i < ParallelLevel; i++)
maxI *= queens;
for (int i = thread; i < maxI; i += get_global_size(0))
{
int id = i;
int level;
bool invalid = false;
for (level = 0; level < ParallelLevel; level++)
{
const int bitPos = 1 << (id % queens);
id /= queens;
// collides with another queens ?
if ((bitPos & maskStack[level]) != 0)
{
invalid = true;
break;
}
// place queen
leftStack [level+1] = (leftStack [level] | bitPos) << 1;
rightStack[level+1] = (rightStack[level] | bitPos) >> 1;
downStack [level+1] = downStack [level] | bitPos;
maskStack [level+1] = leftStack[level+1] | rightStack[level+1] | downStack[level+1];
}
if (invalid)
continue;
// try to fill board with backtracking
while (level >= ParallelLevel)
{
// lowest bit not set (see https://bits.stephan-brumme.com/lowestBitNotSet.html)
const int freeSpot = (maskStack[level] + 1) & ~maskStack[level];
// invalid ?
if ((freeSpot & fullMask) == 0)
{
level--;
continue;
}
// board completed ?
if (level == queens-1)
{
foundSolutions++;
level--;
continue;
}
maskStack[level] |= freeSpot;
// place queen and dig one level deeper
level++;
leftStack [level] = (leftStack [level-1] | freeSpot) << 1;
downStack [level] = downStack [level-1] | freeSpot;
rightStack[level] = (rightStack[level-1] | freeSpot) >> 1;
maskStack [level] = leftStack[level] | downStack[level] | rightStack[level];
}
}
data[thread] = foundSolutions;
}
Benchmark
These are the timings of my computers (Execution only):Code | Device | Compiler | MHz | Cores | Threads | 8x8 | 9x9 | 10x10 | 11x11 | 12x12 | 13x13 | 14x14 | 15x15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
C++ | CPU | Intel Core i7 860 | Visual C++ | 2800MHz | 4x2 | 1 | 1ms | 1ms | 2ms | 4ms | 18ms | 74ms | 362ms | 2192ms |
OpenCL | CPU | Intel Core i7 860 | Intel OpenCL SDK 1.1 | 2800MHz | 4x2 | 512 | 1ms | 1ms | 1ms | 2ms | 11ms | 56ms | 327ms | 2036ms |
OpenCL | CPU | Intel Core i7 860 | AMD APP SDK 2 | 2800MHz | 4x2 | 1024 | 1ms | 1ms | 1ms | 3ms | 14ms | 71ms | 415ms | 2584ms |
OpenCL | GPGPU | NVidia GT240 | Driver 280.26 | 1462MHz | 12x8 | 512 | 1ms | 2ms | 3ms | 7ms | 33ms | 64ms | 292ms | 1562ms |
Solutions: | 92 | 352 | 724 | 2680 | 14200 | 73712 | 365596 | 2279184 | ||||||
C++ | CPU | Intel Pentium D 820 | Visual C++ | 2666MHz | 2 | 1 | 1ms | 1ms | 2ms | 5ms | 20ms | 106ms | 613ms | 3962ms |
OpenCL | GPGPU | NVidia 8600GT | Driver 280.26 | 1188MHz | 4x8 | 512 | 37ms | 177ms | 1021ms | 5925ms |
I increased
ParallelLevel
from
3
to 4
in version 0.2
which give a performance increase of about 10%.As already stated in the paragraph above, there are many solvers that run much faster - but they are much more complex, too.