Practical String Searching

posted by Stephan Brumme

Motivation

There is a huge number of scientific papers on fast, faster and super-fast string searching algorithms.
They usually prove theoretical performance in O-Notation and most papers cover memory consumption as well.

However, theoretical performance isn't always the same as practical performance. That's why I always want to measure real-world throughput: this article presents hopefully understandable C implementations of the most common generic string search algorithms.

In addition I also wrote a simple tool called mygrep that prints all lines of a file where a search phrase is found.
It doesn't come with all the bells and whistles of the Unix tool grep but achieves similar or sometimes even better speed.
A direct comparison can be done online, just scroll down to the Live Demo.

Playing around with various search phrases you will easily see the plain and simple truth: for most situtations the naive brute-force algorithm is a great choice.
A slightly modified version based on memchr and memcmp is often several times faster than the second-fastest contender - just because this simple algorithm can make efficient use of modern CPUs' optimized string opcodes and fits well into the caches.

These generic routines might slow down considerably in worst-case scenarios, like DNA analysis (small alphabet and many repeated similar parts).
And yes, fgrep is damn fast.

Live Demo

This demo scans through the first 100 MByte of Wikipedia (well, an older snapshot dating back to March 3, 2006) on the server.
Before search starts, the file is mapped into the Linux file cache in order to avoid disk latencies (cat enwik8 > /dev/zero).

Notes: execution times can vary significantly - depending on overall server load.
Special characters such as " or \ are disabled to prevent potential security breaches into my server.
The search algorithms actually can search for these special characters (as plain ASCII symbols).
Don't confuse it with meta characters for regular expressions; those are not supported.
Search is case-sensitive.

Search Phrase:
Algorithm Tool Matches Time
(please enter search phrase and click "Go !"

(read more ...)

The Mersenne Twister Pseudo Random Number Generator

posted by Stephan Brumme

Introduction

The Mersenne Twister is often regarded as the fastest pseudo-random number generator which passes almost all statistical tests.

The original C code isn't exactly beautiful, therefore I decided to write my one C++ class.
And for the fun of it, I converted the code to Javascript and added a two live demos, too (scroll down).

Live Demo

This demo will give you the first 10 random numbers generated by the Mersenne Twister.
My C++ implementation returns exactly the same values as the original code written by Makoto Matsumoto and Takuji Nishimura.

These numbers will be computed on the webserver whereas the Javascript code - which generates exactly the same numbers, too - obviously runs in your browser.
Modifying the seed value will change the sequence of random numbers. seed must be a 32 bit integer. If you leave it empty, 5489 will be used instead.

Seed:
Seed:
C++:
Original:
Javascript:
(in hexadecimal)

(read more ...)

Parallel Sieve of Eratosthenes [updated]

posted by Stephan Brumme, updated

Processing 1 Billion Numbers Per Second

There are 50,847,534 prime numbers between 2 and 1,000,000,000. For reasons unknown, I wanted to know whether a C++ implementation can find all these numbers on my desktop computer (Intel Core i7 860, quad-core, hyperthreading, 2.8GHz) in less than 1 second with a simple algorithm such as the Sieve of Eratosthenes.

By only slightly modifying the basic algorithm and unleashing the raw power of OpenMP, I achieved a 12x 20x speed-up.
In the end, my improved sieve finishes after 1.1 seconds - mission almost accomplished.

UPDATE October 6, 2011: After adding a simple test for divisibility by 3 (and 5, 7, 9, 11, 13), the NEW block-wise sieve gets much faster and swiftly breaks through the 1-second-barrier.

UPDATE July 7, 2014: Many browsers support asm.js to boost Javascript performance. Scroll down to find my asm.js implementation or click here.
Algorithm OpenMP from 2 to 1 billion Speed-Up CPU ticks (approx.) Notes
Basic Algorithm no 50847534 primes 13.5 seconds (1.0x) 40 billion shortest code (22 lines)
Odd Numbers Only no 50847534 primes 6.7 seconds 2.0x 20 billion only 10% more code than basic algorithm
Odd Numbers Only yes, 8 threads 50847534 primes 5.4 seconds 2.5x 130 billion
Block-Wise NEW no 50847534 primes 2.4 seconds 5.6x 7 billion most efficient
Block-Wise NEW yes, 8 threads 50847534 primes 0.65 seconds 20.7x 16 billion fastest
Note: Much more complex algorithms easily outperform my code. My primary focus was parallelism in general, not prime numbers !

Live Demo

I ported the short Odd Numbers Only algorithm to Javascript and PHP; the code can be seen at the bottom of this page. The table below shows the expected number of primes for some reference inputs and the performance numbers measured on my computer.

Compute number of primes from 2 to:




Your browser: CCBot/2.0 (http://commoncrawl.org/faq/)

In this benchmark, C++ is approx. 4x faster than the best Javascript engine (Mozilla Firefox) and approx. 100x faster than PHP 5.3. Only the PHP times marked with (*) are measured on my local computer while all other PHP timings were measured on the create.stephan-brumme.com server (which is about 25% bit slower).

All browsers are still 32 bit programs - while my operating system is already 64 bit - and run out of memory when attempting to find all prime numbers from 1 to 1 billion. The more complex blockwise sieve does not suffer from this problem but I was too lazy to port it, too.
Multithreading is currently not supported by neither Javascript nor PHP.
from 2 to ... primes C++ Internet Explorer 10 Firefox 18 Opera 12 Chrome 24 PHP 5.3
10 4 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
100 25 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
1,000 168 <0.001s <0.001s <0.001s <0.001s <0.001s <0.001s
10,000 1,229 <0.001s <0.001s <0.001s <0.001s <0.001s 0.003s
100,000 9,592 <0.001s 0.001s 0.001s 0.002s 0.001s 0.027s
1,000,000 78,498 0.002s 0.010s 0.007s 0.022s 0.015s 0.393s
10,000,000 664,579 0.024s 0.139s 0.103s 0.248s 0.190s 3.717s
100,000,000 5,761,455 0.555s about 1.7s about 1.2s about 2.7s about 2.0s about 29s (*)
1,000,000,000 50,847,534 6.7s n/a n/a n/a n/a n/a
For your interest: the browser ranking is contrary to what I have observed with my Javascript global illumination port of smallpt. I expected PHP to be faster, too.

UPDATE February 14, 2013: More than a year after my initial tests, Firefox became the fastest browser due to its vastly improved Javascript engine.
(read more ...)

Windows and Linux Code Timing

posted by Stephan Brumme

Whenever I have to do some performance testing, I try the most simple way: write a short program and run it within my Linux bash shell which comes with a built-in time command:
$ time ./testcode real 0m0.788s user 0m0.696s sys 0m0.080s
Obviously there can (and will) be some overhead by the operating system. So numbers are typically a bit off.
The timing resolution isn't that great, either, and often lies above 10ms.

That's why I wrote a simple C routine to measure performance inside the code. My goal was to have numbers that are accurate down to about 1 microsecond (10-6).
I didn't opt for extremely high timing precision (count CPU ticks) because I rarely need that degree of precision and really want to avoid the tricky details of CPU frequency scaling, cache stalls etc.

Another feature is portability; simply recompile the code and run it on Windows and Linux. The code works fine with GCC as well as Visual C++.

All my timing code contains is a single function seconds() returning a double. You can invoke it in your programs like this:
double start = seconds(); // ... your code that should be measured ... double duration = seconds() - start;

(read more ...)

Portable C++ Hashing Library

posted by Stephan Brumme, updated

Introduction

Let's start with the core features of my C++ hashing library:

Usage

Since there are no external dependencies, not even dependencies within the library itself, all you need is to include the header of the hashing algorithm of your choice and add the identically named .cpp to your project. That means if you want to add SHA256 hashing to your C++ program, you only have to include sha256.h and sha256.cpp in your project.

If you prefer SHA256, then your code will look like this:
hide // SHA2 test program #include "sha256.h" #include <iostream> // for std::cout only, not needed for hashing library int main(int, char**) { // create a new hashing object SHA256 sha256; // hashing an std::string std::cout << sha256("Hello World") << std::endl; // => a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e // hashing a buffer of bytes const char* buffer = "How are you"; std::cout << sha256(buffer, 11) << std::endl; // => 9c7d5b046878838da72e40ceb3179580958df544b240869b80d0275cc07209cc // or in a streaming fashion (re-use "How are you") SHA256 sha256stream; const char* url = "create.stephan-brumme.com"; // 25 bytes int step = 5; for (int i = 0; i < 25; i += step) sha256stream.add(url + i, step); // add five bytes at a time std::cout << sha256stream.getHash() << std::endl; // => 82aa771f1183c52f973c798c9243a1c73833ea40961c73e55e12430ec77b69f6 return 0; }

(read more ...)

Sorting STL Containers

posted by Stephan Brumme, updated

std::sort

Yes, there is std::sort. Yes, it's pretty much the most efficient way to sort generic STL containers. No, it's not perfect.

Improving std::sort

What if the data is already almost sorted ? std::sort doesn't benefit a lot from processing already sorted data. However, it's quite common to have (partially) sorted data.

std::sort requires RandomAccessIterator which are sometimes not available for custom containers.

What about std::list ? The syntax is completely different:
hide std::list<int> mylist; // ... fill mylist with data ... // compiler error: std::sort(mylist.begin(), mylist.end()); // must use this instead: mylist.sort();
That's why I wrote some sorting routines on my own to overcome all those limitations: The sorting algorithms are ordered more or less by overall performance (slowest first). As you can see faster algorithms have stricter requirements:
Algorithm Stable ? Iterator Notes
Bubble Sort yes forward iterator code shown here requires bidirectional iterator, just remove both instances of --last; (will be slower then)
Selection Sort no forward iterator
Insertion Sort yes bidirectional iterator
Shell Sort no bidirectional iterator std::distance and std::advance benefit heavily from random access iterators
Heap Sort no random-access iterator my n-ary Heap Sort implementation works with bidirectional iterators, too
Merge Sort yes random-access iterator requires additional memory
Merge Sort (in-place) yes forward iterator no additional memory but slower than normal merge sort
Quick Sort no bidirectional iterator std::distance and std::advance benefit heavily from random access iterators
std::sort no random-access iterator
std::stable_sort yes random-access iterator
The algorithms are nothing new, they can be found in every decent book on algorithms and, of course, online (e.g. Wikipedia).


(read more ...)

Portable Memory Mapping C++ Class

posted by Stephan Brumme

Parsing Files The Easy Way

Recently I had to do a lot with file loading and parsing. It can be especially tricky to come up with a fast solution if you have to jump around within these files. Seeking and proper buffering were my biggest problems.

Memory mapping is one of the nicest features of modern operating systems: after opening a file in memory-mapped mode you can treat the file as a large chunk of memory and use plain pointers. The operating system takes care of loading the data on demand (!) into memory - utilizing caches, of course. When using my C++ class MemoryMapped it's really easy:
hide How to use // open file MemoryMapped data("myfile.txt"); // read byte at file offset 2345 unsigned char a = data[2345]; // or if you prefer pointers const short* raw = (const short*) data.getData(); short b = raw[300];
Windows is completely different from Linux when it comes to opening a file in memory-mapped mode. The class MemoryMapped hides all the OS specific stuff in only two files: MemoryMapped.h and MemoryMapped.cpp.
They compile without any warning with GCC 4.7 and Visual C++ 2010. I haven't tried other compilers but they should be able to handle it, too, even when they are a bit older.
(read more ...)

Fast CRC32

posted by Stephan Brumme, updated

Error Checking

Real life data tends to get corrupted because machines (and humans) are never as reliable as we wish for. One efficient way is make sure your data wasn't unintendedly modifiied is to generate some kind of hash. That hash shall be unique, compact and efficient: Many protocols, like Ethernet and GZIP archives, append a so-called CRC hash to their data streams which has a few weaknesses regarding uniqueness (sometimes data modifications remain undetected) but is absolutely great when it comes to compactness (CRC32: 32 bits) and efficiency. You find a CRC32 right below every Download button on my blog, too.

There is one very useful property of CRC32: it can be implemented as a rolling algorithm. That means, if you already have some chunk of data and its CRC, then you can append new data and compute the updated CRC but using your original CRC as a seed and just scanning through the appended data.

I don't want to go into mathematical details of Cyclic Redundancy Checking because there are tons of information on the internet, e.g. Wikipedia and the Painless Guide To CRC Error Detection Algorithms. This article only discusses how to write a fast CRC32 algorithm in C/C++.

Note: The concepts behind the various CRC32 algorithm are not my original work - I only gathered them in one place.

I assume a little-endian architecture (Intel's x86 and x64 CPU default byte ordering) and ints to be 32 bits wide (replace unsigned int by uint32_t if you use GCC64/Unix).

Update November 13, 2011: The code is more portable now, e.g. uint32_t is used. I got it running on an Arduino board, see here for more details.

Update May 4, 2013: Now the code is endian-aware. Read my posting for instructions on how run a big endian virtual machine on Linux or Windows.

Update February 6, 2014: If you are interested in MD5, SHA1 and/or SHA256 hashes then take a look at my portable C++ hashing library which of course includes CRC32, too.

Update August 12, 2014: I added a Javascript version of the simple bitwise algorithm.

CRC32 of 1 GByte published by bits per iteration table size time throughput CPU cycles/byte
Original (unknown) 1 - 29.2 seconds 35 MByte/s approx. 100
Branch-free (unknown) 1 - 16.7 seconds 61 MByte/s approx. 50
Improved Branch-free (unknown) 1 - 14.5 seconds 70 MByte/s approx. 40
Half-Byte (unknown) 4 64 bytes 4.8 seconds 210 MByte/s approx. 14
Standard Implementation Dilip V. Sarwate 8 1024 bytes 2.8 seconds 375 MByte/s approx. 8
Slicing-by-4 Intel Corp. 32 4096 bytes 0.95 or 1.2 seconds* 1050 or 840 MByte/s* approx. 3 or 4*
Slicing-by-8 Intel Corp. 64 8192 bytes 0.55 or 0.7 seconds* 1800 or 1400 MByte/s* approx. 1.75 or 2.25*
Note: GCC 4.7 produces slightly faster code than Visual C++ 2010 for Slicing-by-4 and Slicing-by-8. For the other algorithms, throughput is almost identical.
(read more ...)

Testing Endian Aware Code

posted by Stephan Brumme

Turning Around

CPUs store numbers in two different ways: little or big endian. The majority of CPUs are designed to access memory in Little Endian way which seems to have the bytes mixed the wrong way. The memory layout of the hexadecimal number 0x12345678 looks like this:
byte 0 byte 1 byte 2 byte 3 Typical CPUs
Little Endian 0x78 0x56 0x34 0x12 Intel/AMD x86, x64
Big Endian 0x12 0x34 0x56 0x78 PowerPC, ARM*
Note: ARM chips often support little endian, too.

Virtual Machines / QEMU

I don't have a big endian computer at home (my ARM chips are little endian by default) but some of my source code required testing that everything works as expected on big endian machines, too. That's when I learnt about QEMU, a great open-source processor emulator. It's available on Unix and Windows.

This posting explains how to setup QEMU and run Debian Wheezy on an emulated big endian PowerPC.
(read more ...)

browse through all postings:

older stuff:

2013/Apr/24   Software License
2013/Feb/15   Read-Only Git Access Available
2011/Dec/05   Fowler-Noll-Vo Hash (FNV1a)
2011/Dec/01   Drawing Antialiased Circles and Ellipses
2011/Nov/27   Uncompressing DEFLATE Data
2011/Nov/23   Streaming MPEG-4 Movies From Your Own Server
2011/Nov/08   Binary Search Revisited
2011/Oct/20   Reverse MD5 Hash
2011/Oct/17   Windows 7 Screenshots
2011/Oct/16   Hide Strings In Your Executable
2011/Oct/11   Generate QR Codes With PHP and Google Charts API
2011/Oct/02   Feed.rss and Sitemap.xml
2011/Sep/28   My Experiences with AdSense
2011/Sep/27   Emulating MemCache in Pure PHP
2011/Sep/14   Simple Ajax
2011/Sep/13   Sparklines
2011/Sep/12   smallpt - Global Illumination in Javascript
2011/Sep/11   Multi-Core Sorting
2011/Sep/06   CLein - The tiny, portable OpenCL info tool
homepage