Practical String Searching

posted by Stephan Brumme

Motivation

There is a huge number of scientific papers on fast, faster and super-fast string searching algorithms.
They usually prove theoretical performance in O-Notation and most papers cover memory consumption as well.

However, theoretical performance isn't always the same as practical performance. That's why I always want to measure real-world throughput: this article presents hopefully understandable C implementations of the most common generic string search algorithms.

In addition I also wrote a simple tool called mygrep that prints all lines of a file where a search phrase is found.
It doesn't come with all the bells and whistles of the Unix tool grep but achieves similar or sometimes even better speed.
A direct comparison can be done online, just scroll down to the Live Demo.

Playing around with various search phrases you will easily see the plain and simple truth: for most situtations the naive brute-force algorithm is a great choice.
A slightly modified version based on memchr and memcmp is often several times faster than the second-fastest contender - just because this simple algorithm can make efficient use of modern CPUs' optimized string opcodes and fits well into the caches.

These generic routines might slow down considerably in worst-case scenarios, like DNA analysis (small alphabet and many repeated similar parts).
And yes, fgrep is damn fast.

Live Demo

This demo scans through the first 100 MByte of Wikipedia (well, an older snapshot dating back to March 3, 2006) on the server.
Before search starts, the file is mapped into the Linux file cache in order to avoid disk latencies (cat enwik8 > /dev/zero).

Notes: execution times can vary significantly - depending on overall server load.
Special characters such as " or \ are disabled to prevent potential security breaches into my server.
The search algorithms actually can search for these special characters (as plain ASCII symbols).
Don't confuse it with meta characters for regular expressions; those are not supported.
Search is case-sensitive.

Search Phrase:
Algorithm Tool Matches Time
(please enter search phrase and click "Go !"

Library

My library is written in C and compiles without problems using GCC/G++, CLang and Visual C++.
There are two functions for each algorithm: Both return the location of the first match or NULL. An empty needle matches the beginning of the haystack.
Invalid pointers are detected and NULL will be returned.

Unicode is not supported, but the second processes binary data (which often contains NULL bytes) without problems:
hide parameters const char* searchSimpleString(const char* haystack, const char* needle); const char* searchSimple (const char* haystack, size_t haystackLength, const char* needle, size_t needleLength);
mygrep is a very simple tool to compare performance of all mentioned string search algorithms.
Similar to fgrep it scans a file for a given search phrase and prints all lines where a match was found.
It's bare bones, therefore input pipes aren't supported, and neither is memory mapping used.

Click on show to see the full source code.
show mygrep // ////////////////////////////////////////////////////////// // mygrep.c // Copyright (c) 2014 Stephan Brumme. All rights reserved. // see http://create.stephan-brumme.com/disclaimer.html // // gcc -O3 -std=c99 -Wall -pedantic search.c mygrep.c -o mygrep // file size limited to available memory size because whole file is loaded into RAM // enable GNU extensions, such as memmem() #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #ifdef _MSC_VER /// GNU C++ has a super-optimized version of memmem() which Visual C++ lacks, here is a simple replacement using a function reference #define memmem searchNative #define _CRT_SECURE_NO_WARNINGS #endif #include "search.h" #include <string.h> // memmem() #include <stdio.h> // printf() #include <stdlib.h> // malloc() enum Algorithm { UseBest , UseStrStr , UseMemMem , UseSimple , UseNative , UseKnuthMorrisPratt , UseBoyerMooreHorspool , UseBitap , UseRabinKarp } algorithm; enum { ShowLines = 0, ShowCountOnly = 1 } display; int main(int argc, char* argv[]) { const char* syntax = "Syntax: ./mygrep searchphrase filename [--native|--memmem|--strstr|--simple|--knuthmorrispratt|--boyermoorehorspool|--bitap|--rabinkarp] [-c]\n"; if (argc < 3 || argc > 5) { printf("%s", syntax); return -1; } // don't show lines, just count them display = ShowLines; if (argc == 5 && strcmp(argv[4], "-c") == 0) display = ShowCountOnly; // use safer memmem() by default algorithm = UseBest; if (argc >= 4) { if (strcmp(argv[3], "--native") == 0) algorithm = UseNative; else if (strcmp(argv[3], "--memmem") == 0) algorithm = UseMemMem; else if (strcmp(argv[3], "--strstr") == 0) // be careful: buffer overruns possible !!! algorithm = UseStrStr; else if (strcmp(argv[3], "--simple") == 0) algorithm = UseSimple; else if (strcmp(argv[3], "--knuthmorrispratt") == 0 || strcmp(argv[3], "--kmp") == 0) algorithm = UseKnuthMorrisPratt; else if (strcmp(argv[3], "--boyermoorehorspool") == 0 || strcmp(argv[3], "--bmh") == 0) algorithm = UseBoyerMooreHorspool; else if (strcmp(argv[3], "--bitap") == 0) algorithm = UseBitap; else if (strcmp(argv[3], "--rabinkarp") == 0) algorithm = UseRabinKarp; else if (strcmp(argv[3], "-c") == 0) display = ShowCountOnly; else { printf("%s", syntax); return -2; } } // open file FILE* file = fopen(argv[2], "rb"); if (!file) { printf("Failed to open file"); return -3; } // determine its filesize fseek(file, 0, SEEK_END); long filesize = ftell(file); fseek(file, 0, SEEK_SET); if (filesize == 0) { printf("Empty file"); return -4; } // allocate memory and read the whole file at once char* data = (char*) malloc(filesize + 2); if (!data) { printf("Out of memory"); return -5; } fread(data, filesize, 1, file); fclose(file); // pad data to avoid buffer overruns data[filesize ] = '\n'; data[filesize + 1] = 0; // what we look for const char* needle = argv[1]; const size_t needleLength = strlen(needle); // where we look for const char* haystack = data; const size_t haystackLength = filesize; // fence const char* haystackEnd = haystack + haystackLength; // search until done ... size_t numHits = 0; const char* current = haystack; for (;;) { // offset of current hit from the beginning of the haystack size_t bytesDone = current - haystack; size_t bytesLeft = haystackLength - bytesDone; switch (algorithm) { case UseBest: // when needle is longer than about 16 bytes, Boyer-Moore-Horspool is faster if (needleLength <= 16) current = searchNative (current, bytesLeft, needle, needleLength); else current = searchBoyerMooreHorspool(current, bytesLeft, needle, needleLength); break; case UseMemMem: // correctly handled zeros, unfortunately much slower { const char* before = current; current = (const char*)memmem (current, bytesLeft, needle, needleLength); // workaround for strange GCC behavior, else I get a memory access violation if (current) { int diff = current - before; current = before + diff; } } break; case UseStrStr: // much faster but has problems when bytes in haystack are zero, // requires both to be properly zero-terminated current = strstr (current, needle); break; case UseSimple: // brute-force current = searchSimple (current, bytesLeft, needle, needleLength); break; case UseNative: // brute-force for short needles, based on compiler-optimized functions current = searchNative (current, bytesLeft, needle, needleLength); break; case UseKnuthMorrisPratt: // Knuth-Morris-Pratt current = searchKnuthMorrisPratt (current, bytesLeft, needle, needleLength); break; case UseBoyerMooreHorspool: // Boyer-Moore-Horspool current = searchBoyerMooreHorspool (current, bytesLeft, needle, needleLength); break; case UseBitap: // Bitap / Baeza-Yates-Gonnet algorithm current = searchBitap (current, bytesLeft, needle, needleLength); break; case UseRabinKarp: // Rabin-Karp algorithm current = searchRabinKarp (current, bytesLeft, needle, needleLength); break; } // needle not found in the remaining haystack if (!current) break; numHits++; // find end of line const char* right = current; while (right != haystackEnd && *right != '\n') right++; if (display == ShowCountOnly) { current = right; continue; } // find beginning of line const char* left = current; while (left != haystack && *left != '\n') left--; if (*left == '\n' && left != haystackEnd) left++; // send line to standard output size_t lineLength = right - left; fwrite(left, lineLength, 1, stdout); // and append a newline putchar('\n'); // don't search this line anymore current = right; } if (display == ShowCountOnly) printf("%d\n", numHits); // exit with error code 1 if nothing found return numHits == 0 ? 1 : 0; }

Changelog

Download

Download  search.h
Latest release: August 25, 2014, size: 2391 bytes, 45 lines

CRC32: fc4e7f8d
MD5: fec72f86d29aea4921ff92b01e553559
SHA1: 27f463a9004465e781197425f2adb22bc1640521
SHA256:3d33c1acae049ff9af30a83b2e483a418264d9ef2a7004aae14e743b511e9652

Download  search.c
Latest release: August 25, 2014, size: 15.3 kBytes, 582 lines

CRC32: e2915f86
MD5: f71170ad6f0d1ebe1e7ae4b688485555
SHA1: f5ca408c774310d3b9e6def3ab26cc6fd09368a1
SHA256:c0ee27b1533067266c880989feb51b25af94b06ed45394f9d6e2f09a8065caf2

Download  mygrep.c
Latest release: August 25, 2014, size: 6477 bytes, 237 lines

CRC32: cf92acc5
MD5: a3dbaf10f975f92a2c77b50e4151bd20
SHA1: 52bb4f4b56046a19c85238a0eeee53ea0e88f8b8
SHA256:8415244ac2b57875ec3023475b0d44c431799976d3f8ab951b4ad987920f2a16

Stay up-to-date:git clone http://create.stephan-brumme.com/practical-string-searching/.git

Simple Algorithm a.k.a. Brute Force

The simple algorithm consists of two nested loops: The outer loop walks through all bytes/characters of the haystack and the inner loop compares against the needle. As soon as a match was detected, the algorithm aborts and returns its position inside the haystack.
hide simple algorithm (strings) const char* searchSimpleString(const char* haystack, const char* needle) { // detect invalid input if (!haystack || !needle) return NULL; // until the end of haystack (or a match, of course) while (*haystack) { // compare current haystack to needle size_t i = 0; while (needle[i]) { if (haystack[i] != needle[i]) break; i++; } // needle fully matched (reached terminating NUL-byte) if (!needle[i]) return haystack; // no match, step forward haystack++; } // not found return NULL; }
show simple algorithm (binary) const char* searchSimple(const char* haystack, size_t haystackLength, const char* needle, size_t needleLength) { // detect invalid input if (!haystack || !needle || haystackLength < needleLength) return NULL; // match impossible if less than needleLength bytes left haystackLength -= needleLength - 1; // points beyond last considered start byte const char* haystackEnd = haystack + haystackLength; // until the end of haystack (or a match, of course ...) while (haystack != haystackEnd) { // compare current haystack to needle size_t i = 0; for (; i < needleLength; i++) if (haystack[i] != needle[i]) break; // needle fully matched if (i == needleLength) return haystack; // no match, step forward haystack++; } // not found return NULL; }

Native Algorithm

This algorithm is pretty much identical to the Simple Algorithm but uses the function memchr to locate the first character and then compares the remainder with memcmp.

These two function are standard C and can be found in #include <string.h>. They are implemented extremely efficient and often compare multiple bytes at once.
As you can see in the Live Demo, it's multiple times faster than the Simple Algorithm in its original form.

About 20 lines of the code are spent on speeding up the case of a two-character needle. Optimizing for the three-character case as well doesn't help much because memcmp handles it pretty well.

The phrase Native Algorithm is an invention of mine. I just ran out of ideas how to call it ...
hide native algorithm (binary) const char* searchNative(const char* haystack, size_t haystackLength, const char* needle, size_t needleLength) { // uses memchr() for the first byte, then memcmp to verify it's a valid match // detect invalid input if (!haystack || !needle || haystackLength < needleLength) return NULL; // empty needle matches everything if (needleLength == 0) return haystack; // shorter code for just one character if (needleLength == 1) return (const char*)memchr(haystack, *needle, haystackLength); haystackLength -= needleLength - 1; // points beyond last considered byte const char* haystackEnd = haystack + haystackLength; // optimized code for just two characters if (needleLength == 2) { while ((haystack = (const char*)memchr(haystack, needle[0], haystackLength)) != NULL) { // match second byte/character if (haystack[1] == needle[1]) return haystack; // compute number of remaining bytes haystackLength -= haystackEnd - haystack; if (haystackLength == 0) return NULL; // keep going haystack++; haystackLength--; } // needle not found in haystack return NULL; } // look for first byte while ((haystack = (const char*)memchr(haystack, *needle, haystackLength)) != NULL) { // does last byte match, too ? if (haystack[needleLength - 1] == needle[needleLength - 1]) // okay, perform full comparison, skip first and last byte if (memcmp(haystack + 1, needle + 1, needleLength - 2) == 0) return haystack; // note: the following lines are identical to the "two-character case" from above // compute number of remaining bytes haystackLength = haystackEnd - haystack; if (haystackLength == 0) return NULL; // keep going haystack++; haystackLength--; } // needle not found in haystack return NULL; }

memmem

memmem is a GNU extension to the C standard library and not available for Visual C++.
Its MAN page says it was broken in early releases and only became stable in glibc 2.1.

You need to define _GNU_SOURCE in your code:
#define _GNU_SOURCE #include <string.h> void* memmem(const void* haystack, size_t haystacklen, const void* needle, size_t needlelen);

strstr

strstr is the standard C string search function.

Implementations vary but my observation is that strstr is by far the slowest algorithms on GCC, CLang and Visual C++ (well, their respective Standard C libraries).
#include <string.h> char* strstr(const char* haystack, const char* needle);

Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt algorithm improves on the simple algorithm by skipping multiple characters in case of a mismatch.
It's main drawback is the pre-processing stage which builds a table with needleLength pointers/offsets.

You can find a detailled description on Wikipedia including examples.

My code tries to avoid dynamic allocation of the table if the search phrase is short (and puts the stack on the stack).
Change the constant MaxLocalMemory if you have more/less stack available in your system.
hide Knuth-Morris-Pratt algorithm (strings) const char* searchKnuthMorrisPrattString(const char* haystack, const char* needle) { // detect invalid input if (!haystack || !needle) return NULL; // empty needle matches everything size_t needleLength = strlen(needle); if (needleLength == 0) return haystack; // try to use stack instead of heap (avoid slow memory allocations if possible) const size_t MaxLocalMemory = 256; int localMemory[MaxLocalMemory]; int* skip = localMemory; // stack too small => allocate heap if (needleLength > MaxLocalMemory) { skip = (int*)malloc(needleLength * sizeof(int)); if (skip == NULL) return NULL; } // prepare skip table skip[0] = -1; int i; for (i = 0; needle[i]; i++) { skip[i + 1] = skip[i] + 1; while (skip[i + 1] > 0 && needle[i] != needle[skip[i + 1] - 1]) skip[i + 1] = skip[skip[i + 1] - 1] + 1; } // assume no match const char* result = NULL; int shift = 0; // search while (*haystack) { // look for a matching character while (shift >= 0 && *haystack != needle[shift]) shift = skip[shift]; // single step forward in needle and haystack haystack++; shift++; // reached end of needle => hit if (!needle[shift]) { result = haystack - shift; break; } } // clean up heap (if used) if (skip != localMemory) free(skip); // points to match position or NULL if not found return result; }
show Knuth-Morris-Pratt algorithm (binary) const char* searchKnuthMorrisPratt(const char* haystack, size_t haystackLength, const char* needle, size_t needleLength) { // detect invalid input if (!haystack || !needle || haystackLength < needleLength) return NULL; // empty needle matches everything if (needleLength == 0) return haystack; // try to use stack instead of heap (avoid slow memory allocations if possible) const size_t MaxLocalMemory = 256; int localMemory[MaxLocalMemory]; int* skip = localMemory; // stack too small => allocate heap if (needleLength > MaxLocalMemory) { skip = (int*)malloc(needleLength * sizeof(int)); if (skip == NULL) return NULL; } // prepare skip table skip[0] = -1; size_t i; for (i = 0; i < needleLength; i++) { skip[i + 1] = skip[i] + 1; while (skip[i + 1] > 0 && needle[i] != needle[skip[i + 1] - 1]) skip[i + 1] = skip[skip[i + 1]-1] + 1; } // assume no match const char* result = NULL; const char* haystackEnd = haystack + haystackLength; int shift = 0; // search while (haystack != haystackEnd) { // look for a matching character while (shift >= 0 && *haystack != needle[shift]) shift = skip[shift]; // single step forward in needle and haystack haystack++; shift++; // reached end of needle => hit if ((size_t)shift == needleLength) { result = haystack - shift; break; } } // clean up heap (if used) if (skip != localMemory) free(skip); // points to match position or NULL if not found return result; }

Boyer-Moore-Horspool Algorithm

In contrast to the Knuth-Morris-Pratt algorithm, the Boyer-Moore-Horspool algorithm tries to match the end of the search phrase first.
It needs a temporary table, too, but its size is constant and depends on the size of the alphabet which is 256 for ASCII strings.
The initialisation is a bit more costly.

Again, Wikipedia is a good source of information if you want to go deeper into detail.
show Boyer-Moore-Horspool algorithm (strings) const char* searchBoyerMooreHorspoolString(const char* haystack, const char* needle) { // detect invalid input if (!haystack || !needle) return NULL; // call routine for non-text data return searchBoyerMooreHorspool(haystack, strlen(haystack), needle, strlen(needle)); }
hide Boyer-Moore-Horspool algorithm (binary) const char* searchBoyerMooreHorspool(const char* haystack, size_t haystackLength, const char* needle, size_t needleLength) { // detect invalid input if (!haystack || !needle || haystackLength < needleLength) return NULL; // empty needle matches everything if (needleLength == 0) return haystack; // find right-most position of each character // and store its distance to the end of needle // default value: when a character in haystack isn't in needle, then // we can jump forward needleLength bytes const size_t NumChar = 1 << (8 * sizeof(char)); size_t skip[NumChar]; size_t i; for (i = 0; i < NumChar; i++) skip[i] = needleLength; // figure out for each character of the needle how much we can skip // (if a character appears multiple times in needle, later occurrences // overwrite previous ones, i.e. the value of skip[x] decreases) const size_t lastPos = needleLength - 1; size_t pos; for (pos = 0; pos < lastPos; pos++) skip[(unsigned char)needle[pos]] = lastPos - pos; // now walk through the haystack while (haystackLength >= needleLength) { // all characters match ? for (i = lastPos; haystack[i] == needle[i]; i--) if (i == 0) return haystack; // no match, jump ahead unsigned char marker = (unsigned char) haystack[lastPos]; haystackLength -= skip[marker]; haystack += skip[marker]; } // needle not found in haystack return NULL; }

Bitap Algorithm

The Bitap algorithm is often called Baeza-Yates-Gonnet algorithm. It's original purpose was approximate string matching.
The code below is a simplified version and only supports exact string matching.

The algorithm is based on bit manipulation. Each of needle's characters is represented by a bit.
My implementation is restricted to at most 32 byte long needles on 32-bit systems or at most 64 byte long needles on 64-bit systems.
A temporary table with 256 words is created on the stack.

It's worth noting that the Bitap algorithm's speed is only affected by the length of haystack and needle, not by their content.
In other words: it's very predictable, which might be extremeful interesting for realtime systems.

Please check Wikipedia for further information.
hide Bitap algorithm (strings) const char* searchBitapString(const char* haystack, const char* needle) { // detect invalid input if (!haystack || !needle) return NULL; // empty needle matches everything size_t needleLength = strlen(needle); if (needleLength == 0) return haystack; // create bit masks for each possible byte / ASCII character // each mask is as wide as needleLength const size_t MaxBitWidth = 8 * sizeof(int) - 1; // only if needleLength bits fit into an integer (minus 1), the algorithm will be fast if (needleLength > MaxBitWidth) return searchSimpleString(haystack, needle); // one mask per allowed character (1 byte => 2^8 => 256) // where all bits are set except those where the character is found in needle const size_t AlphabetSize = 256; unsigned int masks[AlphabetSize]; size_t i; for (i = 0; i < AlphabetSize; i++) masks[i] = ~0; for (i = 0; i < needleLength; i++) masks[(unsigned char)needle[i]] &= ~(1 << i); // initial state mask has all bits set except the lowest one unsigned int state = ~1; const unsigned int FullMatch = 1 << needleLength; while (*haystack) { // update the bit array state |= masks[(unsigned char)*haystack]; state <<= 1; // if an unset bit "bubbled up" we have a match if ((state & FullMatch) == 0) return (haystack - needleLength) + 1; haystack++; } // needle not found in haystack return NULL; }
show Bitap algorithm (binary) const char* searchBitap(const char* haystack, size_t haystackLength, const char* needle, size_t needleLength) { // detect invalid input if (!haystack || !needle || haystackLength < needleLength) return NULL; // empty needle matches everything if (needleLength == 0) return haystack; // create bit masks for each possible byte / ASCII character // each mask is as wide as needleLength const size_t MaxBitWidth = 8 * sizeof(int) - 1; // only if needleLength bits fit into an integer (minus 1), the algorithm will be fast if (needleLength > MaxBitWidth) return searchNative(haystack, haystackLength, needle, needleLength); // one mask per allowed character (1 byte => 2^8 => 256) // where all bits are set except those where the character is found in needle const size_t AlphabetSize = 256; unsigned int masks[AlphabetSize]; size_t i; for (i = 0; i < AlphabetSize; i++) masks[i] = ~0; for (i = 0; i < needleLength; i++) masks[(unsigned char)needle[i]] &= ~(1 << i); // points beyond last considered byte const char* haystackEnd = haystack + haystackLength; // initial state mask has all bits set except the lowest one unsigned int state = ~1; const unsigned int FullMatch = 1 << needleLength; while (haystack != haystackEnd) { // update the bit array state |= masks[(unsigned char)*haystack]; state <<= 1; // if an unset bit "bubbled up" we have a match if ((state & FullMatch) == 0) return (haystack - needleLength) + 1; haystack++; } // needle not found in haystack return NULL; }

Rabin-Karp Algorithm

The Rabin-Karp algorithm is based on rolling hashes. Due to hash collisions false positives may be found.
Therefore each potential match must be checked again (simplest way: memcmp).

The main strength of this algorithm is its ability to search for multiple phrases in one pass.
However, I didn't implement this feature to keep things simple. With a single needle there is no need for memory allocation, too.

My hashing routine is a basic additive hash. More elaborate hashes produce way less hash collisions but can be considerably slower.

And, of course, Wikipedia is your friend when it comes to specifics.
hide Rabin-Karp algorithm (strings) const char* searchRabinKarpString(const char* haystack, const char* needle) { // detect invalid input if (!haystack || !needle) return NULL; // empty needle matches everything if (!*needle) return haystack; // find first match of the first letter haystack = strchr(haystack, *needle); if (!haystack) return NULL; // now first letter of haystack and needle is identical // let's compute the sum of all characters of needle unsigned int hashNeedle = *needle; unsigned int hashHaystack = *haystack; const char* scanNeedle = needle + 1; const char* scanHaystack = haystack + 1; while (*scanNeedle && *scanHaystack) { hashNeedle += *scanNeedle++; hashHaystack += *scanHaystack++; } // if scanNeedle doesn't point to zero, then we have too little haystack if (*scanNeedle) return NULL; // length of needle const size_t needleLength = scanNeedle - needle; // walk through haystack and roll the hash for (;;) { // identical hash ? if (hashHaystack == hashNeedle) { // can be a false positive, therefore must check all characters again if (memcmp(haystack, needle, needleLength) == 0) return haystack; } // no more bytes left ? if (!*scanHaystack) break; // update hash hashHaystack -= *haystack++; hashHaystack += *scanHaystack++; } // needle not found in haystack return NULL; }
show Rabin-Karp algorithm (binary) const char* searchRabinKarp(const char* haystack, size_t haystackLength, const char* needle, size_t needleLength) { // detect invalid input if (!haystack || !needle || haystackLength < needleLength) return NULL; // empty needle matches everything if (needleLength == 0) return haystack; // one byte beyond last position where a match can begin const char* haystackEnd = haystack + haystackLength - needleLength + 1; // find first match of the first letter haystack = (const char*)memchr(haystack, *needle, haystackLength); if (!haystack) return NULL; // now first letter of haystack and needle is identical // let's compute the sum of all characters of needle unsigned int hashNeedle = 0; unsigned int hashHaystack = 0; size_t i; for (i = 0; i < needleLength; i++) { // not enough haystack left ? if (!haystack[i]) return NULL; hashNeedle += needle [i]; hashHaystack += haystack[i]; } // walk through haystack and roll the hash computation while (haystack != haystackEnd) { // identical hash ? if (hashHaystack == hashNeedle) { // can be a false positive, therefore must check all characters again if (memcmp(haystack, needle, needleLength) == 0) return haystack; } // update hash hashHaystack += *(haystack + needleLength); hashHaystack -= *haystack++; } // needle not found in haystack return NULL; }
homepage