Practical String Searching

posted August 25, 2014 by Stephan Brumme

Motivation

There is a huge number of scientific papers on fast, faster and super-fast string searching algorithms.
They usually prove theoretical performance in O-Notation and most papers cover memory consumption as well.

However, theoretical performance isn't always the same as practical performance. That's why I always want to measure real-world throughput: this article presents hopefully understandable C implementations of the most common generic string search algorithms.

In addition I also wrote a simple tool called mygrep that prints all lines of a file where a search phrase is found.
It doesn't come with all the bells and whistles of the Unix tool grep but achieves similar or sometimes even better speed.
A direct comparison can be done online, just scroll down to the Live Demo.

Playing around with various search phrases you will easily see the plain and simple truth: for most situtations the naive brute-force algorithm is a great choice.
A slightly modified version based on memchr and memcmp is often several times faster than the second-fastest contender - just because this simple algorithm can make efficient use of modern CPUs' optimized string opcodes and fits well into the caches.

These generic routines might slow down considerably in worst-case scenarios, like DNA analysis (small alphabet and many repeated similar parts).
And yes, fgrep is damn fast.

Live Demo

This demo scans through the first 100 MByte of Wikipedia (well, an older snapshot dating back to March 3, 2006) on the server.
Before search starts, the file is mapped into the Linux file cache in order to avoid disk latencies (cat enwik8 > /dev/zero).

Notes: execution times can vary significantly - depending on overall server load.
Special characters such as " or \ are disabled to prevent potential security breaches into my server.
The search algorithms actually can search for these special characters (as plain ASCII symbols).
Don't confuse it with meta characters for regular expressions; those are not supported.
Search is case-sensitive.

Algorithm	Tool	Matches	Time
(please enter search phrase and click "Go !"

License

This code is licensed under the zlib License:

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License

Library

My library is written in C and compiles without problems using GCC/G++, CLang and Visual C++.
There are two functions for each algorithm:

one has the same parameters as strstr and works on NULL-terminated strings
and the second requires the length of each string (in bytes)

Both return the location of the first match or NULL. An empty needle matches the beginning of the haystack.
Invalid pointers are detected and NULL will be returned.

Unicode is not supported, but the second processes binary data (which often contains NULL bytes) without problems:

hide parameters


     const char* searchSimpleString(const char* haystack,                        const char* needle);
     const char* searchSimple      (const char* haystack, size_t haystackLength, const char* needle, size_t needleLength);

mygrep is a very simple tool to compare performance of all mentioned string search algorithms.
Similar to fgrep it scans a file for a given search phrase and prints all lines where a match was found.
It's bare bones, therefore input pipes aren't supported, and neither is memory mapping used.

Click on show to see the full source code.

show mygrep


     // //////////////////////////////////////////////////////////
     // mygrep.c
     // Copyright (c) 2014,2019 Stephan Brumme. All rights reserved.
     // see http://create.stephan-brumme.com/disclaimer.html
     //
     
     // gcc -O3 -std=c99 -Wall -pedantic search.c mygrep.c -o mygrep
     // file size limited to available memory size because whole file is loaded into RAM
     
     // enable GNU extensions, such as memmem()
     #ifndef _GNU_SOURCE
     #define _GNU_SOURCE
     #endif
     
     #ifdef _MSC_VER
     /// GNU C++ has a super-optimized version of memmem() which Visual C++ lacks, here is a simple replacement using a function reference
     #define memmem searchNative
     #define _CRT_SECURE_NO_WARNINGS
     #endif
     
     #include "search.h"
     
     #include <string.h> // memmem()
     #include <stdio.h>  // printf()
     #include <stdlib.h> // malloc()
     
     
     enum Algorithm
     {
       UseBest
       , UseStrStr
       , UseMemMem
       , UseSimple
       , UseNative
       , UseKnuthMorrisPratt
       , UseBoyerMooreHorspool
       , UseBitap
       , UseRabinKarp
     } algorithm;
     
     enum
     {
       ShowLines     = 0,
       ShowCountOnly = 1
     } display;
     
     int main(int argc, char* argv[])
     {
       const char* syntax = "Syntax: ./mygrep searchphrase filename [--native|--memmem|--strstr|--simple|--knuthmorrispratt|--boyermoorehorspool|--bitap|--rabinkarp] [-c]\n";
       if (argc < 3 || argc > 5)
       {
         printf("%s", syntax);
         return -1;
       }
     
       // don't show lines, just count them
       display = ShowLines;
       if (argc == 5 && strcmp(argv[4], "-c")  == 0)
         display = ShowCountOnly;
     
       // use safer memmem() by default
       algorithm = UseBest;
       if (argc >= 4)
       {
         if      (strcmp(argv[3], "--native") == 0)
           algorithm = UseNative;
         else if (strcmp(argv[3], "--memmem") == 0)
           algorithm = UseMemMem;
         else if (strcmp(argv[3], "--strstr") == 0) // be careful: buffer overruns possible !!!
           algorithm = UseStrStr;
         else if (strcmp(argv[3], "--simple") == 0)
           algorithm = UseSimple;
         else if (strcmp(argv[3], "--knuthmorrispratt")   == 0 ||
                  strcmp(argv[3], "--kmp")    == 0)
           algorithm = UseKnuthMorrisPratt;
         else if (strcmp(argv[3], "--boyermoorehorspool") == 0 ||
                  strcmp(argv[3], "--bmh")    == 0)
           algorithm = UseBoyerMooreHorspool;
         else if (strcmp(argv[3], "--bitap")  == 0)
           algorithm = UseBitap;
         else if (strcmp(argv[3], "--rabinkarp") == 0)
           algorithm = UseRabinKarp;
         else if (strcmp(argv[3], "-c")       == 0)
           display = ShowCountOnly;
         else
         {
           printf("%s", syntax);
           return -2;
         }
       }
     
       // open file
       FILE* file = fopen(argv[2], "rb");
       if (!file)
       {
         printf("Failed to open file\n");
         return -3;
       }
     
       // determine its filesize
       fseek(file, 0, SEEK_END);
       long filesize = ftell(file);
       fseek(file, 0, SEEK_SET);
       if (filesize == 0)
       {
         printf("Empty file\n");
         return -4;
       }
     
       // allocate memory and read the whole file at once
       char* data = (char*) malloc(filesize + 2);
       if (!data)
       {
         printf("Out of memory\n");
         return -5;
       }
       fread(data, filesize, 1, file);
       fclose(file);
     
       // pad data to avoid buffer overruns
       data[filesize    ] = '\n';
       data[filesize + 1] = 0;
     
       // what  we look for
       const char*  needle         = argv[1];
       const size_t needleLength   = strlen(needle);
       // where we look for
       const char*  haystack       = data;
       const size_t haystackLength = filesize;
     
       // fence
       const char*  haystackEnd    = haystack + haystackLength;
     
       // "native" and "Boyer-Moore-Horspool" are in almost all cases the best choice
       if (algorithm == UseBest)
       {
         // when needle is longer than about 16 bytes, Boyer-Moore-Horspool is faster
         if (needleLength <= 16)
           algorithm = UseNative;
         else
           algorithm = UseBoyerMooreHorspool;
       }
     
       // search until done ...
       unsigned int numHits = 0;
       const char* current = haystack;
       for (;;)
       {
         // offset of current hit from the beginning of the haystack
         size_t bytesDone = current - haystack;
         size_t bytesLeft = haystackLength - bytesDone;
     
         switch (algorithm)
         {
         case UseMemMem:
           // correctly handled zeros, unfortunately much slower
           {
             const char* before = current;
             current = (const char*)memmem   (current, bytesLeft, needle, needleLength);
             // workaround for strange GCC behavior, else I get a memory access violation
             if (current)
             {
               int diff = current - before;
               current = before + diff;
             }
           }
           break;
         case UseStrStr:
           // much faster but has problems when bytes in haystack are zero,
           // requires both to be properly zero-terminated
           current = strstr                  (current,            needle);
           break;
     
         case UseSimple:
           // brute-force
           current = searchSimple            (current, bytesLeft, needle, needleLength);
           break;
         case UseNative:
           // brute-force for short needles, based on compiler-optimized functions
           current = searchNative            (current, bytesLeft, needle, needleLength);
           break;
         case UseKnuthMorrisPratt:
           // Knuth-Morris-Pratt
           current = searchKnuthMorrisPratt  (current, bytesLeft, needle, needleLength);
           break;
         case UseBoyerMooreHorspool:
           // Boyer-Moore-Horspool
           current = searchBoyerMooreHorspool(current, bytesLeft, needle, needleLength);
           break;
         case UseBitap:
           // Bitap / Baeza-Yates-Gonnet algorithm
           current = searchBitap             (current, bytesLeft, needle, needleLength);
           break;
         case UseRabinKarp:
           // Rabin-Karp algorithm
           current = searchRabinKarp         (current, bytesLeft, needle, needleLength);
           break;
     
         default:
           printf("Unknown search algorithm\n");
           return -6;
         }
     
         // needle not found in the remaining haystack
         if (!current)
           break;
     
         numHits++;
     
         // find end of line
         const char* right = current;
         while (right != haystackEnd && *right != '\n')
           right++;
     
         if (display == ShowCountOnly)
         {
           current = right;
           continue;
         }
     
         // find beginning of line
         const char* left = current;
         while (left != haystack && *left != '\n')
           left--;
         if (*left == '\n' && left != haystackEnd)
           left++;
     
         // send line to standard output
         size_t lineLength = right - left;
         fwrite(left, lineLength, 1, stdout);
         // and append a newline
         putchar('\n');
     
         // don't search this line anymore
         current = right;
       }
     
       if (display == ShowCountOnly)
         printf("%d\n", numHits);
     
       // exit with error code 1 if nothing found
       return numHits == 0 ? 1 : 0;
     }

Changelog

version 1
- August 25, 2014
- initial release
- Git tag practical_string_searching_v1

Download

Download search.h

Latest release: December 20, 2019, size: 2396 bytes, 45 lines

CRC32: d437e7e5
MD5: 6f41563c56ccb5fd8dc9210446c88c44
SHA1: e761193de07d6b44f74e33dc9dcff0eab13162a4
SHA256:ae13eb72366d3b30e17229de93989299fa3d352c351567de78f219702c47fe6b

Download search.c

Latest release: December 20, 2019, size: 14.8 kBytes, 557 lines

CRC32: 9ac5957d
MD5: 59659ac5b358108fb5395682466fa364
SHA1: 8d4e2da0d19f6f8dbdbf118b6a5a4cd330305dcd
SHA256:6ed64272eb2181d4a4fa004cedc12637ded57ee8e4da06e0612197169bd6f7df

Download mygrep.c

Latest release: December 20, 2019, size: 6.4 kBytes, 243 lines

CRC32: 48d0c214
MD5: 7af4d4c5f6497dcdac1685ab8ab90f4e
SHA1: 655ed76cb3dc0c4dd26cc6957e18298b6d4ef017
SHA256:9e7775a587000b1ae9a7c7a868b279a65bfac30d49bd6e4329f7ea9a903810f7

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

Simple Algorithm a.k.a. Brute Force

The simple algorithm consists of two nested loops: The outer loop walks through all bytes/characters of the haystack and the inner loop compares against the needle. As soon as a match was detected, the algorithm aborts and returns its position inside the haystack.

hide simple algorithm (strings)


     const char* searchSimpleString(const char* haystack, const char* needle)
     {
       // detect invalid input
       if (!haystack || !needle)
         return NULL;
     
       // until the end of haystack (or a match, of course)
       while (*haystack)
       {
         // compare current haystack to needle
         size_t i = 0;
         while (needle[i])
         {
           if (haystack[i] != needle[i])
             break;
           i++;
         }
     
         // needle fully matched (reached terminating NUL-byte)
         if (!needle[i])
           return haystack;
     
         // no match, step forward
         haystack++;
       }
     
       // not found
       return NULL;
     }

show simple algorithm (binary)


     const char* searchSimple(const char* haystack, size_t haystackLength,
                              const char* needle,   size_t needleLength)
     {
       // detect invalid input
       if (!haystack || !needle || haystackLength < needleLength)
         return NULL;
     
       // match impossible if less than needleLength bytes left
       haystackLength -= needleLength - 1;
       // points beyond last considered start byte
       const char* haystackEnd = haystack + haystackLength;
     
       // until the end of haystack (or a match, of course ...)
       while (haystack != haystackEnd)
       {
         // compare current haystack to needle
         size_t i = 0;
         for (; i < needleLength; i++)
           if (haystack[i] != needle[i])
             break;
     
         // needle fully matched
         if (i == needleLength)
           return haystack;
     
         // no match, step forward
         haystack++;
       }
     
       // not found
       return NULL;
     }

Native Algorithm

This algorithm is pretty much identical to the Simple Algorithm but uses the function memchr to locate the first character and then compares the remainder with memcmp.

These two function are standard C and can be found in #include <string.h>. They are implemented extremely efficient and often compare multiple bytes at once.
As you can see in the Live Demo, it's multiple times faster than the Simple Algorithm in its original form.

About 20 lines of the code are spent on speeding up the case of a two-character needle. Optimizing for the three-character case as well doesn't help much because memcmp handles it pretty well.

The phrase Native Algorithm is an invention of mine. I just ran out of ideas how to call it ...

hide native algorithm (binary)


     const char* searchNative(const char* haystack, size_t haystackLength,
                              const char* needle,   size_t needleLength)
     {
       // uses memchr() for the first byte, then memcmp to verify it's a valid match
     
       // detect invalid input
       if (!haystack || !needle || haystackLength < needleLength)
         return NULL;
     
       // empty needle matches everything
       if (needleLength == 0)
         return haystack;
     
       // shorter code for just one character
       if (needleLength == 1)
         return (const char*)memchr(haystack, *needle, haystackLength);
     
       haystackLength -= needleLength - 1;
       // points beyond last considered byte
       const char* haystackEnd = haystack + haystackLength;
     
       // look for first byte
       while ((haystack = (const char*)memchr(haystack, *needle, haystackLength)) != NULL)
       {
         // does last byte match, too ?
         if (haystack[needleLength - 1] == needle[needleLength - 1])
           // okay, perform full comparison, skip first and last byte (if just 2 bytes => already finished)
           if (needleLength == 2 || memcmp(haystack + 1, needle + 1, needleLength - 2) == 0)
             return haystack;
     
         // compute number of remaining bytes
         haystackLength = haystackEnd - haystack;
         if (haystackLength == 0)
           return NULL;
     
         // keep going
         haystack++;
         haystackLength--;
       }
     
       // needle not found in haystack
       return NULL;
     }

memmem

memmem is a GNU extension to the C standard library and not available for Visual C++.
Its MAN page says it was broken in early releases and only became stable in glibc 2.1.

You need to define _GNU_SOURCE in your code:

    
     #define _GNU_SOURCE
     #include <string.h>
     
     void* memmem(const void* haystack, size_t haystacklen, const void* needle, size_t needlelen);

strstr

strstr is the standard C string search function.

Implementations vary but my observation is that strstr is by far the slowest algorithms on GCC, CLang and Visual C++ (well, their respective Standard C libraries).

    
     #include <string.h>
     
     char* strstr(const char* haystack, const char* needle);

Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt algorithm improves on the simple algorithm by skipping multiple characters in case of a mismatch.
It's main drawback is the pre-processing stage which builds a table with needleLength pointers/offsets.

You can find a detailled description on Wikipedia including examples.

My code tries to avoid dynamic allocation of the table if the search phrase is short (and puts the stack on the stack).
Change the constant MaxLocalMemory if you have more/less stack available in your system.

hide Knuth-Morris-Pratt algorithm (strings)


     const char* searchKnuthMorrisPrattString(const char* haystack, const char* needle)
     {
       // detect invalid input
       if (!haystack || !needle)
         return NULL;
     
       // empty needle matches everything
       size_t needleLength = strlen(needle);
       if (needleLength == 0)
         return haystack;
     
       // try to use stack instead of heap (avoid slow memory allocations if possible)
       const size_t MaxLocalMemory = 256;
       int localMemory[MaxLocalMemory];
       int* skip = localMemory;
       // stack too small => allocate heap
       if (needleLength > MaxLocalMemory)
       {
         skip = (int*)malloc(needleLength * sizeof(int));
         if (skip == NULL)
           return NULL;
       }
     
       // prepare skip table
       skip[0] = -1;
       int i;
       for (i = 0; needle[i]; i++)
       {
         skip[i + 1] = skip[i] + 1;
         while (skip[i + 1] > 0 && needle[i] != needle[skip[i + 1] - 1])
           skip[i + 1] = skip[skip[i + 1] - 1] + 1;
       }
     
       // assume no match
       const char* result = NULL;
       int shift = 0;
       // search
       while (*haystack)
       {
         // look for a matching character
         while (shift >= 0 && *haystack != needle[shift])
           shift = skip[shift];
     
         // single step forward in needle and haystack
         haystack++;
         shift++;
     
         // reached end of needle => hit
         if (!needle[shift])
         {
           result = haystack - shift;
           break;
         }
       }
     
       // clean up heap (if used)
       if (skip != localMemory)
         free(skip);
     
       // points to match position or NULL if not found
       return result;
     }

show Knuth-Morris-Pratt algorithm (binary)


     const char* searchKnuthMorrisPratt(const char* haystack, size_t haystackLength,
                                        const char* needle,   size_t needleLength)
     {
       // detect invalid input
       if (!haystack || !needle || haystackLength < needleLength)
         return NULL;
     
       // empty needle matches everything
       if (needleLength == 0)
         return haystack;
     
       // try to use stack instead of heap (avoid slow memory allocations if possible)
       const size_t MaxLocalMemory = 256;
       int localMemory[MaxLocalMemory];
       int* skip = localMemory;
       // stack too small => allocate heap
       if (needleLength > MaxLocalMemory)
       {
         skip = (int*)malloc(needleLength * sizeof(int));
         if (skip == NULL)
           return NULL;
       }
     
       // prepare skip table
       skip[0] = -1;
       size_t i;
       for (i = 0; i < needleLength; i++)
       {
         skip[i + 1] = skip[i] + 1;
         while (skip[i + 1] > 0 && needle[i] != needle[skip[i + 1] - 1])
           skip[i + 1] = skip[skip[i + 1]-1] + 1;
       }
     
       // assume no match
       const char* result = NULL;
       const char* haystackEnd = haystack + haystackLength;
       int shift = 0;
       // search
       while (haystack != haystackEnd)
       {
         // look for a matching character
         while (shift >= 0 && *haystack != needle[shift])
           shift = skip[shift];
     
         // single step forward in needle and haystack
         haystack++;
         shift++;
     
         // reached end of needle => hit
         if ((size_t)shift == needleLength)
         {
           result = haystack - shift;
           break;
         }
       }
     
       // clean up heap (if used)
       if (skip != localMemory)
         free(skip);
     
       // points to match position or NULL if not found
       return result;
     }

Boyer-Moore-Horspool Algorithm

In contrast to the Knuth-Morris-Pratt algorithm, the Boyer-Moore-Horspool algorithm tries to match the end of the search phrase first.
It needs a temporary table, too, but its size is constant and depends on the size of the alphabet which is 256 for ASCII strings.
The initialisation is a bit more costly.

Again, Wikipedia is a good source of information if you want to go deeper into detail.

show Boyer-Moore-Horspool algorithm (strings)


     const char* searchBoyerMooreHorspoolString(const char* haystack, const char* needle)
     {
       // detect invalid input
       if (!haystack || !needle)
         return NULL;
     
       // call routine for non-text data
       return searchBoyerMooreHorspool(haystack, strlen(haystack), needle, strlen(needle));
     }

hide Boyer-Moore-Horspool algorithm (binary)


     const char* searchBoyerMooreHorspool(const char* haystack, size_t haystackLength,
                                          const char* needle,   size_t needleLength)
     {
       // detect invalid input
       if (!haystack || !needle || haystackLength < needleLength)
         return NULL;
     
       // empty needle matches everything
       if (needleLength == 0)
         return haystack;
     
       // find right-most position of each character
       // and store its distance to the end of needle
     
       // default value: when a character in haystack isn't in needle, then
       //                we can jump forward needleLength bytes
       const size_t NumChar = 1 << (8 * sizeof(char));
       size_t skip[NumChar];
       size_t i;
       for (i = 0; i < NumChar; i++)
         skip[i] = needleLength;
     
       // figure out for each character of the needle how much we can skip
       // (if a character appears multiple times in needle, later occurrences
       //  overwrite previous ones, i.e. the value of skip[x] decreases)
       const size_t lastPos = needleLength - 1;
       size_t pos;
       for (pos = 0; pos < lastPos; pos++)
         skip[(unsigned char)needle[pos]] = lastPos - pos;
     
       // now walk through the haystack
       while (haystackLength >= needleLength)
       {
         // all characters match ?
         for (i = lastPos; haystack[i] == needle[i]; i--)
           if (i == 0)
             return haystack;
     
         // no match, jump ahead
         unsigned char marker = (unsigned char) haystack[lastPos];
         haystackLength -= skip[marker];
         haystack       += skip[marker];
       }
     
       // needle not found in haystack
       return NULL;
     }

Bitap Algorithm

The Bitap algorithm is often called Baeza-Yates-Gonnet algorithm. It's original purpose was approximate string matching.
The code below is a simplified version and only supports exact string matching.

The algorithm is based on bit manipulation. Each of needle's characters is represented by a bit.
My implementation is restricted to at most 32 byte long needles on 32-bit systems or at most 64 byte long needles on 64-bit systems.
A temporary table with 256 words is created on the stack.

It's worth noting that the Bitap algorithm's speed is only affected by the length of haystack and needle, not by their content.
In other words: it's very predictable, which might be extremeful interesting for realtime systems.

Please check Wikipedia for further information.

hide Bitap algorithm (strings)


     const char* searchBitapString(const char* haystack, const char* needle)
     {
       // detect invalid input
       if (!haystack || !needle)
         return NULL;
     
       // empty needle matches everything
       size_t needleLength = strlen(needle);
       if (needleLength == 0)
         return haystack;
     
       // create bit masks for each possible byte / ASCII character
       // each mask is as wide as needleLength
       const size_t MaxBitWidth = 8 * sizeof(int) - 1;
       // only if needleLength bits fit into an integer (minus 1), the algorithm will be fast
       if (needleLength > MaxBitWidth)
         return searchSimpleString(haystack, needle);
     
       // one mask per allowed character (1 byte => 2^8 => 256)
       // where all bits are set except those where the character is found in needle
       const size_t AlphabetSize = 256;
       unsigned int masks[AlphabetSize];
       size_t i;
       for (i = 0; i < AlphabetSize; i++)
         masks[i] = ~0;
       for (i = 0; i < needleLength; i++)
         masks[(unsigned char)needle[i]] &= ~(1 << i);
     
       // initial state mask has all bits set except the lowest one
       unsigned int state = ~1;
       const unsigned int FullMatch = 1 << needleLength;
       while (*haystack)
       {
         // update the bit array
         state  |= masks[(unsigned char)*haystack];
         state <<= 1;
     
         // if an unset bit "bubbled up" we have a match
         if ((state & FullMatch) == 0)
           return (haystack - needleLength) + 1;
     
         haystack++;
       }
     
       // needle not found in haystack
       return NULL;
     }

show Bitap algorithm (binary)


     const char* searchBitap(const char* haystack, size_t haystackLength,
                             const char* needle,   size_t needleLength)
     {
       // detect invalid input
       if (!haystack || !needle || haystackLength < needleLength)
         return NULL;
     
       // empty needle matches everything
       if (needleLength == 0)
         return haystack;
     
       // create bit masks for each possible byte / ASCII character
       // each mask is as wide as needleLength
       const size_t MaxBitWidth  = 8 * sizeof(int) - 1;
       // only if needleLength bits fit into an integer (minus 1), the algorithm will be fast
       if (needleLength > MaxBitWidth)
         return searchNative(haystack, haystackLength, needle, needleLength);
     
       // one mask per allowed character (1 byte => 2^8 => 256)
       // where all bits are set except those where the character is found in needle
       const size_t AlphabetSize = 256;
       unsigned int masks[AlphabetSize];
       size_t i;
       for (i = 0; i < AlphabetSize; i++)
         masks[i] = ~0;
       for (i = 0; i < needleLength; i++)
         masks[(unsigned char)needle[i]] &= ~(1 << i);
     
       // points beyond last considered byte
       const char* haystackEnd = haystack + haystackLength;
     
       // initial state mask has all bits set except the lowest one
       unsigned int state = ~1;
       const unsigned int FullMatch = 1 << needleLength;
       while (haystack != haystackEnd)
       {
         // update the bit array
         state  |= masks[(unsigned char)*haystack];
         state <<= 1;
     
         // if an unset bit "bubbled up" we have a match
         if ((state & FullMatch) == 0)
           return (haystack - needleLength) + 1;
     
         haystack++;
       }
     
       // needle not found in haystack
       return NULL;
     }

Rabin-Karp Algorithm

The Rabin-Karp algorithm is based on rolling hashes. Due to hash collisions false positives may be found.
Therefore each potential match must be checked again (simplest way: memcmp).

The main strength of this algorithm is its ability to search for multiple phrases in one pass.
However, I didn't implement this feature to keep things simple. With a single needle there is no need for memory allocation, too.

My hashing routine is a basic additive hash. More elaborate hashes produce way less hash collisions but can be considerably slower.

And, of course, Wikipedia is your friend when it comes to specifics.

hide Rabin-Karp algorithm (strings)


     const char* searchRabinKarpString(const char* haystack, const char* needle)
     {
       // detect invalid input
       if (!haystack || !needle)
         return NULL;
     
       // empty needle matches everything
       if (!*needle)
         return haystack;
     
       // find first match of the first letter
       haystack = strchr(haystack, *needle);
       if (!haystack)
         return NULL;
     
       // now first letter of haystack and needle is identical
       // let's compute the sum of all characters of needle
       unsigned int hashNeedle   = *needle;
       unsigned int hashHaystack = *haystack;
       const char*  scanNeedle   = needle   + 1;
       const char*  scanHaystack = haystack + 1;
       while (*scanNeedle && *scanHaystack)
       {
         hashNeedle   += *scanNeedle++;
         hashHaystack += *scanHaystack++;
       }
     
       // if scanNeedle doesn't point to zero, then we have too little haystack
       if (*scanNeedle)
         return NULL;
     
       // length of needle
       const size_t needleLength = scanNeedle - needle;
     
       // walk through haystack and roll the hash
       for (;;)
       {
         // identical hash ?
         if (hashHaystack == hashNeedle)
         {
           // can be a false positive, therefore must check all characters again
           if (memcmp(haystack, needle, needleLength) == 0)
             return haystack;
         }
     
         // no more bytes left ?
         if (!*scanHaystack)
           break;
     
         // update hash
         hashHaystack -= *haystack++;
         hashHaystack += *scanHaystack++;
       }
     
       // needle not found in haystack
       return NULL;
     }

show Rabin-Karp algorithm (binary)


     const char* searchRabinKarp(const char* haystack, size_t haystackLength,
                                 const char* needle,   size_t needleLength)
     {
       // detect invalid input
       if (!haystack || !needle || haystackLength < needleLength)
         return NULL;
     
       // empty needle matches everything
       if (needleLength == 0)
         return haystack;
     
       // one byte beyond last position where a match can begin
       const char* haystackEnd = haystack + haystackLength - needleLength + 1;
     
       // find first match of the first letter
       haystack = (const char*)memchr(haystack, *needle, haystackLength);
       if (!haystack)
         return NULL;
     
       // now first letter of haystack and needle is identical
       // let's compute the sum of all characters of needle
       unsigned int hashNeedle   = 0;
       unsigned int hashHaystack = 0;
       size_t i;
       for (i = 0; i < needleLength; i++)
       {
         // not enough haystack left ?
         if (!haystack[i])
           return NULL;
     
         hashNeedle   += needle  [i];
         hashHaystack += haystack[i];
       }
     
       // walk through haystack and roll the hash computation
       while (haystack != haystackEnd)
       {
         // identical hash ?
         if (hashHaystack == hashNeedle)
         {
           // can be a false positive, therefore must check all characters again
           if (memcmp(haystack, needle, needleLength) == 0)
             return haystack;
         }
     
         // update hash
         hashHaystack += *(haystack + needleLength);
         hashHaystack -= *haystack++;
       }
     
       // needle not found in haystack
       return NULL;
     }


June 30, 2021: Length-Limited Prefix Codes April 13, 2020: smalLZ4 - optimal LZ4 compression June 17, 2019: toojpeg - a JPEG encoder in a single C++ file ... and 30 more !	RSS	www.stephan-brumme.com bits.stephan-brumme.com photos.stephan-brumme.com minime.stephan-brumme.com
© 2011-2024 Stephan Brumme products and company names mentioned may be trademarks of their respective owners. see my disclaimer. dude, this is the end ... of this page.