Practical String Searching
posted by Stephan Brumme
Motivation
There is a huge number of scientific papers on fast, faster and super-fast string searching algorithms.They usually prove theoretical performance in O-Notation and most papers cover memory consumption as well.
However, theoretical performance isn't always the same as practical performance. That's why I always want to measure real-world throughput: this article presents hopefully understandable C implementations of the most common generic string search algorithms.
In addition I also wrote a simple tool called
mygrep
that prints all lines of a file where a search phrase is found.It doesn't come with all the bells and whistles of the Unix tool
grep
but achieves similar or sometimes even better speed.A direct comparison can be done online, just scroll down to the Live Demo.
Playing around with various search phrases you will easily see the plain and simple truth: for most situtations the naive brute-force algorithm is a great choice.
A slightly modified version based on
memchr
and memcmp
is often several times faster
than the second-fastest contender - just because this simple algorithm can make efficient use of modern CPUs' optimized string opcodes
and fits well into the caches.These generic routines might slow down considerably in worst-case scenarios, like DNA analysis (small alphabet and many repeated similar parts).
And yes,
fgrep
is damn fast.Live Demo
This demo scans through the first 100 MByte of Wikipedia (well, an older snapshot dating back to March 3, 2006) on the server.Before search starts, the file is mapped into the Linux file cache in order to avoid disk latencies (
cat enwik8 > /dev/zero
).Notes: execution times can vary significantly - depending on overall server load.
Special characters such as
"
or \
are disabled to prevent potential security breaches into my server.The search algorithms actually can search for these special characters (as plain ASCII symbols).
Don't confuse it with meta characters for regular expressions; those are not supported.
Search is case-sensitive.
Algorithm | Tool | Matches | Time |
---|---|---|---|
(please enter search phrase and click "Go !" |
License
This code is licensed under the zlib License:This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License
Library
My library is written in C and compiles without problems using GCC/G++, CLang and Visual C++.There are two functions for each algorithm:
- one has the same parameters as
strstr
and works on NULL-terminated strings - and the second requires the length of each string (in bytes)
NULL
. An empty needle matches the beginning of the haystack.Invalid pointers are detected and NULL will be returned.
Unicode is not supported, but the second processes binary data (which often contains NULL bytes) without problems:
hide
parameters
const char* searchSimpleString(const char* haystack, const char* needle);
const char* searchSimple (const char* haystack, size_t haystackLength, const char* needle, size_t needleLength);
mygrep
is a very simple tool to compare performance of all mentioned string search algorithms.Similar to
fgrep
it scans a file for a given search phrase and prints all lines where a match was found.It's bare bones, therefore input pipes aren't supported, and neither is memory mapping used.
Click on show to see the full source code.
show
mygrep
// //////////////////////////////////////////////////////////
// mygrep.c
// Copyright (c) 2014,2019 Stephan Brumme. All rights reserved.
// see http://create.stephan-brumme.com/disclaimer.html
//
// gcc -O3 -std=c99 -Wall -pedantic search.c mygrep.c -o mygrep
// file size limited to available memory size because whole file is loaded into RAM
// enable GNU extensions, such as memmem()
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#ifdef _MSC_VER
/// GNU C++ has a super-optimized version of memmem() which Visual C++ lacks, here is a simple replacement using a function reference
#define memmem searchNative
#define _CRT_SECURE_NO_WARNINGS
#endif
#include "search.h"
#include <string.h> // memmem()
#include <stdio.h> // printf()
#include <stdlib.h> // malloc()
enum Algorithm
{
UseBest
, UseStrStr
, UseMemMem
, UseSimple
, UseNative
, UseKnuthMorrisPratt
, UseBoyerMooreHorspool
, UseBitap
, UseRabinKarp
} algorithm;
enum
{
ShowLines = 0,
ShowCountOnly = 1
} display;
int main(int argc, char* argv[])
{
const char* syntax = "Syntax: ./mygrep searchphrase filename [--native|--memmem|--strstr|--simple|--knuthmorrispratt|--boyermoorehorspool|--bitap|--rabinkarp] [-c]\n";
if (argc < 3 || argc > 5)
{
printf("%s", syntax);
return -1;
}
// don't show lines, just count them
display = ShowLines;
if (argc == 5 && strcmp(argv[4], "-c") == 0)
display = ShowCountOnly;
// use safer memmem() by default
algorithm = UseBest;
if (argc >= 4)
{
if (strcmp(argv[3], "--native") == 0)
algorithm = UseNative;
else if (strcmp(argv[3], "--memmem") == 0)
algorithm = UseMemMem;
else if (strcmp(argv[3], "--strstr") == 0) // be careful: buffer overruns possible !!!
algorithm = UseStrStr;
else if (strcmp(argv[3], "--simple") == 0)
algorithm = UseSimple;
else if (strcmp(argv[3], "--knuthmorrispratt") == 0 ||
strcmp(argv[3], "--kmp") == 0)
algorithm = UseKnuthMorrisPratt;
else if (strcmp(argv[3], "--boyermoorehorspool") == 0 ||
strcmp(argv[3], "--bmh") == 0)
algorithm = UseBoyerMooreHorspool;
else if (strcmp(argv[3], "--bitap") == 0)
algorithm = UseBitap;
else if (strcmp(argv[3], "--rabinkarp") == 0)
algorithm = UseRabinKarp;
else if (strcmp(argv[3], "-c") == 0)
display = ShowCountOnly;
else
{
printf("%s", syntax);
return -2;
}
}
// open file
FILE* file = fopen(argv[2], "rb");
if (!file)
{
printf("Failed to open file\n");
return -3;
}
// determine its filesize
fseek(file, 0, SEEK_END);
long filesize = ftell(file);
fseek(file, 0, SEEK_SET);
if (filesize == 0)
{
printf("Empty file\n");
return -4;
}
// allocate memory and read the whole file at once
char* data = (char*) malloc(filesize + 2);
if (!data)
{
printf("Out of memory\n");
return -5;
}
fread(data, filesize, 1, file);
fclose(file);
// pad data to avoid buffer overruns
data[filesize ] = '\n';
data[filesize + 1] = 0;
// what we look for
const char* needle = argv[1];
const size_t needleLength = strlen(needle);
// where we look for
const char* haystack = data;
const size_t haystackLength = filesize;
// fence
const char* haystackEnd = haystack + haystackLength;
// "native" and "Boyer-Moore-Horspool" are in almost all cases the best choice
if (algorithm == UseBest)
{
// when needle is longer than about 16 bytes, Boyer-Moore-Horspool is faster
if (needleLength <= 16)
algorithm = UseNative;
else
algorithm = UseBoyerMooreHorspool;
}
// search until done ...
unsigned int numHits = 0;
const char* current = haystack;
for (;;)
{
// offset of current hit from the beginning of the haystack
size_t bytesDone = current - haystack;
size_t bytesLeft = haystackLength - bytesDone;
switch (algorithm)
{
case UseMemMem:
// correctly handled zeros, unfortunately much slower
{
const char* before = current;
current = (const char*)memmem (current, bytesLeft, needle, needleLength);
// workaround for strange GCC behavior, else I get a memory access violation
if (current)
{
int diff = current - before;
current = before + diff;
}
}
break;
case UseStrStr:
// much faster but has problems when bytes in haystack are zero,
// requires both to be properly zero-terminated
current = strstr (current, needle);
break;
case UseSimple:
// brute-force
current = searchSimple (current, bytesLeft, needle, needleLength);
break;
case UseNative:
// brute-force for short needles, based on compiler-optimized functions
current = searchNative (current, bytesLeft, needle, needleLength);
break;
case UseKnuthMorrisPratt:
// Knuth-Morris-Pratt
current = searchKnuthMorrisPratt (current, bytesLeft, needle, needleLength);
break;
case UseBoyerMooreHorspool:
// Boyer-Moore-Horspool
current = searchBoyerMooreHorspool(current, bytesLeft, needle, needleLength);
break;
case UseBitap:
// Bitap / Baeza-Yates-Gonnet algorithm
current = searchBitap (current, bytesLeft, needle, needleLength);
break;
case UseRabinKarp:
// Rabin-Karp algorithm
current = searchRabinKarp (current, bytesLeft, needle, needleLength);
break;
default:
printf("Unknown search algorithm\n");
return -6;
}
// needle not found in the remaining haystack
if (!current)
break;
numHits++;
// find end of line
const char* right = current;
while (right != haystackEnd && *right != '\n')
right++;
if (display == ShowCountOnly)
{
current = right;
continue;
}
// find beginning of line
const char* left = current;
while (left != haystack && *left != '\n')
left--;
if (*left == '\n' && left != haystackEnd)
left++;
// send line to standard output
size_t lineLength = right - left;
fwrite(left, lineLength, 1, stdout);
// and append a newline
putchar('\n');
// don't search this line anymore
current = right;
}
if (display == ShowCountOnly)
printf("%d\n", numHits);
// exit with error code 1 if nothing found
return numHits == 0 ? 1 : 0;
}
Changelog
- version 1
- August 25, 2014
- initial release
- Git tag
practical_string_searching_v1
Download
Git users: scroll down to the repository link Latest release: December 20, 2019, size: 2396 bytes, 45 linesCRC32:
d437e7e5
MD5:
6f41563c56ccb5fd8dc9210446c88c44
SHA1:
e761193de07d6b44f74e33dc9dcff0eab13162a4
SHA256:
ae13eb72366d3b30e17229de93989299fa3d352c351567de78f219702c47fe6b
Latest release: December 20, 2019, size: 14.8 kBytes, 557 lines
CRC32:
9ac5957d
MD5:
59659ac5b358108fb5395682466fa364
SHA1:
8d4e2da0d19f6f8dbdbf118b6a5a4cd330305dcd
SHA256:
6ed64272eb2181d4a4fa004cedc12637ded57ee8e4da06e0612197169bd6f7df
Latest release: December 20, 2019, size: 6.4 kBytes, 243 lines
CRC32:
48d0c214
MD5:
7af4d4c5f6497dcdac1685ab8ab90f4e
SHA1:
655ed76cb3dc0c4dd26cc6957e18298b6d4ef017
SHA256:
9e7775a587000b1ae9a7c7a868b279a65bfac30d49bd6e4329f7ea9a903810f7
Stay up-to-date:git clone https://create.stephan-brumme.com/practical-string-searching/git
GitHub mirror:https://github.com/stbrumme/practical-string-searching
If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com
Simple Algorithm a.k.a. Brute Force
The simple algorithm consists of two nested loops: The outer loop walks through all bytes/characters of the haystack and the inner loop compares against the needle. As soon as a match was detected, the algorithm aborts and returns its position inside the haystack.
hide
simple algorithm (strings)
const char* searchSimpleString(const char* haystack, const char* needle)
{
// detect invalid input
if (!haystack || !needle)
return NULL;
// until the end of haystack (or a match, of course)
while (*haystack)
{
// compare current haystack to needle
size_t i = 0;
while (needle[i])
{
if (haystack[i] != needle[i])
break;
i++;
}
// needle fully matched (reached terminating NUL-byte)
if (!needle[i])
return haystack;
// no match, step forward
haystack++;
}
// not found
return NULL;
}
show
simple algorithm (binary)
const char* searchSimple(const char* haystack, size_t haystackLength,
const char* needle, size_t needleLength)
{
// detect invalid input
if (!haystack || !needle || haystackLength < needleLength)
return NULL;
// match impossible if less than needleLength bytes left
haystackLength -= needleLength - 1;
// points beyond last considered start byte
const char* haystackEnd = haystack + haystackLength;
// until the end of haystack (or a match, of course ...)
while (haystack != haystackEnd)
{
// compare current haystack to needle
size_t i = 0;
for (; i < needleLength; i++)
if (haystack[i] != needle[i])
break;
// needle fully matched
if (i == needleLength)
return haystack;
// no match, step forward
haystack++;
}
// not found
return NULL;
}
Native Algorithm
This algorithm is pretty much identical to the Simple Algorithm but uses the functionmemchr
to locate the first character and then compares the remainder with memcmp
.These two function are standard C and can be found in
#include <string.h>
.
They are implemented extremely efficient and often compare multiple bytes at once.As you can see in the Live Demo, it's multiple times faster than the Simple Algorithm in its original form.
About 20 lines of the code are spent on speeding up the case of a two-character needle. Optimizing for the three-character case as well doesn't help much because
memcmp
handles it pretty well.The phrase Native Algorithm is an invention of mine. I just ran out of ideas how to call it ...
hide
native algorithm (binary)
const char* searchNative(const char* haystack, size_t haystackLength,
const char* needle, size_t needleLength)
{
// uses memchr() for the first byte, then memcmp to verify it's a valid match
// detect invalid input
if (!haystack || !needle || haystackLength < needleLength)
return NULL;
// empty needle matches everything
if (needleLength == 0)
return haystack;
// shorter code for just one character
if (needleLength == 1)
return (const char*)memchr(haystack, *needle, haystackLength);
haystackLength -= needleLength - 1;
// points beyond last considered byte
const char* haystackEnd = haystack + haystackLength;
// look for first byte
while ((haystack = (const char*)memchr(haystack, *needle, haystackLength)) != NULL)
{
// does last byte match, too ?
if (haystack[needleLength - 1] == needle[needleLength - 1])
// okay, perform full comparison, skip first and last byte (if just 2 bytes => already finished)
if (needleLength == 2 || memcmp(haystack + 1, needle + 1, needleLength - 2) == 0)
return haystack;
// compute number of remaining bytes
haystackLength = haystackEnd - haystack;
if (haystackLength == 0)
return NULL;
// keep going
haystack++;
haystackLength--;
}
// needle not found in haystack
return NULL;
}
memmem
memmem
is a GNU extension to the C standard library and not available for Visual C++.Its MAN page says it was broken in early releases and only became stable in glibc 2.1.
You need to define
_GNU_SOURCE
in your code:
#define _GNU_SOURCE
#include <string.h>
void* memmem(const void* haystack, size_t haystacklen, const void* needle, size_t needlelen);
strstr
strstr
is the standard C string search function.Implementations vary but my observation is that
strstr
is by far the slowest algorithms on GCC, CLang and Visual C++
(well, their respective Standard C libraries).
#include <string.h>
char* strstr(const char* haystack, const char* needle);
Knuth-Morris-Pratt Algorithm
The Knuth-Morris-Pratt algorithm improves on the simple algorithm by skipping multiple characters in case of a mismatch.It's main drawback is the pre-processing stage which builds a table with
needleLength
pointers/offsets.You can find a detailled description on Wikipedia including examples.
My code tries to avoid dynamic allocation of the table if the search phrase is short (and puts the stack on the stack).
Change the constant
MaxLocalMemory
if you have more/less stack available in your system.
hide
Knuth-Morris-Pratt algorithm (strings)
const char* searchKnuthMorrisPrattString(const char* haystack, const char* needle)
{
// detect invalid input
if (!haystack || !needle)
return NULL;
// empty needle matches everything
size_t needleLength = strlen(needle);
if (needleLength == 0)
return haystack;
// try to use stack instead of heap (avoid slow memory allocations if possible)
const size_t MaxLocalMemory = 256;
int localMemory[MaxLocalMemory];
int* skip = localMemory;
// stack too small => allocate heap
if (needleLength > MaxLocalMemory)
{
skip = (int*)malloc(needleLength * sizeof(int));
if (skip == NULL)
return NULL;
}
// prepare skip table
skip[0] = -1;
int i;
for (i = 0; needle[i]; i++)
{
skip[i + 1] = skip[i] + 1;
while (skip[i + 1] > 0 && needle[i] != needle[skip[i + 1] - 1])
skip[i + 1] = skip[skip[i + 1] - 1] + 1;
}
// assume no match
const char* result = NULL;
int shift = 0;
// search
while (*haystack)
{
// look for a matching character
while (shift >= 0 && *haystack != needle[shift])
shift = skip[shift];
// single step forward in needle and haystack
haystack++;
shift++;
// reached end of needle => hit
if (!needle[shift])
{
result = haystack - shift;
break;
}
}
// clean up heap (if used)
if (skip != localMemory)
free(skip);
// points to match position or NULL if not found
return result;
}
show
Knuth-Morris-Pratt algorithm (binary)
const char* searchKnuthMorrisPratt(const char* haystack, size_t haystackLength,
const char* needle, size_t needleLength)
{
// detect invalid input
if (!haystack || !needle || haystackLength < needleLength)
return NULL;
// empty needle matches everything
if (needleLength == 0)
return haystack;
// try to use stack instead of heap (avoid slow memory allocations if possible)
const size_t MaxLocalMemory = 256;
int localMemory[MaxLocalMemory];
int* skip = localMemory;
// stack too small => allocate heap
if (needleLength > MaxLocalMemory)
{
skip = (int*)malloc(needleLength * sizeof(int));
if (skip == NULL)
return NULL;
}
// prepare skip table
skip[0] = -1;
size_t i;
for (i = 0; i < needleLength; i++)
{
skip[i + 1] = skip[i] + 1;
while (skip[i + 1] > 0 && needle[i] != needle[skip[i + 1] - 1])
skip[i + 1] = skip[skip[i + 1]-1] + 1;
}
// assume no match
const char* result = NULL;
const char* haystackEnd = haystack + haystackLength;
int shift = 0;
// search
while (haystack != haystackEnd)
{
// look for a matching character
while (shift >= 0 && *haystack != needle[shift])
shift = skip[shift];
// single step forward in needle and haystack
haystack++;
shift++;
// reached end of needle => hit
if ((size_t)shift == needleLength)
{
result = haystack - shift;
break;
}
}
// clean up heap (if used)
if (skip != localMemory)
free(skip);
// points to match position or NULL if not found
return result;
}
Boyer-Moore-Horspool Algorithm
In contrast to the Knuth-Morris-Pratt algorithm, the Boyer-Moore-Horspool algorithm tries to match the end of the search phrase first.It needs a temporary table, too, but its size is constant and depends on the size of the alphabet which is 256 for ASCII strings.
The initialisation is a bit more costly.
Again, Wikipedia is a good source of information if you want to go deeper into detail.
show
Boyer-Moore-Horspool algorithm (strings)
const char* searchBoyerMooreHorspoolString(const char* haystack, const char* needle)
{
// detect invalid input
if (!haystack || !needle)
return NULL;
// call routine for non-text data
return searchBoyerMooreHorspool(haystack, strlen(haystack), needle, strlen(needle));
}
hide
Boyer-Moore-Horspool algorithm (binary)
const char* searchBoyerMooreHorspool(const char* haystack, size_t haystackLength,
const char* needle, size_t needleLength)
{
// detect invalid input
if (!haystack || !needle || haystackLength < needleLength)
return NULL;
// empty needle matches everything
if (needleLength == 0)
return haystack;
// find right-most position of each character
// and store its distance to the end of needle
// default value: when a character in haystack isn't in needle, then
// we can jump forward needleLength bytes
const size_t NumChar = 1 << (8 * sizeof(char));
size_t skip[NumChar];
size_t i;
for (i = 0; i < NumChar; i++)
skip[i] = needleLength;
// figure out for each character of the needle how much we can skip
// (if a character appears multiple times in needle, later occurrences
// overwrite previous ones, i.e. the value of skip[x] decreases)
const size_t lastPos = needleLength - 1;
size_t pos;
for (pos = 0; pos < lastPos; pos++)
skip[(unsigned char)needle[pos]] = lastPos - pos;
// now walk through the haystack
while (haystackLength >= needleLength)
{
// all characters match ?
for (i = lastPos; haystack[i] == needle[i]; i--)
if (i == 0)
return haystack;
// no match, jump ahead
unsigned char marker = (unsigned char) haystack[lastPos];
haystackLength -= skip[marker];
haystack += skip[marker];
}
// needle not found in haystack
return NULL;
}
Bitap Algorithm
The Bitap algorithm is often called Baeza-Yates-Gonnet algorithm. It's original purpose was approximate string matching.The code below is a simplified version and only supports exact string matching.
The algorithm is based on bit manipulation. Each of needle's characters is represented by a bit.
My implementation is restricted to at most 32 byte long needles on 32-bit systems or at most 64 byte long needles on 64-bit systems.
A temporary table with 256 words is created on the stack.
It's worth noting that the Bitap algorithm's speed is only affected by the length of haystack and needle, not by their content.
In other words: it's very predictable, which might be extremeful interesting for realtime systems.
Please check Wikipedia for further information.
hide
Bitap algorithm (strings)
const char* searchBitapString(const char* haystack, const char* needle)
{
// detect invalid input
if (!haystack || !needle)
return NULL;
// empty needle matches everything
size_t needleLength = strlen(needle);
if (needleLength == 0)
return haystack;
// create bit masks for each possible byte / ASCII character
// each mask is as wide as needleLength
const size_t MaxBitWidth = 8 * sizeof(int) - 1;
// only if needleLength bits fit into an integer (minus 1), the algorithm will be fast
if (needleLength > MaxBitWidth)
return searchSimpleString(haystack, needle);
// one mask per allowed character (1 byte => 2^8 => 256)
// where all bits are set except those where the character is found in needle
const size_t AlphabetSize = 256;
unsigned int masks[AlphabetSize];
size_t i;
for (i = 0; i < AlphabetSize; i++)
masks[i] = ~0;
for (i = 0; i < needleLength; i++)
masks[(unsigned char)needle[i]] &= ~(1 << i);
// initial state mask has all bits set except the lowest one
unsigned int state = ~1;
const unsigned int FullMatch = 1 << needleLength;
while (*haystack)
{
// update the bit array
state |= masks[(unsigned char)*haystack];
state <<= 1;
// if an unset bit "bubbled up" we have a match
if ((state & FullMatch) == 0)
return (haystack - needleLength) + 1;
haystack++;
}
// needle not found in haystack
return NULL;
}
show
Bitap algorithm (binary)
const char* searchBitap(const char* haystack, size_t haystackLength,
const char* needle, size_t needleLength)
{
// detect invalid input
if (!haystack || !needle || haystackLength < needleLength)
return NULL;
// empty needle matches everything
if (needleLength == 0)
return haystack;
// create bit masks for each possible byte / ASCII character
// each mask is as wide as needleLength
const size_t MaxBitWidth = 8 * sizeof(int) - 1;
// only if needleLength bits fit into an integer (minus 1), the algorithm will be fast
if (needleLength > MaxBitWidth)
return searchNative(haystack, haystackLength, needle, needleLength);
// one mask per allowed character (1 byte => 2^8 => 256)
// where all bits are set except those where the character is found in needle
const size_t AlphabetSize = 256;
unsigned int masks[AlphabetSize];
size_t i;
for (i = 0; i < AlphabetSize; i++)
masks[i] = ~0;
for (i = 0; i < needleLength; i++)
masks[(unsigned char)needle[i]] &= ~(1 << i);
// points beyond last considered byte
const char* haystackEnd = haystack + haystackLength;
// initial state mask has all bits set except the lowest one
unsigned int state = ~1;
const unsigned int FullMatch = 1 << needleLength;
while (haystack != haystackEnd)
{
// update the bit array
state |= masks[(unsigned char)*haystack];
state <<= 1;
// if an unset bit "bubbled up" we have a match
if ((state & FullMatch) == 0)
return (haystack - needleLength) + 1;
haystack++;
}
// needle not found in haystack
return NULL;
}
Rabin-Karp Algorithm
The Rabin-Karp algorithm is based on rolling hashes. Due to hash collisions false positives may be found.Therefore each potential match must be checked again (simplest way:
memcmp
).The main strength of this algorithm is its ability to search for multiple phrases in one pass.
However, I didn't implement this feature to keep things simple. With a single
needle
there is no need for memory allocation, too.My hashing routine is a basic additive hash. More elaborate hashes produce way less hash collisions but can be considerably slower.
And, of course, Wikipedia is your friend when it comes to specifics.
hide
Rabin-Karp algorithm (strings)
const char* searchRabinKarpString(const char* haystack, const char* needle)
{
// detect invalid input
if (!haystack || !needle)
return NULL;
// empty needle matches everything
if (!*needle)
return haystack;
// find first match of the first letter
haystack = strchr(haystack, *needle);
if (!haystack)
return NULL;
// now first letter of haystack and needle is identical
// let's compute the sum of all characters of needle
unsigned int hashNeedle = *needle;
unsigned int hashHaystack = *haystack;
const char* scanNeedle = needle + 1;
const char* scanHaystack = haystack + 1;
while (*scanNeedle && *scanHaystack)
{
hashNeedle += *scanNeedle++;
hashHaystack += *scanHaystack++;
}
// if scanNeedle doesn't point to zero, then we have too little haystack
if (*scanNeedle)
return NULL;
// length of needle
const size_t needleLength = scanNeedle - needle;
// walk through haystack and roll the hash
for (;;)
{
// identical hash ?
if (hashHaystack == hashNeedle)
{
// can be a false positive, therefore must check all characters again
if (memcmp(haystack, needle, needleLength) == 0)
return haystack;
}
// no more bytes left ?
if (!*scanHaystack)
break;
// update hash
hashHaystack -= *haystack++;
hashHaystack += *scanHaystack++;
}
// needle not found in haystack
return NULL;
}
show
Rabin-Karp algorithm (binary)
const char* searchRabinKarp(const char* haystack, size_t haystackLength,
const char* needle, size_t needleLength)
{
// detect invalid input
if (!haystack || !needle || haystackLength < needleLength)
return NULL;
// empty needle matches everything
if (needleLength == 0)
return haystack;
// one byte beyond last position where a match can begin
const char* haystackEnd = haystack + haystackLength - needleLength + 1;
// find first match of the first letter
haystack = (const char*)memchr(haystack, *needle, haystackLength);
if (!haystack)
return NULL;
// now first letter of haystack and needle is identical
// let's compute the sum of all characters of needle
unsigned int hashNeedle = 0;
unsigned int hashHaystack = 0;
size_t i;
for (i = 0; i < needleLength; i++)
{
// not enough haystack left ?
if (!haystack[i])
return NULL;
hashNeedle += needle [i];
hashHaystack += haystack[i];
}
// walk through haystack and roll the hash computation
while (haystack != haystackEnd)
{
// identical hash ?
if (hashHaystack == hashNeedle)
{
// can be a false positive, therefore must check all characters again
if (memcmp(haystack, needle, needleLength) == 0)
return haystack;
}
// update hash
hashHaystack += *(haystack + needleLength);
hashHaystack -= *haystack++;
}
// needle not found in haystack
return NULL;
}