Interactive demos require Javascript. Anything else works.

Practical String Searching

posted by Stephan Brumme

Motivation

There is a huge number of scientific papers on fast, faster and super-fast string searching algorithms.
They usually prove theoretical performance in O-Notation and most papers cover memory consumption as well.

However, theoretical performance isn't always the same as practical performance. That's why I always want to measure real-world throughput: this article presents hopefully understandable C implementations of the most common generic string search algorithms.

In addition I also wrote a simple tool called mygrep that prints all lines of a file where a search phrase is found.
It doesn't come with all the bells and whistles of the Unix tool grep but achieves similar or sometimes even better speed.
A direct comparison can be done online, just scroll down to the Live Demo.

Playing around with various search phrases you will easily see the plain and simple truth: for most situtations the naive brute-force algorithm is a great choice.
A slightly modified version based on memchr and memcmp is often several times faster than the second-fastest contender - just because this simple algorithm can make efficient use of modern CPUs' optimized string opcodes and fits well into the caches.

These generic routines might slow down considerably in worst-case scenarios, like DNA analysis (small alphabet and many repeated similar parts).
And yes, fgrep is damn fast.

Live Demo

This demo scans through the first 100 MByte of Wikipedia (well, an older snapshot dating back to March 3, 2006) on the server.
Before search starts, the file is mapped into the Linux file cache in order to avoid disk latencies (cat enwik8 > /dev/zero).

Notes: execution times can vary significantly - depending on overall server load.
Special characters such as " or \ are disabled to prevent potential security breaches into my server.
The search algorithms actually can search for these special characters (as plain ASCII symbols).
Don't confuse it with meta characters for regular expressions; those are not supported.
Search is case-sensitive.

Search Phrase:
Algorithm Tool Matches Time
(please enter search phrase and click "Go !"

License

This code is licensed under the zlib License:
This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License

Library

My library is written in C and compiles without problems using GCC/G++, CLang and Visual C++.
There are two functions for each algorithm: Both return the location of the first match or NULL. An empty needle matches the beginning of the haystack.
Invalid pointers are detected and NULL will be returned.

Unicode is not supported, but the second processes binary data (which often contains NULL bytes) without problems: mygrep is a very simple tool to compare performance of all mentioned string search algorithms.
Similar to fgrep it scans a file for a given search phrase and prints all lines where a match was found.
It's bare bones, therefore input pipes aren't supported, and neither is memory mapping used.

Click on show to see the full source code.

Changelog

Download

Download  search.h
Latest release: December 20, 2019, size: 2396 bytes, 45 lines

CRC32: d437e7e5
MD5: 6f41563c56ccb5fd8dc9210446c88c44
SHA1: e761193de07d6b44f74e33dc9dcff0eab13162a4
SHA256:ae13eb72366d3b30e17229de93989299fa3d352c351567de78f219702c47fe6b

Download  search.c
Latest release: December 20, 2019, size: 14.8 kBytes, 557 lines

CRC32: 9ac5957d
MD5: 59659ac5b358108fb5395682466fa364
SHA1: 8d4e2da0d19f6f8dbdbf118b6a5a4cd330305dcd
SHA256:6ed64272eb2181d4a4fa004cedc12637ded57ee8e4da06e0612197169bd6f7df

Download  mygrep.c
Latest release: December 20, 2019, size: 6.4 kBytes, 243 lines

CRC32: 48d0c214
MD5: 7af4d4c5f6497dcdac1685ab8ab90f4e
SHA1: 655ed76cb3dc0c4dd26cc6957e18298b6d4ef017
SHA256:9e7775a587000b1ae9a7c7a868b279a65bfac30d49bd6e4329f7ea9a903810f7

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

Simple Algorithm a.k.a. Brute Force

The simple algorithm consists of two nested loops: The outer loop walks through all bytes/characters of the haystack and the inner loop compares against the needle. As soon as a match was detected, the algorithm aborts and returns its position inside the haystack.

Native Algorithm

This algorithm is pretty much identical to the Simple Algorithm but uses the function memchr to locate the first character and then compares the remainder with memcmp.

These two function are standard C and can be found in #include <string.h>. They are implemented extremely efficient and often compare multiple bytes at once.
As you can see in the Live Demo, it's multiple times faster than the Simple Algorithm in its original form.

About 20 lines of the code are spent on speeding up the case of a two-character needle. Optimizing for the three-character case as well doesn't help much because memcmp handles it pretty well.

The phrase Native Algorithm is an invention of mine. I just ran out of ideas how to call it ...

memmem

memmem is a GNU extension to the C standard library and not available for Visual C++.
Its MAN page says it was broken in early releases and only became stable in glibc 2.1.

You need to define _GNU_SOURCE in your code:

strstr

strstr is the standard C string search function.

Implementations vary but my observation is that strstr is by far the slowest algorithms on GCC, CLang and Visual C++ (well, their respective Standard C libraries).

Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt algorithm improves on the simple algorithm by skipping multiple characters in case of a mismatch.
It's main drawback is the pre-processing stage which builds a table with needleLength pointers/offsets.

You can find a detailled description on Wikipedia including examples.

My code tries to avoid dynamic allocation of the table if the search phrase is short (and puts the stack on the stack).
Change the constant MaxLocalMemory if you have more/less stack available in your system.

Boyer-Moore-Horspool Algorithm

In contrast to the Knuth-Morris-Pratt algorithm, the Boyer-Moore-Horspool algorithm tries to match the end of the search phrase first.
It needs a temporary table, too, but its size is constant and depends on the size of the alphabet which is 256 for ASCII strings.
The initialisation is a bit more costly.

Again, Wikipedia is a good source of information if you want to go deeper into detail.

Bitap Algorithm

The Bitap algorithm is often called Baeza-Yates-Gonnet algorithm. It's original purpose was approximate string matching.
The code below is a simplified version and only supports exact string matching.

The algorithm is based on bit manipulation. Each of needle's characters is represented by a bit.
My implementation is restricted to at most 32 byte long needles on 32-bit systems or at most 64 byte long needles on 64-bit systems.
A temporary table with 256 words is created on the stack.

It's worth noting that the Bitap algorithm's speed is only affected by the length of haystack and needle, not by their content.
In other words: it's very predictable, which might be extremeful interesting for realtime systems.

Please check Wikipedia for further information.

Rabin-Karp Algorithm

The Rabin-Karp algorithm is based on rolling hashes. Due to hash collisions false positives may be found.
Therefore each potential match must be checked again (simplest way: memcmp).

The main strength of this algorithm is its ability to search for multiple phrases in one pass.
However, I didn't implement this feature to keep things simple. With a single needle there is no need for memory allocation, too.

My hashing routine is a basic additive hash. More elaborate hashes produce way less hash collisions but can be considerably slower.

And, of course, Wikipedia is your friend when it comes to specifics.
homepage