Tiny Regular Expression Matcher
posted by Stephan Brumme
Regular expressions (regex) can be the gate to heaven as well as the gate to hell. Usually they are both at the same time.
Writing correct regular expression is sometimes as tough as writing a regular expression parser.
Chapter 1 of the excellent book "Beautiful Code" (ISBN 9780596510046) discusses Rob Pike's very short and elegant regex parser code.
It doesn't come with all the bells and whistles of a full Perl regular expression matcher but is quite sufficient for many purposes.
I extended Rob's C code in such a way that it can handle the following regex meta-symbols (see limitations, too):
Regex Symbol | Meaning | Already in Rob Pike's code ? |
---|---|---|
. | match any character | yes |
* | the previous character is arbitrarily repeated | yes |
+ | the previous character occurs at least once | no |
? | the previous character occurs once or not at all | no |
^ | match regular expression only to the beginning of the text | yes |
$ | match regular expression only to the end of the text | yes |
\ | treat next character as a literal, even if it is a regex symbol | no |
.h
and one .c
file without any external dependencies - not even #include <stdlib.h>
.The source file
match.c
contains about 100 lines of straight C code (plus tons of comments and blank lines).Take a look at my explanation of the matching algorithm.
A very basic use case looks as follows:

const char* text = "aabc";
const char* regex = "^a+b";
if (match(text, regex))
printf("yes, %s matches the pattern %s", text, regex);
$
symbols were used in regex
and regex
doesn't match an empty text
, too):

const char* text = "aabc";
const char* regex = "a*b";
const char* where = text;
while (where = match(where, regex))
{
printf("%s matches the pattern %s at offset %d", text, regex, where - text);
where++;
}
match.h
consists of just 2 public functions:
// //////////////////////////////////////////////////////////
// match.h
// Copyright (c) 2015 Stephan Brumme. All rights reserved.
// see http://create.stephan-brumme.com/disclaimer.html
//
#pragma once
// allowed metasymbols:
// ^ -> text must start with pattern (only allowed as first symbol in regular expression)
// $ -> text must end with pattern (only allowed as last symbol in regular expression)
// . -> any literal
// x? -> literal x occurs zero or one time
// x* -> literal x is repeated arbitrarily
// x+ -> literal x is repeated at least once
/// points to first regex match or NULL if not found, text is treated as one line and ^ and $ refer to its start/end
const char* match (const char* text, const char* regex);
/// points to first regex match or NULL if not found, text is split into lines (\n) and ^ and $ refer to each line's start/end
const char* matchlines (const char* text, const char* regex);
CRC32:
6a11ebb0
MD5:
8a302fa5a47ffd48dd43d228b8e24ac8
SHA1:
dbb27490bf0fc2cf00fe4db8173f0ba7c9d0df66
SHA256:
5fed7747dd69bdce3ce0470ba29f0e9a2315d3743b55bec173c147078129473d

// //////////////////////////////////////////////////////////
// match.c
// Copyright (c) 2015 Stephan Brumme. All rights reserved.
// see http://create.stephan-brumme.com/disclaimer.html
//
// I improved Rob Pike's code found in the excellent book "Beautiful Code", ISBN 9780596510046
// http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html
// additions (already mentioned in his extension proposals):
// - match a symbol which is repeated at least once "x+" (which is the same as "xx*")
// - match a symbol which appears at most once "x?"
// - regex can contain escaped metasymbols => "\." searches for a dot
// - matchlines splits input into lines
#include "match.h"
// we have to define NULL because we don't include string.h directly or indirectly
#define NULL 0
// track escaped characters
#define YES 1
#define NO 0
// Windows uses \r\n for newlines, Unix only \n
#define PROCESS_CRLF 1
/// matchhere: search for regex only at beginning of text, stop at textEnd or when a zero-byte is found
/** local helper function for match():
@param text - text to be matched
@param textEnd - last byte of text to be considered, can be NULL if text is zero-terminated
@param regex - simplified regular expression as explained in the header file
@param escaped - YES treats first character of regex as a literal only and ignores their meta symbol meaning, but NO is default
@return if successful points beyond last matching literal, else NULL
**/
static const char* matchhere(const char* text, const char* textEnd, const char* regex, int escaped)
{
// process current byte
if (escaped == NO)
{
switch (*regex)
{
case 0: // regex empty ?
return text;
case '\\': // treat next character as a literal, not a metasymbol
return matchhere(text, textEnd, regex + 1, YES);
case '^': // ^ was already handled in match() / matchlines()
return NULL;
}
}
// lookahead one byte (i.e. watch out for meta-symbols)
switch (regex[1])
{
case '+': // symbol appears at least once
while (*regex == *text || // keep running if next symbol matches
(*regex == '.' && escaped == NO && text != textEnd && *text != 0))
{
// match remainder
const char* matched = matchhere(++text, textEnd, regex + 2, NO);
if (matched != NULL)
return matched;
}
// running out of text
break; // failed
case '*': // symbol appears arbitrarily often
// eat text symbol-by-symbol and try to match remaining regex
do
{
// match remaining regex
const char* matched = matchhere(text, textEnd, regex + 2, NO);
if (matched != NULL)
return matched;
} while (*regex == *text++ || // keep running if next symbol matches
(*regex == '.' && escaped == NO && text != textEnd && *text != 0));
// running out of text
break; // failed
case '?': // symbol appears exactly zero or one time
// assume symbol doesn't appear
{
const char* matched = matchhere(text, textEnd, regex + 2, NO);
if (matched != NULL)
return matched;
}
// continue matching at next position only if symbol appears
if (*regex == *text || (*regex == '.' && escaped == NO && *text != 0))
return matchhere(text + 1, textEnd, regex + 2, NO);
break; // failed
case 0: // *regex will be the last symbol
// dollar matches only end of text
if (*regex == '$' && escaped == NO)
{
if (text == textEnd || *text == 0)
return text;
break; // failed
}
// nothing left for further matching
if (text == textEnd || *text == 0)
break; // failed
// does last literal match ?
return (*regex == *text || (*regex == '.' && escaped == NO)) ? text + 1 : NULL;
default: // no meta-symbol, just plain character
// nothing left for further matching
if (text == textEnd || *text == 0)
break; // failed
// $ must be followed by 0 (previous case)
if (*regex == '$' && escaped == NO)
break; // failed
// same character (or any character) ?
if (*regex == *text || (*regex == '.' && escaped == NO))
return matchhere(text + 1, textEnd, regex + 1, NO);
break; // failed
}
// failed
return NULL;
}
/// points to first regex match or NULL if not found, text is treated as one line and ^ and $ refer to its start/end
/** @param text - text to be matched
@param regex - simplified regular expression as explained in the header file
@return if successful points to the next matching literal, else NULL if no match
**/
const char* match(const char* text, const char* regex)
{
// detect invalid input
if (!text || !regex)
return NULL;
// match only start
if (*regex == '^')
// matchhere looks out for zero byte if textEnd parameter is set to NULL
return matchhere(text, NULL, regex + 1, NO) ? text : NULL;
// try to start match at every text position
do
{
// match found ?
if (matchhere(text, NULL, regex, NO))
return text;
} while (*text++ != 0);
// failed to match
return NULL;
}
/// points to first regex match or NULL if not found, text is split into lines (\n) and ^ and $ refer to each line's start/end
/** @param text - text to be matched
@param regex - simplified regular expression as explained in the header file
@return if successful points to the next matching literal, else NULL if no match
**/
const char* matchlines(const char* text, const char* regex)
{
// detect invalid input
if (!text || !regex)
return NULL;
// until the end of the world ...
do
{
// use simple matching for end-of-text
if (*text == 0)
return match(text, regex);
// find extents of current line
const char* right = text;
#ifdef PROCESS_CRLF
while (*right != 0 && *right != '\n' && *right != '\r')
right++;
#else
while (*right != 0 && *right != '\n')
right++;
#endif
// ^ refers to initial text pointer position in the first line,
// thereafter to each start of a new line
// => if initial text pointer refers to the middle of a line, a false ^ match may be produced
// match only start
if (*regex == '^')
{
if (matchhere(text, right, regex + 1, NO))
return text;
}
else
{
// try to start match at every text position
while (text <= right)
{
// match found ?
if (matchhere(text, right, regex, NO))
return text;
// no match yet, continue
text++;
}
}
// skip forward to next line
text = right;
// skip new-line
#ifdef PROCESS_CRLF
while (*text == '\r')
text++;
#endif
if (*text == '\n')
text++;
} while (*text++ != 0);
// failed to match
return NULL;
}
CRC32:
9766b355
MD5:
b125c34b1fd553c05b0e258546be8c87
SHA1:
c7d15fc1dff69634c857092c84ae5d5cc62589fa
SHA256:
2ffaf82d521fd370353cf165065215a3d948b173a9e7db99760b6d14e79657ab
Stay up-to-date:git clone https://create.stephan-brumme.com/tiny-regular-expression-matcher/git
GitHub mirror:https://github.com/stbrumme/tiny-regular-expression-matcher
If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com
License
This code is licensed under the zlib License:This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License
Changelog
- version 1
- August 11, 2015
- initial release
- Git tag
tiny_regular_expression_matcher_v1
Algorithm
The core functionmatchhere
tries to find the first match of the regular expression regex
in text
and returns a pointer to that position.Search is aborted if the end of
text
is reached, i.e. *text == 0
(zero-terminated C string) or text == textEnd
.The latter is important to successfully match the symbol
$
against the end of a line which isn't necessarily the end of text
.matchhere
invokes itself recursively and processes one or two bytes of regex
in each call.If it fails, the
text
pointer will be incremented and the same bytes of regex
will be matched again.Therefore some special cases might have a runtime somewhere close to O(length(
text
*regex
).- Look at current byte of
regex
: - if
regex
is empty, then currenttext
is successfully matched - if
*regex
is a backslash\
then the next byte ofregex
must be treated as a plain character - if
*regex
is^
thenregex
is invalid (onlymatch
andmatchlines
check for^
) - Take a lookahead at the next bytes of
regex
: - if
regex[1]
is+
then as long as*text
equals*regex
callmatchhere(++text, regex + 2)
until a match is found - if
regex[1]
is*
then callmatchhere(++text, regex + 2)
until a match is found or*text
is not equal to*regex
- if
regex[1]
is?
then matchtext
against the remainder ofregex
(callmatchhere(text, regex + 2)
), if failed, try the same withtext + 1
- if
regex[1]
is the end ofregex
, then*regex
must be either$
or.
or identical to*text
- in every other case
*regex
must be either.
or identical to*text
match
and matchlines
only provide thin wrappers for single-line (match
) and multi-line input (matchlines
).Test Suite
My library passes all these tests successfully (seetest.c
in the Git repository):test case | text | regex | match(text, regex) |
correct | Rob Pike's result |
---|---|---|---|---|---|
Empty1 | "" | "" | yes | yes | yes |
Empty2 | "abc" | "" | yes | yes | yes |
Empty3 | "" | "a?" | yes | yes | unsupported, failed |
Empty4 | "" | ".?" | yes | yes | unsupported, failed |
Empty5 | "" | "a*" | yes | yes | yes |
Empty6 | "" | ".*" | yes | yes | yes |
Empty7 | "" | "^" | yes | yes | yes |
Empty8 | "" | "$" | yes | yes | yes |
Empty9 | "" | "^$" | yes | yes | yes |
Empty10 | "" | "^.?" | yes | yes | unsupported, failed |
Empty11 | "" | "^.*" | yes | yes | yes |
Empty12 | "" | ".?$" | yes | yes | unsupported, failed |
Empty13 | "" | ".*$" | yes | yes | yes |
Empty14 | "" | "^.?$" | yes | yes | unsupported, failed |
Empty15 | "" | "^.*$" | yes | yes | yes |
Empty16 | "" | "." | no | yes | yes |
Empty17 | "" | "^." | no | yes | yes |
Empty18 | "" | ".$" | no | yes | yes |
Empty19 | "" | "^.$" | no | yes | yes |
Exact1 | "abc" | "abc" | yes | yes | yes |
Exact2 | "abc" | "ab" | yes | yes | yes |
Exact3 | "abc" | "bc" | yes | yes | yes |
Exact4 | "abc" | "^a" | yes | yes | yes |
Exact5 | "abc" | "^ab" | yes | yes | yes |
Exact6 | "abc" | "^abc" | yes | yes | yes |
Exact7 | "abc" | "^abc$" | yes | yes | yes |
Exact8 | "abc" | "abc$" | yes | yes | yes |
Exact9 | "abc" | "bc$" | yes | yes | yes |
Exact10 | "abc" | "c$" | yes | yes | yes |
Exact11 | "abc" | "Abc" | no | yes | yes |
Exact12 | "aabc" | "abc" | yes | yes | yes |
Exact13 | "aabcc" | "abc" | yes | yes | yes |
Exact14 | "abcc" | "abc" | yes | yes | yes |
Exact15 | "aabc" | "^abc" | no | yes | yes |
Exact16 | "aabc" | "abc$" | yes | yes | yes |
Exact17 | "abcc" | "abc$" | no | yes | yes |
Exact18 | "abcc" | "^abc" | yes | yes | yes |
Star1 | "a" | "a*" | yes | yes | yes |
Star2 | "aa" | "a*" | yes | yes | yes |
Star3 | "b" | "a*" | yes | yes | yes |
Star4 | "ab" | "a*" | yes | yes | yes |
Star5 | "aab" | "a*" | yes | yes | yes |
Star6 | "a" | "^a*" | yes | yes | yes |
Star7 | "aa" | "^a*" | yes | yes | yes |
Star8 | "b" | "^a*" | yes | yes | yes |
Star9 | "ab" | "^a*" | yes | yes | yes |
Star10 | "aab" | "^a*" | yes | yes | yes |
Star11 | "a" | "a*$" | yes | yes | yes |
Star12 | "aa" | "a*$" | yes | yes | yes |
Star13 | "b" | "a*$" | yes | yes | yes |
Star14 | "ba" | "a*$" | yes | yes | yes |
Star15 | "baa" | "a*$" | yes | yes | yes |
Star16 | "a" | "^a*$" | yes | yes | yes |
Star17 | "aa" | "^a*$" | yes | yes | yes |
Star18 | "b" | "^a*$" | no | yes | yes |
Star19 | "a" | "a*a" | yes | yes | yes |
Star20 | "aa" | "a*a" | yes | yes | yes |
Plus1 | "a" | "a+" | yes | yes | unsupported, failed |
Plus2 | "aa" | "a+" | yes | yes | unsupported, failed |
Plus3 | "b" | "a+" | no | yes | unsupported, yes |
Plus4 | "ab" | "a+" | yes | yes | unsupported, failed |
Plus5 | "aab" | "a+" | yes | yes | unsupported, failed |
Plus6 | "a" | "^a+" | yes | yes | unsupported, failed |
Plus7 | "aa" | "^a+" | yes | yes | unsupported, failed |
Plus8 | "b" | "^a+" | no | yes | unsupported, yes |
Plus9 | "ab" | "^a+" | yes | yes | unsupported, failed |
Plus10 | "aab" | "^a+" | yes | yes | unsupported, failed |
Plus11 | "a" | "a+$" | yes | yes | unsupported, failed |
Plus12 | "aa" | "a+$" | yes | yes | unsupported, failed |
Plus13 | "b" | "a+$" | no | yes | unsupported, yes |
Plus14 | "ba" | "a+$" | yes | yes | unsupported, failed |
Plus15 | "baa" | "a+$" | yes | yes | unsupported, failed |
Plus16 | "a" | "^a+$" | yes | yes | unsupported, failed |
Plus17 | "aa" | "^a+$" | yes | yes | unsupported, failed |
Plus18 | "b" | "^a+$" | no | yes | unsupported, yes |
Plus19 | "a" | "a+a" | no | yes | unsupported, yes |
Plus20 | "aa" | "a+a" | yes | yes | unsupported, failed |
Dot1 | "a" | "." | yes | yes | unsupported, yes |
Dot2 | "." | "." | yes | yes | unsupported, yes |
Dot3 | "ab" | "." | yes | yes | unsupported, yes |
Dot4 | "a" | ".?" | yes | yes | unsupported, failed |
Dot5 | "ab" | ".?" | yes | yes | unsupported, failed |
Dot6 | "a" | "^." | yes | yes | unsupported, yes |
Dot7 | "ab" | "^." | yes | yes | unsupported, yes |
Dot8 | "a" | ".$" | yes | yes | unsupported, yes |
Dot9 | "ab" | ".$" | yes | yes | unsupported, yes |
Dot10 | "a" | "^.$" | yes | yes | unsupported, yes |
Dot11 | "ab" | "^.$" | no | yes | unsupported, yes |
Escape1 | "." | "\\." | yes | yes | unsupported, failed |
Escape2 | "a" | "\\." | no | yes | unsupported, yes |
Escape3 | "a" | "\\.?" | yes | yes | unsupported, failed |
Escape4 | "a" | "\\.*" | yes | yes | unsupported, failed |
Escape5 | "a" | "\\.+" | no | yes | unsupported, yes |
Escape6 | "*" | "\\*" | yes | yes | unsupported, failed |
Escape7 | "a" | "\\*" | no | yes | unsupported, yes |
Escape8 | "a" | "\\*?" | yes | yes | unsupported, failed |
Escape9 | "a" | "\\**" | yes | yes | unsupported, failed |
Escape10 | "a" | "\\*+" | no | yes | unsupported, yes |
Escape11 | "+" | "\\+" | yes | yes | unsupported, failed |
Escape12 | "a" | "\\+" | no | yes | unsupported, yes |
Escape13 | "a" | "\\+?" | yes | yes | unsupported, failed |
Escape14 | "a" | "\\+*" | yes | yes | unsupported, failed |
Escape15 | "a" | "\\++" | no | yes | unsupported, yes |
Escape16 | "^" | "\\^" | yes | yes | unsupported, failed |
Escape17 | "a" | "\\^" | no | yes | unsupported, yes |
Escape18 | "a" | "\\^?" | yes | yes | unsupported, failed |
Escape19 | "a" | "\\^*" | yes | yes | unsupported, failed |
Escape20 | "a" | "\\^+" | no | yes | unsupported, yes |
Escape21 | "$" | "\\$" | yes | yes | unsupported, failed |
Escape22 | "a" | "\\$" | no | yes | unsupported, yes |
Escape23 | "a" | "\\$?" | yes | yes | unsupported, failed |
Escape24 | "a" | "\\$*" | yes | yes | unsupported, failed |
Escape25 | "a" | "\\$+" | no | yes | unsupported, yes |
Escape26 | "?" | "\\?" | yes | yes | unsupported, failed |
Escape27 | "a" | "\\?" | no | yes | unsupported, yes |
Escape28 | "a" | "\\??" | yes | yes | unsupported, failed |
Escape29 | "a" | "\\?*" | yes | yes | unsupported, failed |
Escape30 | "a" | "\\?+" | no | yes | unsupported, yes |
Escape31 | "\\" | "\\" | no | yes | unsupported, yes |
Escape32 | "\\" | "\\\\" | yes | yes | unsupported, failed |
Escape33 | "a\n" | "\n" | yes | yes | yes |
Begin1 | "^" | "^^" | no | yes | failed |
Begin2 | "a" | "a^" | no | yes | misinterpreted, yes |
Begin3 | "^" | "\\^" | yes | yes | unsupported, failed |
Begin4 | "^" | "^\\^" | yes | yes | unsupported, failed |
Begin5 | "^a" | "^\\^" | yes | yes | unsupported, failed |
Begin6 | "^" | "\\^^" | no | yes | unsupported, yes |
End1 | "$" | "$$" | no | yes | failed |
End2 | "a" | "$a" | no | yes | misinterpreted, yes |
End3 | "$" | "\\$" | yes | yes | unsupported, failed |
End4 | "$" | "\\$$" | yes | yes | unsupported, failed |
End5 | "a$" | "\\$$" | yes | yes | unsupported, failed |
End6 | "$" | "$\\$" | no | yes | unsupported, yes |
test case | text | regex | matchlines(text, regex) |
correct |
---|---|---|---|---|
LinesExact1 | "" | "" | 1 | yes |
LinesExact2 | "\n" | "" | 2 | yes |
test case | text | regex | how often succeeds match(text, regex) |
correct |
---|---|---|---|---|
CountExact1 | "" | "a" | 0 | yes |
CountExact2 | "a" | "a" | 1 | yes |
CountExact3 | "aa" | "a" | 2 | yes |
CountExact4 | "aaa" | "a" | 3 | yes |
CountStar1 | "aaa" | "a*" | 4 | yes |
CountPlus1 | "aaa" | "a+" | 3 | yes |
CountDot1 | "aaa" | "." | 3 | yes |
CountDot2 | "aba" | "." | 3 | yes |
CountDot3 | "aba" | ".." | 2 | yes |
CountDot4 | "aba" | "..." | 1 | yes |
CountDot5 | "aba" | "...." | 0 | yes |
CountDot6 | "aba" | "....?" | 1 | yes |
CountDot7 | "aba" | "..?.." | 1 | yes |
CountDot8 | "aba" | "..?..?" | 2 | yes |
CountDot9 | "" | "." | 0 | yes |
CountDot10 | "" | ".?" | 1 | yes |
CountEnd1 | "aaa" | "a$" | 1 | yes |
CountEnd2 | "aaa" | "$" | 1 | yes |
Limitations
The real magic of regular expression parsers is beyond the scope of this short posting.Advanced stuff like groups, alternatives, quantifiers, character classes and back-references were deliberately NOT implemented.
If you need those features, look out for one of the big regex implementations, C++11's
regex
is a good start.Runtime efficiency can be a serious problem for certain use cases. My library wasn't optimized for speed and therefore some regular expression may take a long time.
However, most regex sift through the public domain version of Bram Stoker's "Dracula" within less than 0.01 seconds (i.e. throughput of about 80 MByte/sec on a modest CPU).