Tiny Regular Expression Matcher

posted by Stephan Brumme

Regular expressions (regex) can be the gate to heaven as well as the gate to hell. Usually they are both at the same time.
Writing correct regular expression is sometimes as tough as writing a regular expression parser.

Chapter 1 of the excellent book "Beautiful Code" (ISBN 9780596510046) discusses Rob Pike's very short and elegant regex parser code.
It doesn't come with all the bells and whistles of a full Perl regular expression matcher but is quite sufficient for many purposes.
I extended Rob's C code in such a way that it can handle the following regex meta-symbols (see limitations, too):
Regex Symbol Meaning Already in Rob Pike's code ?
. match any character yes
* the previous character is arbitrarily repeated yes
+ the previous character occurs at least once no
? the previous character occurs once or not at all no
^ match regular expression only to the beginning of the text yes
$ match regular expression only to the end of the text yes
\ treat next character as a literal, even if it is a regex symbol no
My C code consists of just one .h and one .c file without any external dependencies - not even #include <stdlib.h>.
The source file match.c contains about 100 lines of straight C code (plus tons of comments and blank lines).
Take a look at my explanation of the matching algorithm.

You can perform a live search of Bram Stoker's book "Dracula" (public domain, my digital version is 15123 lines long, 838 kBytes):



A very basic use case looks as follows:
hide example: find first match const char* text = "aabc"; const char* regex = "^a+b"; if (match(text, regex)) printf("yes, %s matches the pattern %s", text, regex);
To find all matches (this simplified loop assumes no $ symbols were used in regex and regex doesn't match an empty text, too):
hide example: find all matches const char* text = "aabc"; const char* regex = "a*b"; const char* where = text; while (where = match(where, regex)) { printf("%s matches the pattern %s at offset %d", text, regex, where - text); where++; }
The whole header file match.h consists of just 2 public functions:
hide match.h // ////////////////////////////////////////////////////////// // match.h // Copyright (c) 2015 Stephan Brumme. All rights reserved. // see http://create.stephan-brumme.com/disclaimer.html // #pragma once // allowed metasymbols: // ^ -> text must start with pattern (only allowed as first symbol in regular expression) // $ -> text must end with pattern (only allowed as last symbol in regular expression) // . -> any literal // x? -> literal x occurs zero or one time // x* -> literal x is repeated arbitrarily // x+ -> literal x is repeated at least once /// points to first regex match or NULL if not found, text is treated as one line and ^ and $ refer to its start/end const char* match (const char* text, const char* regex); /// points to first regex match or NULL if not found, text is split into lines (\n) and ^ and $ refer to each line's start/end const char* matchlines (const char* text, const char* regex);
Download  match.h
Latest release: July 31, 2015, size: 936 bytes, 21 lines

CRC32: 6a11ebb0
MD5: 8a302fa5a47ffd48dd43d228b8e24ac8
SHA1: dbb27490bf0fc2cf00fe4db8173f0ba7c9d0df66
SHA256:5fed7747dd69bdce3ce0470ba29f0e9a2315d3743b55bec173c147078129473d


show match.c // ////////////////////////////////////////////////////////// // match.c // Copyright (c) 2015 Stephan Brumme. All rights reserved. // see http://create.stephan-brumme.com/disclaimer.html // // I improved Rob Pike's code found in the excellent book "Beautiful Code", ISBN 9780596510046 // http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html // additions (already mentioned in his extension proposals): // - match a symbol which is repeated at least once "x+" (which is the same as "xx*") // - match a symbol which appears at most once "x?" // - regex can contain escaped metasymbols => "\." searches for a dot // - matchlines splits input into lines #include "match.h" // we have to define NULL because we don't include string.h directly or indirectly #define NULL 0 // track escaped characters #define YES 1 #define NO 0 // Windows uses \r\n for newlines, Unix only \n #define PROCESS_CRLF 1 /// matchhere: search for regex only at beginning of text, stop at textEnd or when a zero-byte is found /** local helper function for match(): @param text - text to be matched @param textEnd - last byte of text to be considered, can be NULL if text is zero-terminated @param regex - simplified regular expression as explained in the header file @param escaped - YES treats first character of regex as a literal only and ignores their meta symbol meaning, but NO is default @return if successful points beyond last matching literal, else NULL **/ static const char* matchhere(const char* text, const char* textEnd, const char* regex, int escaped) { // process current byte if (escaped == NO) { switch (*regex) { case 0: // regex empty ? return text; case '\\': // treat next character as a literal, not a metasymbol return matchhere(text, textEnd, regex + 1, YES); case '^': // ^ was already handled in match() / matchlines() return NULL; } } // lookahead one byte (i.e. watch out for meta-symbols) switch (regex[1]) { case '+': // symbol appears at least once while (*regex == *text || // keep running if next symbol matches (*regex == '.' && escaped == NO && text != textEnd && *text != 0)) { // match remainder const char* matched = matchhere(++text, textEnd, regex + 2, NO); if (matched != NULL) return matched; } // running out of text break; // failed case '*': // symbol appears arbitrarily often // eat text symbol-by-symbol and try to match remaining regex do { // match remaining regex const char* matched = matchhere(text, textEnd, regex + 2, NO); if (matched != NULL) return matched; } while (*regex == *text++ || // keep running if next symbol matches (*regex == '.' && escaped == NO && text != textEnd && *text != 0)); // running out of text break; // failed case '?': // symbol appears exactly zero or one time // assume symbol doesn't appear { const char* matched = matchhere(text, textEnd, regex + 2, NO); if (matched != NULL) return matched; } // continue matching at next position only if symbol appears if (*regex == *text || (*regex == '.' && escaped == NO && *text != 0)) return matchhere(text + 1, textEnd, regex + 2, NO); break; // failed case 0: // *regex will be the last symbol // dollar matches only end of text if (*regex == '$' && escaped == NO) { if (text == textEnd || *text == 0) return text; break; // failed } // nothing left for further matching if (text == textEnd || *text == 0) break; // failed // does last literal match ? return (*regex == *text || (*regex == '.' && escaped == NO)) ? text + 1 : NULL; default: // no meta-symbol, just plain character // nothing left for further matching if (text == textEnd || *text == 0) break; // failed // $ must be followed by 0 (previous case) if (*regex == '$' && escaped == NO) break; // failed // same character (or any character) ? if (*regex == *text || (*regex == '.' && escaped == NO)) return matchhere(text + 1, textEnd, regex + 1, NO); break; // failed } // failed return NULL; } /// points to first regex match or NULL if not found, text is treated as one line and ^ and $ refer to its start/end /** @param text - text to be matched @param regex - simplified regular expression as explained in the header file @return if successful points to the next matching literal, else NULL if no match **/ const char* match(const char* text, const char* regex) { // detect invalid input if (!text || !regex) return NULL; // match only start if (*regex == '^') // matchhere looks out for zero byte if textEnd parameter is set to NULL return matchhere(text, NULL, regex + 1, NO) ? text : NULL; // try to start match at every text position do { // match found ? if (matchhere(text, NULL, regex, NO)) return text; } while (*text++ != 0); // failed to match return NULL; } /// points to first regex match or NULL if not found, text is split into lines (\n) and ^ and $ refer to each line's start/end /** @param text - text to be matched @param regex - simplified regular expression as explained in the header file @return if successful points to the next matching literal, else NULL if no match **/ const char* matchlines(const char* text, const char* regex) { // detect invalid input if (!text || !regex) return NULL; // until the end of the world ... do { // use simple matching for end-of-text if (*text == 0) return match(text, regex); // find extents of current line const char* right = text; #ifdef PROCESS_CRLF while (*right != 0 && *right != '\n' && *right != '\r') right++; #else while (*right != 0 && *right != '\n') right++; #endif // ^ refers to initial text pointer position in the first line, // thereafter to each start of a new line // => if initial text pointer refers to the middle of a line, a false ^ match may be produced // match only start if (*regex == '^') { if (matchhere(text, right, regex + 1, NO)) return text; } else { // try to start match at every text position while (text <= right) { // match found ? if (matchhere(text, right, regex, NO)) return text; // no match yet, continue text++; } } // skip forward to next line text = right; // skip new-line #ifdef PROCESS_CRLF while (*text == '\r') text++; #endif if (*text == '\n') text++; } while (*text++ != 0); // failed to match return NULL; }
Download  match.c
Latest release: August 7, 2015, size: 6820 bytes, 228 lines

CRC32: 9766b355
MD5: b125c34b1fd553c05b0e258546be8c87
SHA1: c7d15fc1dff69634c857092c84ae5d5cc62589fa
SHA256:2ffaf82d521fd370353cf165065215a3d948b173a9e7db99760b6d14e79657ab

Stay up-to-date:git clone http://create.stephan-brumme.com/tiny-regular-expression-matcher/.git

Changelog

Algorithm

The core function matchhere tries to find the first match of the regular expression regex in text and returns a pointer to that position.
Search is aborted if the end of text is reached, i.e. *text == 0 (zero-terminated C string) or text == textEnd.
The latter is important to successfully match the symbol $ against the end of a line which isn't necessarily the end of text.

matchhere invokes itself recursively and processes one or two bytes of regex in each call.
If it fails, the text pointer will be incremented and the same bytes of regex will be matched again.
Therefore some special cases might have a runtime somewhere close to O(length(text*regex).
  1. Look at current byte of regex:
  2. Take a lookahead at the next bytes of regex:
match and matchlines only provide thin wrappers for single-line (match) and multi-line input (matchlines).

Test Suite

My library passes all these tests successfully (see test.c in the Git repository):
test case text regex match(text, regex) correct Rob Pike's result
Empty1 "" "" yes yes yes
Empty2 "abc" "" yes yes yes
Empty3 "" "a?" yes yes unsupported, failed
Empty4 "" ".?" yes yes unsupported, failed
Empty5 "" "a*" yes yes yes
Empty6 "" ".*" yes yes yes
Empty7 "" "^" yes yes yes
Empty8 "" "$" yes yes yes
Empty9 "" "^$" yes yes yes
Empty10 "" "^.?" yes yes unsupported, failed
Empty11 "" "^.*" yes yes yes
Empty12 "" ".?$" yes yes unsupported, failed
Empty13 "" ".*$" yes yes yes
Empty14 "" "^.?$" yes yes unsupported, failed
Empty15 "" "^.*$" yes yes yes
Empty16 "" "." no yes yes
Empty17 "" "^." no yes yes
Empty18 "" ".$" no yes yes
Empty19 "" "^.$" no yes yes
Exact1 "abc" "abc" yes yes yes
Exact2 "abc" "ab" yes yes yes
Exact3 "abc" "bc" yes yes yes
Exact4 "abc" "^a" yes yes yes
Exact5 "abc" "^ab" yes yes yes
Exact6 "abc" "^abc" yes yes yes
Exact7 "abc" "^abc$" yes yes yes
Exact8 "abc" "abc$" yes yes yes
Exact9 "abc" "bc$" yes yes yes
Exact10 "abc" "c$" yes yes yes
Exact11 "abc" "Abc" no yes yes
Exact12 "aabc" "abc" yes yes yes
Exact13 "aabcc" "abc" yes yes yes
Exact14 "abcc" "abc" yes yes yes
Exact15 "aabc" "^abc" no yes yes
Exact16 "aabc" "abc$" yes yes yes
Exact17 "abcc" "abc$" no yes yes
Exact18 "abcc" "^abc" yes yes yes
Star1 "a" "a*" yes yes yes
Star2 "aa" "a*" yes yes yes
Star3 "b" "a*" yes yes yes
Star4 "ab" "a*" yes yes yes
Star5 "aab" "a*" yes yes yes
Star6 "a" "^a*" yes yes yes
Star7 "aa" "^a*" yes yes yes
Star8 "b" "^a*" yes yes yes
Star9 "ab" "^a*" yes yes yes
Star10 "aab" "^a*" yes yes yes
Star11 "a" "a*$" yes yes yes
Star12 "aa" "a*$" yes yes yes
Star13 "b" "a*$" yes yes yes
Star14 "ba" "a*$" yes yes yes
Star15 "baa" "a*$" yes yes yes
Star16 "a" "^a*$" yes yes yes
Star17 "aa" "^a*$" yes yes yes
Star18 "b" "^a*$" no yes yes
Star19 "a" "a*a" yes yes yes
Star20 "aa" "a*a" yes yes yes
Plus1 "a" "a+" yes yes unsupported, failed
Plus2 "aa" "a+" yes yes unsupported, failed
Plus3 "b" "a+" no yes unsupported, yes
Plus4 "ab" "a+" yes yes unsupported, failed
Plus5 "aab" "a+" yes yes unsupported, failed
Plus6 "a" "^a+" yes yes unsupported, failed
Plus7 "aa" "^a+" yes yes unsupported, failed
Plus8 "b" "^a+" no yes unsupported, yes
Plus9 "ab" "^a+" yes yes unsupported, failed
Plus10 "aab" "^a+" yes yes unsupported, failed
Plus11 "a" "a+$" yes yes unsupported, failed
Plus12 "aa" "a+$" yes yes unsupported, failed
Plus13 "b" "a+$" no yes unsupported, yes
Plus14 "ba" "a+$" yes yes unsupported, failed
Plus15 "baa" "a+$" yes yes unsupported, failed
Plus16 "a" "^a+$" yes yes unsupported, failed
Plus17 "aa" "^a+$" yes yes unsupported, failed
Plus18 "b" "^a+$" no yes unsupported, yes
Plus19 "a" "a+a" no yes unsupported, yes
Plus20 "aa" "a+a" yes yes unsupported, failed
Dot1 "a" "." yes yes unsupported, yes
Dot2 "." "." yes yes unsupported, yes
Dot3 "ab" "." yes yes unsupported, yes
Dot4 "a" ".?" yes yes unsupported, failed
Dot5 "ab" ".?" yes yes unsupported, failed
Dot6 "a" "^." yes yes unsupported, yes
Dot7 "ab" "^." yes yes unsupported, yes
Dot8 "a" ".$" yes yes unsupported, yes
Dot9 "ab" ".$" yes yes unsupported, yes
Dot10 "a" "^.$" yes yes unsupported, yes
Dot11 "ab" "^.$" no yes unsupported, yes
Escape1 "." "\\." yes yes unsupported, failed
Escape2 "a" "\\." no yes unsupported, yes
Escape3 "a" "\\.?" yes yes unsupported, failed
Escape4 "a" "\\.*" yes yes unsupported, failed
Escape5 "a" "\\.+" no yes unsupported, yes
Escape6 "*" "\\*" yes yes unsupported, failed
Escape7 "a" "\\*" no yes unsupported, yes
Escape8 "a" "\\*?" yes yes unsupported, failed
Escape9 "a" "\\**" yes yes unsupported, failed
Escape10 "a" "\\*+" no yes unsupported, yes
Escape11 "+" "\\+" yes yes unsupported, failed
Escape12 "a" "\\+" no yes unsupported, yes
Escape13 "a" "\\+?" yes yes unsupported, failed
Escape14 "a" "\\+*" yes yes unsupported, failed
Escape15 "a" "\\++" no yes unsupported, yes
Escape16 "^" "\\^" yes yes unsupported, failed
Escape17 "a" "\\^" no yes unsupported, yes
Escape18 "a" "\\^?" yes yes unsupported, failed
Escape19 "a" "\\^*" yes yes unsupported, failed
Escape20 "a" "\\^+" no yes unsupported, yes
Escape21 "$" "\\$" yes yes unsupported, failed
Escape22 "a" "\\$" no yes unsupported, yes
Escape23 "a" "\\$?" yes yes unsupported, failed
Escape24 "a" "\\$*" yes yes unsupported, failed
Escape25 "a" "\\$+" no yes unsupported, yes
Escape26 "?" "\\?" yes yes unsupported, failed
Escape27 "a" "\\?" no yes unsupported, yes
Escape28 "a" "\\??" yes yes unsupported, failed
Escape29 "a" "\\?*" yes yes unsupported, failed
Escape30 "a" "\\?+" no yes unsupported, yes
Escape31 "\\" "\\" no yes unsupported, yes
Escape32 "\\" "\\\\" yes yes unsupported, failed
Escape33 "a\n" "\n" yes yes yes
Begin1 "^" "^^" no yes failed
Begin2 "a" "a^" no yes misinterpreted, yes
Begin3 "^" "\\^" yes yes unsupported, failed
Begin4 "^" "^\\^" yes yes unsupported, failed
Begin5 "^a" "^\\^" yes yes unsupported, failed
Begin6 "^" "\\^^" no yes unsupported, yes
End1 "$" "$$" no yes failed
End2 "a" "$a" no yes misinterpreted, yes
End3 "$" "\\$" yes yes unsupported, failed
End4 "$" "\\$$" yes yes unsupported, failed
End5 "a$" "\\$$" yes yes unsupported, failed
End6 "$" "$\\$" no yes unsupported, yes

test case text regex matchlines(text, regex) correct
LinesExact1 "" "" 1 yes
LinesExact2 "\n" "" 2 yes

test case text regex how often succeeds match(text, regex) correct
CountExact1 "" "a" 0 yes
CountExact2 "a" "a" 1 yes
CountExact3 "aa" "a" 2 yes
CountExact4 "aaa" "a" 3 yes
CountStar1 "aaa" "a*" 4 yes
CountPlus1 "aaa" "a+" 3 yes
CountDot1 "aaa" "." 3 yes
CountDot2 "aba" "." 3 yes
CountDot3 "aba" ".." 2 yes
CountDot4 "aba" "..." 1 yes
CountDot5 "aba" "...." 0 yes
CountDot6 "aba" "....?" 1 yes
CountDot7 "aba" "..?.." 1 yes
CountDot8 "aba" "..?..?" 2 yes
CountDot9 "" "." 0 yes
CountDot10 "" ".?" 1 yes
CountEnd1 "aaa" "a$" 1 yes
CountEnd2 "aaa" "$" 1 yes

Limitations

The real magic of regular expression parsers is beyond the scope of this short posting.
Advanced stuff like groups, alternatives, quantifiers, character classes and back-references were deliberately NOT implemented.
If you need those features, look out for one of the big regex implementations, C++11's regex is a good start.

Runtime efficiency can be a serious problem for certain use cases. My library wasn't optimized for speed and therefore some regular expression may take a long time.
However, most regex sift through the public domain version of Bram Stoker's "Dracula" within less than 0.01 seconds (i.e. throughput of about 80 MByte/sec on a modest CPU).
homepage