Tiny Regular Expression Matcher

posted August 11, 2015 by Stephan Brumme

Regular expressions (regex) can be the gate to heaven as well as the gate to hell. Usually they are both at the same time.
Writing correct regular expression is sometimes as tough as writing a regular expression parser.

Chapter 1 of the excellent book "Beautiful Code" (ISBN 9780596510046) discusses Rob Pike's very short and elegant regex parser code.
It doesn't come with all the bells and whistles of a full Perl regular expression matcher but is quite sufficient for many purposes.
I extended Rob's C code in such a way that it can handle the following regex meta-symbols (see limitations, too):

Regex Symbol	Meaning	Already in Rob Pike's code ?
.	match any character	yes
*	the previous character is arbitrarily repeated	yes
+	the previous character occurs at least once	no
?	the previous character occurs once or not at all	no
^	match regular expression only to the beginning of the text	yes
$	match regular expression only to the end of the text	yes
\	treat next character as a literal, even if it is a regex symbol	no

My C code consists of just one .h and one .c file without any external dependencies - not even #include <stdlib.h>.
The source file match.c contains about 100 lines of straight C code (plus tons of comments and blank lines).
Take a look at my explanation of the matching algorithm.

A very basic use case looks as follows:

hide example: find first match


     const char* text  = "aabc";
     const char* regex = "^a+b";
     if (match(text, regex))
       printf("yes, %s matches the pattern %s", text, regex);

To find all matches (this simplified loop assumes no $ symbols were used in regex and regex doesn't match an empty text, too):

hide example: find all matches


     const char* text  = "aabc";
     const char* regex = "a*b";
     const char* where = text;
     while (where = match(where, regex))
     {
       printf("%s matches the pattern %s at offset %d", text, regex, where - text);
       where++;
     }

The whole header file match.h consists of just 2 public functions:

hide match.h


     // //////////////////////////////////////////////////////////
     // match.h
     // Copyright (c) 2015 Stephan Brumme. All rights reserved.
     // see http://create.stephan-brumme.com/disclaimer.html
     //
     
     #pragma once
     
     // allowed metasymbols:
     // ^  -> text must start with pattern (only allowed as first symbol in regular expression)
     // $  -> text must end   with pattern (only allowed as last  symbol in regular expression)
     // .  -> any literal
     // x? -> literal x occurs zero or one time
     // x* -> literal x is repeated arbitrarily
     // x+ -> literal x is repeated at least once
     
     /// points to first regex match or NULL if not found, text is treated as one line   and ^ and $ refer to its start/end
     const char* match      (const char* text, const char* regex);
     
     /// points to first regex match or NULL if not found, text is split into lines (\n) and ^ and $ refer to each line's start/end
     const char* matchlines (const char* text, const char* regex);

Git users: scroll down to the repository link

Download match.h

Latest release: July 31, 2015, size: 936 bytes, 21 lines

CRC32: 6a11ebb0
MD5: 8a302fa5a47ffd48dd43d228b8e24ac8
SHA1: dbb27490bf0fc2cf00fe4db8173f0ba7c9d0df66
SHA256:5fed7747dd69bdce3ce0470ba29f0e9a2315d3743b55bec173c147078129473d

show match.c


     // //////////////////////////////////////////////////////////
     // match.c
     // Copyright (c) 2015 Stephan Brumme. All rights reserved.
     // see http://create.stephan-brumme.com/disclaimer.html
     //
     
     // I improved Rob Pike's code found in the excellent book "Beautiful Code", ISBN 9780596510046
     // http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html
     // additions (already mentioned in his extension proposals):
     // - match a symbol which is repeated at least once "x+" (which is the same as "xx*")
     // - match a symbol which appears at most once      "x?"
     // - regex can contain escaped metasymbols => "\." searches for a dot
     // - matchlines splits input into lines
     
     #include "match.h"
     
     // we have to define NULL because we don't include string.h directly or indirectly
     #define NULL 0
     
     // track escaped characters
     #define YES  1
     #define NO   0
     
     // Windows uses \r\n for newlines, Unix only \n
     #define PROCESS_CRLF 1
     
     
     /// matchhere: search for regex only at beginning of text, stop at textEnd or when a zero-byte is found
     /** local helper function for match():
         @param text    - text to be matched
         @param textEnd - last byte of text to be considered, can be NULL if text is zero-terminated
         @param regex   - simplified regular expression as explained in the header file
         @param escaped - YES treats first character of regex as a literal only and ignores their meta symbol meaning, but NO is default
         @return if successful points beyond last matching literal, else NULL
     **/
     static const char* matchhere(const char* text, const char* textEnd, const char* regex, int escaped)
     {
       // process current byte
       if (escaped == NO)
       {
         switch (*regex)
         {
         case 0: // regex empty ?
           return text;
         case '\\': // treat next character as a literal, not a metasymbol
           return matchhere(text, textEnd, regex + 1, YES);
         case '^': // ^ was already handled in match() / matchlines()
           return NULL;
         }
       }
     
       // lookahead one byte (i.e. watch out for meta-symbols)
       switch (regex[1])
       {
       case '+': // symbol appears at least once
         while (*regex == *text || // keep running if next symbol matches
                (*regex == '.' && escaped == NO && text != textEnd && *text != 0))
         {
           // match remainder
           const char* matched = matchhere(++text, textEnd, regex + 2, NO);
           if (matched != NULL)
             return matched;
         }
     
         // running out of text
         break; // failed
     
       case '*': // symbol appears arbitrarily often
         // eat text symbol-by-symbol and try to match remaining regex
         do
         {
           // match remaining regex
           const char* matched = matchhere(text, textEnd, regex + 2, NO);
           if (matched != NULL)
             return matched;
         } while (*regex == *text++ || // keep running if next symbol matches
                  (*regex == '.' && escaped == NO && text != textEnd && *text != 0));
     
         // running out of text
         break; // failed
     
       case '?': // symbol appears exactly zero or one time
         // assume symbol doesn't appear
         {
           const char* matched = matchhere(text, textEnd, regex + 2, NO);
           if (matched != NULL)
             return matched;
         }
     
         // continue matching at next position only if symbol appears
         if (*regex == *text || (*regex == '.' && escaped == NO && *text != 0))
           return matchhere(text + 1, textEnd, regex + 2, NO);
     
         break; // failed
     
       case 0: // *regex will be the last symbol
         // dollar matches only end of text
         if (*regex == '$' && escaped == NO)
         {
           if (text == textEnd || *text == 0)
             return text;
           break; // failed
         }
     
         // nothing left for further matching
         if (text == textEnd || *text == 0)
           break; // failed
     
         // does last literal match ?
         return (*regex == *text || (*regex == '.' && escaped == NO)) ? text + 1 : NULL;
     
       default: // no meta-symbol, just plain character
         // nothing left for further matching
         if (text == textEnd || *text == 0)
           break; // failed
     
         // $ must be followed by 0 (previous case)
         if (*regex == '$' && escaped == NO)
           break; // failed
     
         // same character (or any character) ?
         if (*regex == *text || (*regex == '.' && escaped == NO))
           return matchhere(text + 1, textEnd, regex + 1, NO);
     
         break; // failed
       }
     
       // failed
       return NULL;
     }
     
     
     /// points to first regex match or NULL if not found, text is treated as one line   and ^ and $ refer to its start/end
     /** @param text    - text to be matched
         @param regex   - simplified regular expression as explained in the header file
         @return if successful points to the next matching literal, else NULL if no match
     **/
     const char* match(const char* text, const char* regex)
     {
       // detect invalid input
       if (!text || !regex)
         return NULL;
     
       // match only start
       if (*regex == '^')
         // matchhere looks out for zero byte if textEnd parameter is set to NULL
         return matchhere(text, NULL, regex + 1, NO) ? text : NULL;
     
       // try to start match at every text position
       do
       {
         // match found ?
         if (matchhere(text, NULL, regex, NO))
           return text;
       } while (*text++ != 0);
     
       // failed to match
       return NULL;
     }
     
     
     /// points to first regex match or NULL if not found, text is split into lines (\n) and ^ and $ refer to each line's start/end
     /** @param text    - text to be matched
         @param regex   - simplified regular expression as explained in the header file
         @return if successful points to the next matching literal, else NULL if no match
     **/
     const char* matchlines(const char* text, const char* regex)
     {
       // detect invalid input
       if (!text || !regex)
         return NULL;
     
       // until the end of the world ...
       do
       {
         // use simple matching for end-of-text
         if (*text == 0)
           return match(text, regex);
     
         // find extents of current line
         const char* right = text;
     #ifdef PROCESS_CRLF
         while (*right != 0 && *right != '\n' && *right != '\r')
           right++;
     #else
         while (*right != 0 && *right != '\n')
           right++;
     #endif
     
         // ^ refers to initial text pointer position in the first line,
         // thereafter to each start of a new line
         // => if initial text pointer refers to the middle of a line, a false ^ match may be produced
     
         // match only start
         if (*regex == '^')
         {
           if (matchhere(text, right, regex + 1, NO))
             return text;
         }
         else
         {
           // try to start match at every text position
           while (text <= right)
           {
             // match found ?
             if (matchhere(text, right, regex, NO))
               return text;
     
             // no match yet, continue
             text++;
           }
         }
     
         // skip forward to next line
         text = right;
     
         // skip new-line
     #ifdef PROCESS_CRLF
         while (*text == '\r')
           text++;
     #endif
         if    (*text == '\n')
           text++;
       } while (*text++ != 0);
     
       // failed to match
       return NULL;
     }

Download match.c

Latest release: August 7, 2015, size: 6.7 kBytes, 228 lines

CRC32: 9766b355
MD5: b125c34b1fd553c05b0e258546be8c87
SHA1: c7d15fc1dff69634c857092c84ae5d5cc62589fa
SHA256:2ffaf82d521fd370353cf165065215a3d948b173a9e7db99760b6d14e79657ab

Stay up-to-date:git clone https://create.stephan-brumme.com/tiny-regular-expression-matcher/git

GitHub mirror:https://github.com/stbrumme/tiny-regular-expression-matcher

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

License

This code is licensed under the zlib License:

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License

Changelog

version 1
- August 11, 2015
- initial release
- Git tag tiny_regular_expression_matcher_v1

Algorithm

The core function matchhere tries to find the first match of the regular expression regex in text and returns a pointer to that position.
Search is aborted if the end of text is reached, i.e. *text == 0 (zero-terminated C string) or text == textEnd.
The latter is important to successfully match the symbol $ against the end of a line which isn't necessarily the end of text.

matchhere invokes itself recursively and processes one or two bytes of regex in each call.
If it fails, the text pointer will be incremented and the same bytes of regex will be matched again.
Therefore some special cases might have a runtime somewhere close to O(length(text*regex).

Look at current byte of regex:

if regex is empty, then current text is successfully matched
if *regex is a backslash \ then the next byte of regex must be treated as a plain character
if *regex is ^ then regex is invalid (only match and matchlines check for ^)

Take a lookahead at the next bytes of regex:

if regex[1] is + then as long as *text equals *regex call matchhere(++text, regex + 2) until a match is found
if regex[1] is * then call matchhere(++text, regex + 2) until a match is found or *text is not equal to *regex
if regex[1] is ? then match text against the remainder of regex (call matchhere(text, regex + 2)), if failed, try the same with text + 1
if regex[1] is the end of regex, then *regex must be either $ or . or identical to *text
in every other case *regex must be either . or identical to *text

match and matchlines only provide thin wrappers for single-line (match) and multi-line input (matchlines).

Test Suite

My library passes all these tests successfully (see test.c in the Git repository):

test case	text	regex	`match(text, regex)`	correct	Rob Pike's result
Empty1	""	""	yes	yes	yes
Empty2	"abc"	""	yes	yes	yes
Empty3	""	"a?"	yes	yes	unsupported, failed
Empty4	""	".?"	yes	yes	unsupported, failed
Empty5	""	"a*"	yes	yes	yes
Empty6	""	".*"	yes	yes	yes
Empty7	""	"^"	yes	yes	yes
Empty8	""	"$"	yes	yes	yes
Empty9	""	"^$"	yes	yes	yes
Empty10	""	"^.?"	yes	yes	unsupported, failed
Empty11	""	"^.*"	yes	yes	yes
Empty12	""	".?$"	yes	yes	unsupported, failed
Empty13	""	".*$"	yes	yes	yes
Empty14	""	"^.?$"	yes	yes	unsupported, failed
Empty15	""	"^.*$"	yes	yes	yes
Empty16	""	"."	no	yes	yes
Empty17	""	"^."	no	yes	yes
Empty18	""	".$"	no	yes	yes
Empty19	""	"^.$"	no	yes	yes
Exact1	"abc"	"abc"	yes	yes	yes
Exact2	"abc"	"ab"	yes	yes	yes
Exact3	"abc"	"bc"	yes	yes	yes
Exact4	"abc"	"^a"	yes	yes	yes
Exact5	"abc"	"^ab"	yes	yes	yes
Exact6	"abc"	"^abc"	yes	yes	yes
Exact7	"abc"	"^abc$"	yes	yes	yes
Exact8	"abc"	"abc$"	yes	yes	yes
Exact9	"abc"	"bc$"	yes	yes	yes
Exact10	"abc"	"c$"	yes	yes	yes
Exact11	"abc"	"Abc"	no	yes	yes
Exact12	"aabc"	"abc"	yes	yes	yes
Exact13	"aabcc"	"abc"	yes	yes	yes
Exact14	"abcc"	"abc"	yes	yes	yes
Exact15	"aabc"	"^abc"	no	yes	yes
Exact16	"aabc"	"abc$"	yes	yes	yes
Exact17	"abcc"	"abc$"	no	yes	yes
Exact18	"abcc"	"^abc"	yes	yes	yes
Star1	"a"	"a*"	yes	yes	yes
Star2	"aa"	"a*"	yes	yes	yes
Star3	"b"	"a*"	yes	yes	yes
Star4	"ab"	"a*"	yes	yes	yes
Star5	"aab"	"a*"	yes	yes	yes
Star6	"a"	"^a*"	yes	yes	yes
Star7	"aa"	"^a*"	yes	yes	yes
Star8	"b"	"^a*"	yes	yes	yes
Star9	"ab"	"^a*"	yes	yes	yes
Star10	"aab"	"^a*"	yes	yes	yes
Star11	"a"	"a*$"	yes	yes	yes
Star12	"aa"	"a*$"	yes	yes	yes
Star13	"b"	"a*$"	yes	yes	yes
Star14	"ba"	"a*$"	yes	yes	yes
Star15	"baa"	"a*$"	yes	yes	yes
Star16	"a"	"^a*$"	yes	yes	yes
Star17	"aa"	"^a*$"	yes	yes	yes
Star18	"b"	"^a*$"	no	yes	yes
Star19	"a"	"a*a"	yes	yes	yes
Star20	"aa"	"a*a"	yes	yes	yes
Plus1	"a"	"a+"	yes	yes	unsupported, failed
Plus2	"aa"	"a+"	yes	yes	unsupported, failed
Plus3	"b"	"a+"	no	yes	unsupported, yes
Plus4	"ab"	"a+"	yes	yes	unsupported, failed
Plus5	"aab"	"a+"	yes	yes	unsupported, failed
Plus6	"a"	"^a+"	yes	yes	unsupported, failed
Plus7	"aa"	"^a+"	yes	yes	unsupported, failed
Plus8	"b"	"^a+"	no	yes	unsupported, yes
Plus9	"ab"	"^a+"	yes	yes	unsupported, failed
Plus10	"aab"	"^a+"	yes	yes	unsupported, failed
Plus11	"a"	"a+$"	yes	yes	unsupported, failed
Plus12	"aa"	"a+$"	yes	yes	unsupported, failed
Plus13	"b"	"a+$"	no	yes	unsupported, yes
Plus14	"ba"	"a+$"	yes	yes	unsupported, failed
Plus15	"baa"	"a+$"	yes	yes	unsupported, failed
Plus16	"a"	"^a+$"	yes	yes	unsupported, failed
Plus17	"aa"	"^a+$"	yes	yes	unsupported, failed
Plus18	"b"	"^a+$"	no	yes	unsupported, yes
Plus19	"a"	"a+a"	no	yes	unsupported, yes
Plus20	"aa"	"a+a"	yes	yes	unsupported, failed
Dot1	"a"	"."	yes	yes	unsupported, yes
Dot2	"."	"."	yes	yes	unsupported, yes
Dot3	"ab"	"."	yes	yes	unsupported, yes
Dot4	"a"	".?"	yes	yes	unsupported, failed
Dot5	"ab"	".?"	yes	yes	unsupported, failed
Dot6	"a"	"^."	yes	yes	unsupported, yes
Dot7	"ab"	"^."	yes	yes	unsupported, yes
Dot8	"a"	".$"	yes	yes	unsupported, yes
Dot9	"ab"	".$"	yes	yes	unsupported, yes
Dot10	"a"	"^.$"	yes	yes	unsupported, yes
Dot11	"ab"	"^.$"	no	yes	unsupported, yes
Escape1	"."	"\\."	yes	yes	unsupported, failed
Escape2	"a"	"\\."	no	yes	unsupported, yes
Escape3	"a"	"\\.?"	yes	yes	unsupported, failed
Escape4	"a"	"\\.*"	yes	yes	unsupported, failed
Escape5	"a"	"\\.+"	no	yes	unsupported, yes
Escape6	"*"	"\\*"	yes	yes	unsupported, failed
Escape7	"a"	"\\*"	no	yes	unsupported, yes
Escape8	"a"	"\\*?"	yes	yes	unsupported, failed
Escape9	"a"	"\\**"	yes	yes	unsupported, failed
Escape10	"a"	"\\*+"	no	yes	unsupported, yes
Escape11	"+"	"\\+"	yes	yes	unsupported, failed
Escape12	"a"	"\\+"	no	yes	unsupported, yes
Escape13	"a"	"\\+?"	yes	yes	unsupported, failed
Escape14	"a"	"\\+*"	yes	yes	unsupported, failed
Escape15	"a"	"\\++"	no	yes	unsupported, yes
Escape16	"^"	"\\^"	yes	yes	unsupported, failed
Escape17	"a"	"\\^"	no	yes	unsupported, yes
Escape18	"a"	"\\^?"	yes	yes	unsupported, failed
Escape19	"a"	"\\^*"	yes	yes	unsupported, failed
Escape20	"a"	"\\^+"	no	yes	unsupported, yes
Escape21	"$"	"\\$"	yes	yes	unsupported, failed
Escape22	"a"	"\\$"	no	yes	unsupported, yes
Escape23	"a"	"\\$?"	yes	yes	unsupported, failed
Escape24	"a"	"\\$*"	yes	yes	unsupported, failed
Escape25	"a"	"\\$+"	no	yes	unsupported, yes
Escape26	"?"	"\\?"	yes	yes	unsupported, failed
Escape27	"a"	"\\?"	no	yes	unsupported, yes
Escape28	"a"	"\\??"	yes	yes	unsupported, failed
Escape29	"a"	"\\?*"	yes	yes	unsupported, failed
Escape30	"a"	"\\?+"	no	yes	unsupported, yes
Escape31	"\\"	"\\"	no	yes	unsupported, yes
Escape32	"\\"	"\\\\"	yes	yes	unsupported, failed
Escape33	"a\n"	"\n"	yes	yes	yes
Begin1	"^"	"^^"	no	yes	failed
Begin2	"a"	"a^"	no	yes	misinterpreted, yes
Begin3	"^"	"\\^"	yes	yes	unsupported, failed
Begin4	"^"	"^\\^"	yes	yes	unsupported, failed
Begin5	"^a"	"^\\^"	yes	yes	unsupported, failed
Begin6	"^"	"\\^^"	no	yes	unsupported, yes
End1	"$"	"$$"	no	yes	failed
End2	"a"	"$a"	no	yes	misinterpreted, yes
End3	"$"	"\\$"	yes	yes	unsupported, failed
End4	"$"	"\\$$"	yes	yes	unsupported, failed
End5	"a$"	"\\$$"	yes	yes	unsupported, failed
End6	"$"	"$\\$"	no	yes	unsupported, yes

test case	text	regex	`matchlines(text, regex)`	correct
LinesExact1	""	""	1	yes
LinesExact2	"\n"	""	2	yes

test case	text	regex	how often succeeds `match(text, regex)`	correct
CountExact1	""	"a"	0	yes
CountExact2	"a"	"a"	1	yes
CountExact3	"aa"	"a"	2	yes
CountExact4	"aaa"	"a"	3	yes
CountStar1	"aaa"	"a*"	4	yes
CountPlus1	"aaa"	"a+"	3	yes
CountDot1	"aaa"	"."	3	yes
CountDot2	"aba"	"."	3	yes
CountDot3	"aba"	".."	2	yes
CountDot4	"aba"	"..."	1	yes
CountDot5	"aba"	"...."	0	yes
CountDot6	"aba"	"....?"	1	yes
CountDot7	"aba"	"..?.."	1	yes
CountDot8	"aba"	"..?..?"	2	yes
CountDot9	""	"."	0	yes
CountDot10	""	".?"	1	yes
CountEnd1	"aaa"	"a$"	1	yes
CountEnd2	"aaa"	"$"	1	yes

Limitations

The real magic of regular expression parsers is beyond the scope of this short posting.
Advanced stuff like groups, alternatives, quantifiers, character classes and back-references were deliberately NOT implemented.
If you need those features, look out for one of the big regex implementations, C++11's regex is a good start.

Runtime efficiency can be a serious problem for certain use cases. My library wasn't optimized for speed and therefore some regular expression may take a long time.
However, most regex sift through the public domain version of Bram Stoker's "Dracula" within less than 0.01 seconds (i.e. throughput of about 80 MByte/sec on a modest CPU).


June 30, 2021: Length-Limited Prefix Codes April 13, 2020: smalLZ4 - optimal LZ4 compression June 17, 2019: toojpeg - a JPEG encoder in a single C++ file ... and 30 more !	RSS	www.stephan-brumme.com bits.stephan-brumme.com photos.stephan-brumme.com minime.stephan-brumme.com
© 2011-2024 Stephan Brumme products and company names mentioned may be trademarks of their respective owners. see my disclaimer. dude, this is the end ... of this page.