Today, we’re going to implement applicative regular expressions and parsers (in the style of the regex-applicative library) using free structures!
Free structures are some of my favorite tools in Haskell, and I’ve actually written a few posts about them before, including this one using free groups, this one on a free monad variation, and this one on a “free” applicative on a monoid.
Regular expressions (and parsers) are ubiquitous in computer science and programming, and I hope that demonstrating that they are pretty straightforward to implement using free structures will help you see the value in free structures without getting too bogged down in the details!
All of the code in this post is available online as a “stack executable”. When you run it (
./regexp.hs), you’ll load up a ghci session with all of the definitions in scope, so you can play around with the functions and types :)
This post should be accessible to late beginner or early intermediate Haskell users, and requires some basic familiarity with pattern matching, algebraic data types, and abstractions like
Functor, and do notation.
A regular expression is something that defines a regular language. Formally, it consists of the following base elements:
- The empty set, which always fails to match.
- The empty string, which always succeeds matching the empty string.
- The literal character, denoting a single matching character
And the following operations:
RS, sequence one after the other. A set product.
R|S, one or the other. A set union.
- Kleene Star:
R*, the repetition of
Rzero or more times.
And that’s all that’s in a regular expression. Nothing more, nothing less. From these basic tools, you can derive the rest of the regexp operations — for example,
a+ can be expressed as
aa*, and categories like
\w can be expressed as alternations of valid characters.
Looking at this, does this look a little familiar? It reminds me a lot of the
Alternative typeclass. If a functor
f has an
Alternative instance, it means that it has:
empty, the failing operation
pure x, the always-succeeding operation (from the
<*>, the sequencing operation (from the
<|>, the alternating operation
many, the “zero or more” operation.
This…looks a lot like the construction of a regular language, doesn’t it? It’s almost as if
Alternative has almost exactly what we need. The only thing missing is the literal character primitive.
If you’re unfamiliar with
Alternative, the typeclassopedia has a good step-by-step introduction. But for the purposes of this article, it’s basically just a “double monoid”, with two “combining” actions
<|>, which roughly correspond to
+ in the integers. It’s basically pretty much nothing more than 1-5 in the list above, and some distributivity laws.
So, one way we can look at regular expressions is “The entire
Alternative interface, plus a character primitive”. But! There’s another way of looking at this, that leads us directly to free structures.
Instead of seeing things as “
Alternative with a character primitive”, we can look at it as a character primitive enriched with a Alternative instance.
Let’s write this out. Our character primitive will be:
Note that because we’re working with functors, applicatives, alternatives, etc., all of our regular expressions can have an associated “result”. This is because our regexp values will have a type parameter (which is required for
Alternative). We can choose to ignore this type parameter, but we can also have some fun by using it to represent a “result” that a regexp match will be interpreted as. This is similar to the idea of “capturing” in industrial regexp applications.
Here, the value
Prim 'a' 1 :: Prim Int will represent a primitive that matches on the character
a, interpreting it with a result of
And now…we give it
Alternative structure using the Free Alternative, from the free package:
And that’s it! That’s our entire regular expression type! By giving a
Functor, we get all of the operations of
Alternative over our base. That’s because we have
instance Applicative (Alt f) and
instance Alternative (Alt f). We now have:
- The empty set, coming from
- The empty string, coming from
- The character primitive, coming from the underlying functor
Primthat we are enhancing
- The concatenation operation, from
- The alternating operation, from
- The kleene star, from
All of these additions to our primitive come “for free”!
Essentially, what a free structure gives us is the structure of the abstraction (
Alternative, here) automatically for our base type, and nothing else.
Remember that regular expressions have these operations, and nothing else — no more, no less. That’s exactly what the free Alternative gives us: these operations and the primitive. No more, no less.
After adding some convenient wrappers…we’re done here!
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/misc/regexp.hs#L22-L33 -- | charAs: Parse a given character as a given constant result. charAs :: Char -> a -> RegExp a charAs c x = liftAlt (Prim c x) -- liftAlt lets us use the underlying -- functor Prim in RegExp -- | char: Parse a given character as itself. char :: Char -> RegExp Char char c = charAs c c -- | string: Parse a given string as itself. string :: String -> RegExp String string = traverse char -- neat, huh
Let’s try it out! Let’s match on
(a|b)(cd)*e and return
void from Data.Functor discards the results, since we only care if it matches or not. But we use
many exactly how we’d expect to concatenate and alternate things with
Or maybe more interesting (but slightly more complicated), let’s match on the same pattern and return how many
cds are repeated
This one does require a little more finesse with
<*: the arrows point towards which result to “keep”. And since
many (string "cd") :: RegExp [String] (it returns a list, with an item for each repetition), we can
fmap length to get the
Int result of “how many repetitions”.
However, we can also turn on -XApplicativeDo and write it using do notation, which requires a little less thought:
It’s all a little bit like how we often use “captures” in regular expressions to access a specific part of a match. Here’s an example in ruby:
except we also include a “post-processing” process to get the length of the number of repetitions.
Here’s another handy regexp that matches on a digit between 0 to 9, and the result is the digit
asum [x,y,z] = x <|> y <|> z: it represents a choice between the items in a list.
We can again do some fancy things with it, like make a regexp
\[\d\] that matches on a digit inside
Okay, so all we did was define a data structure that supports character matching, concatenation, alternation, and starring. Big whoop. What we really want to do is use it to parse things, right? How does the Free Alternative help us with that?
Well, it does a lot, actually. Let’s look at two ways of doing this!
What is Freeness?
The “canonical” way of using a free structure is by using its “folding” operation into a concrete structure with the proper instances. For example,
foldMap will turn a free monoid into a value of any monoid instance:
foldMap lifts an
a -> m into a
[a] -> m (or,
FreeMonoid a -> m), with a concrete monoid
m. The general idea is that using a free structure can “defer” the concretization from between the time of construction to the time of use.
For example, we can construct value in the free monoid made from integers:
And now we can decide how we want to interpret
<> — should it be
Or should it be
Or maybe even
The idea is that we can “defer” the choice of concrete
<> is interpreted under by first pushing 1, 2, 3, and 4 into a free monoid value. The free monoid on
Int gives exactly enough structure to
Int to do this job: no more, no less.
foldMap, we say “how to handle the base type”, and it lets us handle the free structure in its entirety by offloading the behavior of
<> to the concrete monoid.
Interpreting in State
In practice, getting a result from a free structure is often about finding (or creating) the right concrete
Alternative that gives us the behavior we want. In this case, we’re in luck. There’s a concrete
Alternative instance that works just the way we want:
StateT String Maybe:
<*>works by sequencing changes in state; in this case, we’ll consider the state as “characters yet to be parsed”, so sequential parsing fits perfectly with
<*>. That’s because combining regexps sequentially can be thought of as statefully chomping down on a string.
<|>works by backtracking and trying again if it runs into a failure. It saves the state of the last successful point and resets to it on failure. This is exactly how we want regexp alternation
The “folding” operation of the free alternative is called
And in the case of
RegExp, we have:
If you’re unfamiliar with the RankN type (the
forall b. stuff), there’s a nice introduction here. But basically, you just need to provide
runAlt with a function that can handle a
Prim b for any
b (and not just a specific one like
foldMap, we need to say “how to handle our base type”. Here, we have to answer “How do we handle
This lets us interpret a
Prim as a
StateT String Maybe action where the state is the “string left to be be processed”. Remember, a
Prim a contains the character we want to match on, and the
a value we want it to be interpreted as. To process a
- Get the state’s (the string left to be parsed) head and tail, using
get. If this match fails, backtrack.
guard, backtrack unless the head matches what the
- Set the state to be the original tail, using
put. This is because we parsed the head already, so now the “string left to be parsed” is just the original tail.
- The result is what the
Primsays it should be.
We can use this to write a function that matches the
RegExp on a prefix. We need to run the state action (using
evalStateT) on the string we want to parse:
And that’s it! Our first solution:
ghci> matchPrefix testRegexp_ "acdcdcde" Just () ghci> matchPrefix testRegexp_ "acdcdcdx" Nothing ghci> matchPrefix testRegexp "acdcdcde" Just 3 ghci> matchPrefix testRegexp "bcdcdcdcdcdcdcde" Just 7 ghci> matchPrefix digit "9" Just 9 ghci> matchPrefix bracketDigit "" Just 2 ghci> matchPrefix (many bracketDigit) "" Just [2,3,4,5] ghci> matchPrefix (sum <$> many bracketDigit) "" Just 14
Wait, what just happened?
Okay, so that might have happened a little quicker than you expected. One minute we were writing our primitive, and the next we had already finished. Here’s the entirety of the code, in a few lines of Haskell:
And now we have a fully functioning regexp parser? What happened?
From a high-level view, remember that
Alt Prim has, in its structure,
runAlt does is that it uses a given concrete
StateT String Maybe) to get the behavior of
many. But! As we can see from that list,
StateT does not have a built-in behavior for
Prim. And so, that’s where
processPrim comes in.
StateT String Maybe’s
So, really, 83% of the work was done for us by
Alternative instance, and the other 17% is in
Admittedly, this does feel a little disappointing, or at least anticlimactic. This makes us wonder: why even use
Alt in the first place? Why not just have
type RegExp = StateT String Maybe and write an appropriate
char :: Char -> StateT String Maybe Char? If
StateT does all of the work anyway, why even bother with
Alt, the free Alternative?
One major advantage we get from using
Alt is that
StateT is…pretty powerful. It’s actually stupid powerful. It can represent a lot of things…most troubling, it can represent things that are not regular expressions. For example, something as simple as
put "hello" :: StateT String Maybe () does not correspond to any regular expression.
So, while we can say that
Alt Prim corresponds to “regular expressions, nothing less and nothing more”, we cannot say the same about
StateT String Maybe.
Alt Prim contains a “perfect fit” representation of a regular expression data type. Everything it can express is a regular expression, and there is nothing it can express that isn’t a regular expression.2
Here, we can think of
StateT as the context that we use to interpret a
RegExp as a parser. But, there might be other ways we want to work with a
RegExp. For example, we might want to inspect it and “print” it out for inspection. This is something we can’t do with
We can’t say that
StateT String Maybe “is” a regular expression — only that it can represent a parser based on a regular expression. But we can say that about
Using the Free structure directly
Alright, that’s great and all. But what if we didn’t want to offload 83% of the behavior to a type that has already been written for us. Is there a way we can directly use the structure of
Alt itself to write our parser?
This is analogous to asking, what if we wanted to actually write a function on a list (by pattern matching on
) instead of always using
foldMap? Can we directly operate on the structure of the list instead of using
foldMap with some concrete monoid?
I’m glad you asked! Let’s look at the definition of the free alternative:
It’s a mutually recursive (and non-regular) type, so it might be a little confusing. One way to understand
Alt is that
Alt xs contains a list of alternatives, or a list of
<|>s. And each of those alternatives is an
AltF, which is a sequence of
f as chained by
<*> (as a chain of function applications).
You can essentially think of
AltF f a as a linked list
[f r], except with a different
r for each item.
Ap is cons (
:), containing the
f r, and
Pure is nil (
forall r. here is -XExistentialQuantification, and is what lets us have a different intermediate type for each item in our chain.
All in all,
Alt f is like a list (
Alt list) of lists (
AltF chains), which take turn alternating between alternative lists and application sequences. A list of chains of lists of chains of lists of chains …
In the big picture, you can think of
Alt f as a “normalized” form of successive or nested
<|>s, similar to how
[a] is a “normalized” form of successive
Ultimately we want to write a
RegExp a -> String -> Maybe a, which parses a string based on a
RegExp. We can do this using the most basic of all Haskell function-writing techniques: pattern matching on each case, and handling each case.
First, the top-level
Alt case. When faced with a list of chains, we can try to parse each one. The result is the first success.
asum :: [Maybe a] -> Maybe a finds the first
Just (success) in a list of attempts.
Now, we need to handle the chain case. To do this, we can pattern match on each constructor, and handle each case.
From here, it’s mostly “type tetris”! We just continually ask GHC what goes in what holes (and what types need to change) until we get something that typechecks.
In the end of the very mechanical process, we get:
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/misc/regexp.hs#L71-L76 matchChain :: AltF Prim a -> String -> Maybe a matchChain (Ap (Prim c x) next) cs = case cs of  -> Nothing d:ds | c == d -> matchAlts (($ x) <$> next) ds | otherwise -> Nothing matchChain (Pure x) _ = Just x
Ap(analogous to cons,
:), it means we’re in the middle of the chain.
- If the input string is empty, then we fail to match.
- Otherwise, here’s the interesting thing. We have the
Prim rwith the character we want to match, the first letter in the string, and
next :: RegExp (r -> a), the next
RegExpin our sequential parsing chain.
- If the match is a success, we continue down the chain, to
next. We just need to massage the types a bit to make it all work out.
- Otherwise, it’s a failure. We’re done here.
- If the match is a success, we continue down the chain, to
Pure x(analogous to nil,
), it means we’re at the end of the chain. We return the result in
In the end though, you don’t really need to understand any of this too deeply in order to write this. Sure, it’s nice to understand what
AltF, etc. really “mean”. But, we don’t have to — the types take care of all of it for you :)
That should be good enough to implement another prefix parser:
ghci> matchAlts testRegexp_ "acdcdcde" Just () ghci> matchAlts testRegexp_ "acdcdcdx" Nothing ghci> matchAlts testRegexp "acdcdcde" Just 3 ghci> matchAlts testRegexp "bcdcdcdcdcdcdcde" Just 7 ghci> matchAlts digit "9" Just 9 ghci> matchAlts bracketDigit "" Just 2 ghci> matchAlts (many bracketDigit) "" Just [2,3,4,5] ghci> matchAlts (sum <$> many bracketDigit) "" Just 14
What did we do?
The two attempts here can be compared to using lists via
foldMap vs. using lists via pattern matching.
Because lists act as a free monoid, any list function can be written using
foldMap and not directly pattern matching. If this seems unbelievable to you, try finding a function that can’t — you might be surprised!
However, because lists are an algebraic data type, any list function can be written using straight up pattern matching on
One nice thing about list is that, no matter how you assemble it, it always ends up as a series of conses and nil. We say that the free monoid is normalizing. That is,
[1,2,3] <>  has the same representation as
 <> [2,3] <> . When we pattern match on
, we can’t distinguish between those two original methods of creation.
Alt is normalizing as well. An example of a possible variant that is not normalizing is:
This is how we might have written
RegExp, if we didn’t know about the free alternative. However, this representation is not normalizing, because we have two
RegExp a values that represent the same regexp:
These two match the same thing. But they have different representations. This representation not normalizing, since the same regexp can be expressed in two different ways.
Alt Prim is better because if two regexps are the same…they will correspond to the same
Alt Prim. It forces each value to exist in a “canonical” normalized form.
This means that when we eventually write our parsing function
matchAlts, we aren’t allowed to care about “how” the regexps are made. We aren’t allowed to distinguish between
a|(b|c). The normalizing property of
Alt means that we are forced to treat both of those the exact same way. It forces us to obey the laws about the structures of regular languages.
It’s easy to imagine a bug that might occur if we accidentally treated
(a|b)|c differently than
a|(b|c) — and indeed it sounds like an easy bug to accidentally make if we are using the non-normalizing representation.
Alt instead of rolling our own regexp expression not only enforces our integrity, but it also eliminates a huge class of potential bugs.
However, it should be noted that, while
Alt f is strongly normalizing with respect to Alternative structure,
Alt Prim isn’t strongly normalizing with respect to regular expressions structure in every single case. For example,
Alt Prim will still treat
a|a as different from
a. This is mostly because
Alt has to be “agnostic” to
f. But, like with all structural “type safety”, I always follow this rule of thumb: A lot of safety is better than no safety. This method can’t eliminate all bugs arising from this angle, but it can eliminate a whoooole lot.
Some subtle caveats
Before we conclude, let’s take some time to clarify a subtle point. Feel free to skip this whole section if you don’t care about the fact that these aren’t identical to the mathematical formalism of regular languages.
While we can use
RegExp just like a regular expression, the formal concept of regular expressions is actually slightly different, due to one pesky thing: laziness.
We really shouldn’t be too surprised, since laziness actually throws a wrench into a lot of Haskell abstractions that are based on math. For example, laziness is the reason that lists aren’t “true” mathematical free monoids.
The reason is that because of laziness and unbounded recursion, we can create an “infinite” regular language:
a|aa|aaa|aaaa|aaaaa|aaaaaa|..., forever and ever. However, infinite regular expressions aren’t allowed in the mathematical version. In Haskell, unfortunately, there is no way to “turn off” recursion: we’re stuck with it.
Even more unfortunately, this is actually how the
Alt encoding of the free alternative above implements
a* is implemented as
|a|aa|aaa|aaaa|aaaaa|..., infinitely. So the representation actually relies on laziness and infinite recursion to do its job. If you look at the contents of
many (char 'a'), you will see an infinite list.
“Haskell without recursion” is fine with
Alt for a “star-free language”, but it won’t cut it for a regular one.
For the purposes we talked about in this post, this doesn’t matter. However, this does create serious issues if we want to write a non-deterministic finite automata based parser (NFA), which is the de facto standard for implementing fast regexp parsers. We can only really generate an NFA if we have a finite value, which means anything using
many is out of the question.
Not all hope is lost, however. We can actually use the “final encoding” of
Alt, from Control.Alternative.Free.Final, to gain a
many that is non-recursive.
Using the final encoding means we lose the “pattern match” method, and can only use the
runAlt method. However, we can off-load to
Alternative instances that have non-recursive
many (like the
RE type from regex-applicative) that allows us to generate an NFA parser. While this still has issues because Haskell allows general recursion, at least
many in specific is no longer dependent on infinite structures.
There’s another interesting point to be made, however, regarding compatibility with NFAs. Even though this recursive encoding doesn’t allow us to create an explicit NFA (a full graph with nodes and transitions), it does allow us to make an implicit one. We can’t ever make an explicit NFA involving
many because the naive
many in the normal
Alt gives us an infinite data structure, so we get an infinite graph.
However, our implementation of
matchPrefix is actually an implicit NFA in the GHC runtime, where the ‘states’ can be considered function pointers on the heap. These pointers refer to other pointers, and the overall behavior works like an unoptimized NFA under the hood that’s assembled as we go. This circumvents the infinite data structure problem because GHC Haskell internally implements recursion by cycles in pointer structures.
Let’s wrap things up! As a cherry on top, we can write our final function to find matches anywhere inside a string by using
tails (which gives us all prefixes in a string) and
mapMaybe (which maps
matchPrefix on every prefix and keeps the successes). It’s also useful to write a function to get the first successful match, using
This is pretty efficient, due to how
matchPrefix short-circuits with
Nothing as soon as it fails, and how
listToMaybe short-circuits as soon as it finds a
Hopefully from this, you can see the value of free structures :)
- Given some base primitive, they give you exactly the structure you need — no more, no less.
- They let you work with your type safely, and then unsafely “run” it inside different useful contexts.
- They are normalizing, so you are not allowed to make “illegal” distinctions. This eliminates a class of bugs where you accidentally treat two cases differently when they are meant to be treated the same way.
Our journey from going to regular languages to
Alt Prim was to recognize that the structures involved in regular expressions matched an
Alternative interface “plus” some extra primitives — and then shifting our perspective to enriching a primitive with
Where can we go from here? First, try playing around with the sample code. One easy addition would be to add other types of primitives:
data Prim a = Only Char a -- ^ match a char with a given result | Letter a -- ^ match any letter with the same result | Digit (Int -> a) -- ^ match any digit, with a result computed from it | Wildcard (Char -> a) -- ^ match any char, with a computed result | Satisfy (Char -> Maybe a) -- ^ match potentially any char, based on a function
so we can support a lot of the basic character classes that many implementations of regular expressions support. Try this out in the sample code as an exercise!
One fun thing you can do also is to use our regexp type to generate a string that it would match on. Try doing this both in the
runAlt-based method and also the explicit pattern matching method!
Another interesting direction we can take, along the lines of build systems a la carte, is experimenting with different free structures to give rise to different types of languages/expressions. For example, if we use the free Applicative, we get a language that has only concatenation and empty strings and primitives, and no alternations. It’s like regular expressions with no
|, or basically only straight up matches. If we use the free Monad, we get a context-sensitive language with no backtracking. If we use the free MonadPlus, we get a context-sensitive language with backtracking. And if we use the (redundant) free Functor…we get a language that can parse a string of one and only one character. It’s nice that we get this sort of “a la carte” scaling system by our choice of free structure.
I hope that after working through this example, you will begin to start recognizing opportunities for using free structures everywhere you look! Once you start, it’s hard to stop :)
I am very humbled to be supported by an amazing community, who make it possible for me to devote time to researching and writing these posts. Very special thanks to my supporters at the “Amazing” level on patreon, Sam Stites and Josh Vera! :)