# Streaming Huffman Compression in Haskell (Part 2: Binary and Searches)

by Justin Le ♦

Source ♦ Markdown ♦ LaTeX ♦ Posted in Haskell, Tutorials ♦ Comments

Continuing on this series of beginner/intermediate projects for newer Haskell users, let’s look back at our Huffman encoding project.

In our last post we went over two types of binary trees implemented as algebraic data structures in Haskell, and also a scheme for assembling a Huffman encoding tree using the State monad.

Now let’s look at serializing and unserializing our prefix trees for easy storage, and then at actually using them to encode and decode!

## Binary

There are a couple of serialization libraries in Haskell; the dominant one is binary, but cereal is also not uncommon. The two diverge on several design points, and you can read up on them in the documentation for *cereal*. We’ll be using *binary* for the this tutorial; among many reasons, for its easy integration with the *pipes* library we will be working with later.

### The Easy Way

So let’s make `PreTree`

serialize/unserializable.

The easy way is to enable the `DeriveGeneric`

language extension on GHC, use `deriving (Generic)`

when we define our `PreTree`

, and then:

And…that’s it! We just auto-generated functions to serialize and deserialize our `PreTree`

s (if what they contain is itself serializable).

In real life, we would do this. However, for the sake of learning, let’s dig a bit more into the `Binary`

typeclass.

### The *Other* Easy Way

So the big crux of *binary* is the `Binary`

typeclass:

where `Put`

and `Get`

are sort of “instruction objects for putting/getting binary”. `Get`

is a monad, and `Put`

is a wrapped `PutM`

, which is a writer monad. (To be more specific, `Put`

is `PutM ()`

, because the final action has no result and only “writes”)

So `Binary`

things are things that you can serialize (with the instructions in `put`

) and deserialize (with the instructions in `get`

).

Luckily, because of Haskell’s great composition tools, assembling these instructions by hand are easy peasy!

#### Put

Let’s define our own custom `Put`

for our `PreTree`

s:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L69-L76
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
putPT :: Binary a => PreTree a -> Put
putPT (PTLeaf x) = do
put True -- signify we have a leaf
put x
putPT (PTNode pt1 pt2) = do
put False -- signify we have a node
put pt1
put pt2
```

This all should be fairly readable and self-explanatory.

“To put a

`PTLeaf x`

, first put a flag saying you have a leaf, then put the value of`x`

.”“To put a

`PTNode pt1 pt2`

, first put a flag saying you have a node, then put both trees.”

Due to how monads and pattern matching work, the whole thing is pretty expressive, pleasant to read, and satisfying to write.

The only slightly annoying thing is that we subject ourselves to boolean blindness by using `True`

or `False`

; we have to keep track of what means what. Alternatively, we can create our own binary data types, `data PTType = IsNode | IsLeaf`

, and `put`

*that*, instead…but in this case it might not be so bad to live with boolean blindness for now.

#### Get

Now let’s define our own custom `Get`

:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L79-L84
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
getPT :: Binary a => Get (PreTree a)
getPT = do
isLeaf <- get
if isLeaf
then PTLeaf <$> get
else PTNode <$> get <*> get
```

This also shouldn’t be too bad!

- “Get” the boolean flag, to tell you if you have a leaf or a node.
- If it’s a leaf, then
`get`

the data inside the leaf, and wrap it in a`PTLeaf`

. - If it’s not,
`get`

the next two`PreTree a`

’s, and put them both in a`PTNode`

.

The neat thing here is that `get`

is polymorphic in its return type. We know that the first `get`

expects a `Bool`

, so it knows to parse a `Bool`

. We know that the second `get`

expects an `a`

, so it knows to parse an `a`

. We know that the final two `get`

s both expect `PreTree a`

’s, so it nows what to parse for that too.

Hooray for type inference!

If you’re not familiar with the `f <$> x <*> y`

idiom, you can consider it to be the same thing as `f x y`

, except that `x`

and `y`

are “inside” things:

Where `(<$>)`

and `(<*>)`

come from `Control.Applicative`

. We call this style “applicative style”, in the biz.

#### Wrapping it up

And finally, to tie it all together:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L36-L38
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
instance Binary a => Binary (PreTree a) where
put = putPT
get = getPT
```

### Testing it out

However way we decide to write our `Binary`

instance, let’s test it all out.

```
ghci> let (Just pt) = runBuildTree "hello world"
ghci> let encoded = encode pt
ghci> :t encoded
encoded :: ByteString -- a string of bytes
ghci> let decoded = decode encoded :: PreTree Char
ghci> decoded
PTNode (PTNode (PTNode (PTLeaf 'h')
(PTLeaf 'e')
)
(PTNode (PTLeaf 'w')
(PTLeaf 'r')
)
)
(PTNode (PTLeaf 'l')
(PTNode (PTNode (PTLeaf 'd')
(PTLeaf ' ')
)
(PTLeaf 'o')
)
)
ghci> decoded == t
True
```

Neat! We can also write it to a file and re-read:

```
ghci> encodeFile "test.dat" t
ghci> t' <- decodeFile "test.dat" :: IO (PreTree Char)
ghci> t'
PTNode (PTNode (PTNode (PTLeaf 'h')
(PTLeaf 'e')
)
(PTNode (PTLeaf 'w')
(PTLeaf 'r')
)
)
(PTNode (PTLeaf 'l')
(PTNode (PTNode (PTLeaf 'd')
(PTLeaf ' ')
)
(PTLeaf 'o')
)
)
ghci> t' == t
True
```

And this looks like it works pretty well!

## Encoding

Now that we’ve got that out of the way, let’s work on actually encoding and decoding.

So, basically, we encode a character in a huffman tree by path you take to reach the character.

Let’s represent this path as a list of `Direction`

s:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L25-L30
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
data Direction = DLeft
| DRight
deriving (Show, Eq, Generic)
type Encoding = [Direction]
```

Eventually, an `Encoding`

will be turned into a `ByteString`

, with `DLeft`

representing the 0 bit and `DRight`

representing the 1 bit. But we keep them as their own data types now because everyone hates boolean blindness. Instead of keeping a `True`

or `False`

, we keep data types that actually carry semantic meaning :) And we can’t do silly things like use a boolean as a direction…what the heck? Why would you even want to do that? How is “true” a direction?

### Direct search

Here’s a naive recursive direct (depth-first) search.

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L90-L96
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
findPT :: Eq a => PreTree a -> a -> Maybe Encoding
findPT pt0 x = go pt0 []
where
go (PTLeaf y ) enc | x == y = Just (reverse enc)
| otherwise = Nothing
go (PTNode pt1 pt2) enc = go pt1 (DLeft : enc) <|>
go pt2 (DRight : enc)
```

The algorithm goes:

If you find a

`PTLeaf`

, if the data matches what you are looking for, return the current path in a`Just`

. If not, this is a dead-end; return`Nothing`

.If you find a

`PTNode`

, search the left branch adding a`DLeft`

to the current path, and the right branch adding a`DRight`

to the current path. Use`(<|>)`

to perform the search lazily (ie, stop after the first success).

```
ghci> let pt = runBuildTree "hello world"
ghci> findPT pt 'e'
Just [DLeft, DLeft, DRight]
ghci> findPT pt 'q'
Nothing
```

While it is clearly horribly inefficient, it does serve as a nice clean example of a depth-first search (which exits as soon as it finds the goal), and probably a nice reference implementation for us to reference later.

Its inefficiency lies in many things — chiefly of those being the fact that Huffman trees don’t give you any real help as a search tree, and nothing short of a full depth-first traversal would work. Also, you probably don’t want to do this every time you want to encode something; you’d want to have some sort of memoizing and caching, ideally.

### Pre-searching

We can sort of “solve” both of these problems this by traversing through our `PreTree`

and adding an entry to a `Map`

at every leaf. This fixes our repetition problem by memoizing all of our results into a map…and it fixes our search problem because `Map`

s are an ordered binary search tree with efficient O(log n) lookups.^{1}

There are many ways to do this; my favorite right now is to do it by doing collapsing our tree into one giant map, using the Monoid instance of `Map`

.

Basically, we turn each of our leaves into little `Map`

s, and then “combine” them all, using `(<>)`

, which “combines” or merges two `Map k v`

’s, using their Monoid instance:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L101-L106
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
ptTable :: Ord a => PreTree a -> Map a Encoding
ptTable pt = go pt []
where
go (PTLeaf x) enc = x `M.singleton` reverse enc
go (PTNode pt1 pt2) enc = go pt1 (DLeft : enc) <>
go pt2 (DRight : enc)
```

We do some sort of fancy depth-first “map” over all of the leaves, keeping track of how deep we are. Then we combine it all as we go along with `<>`

.

Note how it is almost identical in structure to `findPT`

:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L90-L96
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
findPT :: Eq a => PreTree a -> a -> Maybe Encoding
findPT pt0 x = go pt0 []
where
go (PTLeaf y ) enc | x == y = Just (reverse enc)
| otherwise = Nothing
go (PTNode pt1 pt2) enc = go pt1 (DLeft : enc) <|>
go pt2 (DRight : enc)
```

Except instead of doing a “short-circuit combination” with `(<|>)`

, we do a “full combination” with `(<>)`

.

### Lookup, Act 2

So now that we have our lookup table, our new lookup/find function is both simple and performant:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L110-L111
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
lookupPTTable :: Ord a => Map a Encoding -> a -> Maybe Encoding
lookupPTTable = flip M.lookup
```

given, of course, that we generate our table first.

```
ghci> let pt = runBuildTree "hello world"
ghci> let tb = fmap ptTable pt
ghci> tb >>= \tb' -> lookupPTTable tb' 'e'
Just [DLeft, DLeft, DRight]
ghci> tb >>= \tb' -> lookupPTTable tb' 'q'
Nothing
```

(Here we use the Monad instance for Maybe, to extract the `tb'`

out of the `Just tb`

. We “sequence” two Maybe’s together. For more information, check out my blog post on this exact topic)

### Encoding many

Now, we’d like to be able to decode an entire stream of `a`

’s, returning a list of the encodings.

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L114-L117
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
encodeAll :: Ord a => PreTree a -> [a] -> Maybe Encoding
encodeAll pt xs = concat <$> sequence (map (lookupPTTable tb) xs)
where
tb = ptTable pt
```

This is a bit dense! But I’m sure that you are up for it.

First, we build the lookup table and call it

`tb`

.Then, we map

`lookupPTTable tb`

over our list`xs`

, to get a list of type`[Maybe Encoding]`

.Then, we use

`sequence`

, which in our case is`[Maybe a] -> Maybe [a]`

. It turns a list of Maybe’s into a list inside a Maybe. Recall the semantics of the Maybe monad: If you ever encounter a`Nothing`

, the*whole thing*is a`Nothing`

. So in this case, if*any*of the inputs are not decodable,*the entire thing is Nothing*.Note that the standard libraries provide a synonym for

`sequence . map`

—`mapM`

. So we could have written it as`mapM (lookupPTTable t) xs`

…but that is significantly less clear/immediately understandable.Recall that our

`sequence`

left us with a`Maybe [Encoding]`

…but we only want`Maybe Encoding`

. So we can use`(<$>)`

to`concat`

all of the`Encoding`

s inside the Maybe.

```
ghci> let pt = runBuildTree "hello world" -- :: Maybe (PreTree Char)
ghci> pt >>= \pt' -> encodeAll pt' "hello world"
Just [DLeft, DLeft, DLeft, DLeft, DLeft, DRight, DRight, DLeft, DRight, DLeft,
DRight, DRight, DRight, DRight, DRight, DLeft, DRight, DLeft, DRight, DLeft,
DRight, DRight, DRight, DLeft, DRight, DRight, DRight, DLeft, DRight, DRight,
DLeft, DLeft]
ghci> pt >>= \pt' -> encodeAll pt' "hello worldq"
Nothing
```

Welp, that’s half the battle!

## Decoding

For huffman trees, decoding is the much simpler process. Simply traverse down the tree using the given encoding and return a value whenever you reach a leaf.

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L123-L128
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
decodePT :: PreTree a -> Encoding -> Maybe (a, Encoding)
decodePT (PTLeaf x) ds = Just (x, ds)
decodePT (PTNode pt1 pt2) (d:ds) = case d of
DLeft -> decodePT pt1 ds
DRight -> decodePT pt2 ds
decodePT (PTNode _ _) [] = Nothing
```

The logic should seem pretty familiar. The main algorithm involves going down the tree, “following” the direction list. If you reach a leaf, then you have found something (and return the directions you haven’t followed yet). If you run out of directions while on a node…something has gone wrong.

```
ghci> do pt <- runBuildTree "hello world"
| enc <- encodeAll pt "hello world"
| decodePT pt enc
Just ('h', [DLeft, DLeft ...])
```

(Here we are using the Maybe monad, in order to “stitch together” three possibly-failing operations in a row. We call `pt`

and `enc`

the values “inside” the `Just pt`

and `Just enc`

returned by `runBuildTree`

and `encodeAll`

; the whole thing fails if any of the steps fail at any time. If you are not familiar with this, I sort of literally wrote an entire blog post on this subject :) )

### Decoding many

We’d like to repeatedly iterate this until we have consumed our entire encoding.

Basically, starting with a list of encodings, we want to continually chop it up and build a list from it.

This sounds a lot like the `Data.List`

function `unfoldr`

:

`unfoldr`

makes a list by applying your function repeatedly to a “de-cumulator”, carrying the state of the decumulator, and stopping when your function returns `Nothing`

. You can think of it as the “opposite” of `foldr`

.

Using `unfoldr`

, we can write a `decodeAll`

:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L132-L133
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
decodeAll :: PreTree a -> Encoding -> [a]
decodeAll pt = unfoldr (decodePT pt)
```

```
ghci> do pt <- runBuildTree "hello world"
| enc <- encodeAll pt "hello world"
| return (decodeAll pt enc)
```

Which works exactly as we’d like!

## Testing

We can write a utility to test our encoding/decoding functions:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/Huffman.hs#L106-L113
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
testTree :: Ord a => [a] -> [a]
testTree [] = [] -- handle the empty list
testTree xs = decoded
where
Just decoded = do
pt <- runBuildTree xs
enc <- encodeAll pt xs
return (decodeAll pt enc)
```

(Again, refer to my MonadPlus article from earlier, if you are unfamiliar with working with the Maybe monad)

`testTree`

should be an identity; that is, `testTree xs === xs`

.

```
ghci> testTree "hello world"
"hello world"
ghci> testTree "the quick brown fox jumps over the lazy dog"
"the quick brown fox jumps over the lazy dog"
```

Note the very unsafe irrefutable pattern match on `Just decoded`

. We’ll fix this later!

### QuickCheck

Now that we have a neat proposition, we can use `quickcheck`

on it, from the great QuickCheck library. `quickcheck`

will basically test our proposition `testTree xs == xs`

by generating several random `xs`

’s.

```
ghci> import Test.QuickCheck
ghci> :set -XScopedTypeVariables
ghci> quickCheck (\(xs :: String) -> testTree xs == xs)
*** Failed! Falsifiable (after 3 tests and 2 shrinks):
"a"
```

#### Failure!

Oh! We failed? And on such a simple case? What happened?

If we look at how `"a"`

is encoded, it’ll become apparent:

```
ghci> let (Just pt) = runBuildTree "aaa"
ghci> pt
PTLeaf 'a'
ghci> findPT pt 'a'
Just []
ghci> encodeAll pt "aaaaaaaaaaa"
Just []
```

Ah. Well, that’s a problem. Basically, our input string has “zero” entropy, according to typical measurements. So we cannot naively huffman encode it.

#### Success!

There are a few ways to deal with this. The most “immediate” way would be to realize that `decodeAll`

is partial (that is, it does not terminate/is undefined on some of its inputs), and will actually never terminate if the given tree is a singleton tree. We can write a “safe” `decodeAll`

:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L137-L139
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
decodeAll' :: PreTree a -> Encoding -> Maybe [a]
decodeAll' (PTLeaf _) _ = Nothing
decodeAll' pt enc = Just $ unfoldr (decodePT pt) enc
```

In doing this, we don’t exactly “fix” the problem…we only defer responsibility. Now, whoever uses `decodeAll'`

(like our eventual encoding interface) is *forced to handle the error* (by handing the `Nothing`

case). In this way, *the type system enforces safety*. Had we always used the unsafe `decodeAll`

, then whoever uses it eventually has to “manually remember” to handle the unterminating case, by carefully reading documentation or something. In this case, the type system is a big, explicit reminder saying “hey, deal with the unterminating case.”

We’ll also a “safe” `testTree`

:

```
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/Huffman.hs#L117-L121
-- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding
testTree' :: Ord a => [a] -> Maybe [a]
testTree' xs = do
pt <- runBuildTree xs
enc <- encodeAll pt xs
decodeAll' pt enc
```

So we can now quickcheck:

```
ghci> quickCheck (\(xs :: String) -> testTree' xs `elem` [Nothing, Just xs])
+++ OK, passed 100 tests.
```

Hooray!

### Re: Testing

All I’ll admit that I didn’t even anticipate the degenerate singleton tree case until I decided to add a quickcheck section to this post. It just goes to show that you should always test! And it also shows how easy it is to write tests in quickcheck. One line could mean five unit tests, and you might even test edge/corner cases that you might have never even thought about!

For example, we probably should have tested `lookupPTTable`

against `findPT`

, our reference implementation :) We should have also tested our binary encode/decode!

## Next Time

We’re almost there!

For our last section, we are going to be focusing on pulling it all together to make a streaming compression/decompression interface that will be able to read a file and encode/decode into a new file as it goes, in constant memory, using pipes. After that, we will also be looking at how to profile code, applying some optimization tricks we can do to get things just right, and other things to wrap up.

Note — this section was largely rewritten; it used to contain a rather involved yet misled tutorial about the Writer monad, as suggested by old links/titles. This can still be found here, if you want to read through it.↩︎