Streaming Huffman Compression in Haskell (Part 2: Binary and Searches)

SourceMarkdownLaTeXPosted in Haskell, TutorialsComments


Continuing on this series of beginner/intermediate projects for newer Haskell users, let’s look back at our Huffman encoding project.

In our last post we went over two types of binary trees implemented as algebraic data structures in Haskell, and also a scheme for assembling a Huffman encoding tree using the State monad.

Now let’s look at serializing and unserializing our prefix trees for easy storage, and then at actually using them to encode and decode!

Binary

There are a couple of serialization libraries in Haskell; the dominant one is binary, but cereal is also not uncommon. The two diverge on several design points, and you can read up on them in the documentation for cereal. We’ll be using binary for the this tutorial; among many reasons, for its easy integration with the pipes library we will be working with later.

The Easy Way

So let’s make PreTree serialize/unserializable.

The easy way is to enable the DeriveGeneric language extension on GHC, use deriving (Generic) when we define our PreTree, and then:

And…that’s it! We just auto-generated functions to serialize and deserialize our PreTrees (if what they contain is itself serializable).

In real life, we would do this. However, for the sake of learning, let’s dig a bit more into the Binary typeclass.

The Other Easy Way

So the big crux of binary is the Binary typeclass:

where Put and Get are sort of “instruction objects for putting/getting binary”. Get is a monad, and Put is a wrapped PutM, which is a writer monad. (To be more specific, Put is PutM (), because the final action has no result and only “writes”)

So Binary things are things that you can serialize (with the intructions in put) and deserialize (with the instructions in get).

Luckily, because of Haskell’s great composition tools, assembling these instructions by hand are easy peasy!

Put

Let’s define our own custom Put for our PreTrees:

This all should be fairly readable and self-explanatory.

  • “To put a PTLeaf x, first put a flag saying you have a leaf, then put the value of x.”

  • “To put a PTNode pt1 pt2, first put a flag saying you have a node, then put both trees.”

Due to how monads and pattern matching work, the whole thing is pretty expressive, pleasant to read, and satisfying to write.

The only slightly annoying thing is that we subject ourselves to boolean blindness by using True or False; we have to keep track of what means what. Alternatively, we can create our own binary data types, data PTType = IsNode | IsLeaf, and put that, instead…but in this case it might not be so bad to live with boolean blindness for now.

Get

Now let’s define our own custom Get:

This also shouldn’t be too bad!

  • “Get” the boolean flag, to tell you if you have a leaf or a node.
  • If it’s a leaf, then get the data inside the leaf, and wrap it in a PTLeaf.
  • If it’s not, get the next two PreTree a’s, and put them both in a PTNode.

The neat thing here is that get is polymorphic in its return type. We know that the first get expects a Bool, so it knows to parse a Bool. We know that the second get expects an a, so it knows to parse an a. We know that the final two gets both expect PreTree a’s, so it nows what to parse for that too.

Hooray for type inference!

If you’re not familiar with the f <$> x <*> y idiom, you can consider it to be the same thing as f x y, except that x and y are “inside” things:

Where (<$>) and (<*>) come from Control.Applicative. We call this style “applicative style”, in the biz.

Wrapping it up

And finally, to tie it all together:

Testing it out

However way we decide to write our Binary instance, let’s test it all out.

Neat! We can also write it to a file and re-read:

And this looks like it works pretty well!

Encoding

Now that we’ve got that out of the way, let’s work on actually encoding and decoding.

So, basically, we encode a character in a huffman tree by path you take to reach the character.

Let’s represent this path as a list of Directions:

Eventually, an Encoding will be turned into a ByteString, with DLeft representing the 0 bit and DRight representing the 1 bit. But we keep them as their own data types now because everyone hates boolean blindness. Instead of keeping a True or False, we keep data types that actually carry semantic meaning :) And we can’t do silly things like use a boolean as a direction…what the heck? Why would you even want to do that? How is “true” a direction?

Here’s a naive recursive direct (depth-first) search.

The algorithm goes:

  1. If you find a PTLeaf, if the data matches what you are looking for, return the current path in a Just. If not, this is a dead-end; return Nothing.

  2. If you find a PTNode, search the left branch adding a DLeft to the current path, and the right branch adding a DRight to the current path. Use (<|>) to perform the search lazily (ie, stop after the first success).

While it is clearly horribly inefficient, it does serve as a nice clean example of a depth-first search (which exits as soon as it finds the goal), and probably a nice reference implementation for us to reference later.

Its inefficiency lies in many things — chiefly of those being the fact that Huffman trees don’t give you any real help as a search tree, and nothing short of a full depth-first traversal would work. Also, you probably don’t want to do this every time you want to encode something; you’d want to have some sort of memoing and cacheing, ideally.

Pre-searching

We can sort of “solve” both of these problems this by traversing through our PreTree and adding an entry to a Map at every leaf. This fixes our repetition problem by memoizing all of our results into a map…and it fixes our search problem because Maps are an ordered binary search tree with efficient O(log n) lookups.1

There are many ways to do this; my favorite right now is to do it by doing collapsing our tree into one giant map, using the Monoid instance of Map.

Basically, we turn each of our leaves into little Maps, and then “combine” them all, using (<>), which “combines” or merges two Map k v’s, using their Monoid instance:

We do some sort of fancy depth-first “map” over all of the leaves, keeping track of how deep we are. Then we combine it all as we go along with <>.

Note how it is almost identical in structure to findPT:

Except instead of doing a “short-circuit combination” with (<|>), we do a “full combination” with (<>).

Lookup, Act 2

So now that we have our lookup table, our new lookup/find function is both simple and performant:

given, of course, that we generate our table first.

(Here we use the Monad instance for Maybe, to extract the tb' out of the Just tb. We “sequence” two Maybe’s together. For more information, check out my blog post on this exact topic)

Encoding many

Now, we’d like to be able to decode an entire stream of a’s, returning a list of the encodings.

This is a bit dense! But I’m sure that you are up for it.

  1. First, we build the lookup table and call it tb.

  2. Then, we map lookupPTTable tb over our list xs, to get a list of type [Maybe Encoding].

  3. Then, we use sequence, which in our case is [Maybe a] -> Maybe [a]. It turns a list of Maybe’s into a list inside a Maybe. Recall the semantics of the Maybe monad: If you ever encounter a Nothing, the whole thing is a Nothing. So in this case, if any of the inputs are not decodable, the entire thing is Nothing.

    Note that the standard libraries provide a synonym for sequence . mapmapM. So we could have written it as mapM (lookupPTTable t) xs…but that is significantly less clear/immediately understandable.

  4. Recall that our sequence left us with a Maybe [Encoding]…but we only want Maybe Encoding. So we can use (<$>) to concat all of the Encodings inside the Maybe.

Welp, that’s half the battle!

Decoding

For huffman trees, decoding is the much simpler process. Simply traverse down the tree using the given encoding and return a value whenever you reach a leaf.

The logic should seem pretty familiar. The main algorithm involves going down the tree, “following” the direction list. If you reach a leaf, then you have found something (and return the directions you haven’t followed yet). If you run out of directions while on a node…something has gone wrong.

(Here we are using the Maybe monad, in order to “stitch together” three possibly-failing operations in a row. We call pt and enc the values “inside” the Just pt and Just enc returned by runBuildTree and encodeAll; the whole thing fails if any of the steps fail at any time. If you are not familiar with this, I sort of literally wrote an entire blog post on this subject :) )

Decoding many

We’d like to repeatedly iterate this until we have consumed our entire encoding.

Basically, starting with a list of encodings, we want to continually chop it up and build a list from it.

This sounds a lot like the Data.List function unfoldr:

unfoldr makes a list by applying your function repeatedly to a “de-cumulator”, carrying the state of the decumulator, and stopping when your function returns Nothing. You can think of it as the “opposite” of foldr.

Using unfoldr, we can write a decodeAll:

Which works exactly as we’d like!

Testing

We can write a utility to test our encoding/decoding functions:

(Again, refer to my MonadPlus article from earlier, if you are unfamiliar with working with the Maybe monad)

testTree should be an identity; that is, testTree xs === xs.

Note the very unsafe irrefutable pattern match on Just decoded. We’ll fix this later!

QuickCheck

Now that we have a neat proposition, we can use quickcheck on it, from the great QuickCheck library. quickcheck will basically test our proposition testTree xs == xs by generating several random xs’s.

Failure!

Oh! We failed? And on such a simple case? What happened?

If we look at how "a" is encoded, it’ll become apparent:

Ah. Well, that’s a problem. Basically, our input string has “zero” entropy, according to typical measurements. So we cannot naively huffman encode it.

Success!

There are a few ways to deal with this. The most “immediate” way would be to realize that decodeAll is partial (that is, it does not terminate/is undefined on some of its inputs), and will actually never terminate if the given tree is a singleton tree. We can write a “safe” decodeAll:

In doing this, we don’t exactly “fix” the problem…we only defer responsibility. Now, whoever uses decodeAll' (like our eventual encoding interface) is forced to handle the error (by handing the Nothing case). In this way, the type system enforces safety. Had we always used the unsafe decodeAll, then whoever uses it eventually has to “manually remember” to handle the unterminating case, by carefuly reading documentation or something. In this case, the type system is a big, explicit reminder saying “hey, deal with the unterminating case.”

We’ll also a “safe” testTree:

So we can now quickcheck:

Hooray!

Re: Testing

All I’ll admit that I didn’t even anticipate the degenerate singleton tree case until I decided to add a quickcheck section to this post. It just goes to show that you should always test! And it also shows how easy it is to write tests in quickcheck. One line could mean five unit tests, and you might even test edge/corner cases that you might have never even thought about!

For example, we probably should have tested lookupPTTable against findPT, our reference implementation :) We should have also tested our binary encode/decode!

Next Time

We’re almost there!

For our last section, we are going to be focusing on pulling it all together to make a streaming compression/decompression interface that will be able to read a file and encode/decode into a new file as it goes, in constant memory, using pipes. After that, we will also be looking at how to profile code, applying some optimization tricks we can do to get things just right, and other things to wrap up.


  1. Note — this section was largely rewritten; it used to contain a rather involved yet misled tutorial about the Writer monad, as suggested by old links/titles. This can still be found here, if you want to read through it.

Comments powered by Disqus