It seems these days like programming with dependent types in Haskell (and its advantages) is moving slowly but steadily to the mainstream of Haskell programming. In the current state of Haskell education, dependent types are often considered topics for “advanced” Haskell users. However, I can foresee a day where the ease of use of modern Haskell libraries relying on dependent types forces programming with dependent types to be an integral part of normal intermediate (or even beginner) Haskell education.
There are more and more and more and more great resources and tutorials and introductions to integrating dependent types into your Haskell every day. The point of this series is to show more some practical examples of using dependent types in guiding your programming, and to also walk through the “why” and high-level philosophy of the way you structure your Haskell programs. It’ll also hopefully instill an intuition of a dependently typed work flow of “exploring” how dependent types can help your current programs. The intended audience of this post is for intermediate Haskell programmers in general, with no required knowledge of dependently typed programming. I should also point out that I’m no expert — I’m still in the process of learning this all, myself :)
The first project in this series will build up to a type-safe artificial neural network implementation with back-propagation training.
This post is written on stack snapshot nightly-2016-06-28, with singletons-2.2, but uses an unreleased version of hmatrix, hmatrix-0.18 (commit 42a88fb). I maintain my own documentation for reference.
If you’re forced to use GHC 7.10 for some reason, there’s also a bug in singletons-2.0.1 package that’s fixed in singletons-2.1, but 2.1 is not available with GHC 7.10 – I have a github fork that fixes the bug if you want to stay on GHC 7.10.
You can add this:
packages field of your directory or global stack.yaml and stack will know what version of hmatrix and singletons to use when you use
stack runghc or
stack ghc, etc. to build your files.
Artificial neural networks have been somewhat of a hot topic in computing recently. Implementations of training algorithms (like back-propagation) are tricky to implement correctly — despite being simple, there are many locations where accidental bugs might pop up when multiplying the wrong matrices, for example.
Though some might recognize that complicated matrix and vector arithmetic is a common application of phantom type-based dependent types, it’s not necessarily always easy to gauge before-the-fact what would or would not be a good candidate for adding dependent types to. Often times, it can even be considered premature to start off with “as powerful types as you can”. So let’s walk through programming things with as “dumb” types as possible, and see where types can help.
We’ll be following a process called “type-driven development” — start with general and non-descriptive types, write the implementation and recognize partial functions and red flags, and slowly refine and add more and more powerful types to fix the problems.
Here’s a quick run through on background for ANN’s — but remember, this isn’t an article on ANN’s, so we are going to be glossing over some of the details.
We’re going to be implementing a feed-forward neural network with back-propagation training. These networks are layers of “nodes”, each connected to the each of the nodes of the previous layer. Input goes to the first layer, which feeds information to the next layer, which feeds it to the next, etc., until the final layer, where we read it off as the “answer” that the network is giving us. Layers between the input and output layers are called hidden layers. Every node “outputs” a weighted sum of all of the outputs of the previous layer, plus an always-on “bias” term (so that its result can be non-zero even when all of its inputs are zero). Symbolically, it looks like:
Or, if we treat the output of a layer and the list of list of weights as a matrix, we can write it a little cleaner:
The result, the -vector of nodes , is computed from the -vector of biases and the weight matrix multiplied with the -vector input, .
To “scale” the result (and to give the system the magical powers of nonlinearity), we actually apply an “activation function” to the output before passing it down to the next step. We’ll be using the popular logistic function, .
Training a network involves picking the right set of weights to get the network to answer the question you want.
We can store a network by storing the matrix of of weights and biases between each layer:
Weights linking an m-node layer to an n-node layer has an n-dimensional bias vector (one component for each output) and an n-by-m node weight matrix (one column for each output, one row for each input).
(We’re using the
Matrix type from the awesome hmatrix library for performant linear algebra, implemented using blas/lapack under the hood)
A feed-forward neural network is then just a linked list of these weights:
Note that we’re using GADT syntax here, which just lets us define
Network (with a kind signature,
*) by providing the type of its constructors,
(:&~). It’d be equivalent to the following normal data declaration:
A network with one input layer, two inner layers, and one output layer would look like:
The first component is the weights from the input to first inner layer, the second is the weights between the two hidden layers, and the last is the weights between the last hidden layer and the output layer.
We can write simple procedures, like generating random networks:
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/dependent-haskell/NetworkUntyped.hs#L46-L56 randomWeights :: MonadRandom m => Int -> Int -> m Weights randomWeights i o = do seed1 :: Int <- getRandom seed2 :: Int <- getRandom let wB = randomVector seed1 Uniform o * 2 - 1 wN = uniformSample seed2 o (replicate i (-1, 1)) return $ W wB wN randomNet :: MonadRandom m => Int -> [Int] -> Int -> m Network randomNet i  o = O <$> randomWeights i o randomNet i (h:hs) o = (:&~) <$> randomWeights i h <*> randomNet h hs o
(We’re using the
MonadRandom typeclass from the MonadRandom library, which uses the mechanisms in System.Random and gives us a generic way of working with monads where we can get random values with
And now we can write a function to “run” our network on a given input vector, following the matrix equation we wrote earlier:
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/dependent-haskell/NetworkUntyped.hs#L30-L44 logistic :: Floating a => a -> a logistic x = 1 / (1 + exp (-x)) runLayer :: Weights -> Vector Double -> Vector Double runLayer (W wB wN) v = wB + wN #> v runNet :: Network -> Vector Double -> Vector Double runNet (O w) !v = logistic (runLayer w v) runNet (w :&~ n') !v = let v' = logistic (runLayer w v) in runNet n' v'
#> is matrix-vector multiplication)
If you’re a non-Haskell programmer, this might all seem perfectly fine and normal, and you probably have only a slightly elevated heart rate. If you are a Haskell programmer, you are most likely already having heart attacks. Let’s imagine all of the bad things that could happen:
How do we know that we didn’t accidentally mix up the dimensions for our implementation of
randomWeights? We could have switched parameters and be none the wiser.
How do we even know that each subsequent matrix in the network is “compatible”? We want the outputs of one matrix to line up with the inputs of the next, but there’s no way to know. It’s possible to build a bad network, and things will just explode at runtime.
How do we know the size of vector the network expects? What stops you from sending in a bad vector at run-time? We might do runtime-checks, but the compiler won’t help us.
How do we verify that we have implemented
runNetin a way that they won’t suddenly fail at runtime? We write
l #> v, but how do we know that it’s even correct…what if we forgot to multiply something, or used something in the wrong places? We can it prove ourselves, but the compiler won’t help us.
Now, let’s try implementing back-propagation! It’s a textbook gradient descent algorithm. There are many explanations on the internet; the basic idea is that you try to minimize the squared error of what the neural network outputs for a given input vs. the actual expected output. You find the direction of change that minimizes the error (by finding the derivative), and move that direction. The implementation of backpropagation is found in many sources online and in literature, so let’s see the implementation in Haskell:
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/dependent-haskell/NetworkUntyped.hs#L58-L96 train :: Double -- ^ learning rate -> Vector Double -- ^ input vector -> Vector Double -- ^ target vector -> Network -- ^ network to train -> Network train rate x0 target = fst . go x0 where go :: Vector Double -- ^ input vector -> Network -- ^ network to train -> (Network, Vector Double) -- handle the output layer go !x (O w@(W wB wN)) = let y = runLayer w x o = logistic y -- the gradient (how much y affects the error) -- (logistic' is the derivative of logistic) dEdy = logistic' y * (o - target) -- new bias weights and node weights wB' = wB - scale rate dEdy wN' = wN - scale rate (dEdy `outer` x) w' = W wB' wN' -- bundle of derivatives for next step dWs = tr wN #> dEdy in (O w', dWs) -- handle the inner layers go !x (w@(W wB wN) :&~ n) = let y = runLayer w x o = logistic y -- get dWs', bundle of derivatives from rest of the net (n', dWs') = go o n -- the gradient (how much y affects the error) dEdy = logistic' y * dWs' -- new bias weights and node weights wB' = wB - scale rate dEdy wN' = wN - scale rate (dEdy `outer` x) w' = W wB' wN' -- bundle of derivatives for next step dWs = tr wN #> dEdy in (w' :&~ n', dWs)
The algorithm computes the updated network by recursively updating the layers, backwards up from the output layer. At every step, it returns the updated layer/network, as well as a bundle of derivatives (
dWs) for the next layer up to use to calculate its descent direction.
At the output layer, all it needs to calculate the direction of descent is just
o - targ, the target. At the inner layers, it has to use the
dWs bundle it receives from the lower layers to figure it out.
dWs essentially “bubbles up” from the output layer up to the input layer calculations.
Writing this is a bit of a struggle. I actually implemented this incorrectly several times before writing it as you see here. The type system doesn’t help you like it normally does in Haskell, and you can’t really use parametricity to help you write your code like normal Haskell. Everything is monomorphic, and everything multiplies with everything else. You don’t have any hints about what to multiply with what at any point in time. It’s like all of the bad things mentioned before, but amplified.
In short, you’re leaving yourself open to many potential bugs…and the compiler doesn’t help you write your code at all! This is the nightmare of every Haskell programmer. There must be a better way!1
Putting it to the test
Pretty much the only way you can verify this code is to test it out on example cases. In the source file, I have
main test out the backprop, training a network on a 2D function that was “on” for two small circles and “off” everywhere else (A nice cute non-linearly-separable function to test our network on). We basically train the network to be able to recognize the two-circle pattern. I implemented a simple printing function and tested the trained network on a grid:
$ stack install hmatrix MonadRandom $ stack ghc -- -O2 ./NetworkUntyped.hs $ ./NetworkUntyped # Training network... # # # .=########= # .##############. # ################ # ################ # .##############- # .########### # ... ... # -##########. # -##############. # ################ # ################ # =############= # .#######=. # #
Not too bad! The network learned to recognize the circles. But, I was basically forced to resort to unit testing to ensure my code was correct. Let’s see if we can do better.
The Call of Types
Before we go on to the “typed” version of our program, let’s take a step back and look at some big checks you might want to ask yourself after you write code in Haskell.
- Are any of my functions either partial or implemented using partial functions?
- How could I have written things that are incorrect, and yet still type check? Where does the compiler not help me by restricting my choices?
Both of these questions usually yield some truth about the code you write and the things you should worry about. As a Haskeller, they should always be at the back of your mind!
Looking back at our untyped implementation, we notice some things:
- Literally every single function we wrote is partial. Like, actually.2 If we had passed in the incorrectly sized matrix/vector, or stored mismatched vectors in our network, everything would fall apart.
- There are billions of ways we could have implemented our functions where they would still typechecked. We could multiply mismatched matrices, or forget to multiply a matrix, etc.
With Static Size-Indexed Types
Gauging our potential problems, it seems like the first major class of bugs we can address is improperly sized and incompatible matrices. If the compiler always made sure we used compatible matrices, we can avoid bugs at compile-time, and we also can get a friendly helper when we write programs (by knowing what works with what, and what we need were, and helping us organize our logic)
Let’s write a
Weights type that tells you the size of its output and the input it expects. Let’s have, say, a
Weights 10 5 be a set of weights that takes you from a layer of 10 nodes to a layer of 5 nodes.
w :: Weights 4 6 would take you from a layer of 4 nodes to a layer of 6 nodes:
The type constructor
Weights has the kind
Weights :: Nat -> Nat -> * — it takes two types of kind
Nat (from the GHC.TypeLits module, which the integer type literals give us with DataKinds enabled) and returns a
* — a “normal type”.
We’re using the Numeric.LinearAlgebra.Static module from hmatrix, which offers matrix and vector types with their size in their types: an
R 5 is a vector of Doubles with 5 elements, and a
L 3 6 is a 3x6 vector of Doubles.
These types are called “dependent” types because the type itself depends on its value. If an
R n contains a 5-element vector, its type is
The Static module in hmatrix relies on the
KnownNat mechanism that GHC offers. Almost all operations in the library require a
KnownNat constraint on the type-level Nats — for example, you can take the dot product of two vectors with
dot :: KnownNat n => R n -> R n -> Double. It lets the library use the information in the
n at runtime as an
Integer. (More on this later!)
Moving on, our network type for this post will be something like
Network 10 '[7,5,3] 2: Take 10 inputs, return 2 outputs — and internally, have hidden layers of size 7, 5, and 3. (The
'[7,5,3] is a type-level list of Nats; the optional
' apostrophe is just for our own benefit to distinguish it from a value-level list of integers.)
We use GADT syntax here again. The kind signature of the type constructor means that the
Network type constructor takes three inputs: a
Nat (type-level numeral, like
5), list of
Nats, and another
Nat (the input, hidden layers, and output sizes). Let’s go over the two constructors.
Oconstructor takes a
Weights i oand returns a
Network i ' o. That is, if your network is just weights from
ooutputs, your network itself just takes
iinputs and returns
ooutputs, with no hidden layers.
(:&~)constructor takes a
Network h hs o– a network with
ooutputs – and “conses” an extra input layer in front. If you give it a
Weights i h, its outputs fit perfectly into the inputs of the subnetwork, and you get a
Network i (h ': hs) o. (
(:), is the same as normal
(:), but is for type-level lists. The apostrophe is optional here too, but it’s just nice to be able to visually distinguish the two)
We add a
KnownNatconstraint on the
h, so that whenever you pattern match on
w :&~ net, you automatically get a
KnownNatconstraint for the input size of
net(and the output of
w) that the hmatrix library can use.
We can still construct them the same way:
Note that the shape of the constructors requires all of the weight vectors to “fit together”.
ih :&~ O ho would be a type error (feeding a 7-output layer to a 4-input layer). Also, if we ever pattern match on
:&~, we know that the resulting matrices and vectors are compatible!
One neat thing is that this approach is also self-documenting. I don’t need to specify what the dimensions are in the docs and trust the users to read it and obey it. The types tell them! And if they don’t listen, they get a compiler error! (You should, of course, still provide reasonable documentation. But, in this case, the compiler actually enforces your documentation’s statements!)
Generating random weights and networks is even nicer now:
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/dependent-haskell/NetworkTyped.hs#L57-L64 randomWeights :: (MonadRandom m, KnownNat i, KnownNat o) => m (Weights i o) randomWeights = do s1 :: Int <- getRandom s2 :: Int <- getRandom let wB = randomVector s1 Uniform * 2 - 1 wN = uniformSample s2 (-1) 1 return $ W wB wN
Notice that the Static versions of
uniformSample don’t actually require the size of the vector/matrix you want as an input – they just use type inference to figure out what size you want! This is the same process that
read uses to figure out what type of thing you want to return. You would use
randomVector s Uniform :: R 10, and type inference would give you a 10-element vector the same way
read "hello" :: Int would give you an
It’s important to note that it’s much harder to implement this incorrectly. Before, you could give the matrix the wrong dimensions (maybe you flipped the parameters?), or gave the wrong parameter to the vector generator.
But here, you are guaranteed/forced to return the correctly sized vectors and matrices. In fact, you don’t even have to worry about it — it’s handled automatically by the magic of type inference3! I consider this a very big victory. One of the whole points of types is to give you less to “worry about”, as a programmer. Here, we completely eliminate an entire dimension of programmer concern.
Benefits to the user
Not only is this style nicer for you as the implementer, it’s also very beneficial for the user of the function. Consider looking at the two competing type signatures side-by-side:
If you want to use this function, you have to look up some things from the documentation:
- What do the two arguments represent?
- What order is the function expecting these two arguments?
- What will be the dimension of the result?
These are three things you need to look up in the documentation. There’s simply no way around it.
But, here, all of these questions are answered immediately, just from the type (which you can get from GHC, or from ghci). You don’t need to worry about arguments. You don’t need to worry about what order the function is expecting the arguments to be in. And you already know exactly what the dimensions of the result is, right in the type.
I often implement many of my functions in this style, even if the rest of my program isn’t intended to be dependently typed (I can just convert the type to a “dumb” type as soon as I get the result). All of these benefits come even when the caller doesn’t care at all about dependently typed programming — it’s just a better style of defining functions/offering an API!
Singletons and Induction
The code for the updated
randomNet takes a bit of background to understand, so let’s take a quick detour through the concepts of singletons, dependent pattern matching, and induction on dependent data types.4
Let’s say we want to implement an algorithm that can create any
Network i hs o, so that we can construct a
Network 4 '[3,2] 1 or something. In true Haskell fashion, we want do this recursively (“inductively”). After all, we know how to make a
Network i ' o (just
O <$> randomWeights), and we know how to create a
Network i (h ': hs) o if we had a
Network h hs o (just use
randomWeights). Now all we have to do is just “pattern match” on the type-level list, and…
Oh wait. We can’t pattern match on types like that in Haskell. This is a consequence of one of Haskell’s fundamental design decisions: types are erased at runtime. We need to have a way to “access” the type (at run-time) as a value so we can pattern match on it and do things with it.
In Haskell, the popular way to deal with this is by using singletons — (parameterized) types which only have valid constructor. The canonical method of working with singletons in Haskell is with the singletons library, which provides a uniform interface for all sorts of singletons of types you’ll encounter in everyday use.
We want to “pattern match” on a type-level list, so we want a singleton for lists. The singletons library provides them:
This means that if we ever get value of type
Sing as (and
as is a type-level list), we can pattern match on it. If we match on the
SNil constructor, we know it’s a
Sing ' in that branch, and if we match on the
SCons constructor, we know it’s a
Sing (a ': as) – a non-empty list. This is called dependent pattern matching. Every “branch” of your case statement has a different inferred type of the arguments, depending on the constructor you match on.
singletons actually provides a whole bunch of singleton constructors for different types and kinds, like for
(That’s the type
'True, of kind
So, if we ever are given a
Sing b with some type-level
Bool we don’t know, we can pattern match on it. And in the branch that
STrue matches on,
'True, and in the branch that
SFalse matches on,
Singletons give us a way to pattern match on types by having an actual term-level value we can pattern match on. So, we could implement:
randomNet gets to directly pattern match and deconstruct on
However, for actual API’s, it’s often more convenient to not require the extra parameter, and have it be “inferred” in the way we’ve been doing it before. That way the user doesn’t have the burden of supplying it. The singletons library offers a typeclass we can use to implicitly conjure up values of a singleton type –
SingI. We can use
sing :: SingI s => Sing s to generate the “inferred” singleton:
So if you have a function
SingI hs => ..., it’s really no different than
Sing hs -> .... The function itself gets to use a
Sing hs either way … but for the first, the argument is implicit.
The final piece of the puzzle is the singleton for a type-level
Nat. It’s a little different because when you pattern match on it, instead of directly learning about the type, you “receive” a
KnownNat instance you can use.
Essentially, the data constructor comes “packaged” with a
KnownNat n instance. The creation of
SNat :: Sing n requires the presence of
KnownNat n. So if you ever pattern match on a validly created
SNat, the fact that that
SNat constructor even exists (instead of, say, being
undefined) is a witness to that very
KnownNat instance, and the type system lets us use this. It’s as if you “pattern match” out the instance itself, like any other value the constructor might have.
Now we have enough pieces of the puzzle:
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/dependent-haskell/NetworkTyped.hs#L66-L75 randomNet :: forall m i hs o. (MonadRandom m, KnownNat i, SingI hs, KnownNat o) => m (Network i hs o) randomNet = go sing where go :: forall h hs'. KnownNat h => Sing hs' -> m (Network h hs' o) go = \case SNil -> O <$> randomWeights SNat `SCons` ss -> (:&~) <$> randomWeights <*> go ss
The real heavy lifting is done by
go (written with LambdaCase), which takes the singleton structure it needs and recursively calls it until it reaches the base case (
SNil, an output layer). We just call
go sing to give it the initial structure it needs. Note there,
sing :: Sing hs, but this is inferred, because
Sing hs -> Network i hs o, and it’s being asked to return a
Network i hs o, so it’s safely inferable that we want
Remember that we can write
O <$> randomWeights because
read, adapts to whatever type we want from it — in this case, we ask for a
Weights h o, and type inference is the real hero.
When possible, we like to write functions like
go that take explicit singletons. In a lot of situations, we’ll actually write our internal logic itself using explicit singletons, and only use
SingI and implicit singletons at the external boundaries of our API (like
randomNet) for convenience to the user.
We’ve stumbled upon common pattern in dependent Haskell: “building up” a value-level singleton structure from a type that we want (either explicitly given as an argument, or provided through a typeclass like
SingI) and then inductively piggybacking on that structure’s constructors to build the thing you really want (called “elimination”). Here, we use
SingI hs and “eliminate” that structure to create our
Network i hs o.
On Typeclasses and Dictionaries
One of the more bizarre things here, to me, is that
SNat somehow gave us a
KnownNat n instance that we can use and pass off to
randomWeights. However, once you realize that typeclasses in Haskell really aren’t any more than a way to pass in implicit arguments, it starts to make sense.
The only thing you can really do from a
KnownNat is to get an
Integer from it with
natVal. So really,
KnownNat n => ... is more or less the same as
Integer -> .... That’s right — at runtime, a
KnownNat n constraint is more or less just an
Integer that GHC passes around automatically for you, to save you the hassle of manually passing it in yourself. (We say that the “dictionary” of
So, the constructor:
Is really kind of like:
The GADT constructor for
SNat requires a
KnownNat n instance in scope to produce. That instance is essentially stored inside the constructor (as if it were just an
Integer). Then, later, when you pattern match on it, you pattern match out the instance that was originally put in there, and you can use it!
So what’s the big deal, why not just ditch
KnownNat and just pass around integers? The difference is that GHC and the compiler can now track these at compile-time to give you checks on how your Nat’s act together on the type level, allowing it to catch mismatches with compile-time checks instead of run-time checks.
Running with it
So now, you can use
randomNet :: IO (Network 5 '[4,3] 2) to get a random network of the desired dimensions! (
IO is an instance of
Can we just pause right here to just appreciate how awesome it is that we can generate random networks of whatever size we want by just requesting something by its type? Our implementation is also guaranteed to have the right sized matrices — no worrying about using the right size parameters for the right matrix in the right order. GHC does it for you automatically! And, for the person who uses
randomNet, they don’t have to bungle around with figuring out what function argument indicates what, and in what order, and they don’t have to play a guessing game about the shape of the returned matrix.
The code for running the nets is actually literally identical from before:
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/dependent-haskell/NetworkTyped.hs#L42-L55 runLayer :: (KnownNat i, KnownNat o) => Weights i o -> R i -> R o runLayer (W wB wN) v = wB + wN #> v runNet :: (KnownNat i, KnownNat o) => Network i hs o -> R i -> R o runNet = \case O w -> \(!v) -> logistic (runLayer w v) (w :&~ n') -> \(!v) -> let v' = logistic (runLayer w v) in runNet n' v'
But now, we get the assurance that the matrices and vectors all fit each-other, at compile-time. GHC basically writes our code for us. The operations all demand vectors and matrices that “fit together”, so you can only ever multiply a matrix by a properly sized vector.
The source code is the same from before, so there isn’t any extra overhead in annotation. The correctness proofs and guarantees basically come without any extra work — they’re free!
Our back-prop algorithm is ported pretty nicely too:
-- source: https://github.com/mstksg/inCode/tree/master/code-samples/dependent-haskell/NetworkTyped.hs#L77-L116 train :: forall i hs o. (KnownNat i, KnownNat o) => Double -- ^ learning rate -> R i -- ^ input vector -> R o -- ^ target vector -> Network i hs o -- ^ network to train -> Network i hs o train rate x0 target = fst . go x0 where go :: forall j js. KnownNat j => R j -- ^ input vector -> Network j js o -- ^ network to train -> (Network j js o, R j) go !x (O w@(W wB wN)) = let y = runLayer w x o = logistic y -- the gradient (how much y affects the error) -- (logistic' is the derivative of logistic) dEdy = logistic' y * (o - target) -- new bias weights and node weights wB' = wB - konst rate * dEdy wN' = wN - konst rate * (dEdy `outer` x) w' = W wB' wN' -- bundle of derivatives for next step dWs = tr wN #> dEdy in (O w', dWs) -- handle the inner layers go !x (w@(W wB wN) :&~ n) = let y = runLayer w x o = logistic y -- get dWs', bundle of derivatives from rest of the net (n', dWs') = go o n -- the gradient (how much y affects the error) dEdy = logistic' y * dWs' -- new bias weights and node weights wB' = wB - konst rate * dEdy wN' = wN - konst rate * (dEdy `outer` x) w' = W wB' wN' -- bundle of derivatives for next step dWs = tr wN #> dEdy in (w' :&~ n', dWs)
It’s pretty much again almost an exact copy-and-paste, but now with GHC checking to make sure everything fits together in our implementation.
One thing that’s hard for me to convey here without walking through the implementation step-by-step is how much the types help you in writing this code.
Before starting writing a back-prop implementation without the help of types, I’d probably be a bit concerned. I mentioned earlier that writing the untyped version was no fun at all. But, with the types, writing the implementation became a joy again. And, you have the help of hole driven development, too.
If you need, say, an
R n, there might be only one way get it! And if you have something that you need to combine with something you don’t know about, you can use typed holes (
_) and GHC will give you a list of all the values you have in scope that can fit there. Your programs basically write themselves!
The more you can restrict the implementations of your functions with your types, the more of a joy programming in Haskell is. Things fit together and fall together before your eyes…and the best part is that if they’re wrong, the compiler will nudge you gently into the correct direction.
The most stressful part of programming happens when you have to tenuously hold a complex and fragile network of ideas and constraints in your brain, and any slight distraction or break in focus causes everything to crash down in your mind. Over time, people have begun to believe that this is “normal” in programming. Don’t believe this lie — it’s not! A good programming experience involves maintaining as little in your head as possible, and letting the compiler handle remembering/checking the rest.
The final test
$ stack install hmatrix MonadRandom singletons $ stack ghc -- -O2 ./NetworkTyped.hs $ ./NetworkTyped # Training network... # # # -#########- # -#############= # -###############- # =###############= # ##############=. # .##########=. # .==#=- # -###########- # =##############. # .###############= # =##############- # =############- # -######=-. # #
Finding Something to Depend on
We wrote out an initial “non-typed” implementation and recognized a lot red flags that you might already be trained to recognize if you have been programming Haskell for a while: partial functions and multiple potential implementations.
We followed our well-tuned Haskell guts, listened to our hearts, and introduced extra power in our types to remove all partial functions and eliminate most potential implementations (though not all, yet — there are more gains to be made from pulling in more parametric polymorphism).
Though we might have been able to find the bugs we avoided “eventually”, we were able to remove entire dimensions of programmer concern and also leverage parametric polymorphism to help write our programs for us. We found joy again in programming.
In the process, however, we encountered some unexpected resistance from Haskell (the language). We couldn’t directly pattern match on our types, so we ended up playing games with singletons and GADT constructors to pass instances.
In practice, using types as powerful and descriptive as these begin to require a whole new set of tools once you get past the simplest use cases here. For example, our
Network types so far required you to specify their size in the program itself (
Network 2 '[16, 8] 1 in the example source code, for instance). But what if we wanted to generate a network that has runtime-determined size (For example, getting the size from user input)? What if we wanted to load a pre-trained network whose size we don’t know? How can we manipulate our networks in a “dynamic” and generic way that still gives us all of the benefits of type-safe programming? We’ll found out next post!
What we’re looking at here is a world where types can depend on run-time values … and values can depend on types. A world where types can be returned from functions and where types become as much of a manipulatable citizen of as values are.
The art of working with types like this is dependently typed programming. We’re going to feel a bit of push back from Haskell at first, but after we hit our stride and tame the tools we need, we’re going to open up a whole new world of potential!
Here are some exercises you can do for fun to test your understanding and apply some of the concepts! The links are to the solutions in the source file.
Write a function that “pops” the input layer off of a
Network, returning both the input layer’s weights and the rest of the network,
(Weights i h, Network h hs o).
Think about what its type would have to be. Could it possibly be called with a network that cannot be popped? (that is, that has only one layer?)
Write a function that takes two networks of the same dimensions and adds together their weights. Remember that
L m nhas a
Numinstance that adds the matrices together element-by-element.
Could this function ever be accidentally called on two networks that have different internal structures?
Write a function that takes a
Network i hs oand returns the singleton representing its hidden layer structure —
This sentence is the story of my Haskell life.↩
Okay, maybe not literally every one. But, pretty much every one.↩
Thank you based Hindley-Milner.↩
This entire section was previously written to use the typelits-witnesses library, but has now been re-written to use the canonical singletons library after a conversation with Andras Kovacs. The old version is still online at github, if you want to look at it or compare!↩