Post

Data.Reify for GADTs

'Reimplementing' data-reify for GADTs

Data.Reify for GADTs

Context

Some parts of my PhD research deal with languages’ design and implementation. Eventually, I had to define an AST for a functional, first-order, typed and compiled language. This language would be implemented as a domain-specific language (DSL) in Haskell.

This language would be a prototype for a bigger project (an eDSL, but I’ll keep that for later), so the idea is to implement the language, in a very naive way. The simpler, the better.

Since the language is functional, we have to support recursion (e.g. recursive functions). And since we want a typed AST, using a GADT seemed to be the way to go.

In theory, tackling recursion seemed quite straight forward, thanks to the data-reify1 library.

data-reify, what is it?

This library, result of the ‘Type-safe observable sharing in Haskell’ paper (A. Gill, 2009)2, basically takes a (potentially infinitely) recursive object (e.g. an AST), and returns a graph, where each node is an AST. These ASTs’ sub-expressions are turned into indexes, referring to other nodes in the graph.

Let’s take the example of an infinite binary-tree (taken from the package’s repository)[^tree-link]

[^tree-link] Link

1
2
3
4
5
6
7
8
9
10
data Tree a =
    Leaf a
  | Node (Tree a) (Tree a)
  deriving (Show)

loop1 :: Tree Int
loop1 = Node (Node (Leaf 1) loop1) loop2

loop2 :: Tree Int
loop2 = Node loop1 (Leaf 2)

data-reify would give us the following graph:

1
2
3
4
5
6
Root: 1
1: Node (Unique 2) (Unique 4)
2: Node (Unique 3) (Unique 1)
3: Leaf 1
4: Node (Unique 1) (Unique 5)
5: Leaf 2

Using Unique values to refer to other nodes in the graph, we avoid explicit object recursion to represent a recursive function. Operating on this now-graph Tree (like getting the left-most value) is trivial.

This Tree example is very simple, but is enough to showcase what the library does. Note that, in some example, the reification of a recursive tree can lead to an infinite loop. This can be avoided if you define something like a let-binding in your AST, establishing sharing and avoid infinite tree expansion.

Now, for data-reify to work, some boilerplate is necessary:

1
2
3
4
5
6
7
8
9
data TreeF a t =
    LeafF a
  | ForkF t t
  deriving (Show, Functor, Foldable)

instance MuRef (Tree a) where
  type DeRef (Tree a) = TreeF a
  mapDeRef _     (Leaf v) = pure $ LeafF v
  mapDeRef child (Fork l r) = liftA2 ForkF (child l) (child r)

We need to define a TreeF, similar to the Tree ADT, but where the sub-expression have the type t, instead of Tree a. This is necessary for the graph to have Unique values, instead of Trees.

Let’s take a look at that MuRef type-class:

1
2
3
4
5
6
class MuRef a where
  type DeRef a :: * -> *
  mapDeRef :: (Applicative f) =>
              (forall b . (MuRef b, DeRef a ~ DeRef b) => b -> f u)
              -> a
              -> f (DeRef a u)

MuRef is a type-class which takes a fully applied type as parameter (e.g. Tree a, not just Tree). DeRef is a type family that gives the type of the Tree where the subtrees can be anything (i.e. with the type-recursion factored out). Thus, type DeRef (Tree a) = TreeF a. The mapDeRef definition is a bit scary. It’s actually just traverse from the Traversable type-class: it takes a Tree and a function that can be applied to the Tree’s subtrees and give back an Applicative. Worded differently, it takes the Tree as input, apply the function to each subtree, and use pure, liftA2 and/or <*> to lift the result in the Applicative returned by that function.

Once the instance of that type-class is written, we can use the ‘main’ function of this library:

1
reifyGraph :: (MuRef s) => s -> IO (Graph (DeRef s))

Thus, given a Tree, the function would return a Graph (TreeF s Unique). The Graph type is just a pair of [(Unique, (TreeF s) Unique)] and a Unit, referring to the root node of the graph.

TLDR: The main ‘challenge’ with data-reify is to understand what DeRef does and how to implement mapDeRef.

Why it does not work as is

What about GADTs? A GADT-defined tree would be something like:

1
2
3
data Tree e a where
  Leaf :: a -> Tree e a
  Node :: e a -> e a -> Tree e a

Where a is the type of the values in the Tree and e is a type that needs a type (i.e. e :: * -> *). The latter defines the type of the subtrees (e a means that the subtree must have values of type a).

Using a GADT to define tree-like structure is a common way to define, say, ASTs. Haskell being a popular choice of languages implementation, defining ASTs is actually a common thing to do.

A ‘problem’ with this approach is that building AST is a bit tricky: Node (Leaf 1) (Node (Leaf 2) (Leaf 3)) has type TreeF (TreeF (TreeF e)) a. See how the type information gives us the depth of the tree? Yeah, we don’t want that. Instead, we will define and use a Fix type to fix this (pun intended) (go checkout this awesome article for more [^fix-gadt])

Let’s try to declare the instance of MuRef for our GADT

1
2
instance MuRef (Fix Tree a) where
  type DeRef (Fix Tree a) = Tree -- ...???

OK, first problem here: Tree has kind (* -> *) -> * -> *, not * -> *. Fix Tree has the correct kind though, but the type parameters do not match semantically: the first * in * -> * is the kind of the sub-expression type. However, in our GADT, that sub-expression has kind (* -> *), not * (We would also need to swap the type parameters, but it can easily be done though a type synonym or something).

The type of MuRef is also problematic, in the sense that its definition is not compatible with the fact that our sub-expressions are typed: it would not be (forall b . (MuRef b, DeRef a ~ DeRef b) => b -> f u) but (forall b t . (MuRef (b t), DeRef (a t) ~ DeRef (b t)) => b t -> f u). But even then, this does not work because the sub-expressions of a may not have the same type as a.

For example, consider this AST

1
2
3
data AST e a where
  Fst :: e (a, a1) -> AST e a
  Snd:: e (a1, a) -> AST e a

(a, a1) is not equal to a, making the equality between DeRef (a t) and DeRef (b t) impossible if b is a sub-expression of a with a different t.

Let’s see how we could overcome this.

TLDR

In short:

  • The DeRef type-family cannot be used with a GADT, as sub-expressions are typed, and the former’s definition does not take that into account. DeRef a should have hind (* -> *) -> * -> *
  • The mapDeRef’s constraint uses type equality to ensure that it is applied to the correct type of sub-expression. However, this type equality cannot be used with GADTs since we can’t guarantee that the sub-expressions’ semantic type is the same as the parent node’s.

[^fix-gadt] Fixing GADTs

Solution

After a couple of days tinkering, I found a (sad) solution: a full ‘reimplementation’ Well, saying it’s a full reimplementation is a bit of a stretch since we (fortunately) won’t have to change the actual logic of reifyGraph, just the type definition.

Graph

Let’s start with the easiest part:

1
2
3
4
5
6
7
type Unique = Int -- Same as in the original library

data Graph e a = Graph [(Unique, Node e)] Unique

data Node e = forall t. MkNode (DeRef e Terminal t)

newtype Terminal a = Terminal {unTerminal :: Unique}

Graph is similar to the original library’s: it’s still a map from Unique to AST/Tree Node.

Node is just a wrapper around DeRef e applied to Terminal and t, an existentially quantified type. This allows us to have a list of tuples where the second element is a Node of ‘any’ type.

Terminal a is a wrapper around a Unique is a phantom type, allowing us to say to the type-checker ‘Hey!, this sub-expression in the graph has type a, if you evaluate this sub-expression, you’ll get an a, I promise’.

MuRef

1
2
3
4
5
6
7
8
class MuRef (a :: Type -> Type) where
    type DeRef a :: (Type -> Type) -> Type -> Type
    type E a :: Type -> Type
    mapDeRef ::
        (Applicative f, e ~ E a) =>
        (forall t'. (MuRef e) => e t' -> f (u t')) ->
        a t ->
        f ((DeRef a) u t)

MuRef a has kind * -> Constraint, meaning that a does not contain the semantic type of the expression it represents. E.g., we would define instance MuRef (Fix Tree), instead of instance MuRef (Fix Tree a)

DeRef a has a kind compatible with GADTs. For example, DeRef (Fix Tree) would be Tree

E a gives the type of the sub-expression of a. For Fix Tree, it would be Fix Tree. We’ll use this type-family along with equality constraint to guarantee to reifyGraph that the input tree has monotonic subtree types.

Finally, mapDeRef takes a tree a with type t and a function that takes a subtree of a (of type e with whatever type t') and returns a f (u t'). The function returns, wrapped in an applicative, a tree like a but where the subtrees have type u. We still guarantee that the returned tree have the same type as the input tree.

reifyGraph

reifyGraph’s type becomes:

1
2
3
4
reifyGraph ::
    (MuRef s, E (E s) ~ E s, DeRef (E s) ~ DeRef s) =>
    s a ->
    IO (Graph (DeRef s) a)

In this new definition, we enforce that the input tree’s subtrees have the same type as the parent tree. This might feel a bit restrictive, but the original definition of mapDeRef had the DeRef b ~ DeRef a constraint, which enforced a similar restriction.

We’ll have to add this constraint to every subfunction of reifyGraph, but considering there are only three, I guess it’s totally acceptable.

Passing a Fix Tree a returns a Tree Terminal a.

Conclusion

Too bad we could not reuse code from the original package and had to replicate it to just adapt the types to our needs, considering the internal logic is untouched. But this choice was taken after a couple of days thinking about it. Considering the small size of the package, we could consider it a quick/dirty solution.

Defining a wrapper around the original MuRef type-class is not possible considering that at some point we would have to remove type information, and recover them would be a pain.

I have not tried to use this new version on non-GADT tree structures, that would be an interesting future work.

The code is available on GitHub and has been published on Hackage as data-reify-gadt. The repository contains code examples, if you want to take a look.

The day after I published the package, I stumbled across data-treify 3, a 10-year-old library, which seems to tackle our issue. However, unlike what was presented in this article, it seems to be a full-reimplementation with a few more type information.

This post is licensed under CC BY 4.0 by the author.

Trending Tags