Playing Fast and Loose With GADTs

We’re going to be playing with Generalized Algebraic Data Types (terrible article) in Clojure. To preface, I am not an authority on GADTs, or much of anything — I mostly learn about things by implementing and explaining them. The code is here.

We’ll begin with a brief exposition of what these things are, and then go on to implement a tiny, domain-specific language in which we can express them, pattern-match them, and define functions over them. We will then implement a trivial, type-safe expression interpreter in this DSL.

Note that we’re not interested in a concrete implementation of GADTs, just exploring them at the type level — we’re writing a type-inferring checker, not adding a feature to a programming language.

Algebraic?

What’s algebraic about a data type? If we think about particular ways of combining types as mirroring algebraic operations, it’ll help us reason about fancier typing constructs (and compilers love it). Let’s look at a “product type”:

data Point2D = Point2D Float Float

This type can represent every possible float paired with every possible float, or Float * Float Which is a sick number of floats¹. We’ve also got sum types:

data Point = Point2D Float Float | Point3D Float Float Float

Which correspond to addition, or (Float * Float) + (Float * Float * Float). I’m taking for granted that everyone has internalized this syntax, but the above declares, in Haskell-land, a type Point with two constructors, Point2D and Point3D. Our ADTs can take type variables, e.g. Maybe, which is called Option almost everywhere else:

data Maybe a = Nothing | Just a

Where a is a type variable. These are just ADTs - what’s a generalized ADT?

Generalized ADTs

In the previous example, Point2D and Point3D are both Point, which is fine, but there are cases where we’d like different type constructors to produce different instances of the type. The classic example is the typed interpreter:

data Expr a where
  LitInt  :: Int  -> Expr Int
  LitBool :: Bool -> Expr Bool
  Add     :: Expr Int -> Expr Int -> Expr Int
  If      :: Expr Bool -> Expr a -> Expr a -> Expr a

These constructors are just functions - Add takes two Expr Int and returns an Expr Int. The power of this type becomes apparent when we look at the implementation of an evaluation function:

eval :: Expr a -> a
eval (LitInt n)      = n
eval (LitBool b)     = b
eval (Add x y)       = eval x + eval y
eval (If cond t e)   = if eval cond then eval t else eval e

When we pattern match on an Expr constructor (e.g. LitInt n), we refine the type variable (eval is typed as Expr a -> a), and thus know at compile time that n is an Int, and can treat it like one on the right-hand side of the =. We can’t do this with plain ADTs. There are other cute applications, like type-level data structure invariants.

The specifics will be covered in more detail below.

Implementation

I’m going to repeatedly refer to unification, which will be familiar to anyone who has experimented with logic programming, or type inference. In our context, it means finding a variable substution which will make two types equivalent, if such a thing is possible.

Our implementation is purely functional, because that’s the simplest approach here. It isn’t necessarily the simplest approach for for all unification problems.

Note that the implementation is optimized for brevity and simplicity; there are many edge cases it doesn’t handle. We’re doing the bare minimum to support our typed interpreter example.

Beyond the initial explanations, we will also use an idealized syntax when discussing terms in the program, like {A (List A)} instead of spelling out the record types.

Let’s get the boring stuff out of the way. I apologize in advance that I write and format my code in a particularly twee style.

(ns gadt.core
  (:require [clojure.string :as str]
            [clojure.walk   :as walk]))

(defrecord TypeVar  [name])
(defrecord TypeCtor [name tag params ret-type])

Type constructors hold the name of the type (e.g. Expr), the tag, or constructor name (IntLit) and a sequence of the type parameters and the return type.

(defrecord Type [name params])

Above is how we represent occurrences of the (parameterized) type itself within its constructors, e.g. all instances of Expr in:

...
Add :: Expr Int -> Expr Int -> Expr Int

While these are strictly type constructors, we treat them differently because:

We can’t construct types with them outside of the definition of Expr.
They require unification, as they appear in the parameter lists and return types of exposed type constructors.

(defn type-var?    [t] (instance? TypeVar  t))
(defn type-ctor?   [t] (instance? TypeCtor t))
(defn type?        [t] (instance? Type     t))
(defn atomic-type? [t] (keyword? t))

(defn make-type-var [name]
  (->TypeVar name))
(defn make-type-ctor [name tag params ret-type]
  (->TypeCtor name tag params ret-type))
(defn make-type [name params]
  (->Type name params))

(defn env-lookup [env var]
  (or (env var)
      (throw (ex-info (str "Unbound identifier: " var) {:var var}))))

env is just an immutable map. In practice, it’s almost always *env*, a dynamic binding. Questionable choice, maybe, but we’re going to have functions with pattern matching expressions inside of them, and if they all had to be passed an environment map explicitly, it would get messy. Let’s jump right into how we declare our data types:

Declarations

(declare-data-type List [A]
  (Nil  []           (List A))
  (Cons [A (List A)] (List A)))

We’re using upper-case letters for type-variables by convention. What’s happening here is we’re introducing a GADT called List, with a single universally quantified (i.e. “forall A”) type variable, which the Nil constructor does not constrain — it returns the polymorphic (List A). The Cons constructor takes a value of some type, and a List of that same type, returning a List of that one type.

The declare-data-type macro is not important, but after it runs, we will end up with the following *env*:

{Nil
 #TypeCtor{:name      List
           :tag       Nil
           :params    []
           :ret-type  #Type{:name List :args [#TypeVar{:name A}]}}
 Cons
 #TypeCtor{:name      List
           :tag       Cons
           :params    [#TypeVar{:name A}
                       #Type{:name List :args [#TypeVar{:name A}]}]
           :ret-type  #Type{:name List :args [#TypeVar{:name A}]}}}

Whew. It’s important to note that it’s not called define-data-type — the form binds no names in the current namespace, only in the currently bound type environment.

Typechecking

Let’s look at a redacted version of typecheck, and imagine the case where we’re typechecking the constructor (Nil) in the environment we defined above.

(defn typecheck
  ([expr env]
   (cond
     (atomic-type? expr) expr
     (number?      expr) :number
     (string?      expr) :string
     (boolean?     expr) :bool
     (symbol?      expr) (env-lookup env expr)

     ....
     (and (list? expr) (symbol? (first expr)))
     (let [[ctor & args] expr
           ctor-type     (env-lookup env ctor)]
       (cond
         (type-ctor? ctor-type)
         (let [param-types (:params ctor-type)]
           (assert (= (count args) (count param-types)))
           (let [ret-type  (:ret-type ctor-type)
                 arg-types (map #(typecheck % env) args)
                 subst     (unify-pairwise param-types arg-types)]
             (instantiate ret-type subst)))

         ...
         :else (throw (ex-info "Not a constructor or function"
                               {:expr expr}))))))
  ([expr] (typecheck expr *env*)))

We can see that not much would happen. unify-pairwise has nothing to do, because there are no parameters or arguments, and instantiate (which substitutes type variables for types, given a substitution map) will return the input type unmolested if given an empty one. Here, we’re instantiating the return type of (Nil), which is the polymorphic (List A). So, the result of typechecking (Nil) is (List A).

instantiate is pretty simple, subst is a map of type variables to types, and we try to look up each type variable in it, recurring structurally anywhere a type variable might be hiding.

(defn instantiate [type subst]
  (cond
    (type-var?  type) (subst type type)
    (type-ctor? type) (make-type-ctor
                       (:name type)
                       (:tag  type)
                       (map #(instantiate % subst) (:params type))
                       (instantiate (:ret-type type) subst))
    (type?      type) (make-type
                       (:name type)
                       (map #(instantiate % subst) (:params type)))
    :else             type)))

Our unification functions:

(declare unify)

(defn unify-pairwise [t1s t2s]
  (reduce
   (fn [subst [arg1 arg2]]
     (let [arg1'  (instantiate arg1 subst)
           arg2'  (instantiate arg2 subst)
           subst' (unify arg1' arg2')]
       (merge subst subst')))
   {}
   (map vector t1s t2s)))


(defn unify [t1 t2]
  (cond
    (= t1 t2) {}

    (type-var? t1)
    (if-not (contains? (type-vars t2) t1)
      {t1 t2}
      {})

    (type-var? t2)
    (if-not (contains? (type-vars t1) t2)
      {t2 t1}
      {})

    (and (type? t1)
         (type? t2)
         (= (:name t1)
            (:name t2))) (unify-pairwise (:params t1) (:params t2))

    :else
    (throw (ex-info "Cannot unify" {:types [t1 t2]}))))

(defn try-unify [t1 t2]
  (try
    (unify t1 t2)
    (catch Exception _
      nil)))

type-vars (elided), is similar structurally to instantiate insofar is it visits the interesting parts of each type.

What about a case like (typecheck '(Cons 2 (Nil)))? In typecheck, the parameter types for Cons are [A (List A)], and our supplied arguments are of type [:number (List A)], obtained by recursive calls to typecheck. 2 is trivially a :number, and recall from our previous example that the return type of (Nil) is the polymorphic (List A). We then call unify-pairwise on [A (List A)] [:number (List A)] which:

Instantiates A and :number (no-op)
Unifies A and :number, yielding the substition map {A :number}
Instantiates (List A) and (List A) with that substition map (both instantiate to (List :number))
Both trivially unify via =, adding no additional substitutions
The return type of Cons, which is (List A), is then instantiated in {A :number}, giving us the type (List :number) for the expression

Now, we want to pattern match our types. This involves some trade-offs. I didn’t want the implementation to grow in complexity unnecessarily, so I didn’t handle nested constructors, meaning that this may very well match:

(match (Cons 1 (Nil))
  [(Cons x (Cons y (Nil))) x])

I don’t see this as an immediate problem, because this is a fledgling typechecker, which is not often going to have access to concrete terms.

Pattern Matching

We need to add another clause to typecheck:

(defn typecheck [expr env]
  (cond
    ...
    (and (list? expr) (= (first expr) 'match))
    (let [[_ inspect & cases] expr
          ty                  (typecheck inspect env)]
      (or (process-match-cases env ty cases nil)
          (throw (ex-info "No matching cases in match expression" {:expr expr}))))))

Let’s take match pattern like this, the body of a hypothetical first function:

(declare-data-type Maybe [A]
  (Nothing []  (Maybe A))
  (Just    [A] (Maybe A)))

(match (Cons 1 (Nil))
  [(Cons x _) (Just x)]
  [(Nil)      (Nothing)])

Which typechecks to (Maybe :number). We first typecheck (Cons 1 (Nil)), getting (List :number) and bind the patterns to cases (in the above code listing). Let’s look at process-match-cases:

(defn process-match-cases [env ty cases result]
  (if (empty? cases)
    result
    (let [[[pat body] & rest-cases] cases]
      (if-let [refined-env (pattern-match env pat ty)]
        (let [body-type (typecheck body refined-env)
              result    (if result
                          (if (type-var? result)
                            result
                            (if (type-var? body-type)
                              body-type
                              (if-let [subst (try-unify result body-type)]
                                (instantiate result subst)
                                ty)))
                          body-type)]
          (process-match-cases env ty rest-cases result))
        (process-match-cases env ty rest-cases result)))))

We process all of the cases, because we’re a type checker. pattern-match returns nil if it can’t match the given pattern against the supplied type, otherwise it returns the environment extended with the bindings that’ll exist on the right hand side of the match (the body). The handling of result is to try and maintain the most general type possible across all matching patterns. Here’s pattern-match:

(defn pattern-match [env pattern ty]
  (cond
    (= pattern '_)    env
    (symbol? pattern) (assoc env pattern ty)

    (and (list? pattern) (symbol? (first pattern)))
    (let [[ctor-name & sub-pats] pattern
          ctor-type              (try
                                   (env-lookup env ctor-name)
                                   (catch Exception _
                                     nil))]
      (when (and ctor-type (type-ctor? ctor-type))
        (let [param-types (:params ctor-type)
              ret-type    (:ret-type ctor-type)]
          (when (= (count sub-pats) (count param-types))
            (when-let [subst (try-unify ret-type ty)]
              (let [refined-params (map #(instantiate % subst) param-types)]
                (loop [env            env
                       [p & patterns] sub-pats
                       [t & types]    refined-params]
                  (if-not t
                    env
                    (when-let [env (pattern-match env p t)]
                      (recur env patterns types))))))))))

    (or (number?  pattern)
        (boolean? pattern)
        (string?  pattern))
    (if (try-unify (typecheck pattern env) ty)
      env
      nil)

    :else nil))

In the case where we’re matching the pattern (Cons x _) we’ll hit the third clause in the cond. Recall that ty is the result of typechecking the subject of the match pattern — (Cons 1 (Nil)). refined-params will be [:number (List :number)], so the recursive calls in loop (x => (assoc env x :number), _ => env) refine the environment. The result is (Maybe :number) when the body, (Just x), is typechecked in the refined env by the result processing logic in process-match-cases.

Functions

(defrecord FuncType [params ret-type])

(defn func-type? [t] (instance? FuncType t))

(defn make-func-type [param-types ret-type]
  (->FuncType param-types ret-type))

(def ^:dynamic *env*
  {'+   (make-func-type [:number :number] :number)
   '-   (make-func-type [:number :number] :number)
   '*   (make-func-type [:number :number] :number)
   'not (make-func-type [:bool] :bool)
   '=   (make-func-type [:number :number] :bool)})

(defn instantiate [type subst]
  (cond
    (func-type? type) (make-func-type
                       (map #(instantiate % subst) (:params type))
    ...)))

(defn unify [t1 t2]
  (cond
    (and (func-type? t1)
         (func-type? t2))
    (let [param-subst (unify-pairwise (:params t1) (:params t2))
          ret-subst   (unify (instantiate (:ret-type t1) param-subst)
                             (instantiate (:ret-type t2) param-subst))]
      (merge param-subst ret-subst))
      ...))

(defn typecheck
  ([expr env]
   (cond
     ...
     (and (list? expr) (= (first expr) 'if))
     (let [[_ cond then else] expr
           cond-type (typecheck cond env)
           then-type (typecheck then env)
           else-type (typecheck else env)]
       (unify cond-type :bool)
       (unify then-type else-type)
       (let [result-subst (unify then-type else-type)]
         (instantiate then-type result-subst)))

     (and (list? expr) (symbol? (first expr)))
     (let [[ctor & args] expr
           ctor-type     (env-lookup env ctor)]
       (cond
         ...
         (func-type? ctor-type)
         (let [param-types (:params ctor-type)]
           (assert (= (count args) (count param-types)))
           (let [ret-type  (:ret-type ctor-type)
                 arg-types (map #(typecheck % env) args)]
             (doseq [[param-type arg-type] (map vector param-types arg-types)]
               (unify param-type arg-type))
             ret-type))))))
  ([expr] (typecheck expr *env*)))

Nothing too surprising here. Again, we’re going to skip over the macro we use to declare our functions, and just declare some:

(declare-fn [A] identity [x A] A
  x)

(type->str (*env* 'identity)) ;; => 'A -> 'A

This is the polymorphic identity function, parameterized over the type variable A. Here’s a simple factorial:

(declare-fn factorial [n :number] :number
  (if (= n 0)
    1
    (* n (factorial (- n 1)))))

Note that as before, we’re not binding anything at the top-level, and we can’t execute these functions. They are typechecked (the inferred body type is unified with the annotated return type) at declaration time. And, finally:

(declare-data-type Expr [A]
  (LitNum  [:number]         (Expr :number))
  (LitBool [:bool]           (Expr :bool))
  (Not     [(Expr :bool)]    (Expr :bool))
  (Add     [(Expr :number)
            (Expr :number)]  (Expr :number))
  (Eq?     [(Expr :number)
            (Expr :number)]  (Expr :bool))
  (If      [(Expr :bool)
            (Expr A)
            (Expr A)]        (Expr A)))

(declare-fn [A] evaluate [x (Expr A)] A
  (match x
    [(LitNum n)  n]
    [(LitBool b) b]
    [(Not e)     (not (evaluate e))]
    [(Add e1 e2) (+ (evaluate e1) (evaluate e2))]
    [(Eq? e1 e2) (= (evaluate e1) (evaluate e2))]
    [(If cond then-expr else-expr)
     (if (evaluate cond)
       (evaluate then-expr)
       (evaluate else-expr))]))

;; Expr 'A -> 'A

Happy hacking! Hope there weren’t too many howlers. Full implementation source.

We might be tempted to think about this in terms of sets, where Point2D is the cartesian product of the set of Float and the set of Float. Set-theoretic type systems are a thing, but we’re sticking with algebraic types here. ↩