CS 242: Algebraic data types

After the last few lectures, we have most of our language fundamentals in place: variables, functions, numbers, and booleans. Now it’s time to start building up data structures that give us rich representations of objects in the world. For example, the world’s most popular data structure is probably the list, or an ordered sequence of elements, usually of the same type. While most Turing languages implements lists as arrays due to their correspondence with random-access memory, Church languages usually implement lists as linked lists due to their correspondence to an inductive datatype. Specifically, a linked list in OCaml looks like:

module type List = sig
  type 'a list
end

module LinkedList : List = struct
  type 'a list = Null | Node of 'a * 'a list
end

To define a linked list, we invoke five different type level features:

Variants (or enums) that define a type that can be one of multiple things. Here, a linked list that is either empty or a node.
Within a node, we have a tuple that combines two types together: an 'a (pronounced “alpha”) and an alpha list.
This “alpha” uses another feature, polymorphism, to allow linked list to contain elements of potentially any time.
The recursive usage of alpha list requires recursive types, or types whose definition contains self-reference.
The modules allow the user of the list type to be hidden from its concrete implementation, which requires an idea of existential types.

In the next three lectures, we will explore the theory and practice of each feature except for recursive types (due to time constraints). By the end, we will have a proper mathematical formalization of a modular, polymorphic, inductive linked list data type. Today, we will start by discussing algebraic data types, i.e. structs and enums.

ADTs by example

The basic idea behind algebraic data types (ADTs) is to represent relations between data, specifically the notions of “and” and “or”. And AND type is one that represents multiple types combined together, and an OR type represents a value that is exactly one of many possible types.

Records

The former idea is quite common in programming languages. C has structs:

typedef struct {
  int x;
  int y;
} point_t;

void main() {
  point_t origin;
  origin.x = 0;
  origin.y = 0;
  printf("(%d, %d)", origin.x, origin.y);
}

Python has tuples:

origin = (0, 0)
print(origin[0], origin[1])

These are both instances of the same core idea: types that contain multiple components. If the components are anonymous, like in the Python example, we call them tuples¹, and if the components have names, we call them records (or structs in the C case). Tuples are distinct from lists, as tuples have a fixed size. Records are distinct from dictionaries (aka maps, aka associative arrays) for the same reason. You have most certainly seen tuples or records before, so not much more detail is necessary. Here’s what they look like in OCaml:

type point = {
  x : int;
  y : int;
}

let origin: int * int = (0, 0) in
let (x, y) = origin in
Printf.printf "(%d, %d)" x y;

let origin: point = { x = 0; y = 0; } in
Printf.printf "(%d, %d)" origin.x origin.y

Two points of interest. First, in order to access elements of the tuple, we destructured the tuple into it’s components with let (x, y) = origin. This, and its more generalized notion of pattern matching are common features of functional programming languages (and sadly uncommon in other languages). Second, when constructing the point record, the record literal itself did not have to be annotated with the type name (e.g. point { x = 0; y = 0 }), and is instead inferred.

Variants

This latter idea, of having a type that represents one of many choices, is far less common in most traditional programming languages, but equally as important as records. In OCaml, this is usually called a “variant” type. The simplest example is the option type (OCaml):

type option = Some of int | None

let div (m : int) (n : int) : option =
  if n = 0 then None
  else Some (m / n)

let k = div 5 0 in
match k with
| Some k' -> Printf.printf "Success: %d" k'
| None -> Printf.printf "Failure :("

A value of the option type can be one two things: either a None, or a Some(n) for any integer n. In the div function, we use this to indicate the absence or presence of a result: if we divide by zero, then return None, else return Some(m/n).

That creates a value of the option type, but in order to use it, we have to be able to ask: is this result a some, or a none? For that, we can use the match statement, which allows us to define a branch for each possible value of the result. Here, we say “if k is a some, then get the integer result and print it, otherwise print failure.”

As an extended example, variants are amazingly useful for representing errors. Consider some alternative approaches to error handling. In C, programs traditionally use error codes:

#define ERROR_DIV_BY_ZERO -1

char* error_to_string(int err) {
  if (err == ERROR_DIV_BY_ZERO) {
    return "Div by zero";
  } else {
    return "<Unknown>";
  }
}

int div(int m, int n, int* result) {
  if (n == 0) {
    return ERROR_DIV_BY_ZERO;
  } else {
    *result = m / n;
    return 0;
  }
}

void main() {
  int result;
  int err = div(5, 0, &result);
  if (err < 0) {
    printf("Error: %s\n", error_to_string(err));
    exit(1);
  }
}

This interface is horrendous.

We have to pass a pointer to our result, only for it to possibly be filled in, and we could still ignore the error and use result’s undefined value.
We have to have a gross morass of #defines and error to string conversions.
We have to have a clear convention on error codes: is 0 success or failure? Is greater than 0, or less than 0 a success or a failure?

Some “modern languages” still get this wrong². Any language with a notion of null pointers (e.g. Java) often uses pointers as “implicit” option types—if the returned pointer in null, then the operation failed, otherwise it succeeded. By contrast, if you look at using our div function above:

let k = div 5 0 in
match k with
| Some k' -> Printf.printf "Success: %d" k'
| None -> Printf.printf "Failure :("

We are disallowed from using the result value unless it succeeded. We simply cannot accidentally mess up the error handling. Moreover, we could use a more expressive result type to have richer kinds of errors:

type error = DivByZero | ...
type result = Ok of int | Error of error

let error_to_string e = match e with
| DivByZero -> "Div by zero"
| ...

let div (m : int) (n : int) : result =
  if n = 0 then Error (DivByZero)
  else Ok (m / n)

let k = div 5 0 in
match k with
| Ok k' -> Printf.printf "Success: %d" k'
| Error e -> Printf.printf "Failure: %s" (error_to_string e)

We can reuse the variant mechanism not just to represent our results, but also our errors too! One thing this makes clear: the power of a variant over the traditional notion of an “enum” (like you find in C) is that variants branches have fields. This enables rich, safe interfaces like the one above.

ADT theory

Hopefully that gives you an intuition for what ADTs are and how they work. Now, we’re going to explore a formalism that extends our typed lambda calculus with ADTs.

Product types

First, we’re going to look at simplified records: pairs (or 2-uples). Before I provide you the solution, I want you to think for a moment—how would you implement pairs? What do you need to add to the syntax, statics, and dynamics?

As a general methodology for developing new language features, a given feature usually has two “forms” (syntax extensions): an introduction form, and an elimination form. Introductions create a thing, and elimination destroys the thing, usually providing value in the process. For example, functions are introduced by lambdas, and eliminated by function application.

For pairs, we’ll do the same thing. We need a form to create pairs, and to destroy pairs (i.e. access their elements).

$\begin{alignat*}{3} \msf{Type}~\tau ::= \qamp \ldots \\ \mid \qamp \tprod{\tau_L}{\tau_R} \qqamp \text{product} \\[1.2em] \msf{Direction}~d ::= \qamp L \mid R\\[1.2em] \msf{Expression}~e ::= \qamp \ldots \\ \mid \qamp \pair{e_L}{e_R} \qqamp \text{pair} \\ \mid \qamp \proj{e}{d} \qqamp \text{projection} \end{alignat*}$

We add a new “direction” syntax to specify which element of the pair we’re accessing. The operation of accessing an element of a pair is called a “projection.” For example, a function that returns that adds two pairs of integers:

$\funt{p_1}{\tprod{\tint}{\tint}}{ \funt{p_2}{\tprod{\tint}{\tint}}{ \pair{\proj{p_1}{L} + \proj{p_2}{L}}{\proj{p_1}{R} + \proj{p_2}{R}}}}$

We also added a new type: $\tprod{\tau_L}{\tau_R}$ . In the world of type theory, this is usually called a “product type.” The reasoning for the name will become more apparent when we discuss the “algebra” part of ADTs. In the meantime, we’ll need new static semantics:

$\ir{T-Pair} {\typeJC{e_L}{\tau_L} \s \typeJC{e_R}{\tau_R}} {\typeJC{\pair{e_L}{e_R}}{\tprod{\tau_L}{\tau_R}}} \s \ir{T-Project-L} {\typeJC{e}{\tprod{\tau_L}{\tau_R}}} {\typeJC{\proj{e}{L}}{\tau_L}} \s \ir{T-Project-R} {\typeJC{e}{\tprod{\tau_L}{\tau_R}}} {\typeJC{\proj{e}{R}}{\tau_R}}$

Read them carefully to make sure you parse the syntax. The rules themselves should seem fairly commonsense–if a pair is composed of two expressions with types $\tau_L$ and $\tau_R$ , then the pair type is $\tprod{\tau_L}{\tau_R}$ . If a pair has type $\tprod{\tau_L}{\tau_R}$ and you’re taking out the left element, then the result is of type $\tau_L$ , vice versa for the right element. Lastly, the dynamic semantics:

$\ir{D-Pair}{}{\val{(e_L, e_R)}} \s \ir{D-Project-Step}{\steps{e}{e'}}{\steps{\proj{e}{d}}{\proj{e'}{d}}} \nl \ir{D-Project-L}{}{\steps{\pair{e_L}{e_R}.L}{e_L}} \s \ir{D-Project-R}{}{\steps{\pair{e_L}{e_R}.R}{e_R}}$

We adopt a lazy semantics (don’t evaluate a pair’s components) for simplicity. The projection rules should be self-explanatory at this point—my goal is that you can start reading operational semantics, and understanding the core idea without much additional English.

Sum types

As with products, we will present a simplified version of variants, usually called “sum types.” And as before, we will start by defining the syntax of the introduction and elimination forms, and then defining their semantics.

$\begin{alignat*}{3} \msf{Type}~\tau ::= \qamp \ldots \\ \mid \qamp \tsum{\tau_L}{\tau_R} \qqamp \text{sum} \\[1.2em] \msf{Expression}~e ::= \qamp \ldots \\ \mid \qamp \inj{e}{d}{\tau} \qqamp \text{injection} \\ \mid \qamp \case{e}{x_L}{e_L}{x_R}{e_R} \qqamp \text{case} \end{alignat*}$

This syntax looks a little more intimidating! First, to introduce a value of sum type, we use an “injection”. For example, $\inj{2}{L}{\tsum{\tint}{(\tfun{\tint}{\tint})}}$ creates a value that could be an integer, or could be a function (but it’s actually an integer). Then we use a $\msf{case}$ statement like a more limited match to conditionally run code based on the value of a sum type. For example, a function that either returns a number or calls a function:

$\funt{x}{\tsum{\tint}{(\tfun{\tint}{\tint})}}{\case{x}{n}{n}{f}{(\app{f}{0})}}$

We need to add typing rules for our new syntax:

$\ir{T-Inject-L} {\typeJC{e}{\tau_L}} {\typeJC{\inj{e}{L}{\tsum{\tau_L}{\tau_R}}}{\tau_L + \tau_R}} \s \ir{T-Inject-R} {\typeJC{e}{\tau_R}} {\typeJC{\inj{e}{R}{\tsum{\tau_L}{\tau_R}}}{\tau_L + \tau_R}} \nl \ir{T-Case} {\typeJC{e}{\tsum{\tau_L}{\tau_R}} \s \typeJ{\ctx,\hasType{x_L}{\tau_L}}{e_L}{\tau} \s \typeJ{\ctx,\hasType{x_R}{\tau_R}}{e_R}{\tau}} {\typeJC{\case{e}{x_L}{e_L}{x_R}{e_R}}{\tau}}$

The first two rules are fairly simple—if an injection says “this $e$ is a sum type $\tsum{\tau_L}{\tau_R}$ ”, then double-check the type of $e$ and roll with it, no further questions asked. Case statements are more tricky. First, we check that the expression we’re casing on is actually a sum type $\tsum{\tau_L}{\tau_R}$ . Then we have two expressions with variable bindings, and type-checking rules reminiscent of the original rule for lambdas. For $e_L$ , we have to check what type it returns assuming $\hasType{x_L}{\tau_L}$ , getting some $\tau$ . We then check the type of $e_R$ assuming $\hasType{x_R}{\tau_R}$ . Importantly, the two expressions have to return the same type. This is implicitly written in the rule by using the same variable name $\tau$ for both judgments.

Lastly, the dynamic semantics:

$\ir{D-Inject}{}{\val{\inj{e}{d}{\tau}}} \nl \ir{D-Case-Step} {\steps{e}{e'}} {\steps {\case{e}{x_L}{e_L}{x_R}{e_R}} {\case{e'}{x_L}{e_L}{x_R}{e_R}}} \nl \ir{D-Case-L} {} {\steps {\case{(\inj{e}{L}{\tau})}{x_L}{e_L}{x_R}{e_R}} {\subst{x_L}{e}{e_L}}} \nl \ir{D-Case-R} {} {\steps {\case{(\inj{e}{R}{\tau})}{x_L}{e_L}{x_R}{e_R}} {\subst{x_R}{e}{e_R}}}$

As an exercise, you should read through these yourself and see if you can understand the semantics. A goal of this course is to equip you with the necessary tools and methodologies to explore programming languages on your own, so this is a valuable lesson for improving your fluency in the langauge of PL.

ADT metatheory

Now that we’ve changed our simply typed lambda calculus, we’re going to have to re-prove our type safety theorems. After all, we could have introduced potentially incorrect/breaking changes. However, one of the neat things about structural induction proofs is that they are naturally extensible. Rather than having to go through the entire language again, I can just prove the cases of the theorem for each new rule I provided. Below, I’ll walk through a proof of progress for sum types as an example.

Theorem: if $\hasType{e}{\tau}$ then either $\val{e}$ or there exists an $e'$ such that $\steps{e}{e'}$ .

Proof: by structural induction (extending from last time).

T-Inject-L: if $\typeJ{}{\inj{e}{L}{\tsum{\tau_L}{\tau_R}}}{\tsum{\tau_L}{\tau_R}}$ then either $\val{(\inj{e}{L}{\tsum{\tau_L}{\tau_R}})}$ or there exists an $e'$ such that $\steps{\inj{e}{L}{\tsum{\tau_L}{\tau_R}}}{e'}$ .

By D-Inject, $\val{(\inj{e}{L}{\tsum{\tau_L}{\tau_R}})}$ , so progress holds.
T-Inject-R: proof is same as T-Inject-L.
T-Case: if $\typeJ{}{\case{e}{x_L}{e_L}{x_R}{e_R}}{\tau}$ then either $\val{(\case{e}{x_L}{e_L}{x_R}{e_R})}$ or there exists an $e'$ such that $\steps{\case{e}{x_L}{e_L}{x_R}{e_R}}{e'}$ .

First, assume the premises of T-Case (listed above).

Second, assume the inductive hypothesis for $e, e_L, e_R$ .

Third, casing on each possibility for $e$ derived from the IH:
1. If $\val{e}$ , then by inversion of T-Inject-L and T-Inject-R, because $\typeJ{}{e}{\tsum{\tau_L}{\tau_R}}$ , two possible values for $e$ :
  1. $\inj{e'}{L}{\tsum{\tau_L}{\tau_R}}$ : then by D-Case-L:
    $\steps {\case{(\inj{e'}{L}{\tau})}{x_L}{e_L}{x_R}{e_R}} {\subst{x_L}{e'}{e_L}}$
  2. $\inj{e'}{R}{\tsum{\tau_L}{\tau_R}}$ : then by D-Case-R:
    $\steps {\case{(\inj{e'}{R}{\tau})}{x_L}{e_L}{x_R}{e_R}} {\subst{x_R}{e'}{e_R}}$
2. If $\steps{e}{e'}$ , then by D-Case-Step:
  $\steps {\case{e}{x_L}{e_L}{x_R}{e_R}} {\case{e'}{x_L}{e_L}{x_R}{e_R}}$
Hence, in each case, there exists an $e'$ that the $\msf{case}$ steps to, and the progress theorem holds.

Algebra of ADTs

Product types and sum types are collectively called “algebraic” data types because they have algebraic properties similar to normal integers. It’s a neat little construction that might help you understand the relation between products/sums, although it probably won’t change much in the design of your programs.

Generally, you can understand the algebraic properties in terms of the number of values that inhabit a particular type. For example, the type $\tbool$ is inhabited by two values, true and false. We write this as $\mag{\tbool} = 2$ . To fully develop the algebra, we first need to add two concepts to our language:

$\begin{alignat*}{3} \msf{Type}~\tau ::= \qamp \ldots \\ \mid \qamp \tvoid \qqamp \text{void type} \\ \mid \qamp \tunit \qqamp \text{unit type} \\[1em] \msf{Expression}~e ::= \qamp \ldots \\ \mid \qamp () \qqamp \text{unit value} \end{alignat*}$

Above we introduce the void and unit types³. The idea behind these types is that there are 0 expressions that have the type $\tvoid$ , and there is 1 expression that has the type $\tunit$ (the unit value). With these in the language, now we have an identity for both of our data types:

$\begin{align*} |\tprod{\tau}{\tunit}| &= |\tau| \\ |\tsum{\tau}{\tvoid}| &= |\tau| \end{align*}$

For example, if $\tau = \tbool$ , then the terms that have the type $\tprod{\tbool}{\tunit}$ are $(\msf{true}, ())$ and $(\msf{false}, ())$ . In an information-theoretic sense, adding the unit type (or the $1$ type) to our pair provides no additional information about our data structure. The number of terms inhabiting the type remain the same. Similarly, for the sum case, the type $\tsum{\tbool}{\tvoid}$ has two values: $\inj{\msf{true}}{L}{\tsum{\tbool}{\tvoid}}$ and $\inj{\msf{false}}{L}{\tsum{\tbool}{\tvoid}}$ . There are no values of type $0$ , so there are no other possible injections.

More generally, the sum and product type operators follow these basic rules:

$\begin{align*} |\tprod{\tau_L}{\tau_R}| &= |\tau_L| \times |\tau_R| \\ |\tsum{\tau_L}{\tau_R}| &= |\tau_L| + |\tau_R| \end{align*}$

These rules then give rise to a number of algebraic properties. For example:

$\begin{align*} |\tprod{\tau_L}{\tau_R}| &= |\tprod{\tau_R}{\tau_L}| \tag*{(commutativity)} \\ |\tprod{(\tprod{\tau_L}{\tau_R})}{\tau_R'}| &= |\tprod{\tau_L}{(\tprod{\tau_R}{\tau_R'})}| \tag*{(associativity)} \\ |\tau_L' \times (\tau_L + \tau_R)| &= |(\tau_L' \times \tau_L) + (\tau_L' \times \tau_R)| \tag*{(distributivity)} \end{align*}$

One way to interpret this is as calculuating information content. A pair of two numbers is logically same as the same pair reversed. As in, there’s not a real difference between { int x; int y; } and { int y; int x }. For more on this, I recommend The algebra (and calculus!) of algebraic data types (uses Haskell syntax, but ideas are the same).

“Tuple” is a generalization of single, double, triple, quadruple, quintuple, etc. to $t$ -uple, a composite of $t$ elements. Hence, tuple. ↩
It’s always a strong statement to say something is categorically “wrong” in a programming language, but I feel pretty strongly about this one. ↩
While the void type does not exist in OCaml, the unit type is used frequently to represent side effects. For example, Printf.printf returns a unit type. ↩

Sections