CS 242: Modules

In programming language design, a module can often refer to many kinds of functionality. At its core, a module is a bundle of related functionality, usually under a particular namespace. In Python for example, a module is literally a file, and all the functions/data defined in the file are grouped under the file’s module. The module’s name is defined by the file’s name, and then a complex system of imports (e.g. from ..a.b import c) locates a file and executes it.

The other primary role of a module is to distinguish between interface and implementation. For example, in Javascript, a module can choose to selectively export certain symbols with an export keyword which become the module’s interface. Similarly, Java has private classes that can only be referenced in the defining module.

For this course, we will treat the former role of file-finding as a less interesting, more logistical feature. There are certainly many meaningful design decisions involved, particularly in the creation of a networked package manager like npm, pip, maven, so on. From a theoretical perspective, we will focus on the latter role of data abstraction and information hiding: how can we formally encode the notion of interface/implementation separation?

Abstract data types

As before, we will begin by walking through a few concrete examples in OCaml before diving into the theory. A common use case for modules in OCaml is to define an abstract data type, or ADT (not to be confused with an algebraic data type). You may have learned about them all the way back in CS 106B when talking about vectors, stacks, and queues (slide link):

The key idea of an abstract data type is that you can talk about a data structure in terms of its high-level operations and not its implementation details. For example, a list can be implemented as an array or a linked list, but I should be able to index, search, and sort it all the same without knowing which implementation is used (ignoring computational complexity questions for now).

In CS 106B, you learn about ADTs from a cognitive perspective: how do you think about a problem in terms of stacks, vectors, grids, graphs, and so on? How can you map a real world problem (like N-grams) onto these generic data structures? However in practice, ADTs also provide a significant software engineering benefit. If my client code does not rely on implementation details of your module, then you can freely change the module’s implementation as long as the interface stays the same. This decoupling enables software teams to decompose big problems and work independently. For example, Jeff Bezos famously mandated this idea in 2002:

All teams will henceforth expose their data and functionality through service interfaces.

Teams must communicate with each other through these interfaces.

There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.

It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols – doesn’t matter. Bezos doesn’t care.

All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

Anyone who doesn’t do this will be fired.

OCaml modules

Now let’s take a look at a concrete example. In OCaml, we can define an interface (or “signature”) for a queue as follows:

module type Queue = sig
  type 'a t

  val make : unit -> 'a t
  val enqueue : 'a t -> 'a -> 'a t
  val dequeue : 'a t -> ('a * 'a t) option
end

A queue has two components: an abstract type 'a t, where 'a means the type of the element in the queue, and t means the type of the queue itself. The usage of t is idiomatic for OCaml, so a client can refer to an integer queue as the type int Queue.t as opposed to e.g. int Queue.queue. Here, t means “the main type associated with the module”.

The val keyword indicates values as a part of the interface, here three different functions. make creates a queue, enqueue puts an element on the queue, and dequeue takes an element off the queue (returning None if the queue is empty). Because this is a functional data structure, modifications do not happen in-place, but rather return a new data structure.

We can implement a queue using OCaml’s standard linked list data structure:

module SingleListQueue : Queue = struct
  type 'a t = 'a list

  let make () = []

  let enqueue (q : 'a t) (x : 'a) : 'a t = x :: q

  let rec dequeue (q : 'a t) : ('a * 'a t) option =
    match q with
    | [] -> None
    | [x] -> Some (x, [])
    | x :: q' ->
      (match dequeue q' with
       | Some (x', q'') -> Some (x', x :: q'')
       | None -> None)
end

This provides a concrete implementation of both the type and the functions. We assign 'a t to 'a list, defining our implementation type. Then enqueue is simple, just put the enqueued element at the top of the list. Dequeue is then complex, since we have to perform an $O(n)$ operation to walk to the end of the list and remove the last element.

Given this module, a piece of client code can use the queue’s type and functions as follows:

let module Q = SingleListQueue in (* module alias for less verbosity *)
let q : int Q.t = Q.make () in
let q = Q.enqueue q 2 in
let q = Q.enqueue q 3 in
assert (match Q.dequeue q with
| Some (n, _) -> n = 2
| None -> false)

Note that if we declare the module without the : Queue syntax, e.g.

module SingleListQueue = struct ... end

Then we can make use of the module’s implementation details. For example:

let q : int Q.t = Q.make () in
match q with
| [] -> Printf.printf "Empty"
| x :: q' -> Printf.printf "First element: %d" x

However, if we run the same code but with SingleListQueue : Queue, then we now get the error:

Error: This pattern matches values of type 'a list
       but a pattern was expected which matches values of type
         int Q.t = int SingleListQueue.t

This demonstrates how OCaml enforces the interface/implementation distinction. When a module implements a signature, only the knowledge exported by the signature can be used in the client. Here, because the fact that 'a t = 'a list is only contained in the implementation, it’s a type error to use that fact outside the SingleListQueue implementation.

Functors

Again, one advantage of being implementation-agnostic is that different queues imlementations can be freely and safely swapped in the client code. For example, imagine a more efficient queue implemented with two lists:

module DoubleListQueue : Queue = struct
  type 'a t = 'a list * 'a list

  let make () = ([], [])

  let enqueue ((enq, deq) : 'a t) (x : 'a) : 'a t = (x :: enq, deq)
  let rec dequeue ((enq, deq) : 'a t) : ('a * 'a t) option =
    match (enq, deq) with
    | ([], []) -> None
    | (_, []) -> dequeue ([], List.rev enq)
    | (_, x :: deq') -> Some ((x, (enq, deq')))
end

This clever queue has an amortized $O(1)$ dequeue cost (see Example 2 in the reference). As a client, I can write my code using a parameterized module, or a functor, so its queue implementation can be easily changed.

module QueueTest(Q : Queue) = struct
  let q : int Q.t = Q.make () in
  let q = Q.enqueue q 2 in
  let q = Q.enqueue q 3 in
  assert (match Q.dequeue q with
  | Some (n, _) -> n = 2
  | None -> false)
end

module SingleListQueueTest = QueueTest(SingleListQueue)
module DoubleListQueueTest = QueueTest(DoubleListQueue)

This syntax says: the module QueueTest takes as input another module Q that implements the Queue interface. It’s a module-level function, essentially. Then it makes use of a generic Q module, agnostic of its implementation. We can stamp out the test structure for each new implementation as shown at the bottom. Now that’s what I call modularity!

Existential types

As always, our next step is to turn this informal intuition into a formal mathematical description. Recall back to algebraic data types: if we view records and variants as products and sums, they become the natural dual of each other. If you have a type that gives you x AND y, then it’s that begs the question: what does x OR y mean?

Now think back to last lecture, where we saw that parametric polymorphism could be cast as a for-all type $\tpoly{\alpha}{\tau}$ . Meaning, a program that worked for all input types $\alpha$ , regardless of what they were. In first-order logic, the opposite of a for-all is “there-exists”, or “ $\exists$ ” symbolically. It turns out that we can neatly capture the essence of abstract data types with existential types.

The intuition behind existential types as ADTs is the idea of selectively erasing type-knowledge about an expression. As a toy example, let’s say I have a record type where field “state” should be private, and a function “get” should be public. The type:

$\msf{obj} \triangleq \trec{\alpha}\{\msf{state} : \tnum, \msf{get} : \tfun{\alpha}{\tnum}\}$

And an example of making such an object:

$\begin{align*} &\lett{x}{\msf{obj}}{(\fold{\{\msf{state} = 1; \msf{get} = \funt{o}{\msf{obj}}{(\unfold{o}).\msf{state}}\}}{\msf{obj}})}{\\ &\lett{n}{\tnum}{(\unfold{x}).\msf{get}~x}{\ldots}} \end{align*}$

However, a client using the object $x$ can freely reference $(\unfold{x}).\msf{state}$ due to standard record semantics. What we’d like to do is change the type of $\msf{obj}$ to disallow the access to $\msf{state}$ , while keeping the underlying value the same. As an existential type, we’d write that as:

$\msf{obj\_priv} \triangleq \tmod{\beta}{\trec{\alpha}\{\msf{state} : \beta, \msf{get} : \tfun{\alpha}{\tnum}\}}$

The key idea is that we “hide” the concrete type of state as an unknown type $\beta$ . The type says “there exists a $\beta$ such that this object is a type”. Then any client using the object can still access $x.\msf{state}$ , but their program will only type-check knowing that $x.\msf{state} : \beta$ , not $x.\msf{state} : \tnum$ . Hence, they cannot perform any numeric operations on the state, and have to go through the $\mathsf{get}$ function to access the state.

Let’s now walk through formally the syntax and semantics of existential types. First, we extend our type language with an existential:

$\begin{alignat*}{2} \msf{Type}~\tau ::= \qamp \ldots \\ \mid \qamp \tmod{\alpha}{\tau} \qqamp \text{existential type} \end{alignat*}$

In the expression language, our goal is first to add an operation that allow us to create values of existential type by “erasing” types. How can we make an $\msf{obj\_priv}$ ? Then we want to make an operation that lets us use values of existential type. Thinking about existential types as mapping to modules, we call these operations “export” and “import.”

$\begin{alignat*}{2} \msf{Expression}~e ::= \qamp \ldots \\ \mid \qamp \export{e}{\tau_\msf{adt}}{\tau_\msf{mod}} \qqamp \text{create module} \\ \mid \qamp \import{x}{\alpha}{e_\msf{mod}}{e_\msf{body}} \qqamp \text{use module} \end{alignat*}$

The idea is that export erases the type $\tau_\msf{adt}$ from $e$ , returning a value of existential type $\tau_\msf{mod}$ . Then an import allows client code to access the exported expression, but without knowledge of the actual type $\tau_\msf{adt}$ , only the abstract type $\alpha$ . Extending our previous example:

$\begin{align*} &\lett{m_\msf{mod}}{\msf{obj\_priv}}{\export{(\fold{\{\msf{state} = 1; \msf{get} = \funt{o}{\msf{obj}}{(\unfold{o}).\msf{state}}\}}{\msf{obj}})}{\tnum}{\msf{obj\_priv}}}{\\ &\import{m_\msf{api}}{\alpha}{m_\msf{mod}}{\\ &\lett{n}{\tnum}{(\unfold{m_\msf{api}}).\msf{get}~m_\msf{api}}{\ldots}}} \end{align*}$

Then inside this piece of code, if we try to compute $(\unfold{m_\msf{api}}).\msf{state}$ , we get back a value of type $\alpha$ , not of type $\tnum$ . That enforces the desired privacy boundary.

Formally, we encode this data type abstraction behavior through two type rules:

$\ir{T-Export} {\typeJC{e}{\subst{\alpha}{\tau_\msf{adt}}{\tau_\msf{mod}}}} {\typeJC {\export{e}{\tau_\msf{adt}}{\tmod{\alpha}{\tau_\msf{mod}}}} {\tmod{\alpha}{\tau_\msf{mod}}}} \nl \ir{T-Import} {\typeJC{e_\msf{mod}}{\tmod{\alpha}{\tau_\msf{mod}}} \s \typeJ{\ctx, x : \subst{\alpha}{\beta}{\tau_\msf{mod}}}{e_\msf{body}}{\tau_\msf{body}}} {\typeJC {\import{x}{\beta}{e_\msf{mod}}{e_\msf{body}}} {\tau_\msf{body}}}$

Interpreting each rule in turn:

T-Export: remember that the key idea in creating a value of existential type is to “erase” a component of a value’s type. The type $\tmod{\alpha}{\tau_\msf{mod}}$ represents the user’s desired existential type with the erased component, e.g. $\msf{obj\_priv}$ erases $\msf{state} : \tnum$ to $\msf{state} : \beta$ . To ensure that the provided expression can actually erase to the provided existential type, we have to check the key condition: does $e$ have type $\tau_\msf{mod}$ if we replace the abstract type variable $\alpha$ with the concrete type $\tau_\msf{adt}$ ? For example, in the expression $(\msf{let}~m_\msf{mod} : \msf{obj\_priv} = \ldots)$ , we have to check that replacing $\beta$ with $\tnum$ in $\tmod{\beta}{\trec{\alpha}\{\msf{state} : \beta, \msf{get} : \tfun{\alpha}{\tnum}\}}$ gives us back the actual object type $\msf{obj}$ .
T-Import: to then use a value of existential type, an import expression provides access to both the module body $e_\msf{mod}$ and its abstract data type $\alpha$ . The client $e_\msf{body}$ can only use the module under its existential type, which enforces the privacy condition that knowledge of erased types cannot be used. We replace $\alpha$ with $\beta$ in $\tau_\msf{mod}$ so the client can use a separate type variable than the one used in the existential type.

And finally, we have the dynamic semantics.

$\ir{D-Export}{}{\val{(\export{e}{\tau_\msf{adt}}{\tmod{\alpha}{\tau_\msf{mod}}})}} \nl \ir{D-Import-Step} {\steps{e_\msf{mod}}{e'_\msf{mod}}} {\steps {\import{x}{\alpha}{e_\msf{mod}}{e_\msf{body}}} {\import{x}{\alpha}{e'_\msf{mod}}{e_\msf{body}}}} \nl \ir{D-Import-Sub} {} {\steps {\import{x}{\alpha}{(\export{e_\msf{mod}}{\tau_\msf{adt}}{\tau_\msf{mod}})}{e_\msf{body}}} {\subst{x}{e_\msf{mod}}{e_\msf{body}}}}$

Just like with parametric polymorphism, our modules are essentially no-ops at runtime. All the feature complexity lives in the typing rules to ensure that the expressions are used in only the proper way. As one final example, we can now encode the queue API from above into our lambda calculus. First, we write down the module signature as a combination of an existential and record type.

$\begin{align*} \alpha~\msf{queue\_api} \triangleq \tmod{\beta}{\{ &\msf{make} : \tfun{\tunit}{\beta}; \\ &\msf{enqueue} : \tfun{\beta}{\tfun{\alpha}{\beta}}; \\ &\msf{dequeue} : \tfun{\beta}{(\alpha \times \beta)~\msf{option}} \}} \end{align*}$

Note that we’re not defining $\alpha~\msf{queue}$ , as in the type of a concrete queue (like we did with $\alpha~\msf{list}$ last time). Instead we’re defining the type of an API over queues of $\alpha$ elements, where $\beta$ is the concrete implementation of the queue structure. We can use export to create such a concrete implementation:

$\begin{align*} \lett{qm_\msf{list}}{\tpoly{\alpha}{\alpha~\msf{queue}}}{\\ \poly{\alpha}{\export{\{&\msf{make}: \funt{\_}{\tunit}{\msf{Nil}~[\alpha]}, \\ &\msf{enqueue}: \funt{q}{\alpha~\msf{list}}{\funt{x}{\alpha}{\msf{Cons}~[\alpha]~q~x}}, \\ &\msf{dequeue}: \fix{f}{\tfun{\alpha~\msf{list}}{(\alpha \times \alpha~\msf{list})~\msf{option}}}{ \funt{q}{\alpha~\msf{list}}{\\ & \hspace{2em} \case{(\unfold{q})}{\_}{\msf{None}~[\alpha]}{t}{ \ldots } }} \\ &\}}{\alpha~\msf{list}}{\alpha~\msf{queue\_api}}} \\ }{\ldots} \end{align*}$

Then we can use the implementation with an import:

$\begin{align*} &\import{qm}{\beta}{qm_\msf{list}~[\tnum]}{ \\ &\lett{q}{\beta}{qm.\msf{empty}~()}{ \\ &\lett{q}{\beta}{qm.\msf{enqueue}~q~0}{ \ldots }}} \end{align*}$

And that’s it! We now have a formalized notion of modules through existential types.

Sections

Abstract data types

OCaml modules

Functors

Existential types