pl-rants

Rants about programming languages

Dec 23, 2019

Compiling to Closures

Table of Contents

This is partly a follow-up on the previous post about RTypes library, partly some random thoughts on the topic in general.

It's been quite a while since I published a post, and it was not the case that I had nothing to share, far from that. The reason is, I have been spending almost every available minute of my off-work time on my side project. Alas, it isn't doing as well as I hoped it would. On the bright side, I finally got time to write about some interesting things related to PL topics.

TL;DR "compiled to closures" interpreter turned out to be 2x faster than the walk-the-tree interpreter.

So what is "compiling to closures"? To say the truth, up until recently I have been mentally cringing every time I heard someone put words "compile" and "closures" together. Normally, when we talk about compiling in the context of programming languages, we refer to the process of translating abstract syntax trees (AST) to some lower-level, simplified, machine-readable form. For example, we can compile C code to x86 machine code or Erlang to BEAM bytecode to be executed on the BEAM virtual machine. As opposed to that, interpreting, in its traditional sense, is the process of directly evaluating code while traversing the AST. Nowadays, when we have multi-stage JIT compilers, GraalVM and whatnots, it's not an easy task to draw a solid line between the two. Still, the distinction becomes more clear if we were to look at some code that illustrates the ideas.

Let's imagine we have a type1 that defines an AST (most examples are in Elixir)

@type ast :: list(expr)
@typep expr :: {:funcall, funname, list(expr)}
             | {:binop, bopname, lhs::expr, rhs::expr}
             | {:unop, uopname, expr}
             | term
@typep term :: {:integer, integer()}
             | {:float, float()}
             | {:bool, boolean()}
             | {:string, binary()}

The type says that the AST for some hypothetical language is a list of expressions, where each expression can be either a function call, a binary operator, a unary operator, or a single term, and the basic types allowed in that language are integers, floating point numbers, booleans, and strings. The AST is abstract, that is, it says nothing about the concrete language and can equally well represent code in different surface languages. To highlight the idea, consider the following code fragments in some language A and some other language B

// language A
f(1+2, "hi there") * 512;
print(5, "hello" + "world");
# language B
(f (1 intadd 2) 'hi there') mult 512
print 5 ('hello'%'world')

They are likely to produce the same AST

[
  {:binop, :*,
    {:funcall, "f", [{:binop, :+, {:integer, 1}, {:integer, 2}},
                     {:string, "hi there"}]},
    {:integer, 512}},
  {:funcall, "print", [{:integer, 5},
                       {:binop, :concat,
                        {:string, "hello"}, {:string, "world"}}]}
]

The point is, ASTs represent code structure, ignoring particularities of a surface language. That's why we concentrate only on AST in the following sections.

Interpreter vs Compiler

To write an interpreter for the AST above, we can sketch a function eval that would recursively walk the tree, performing necessary operations as it proceeds

# the function takes AST and some Environment and
# returns some other AST, and, possibly updated, Environment.
@spec eval(expr(), env()) :: {expr(), env()}

# unary operator evaluates the term and then applies the op
def eval({:unary, op, term}, env) do
  {term1, env1} = eval(term, env)
  apply_unop(op, term1, env1)
end

# binary operator evaluates both sides and then applies the op
def eval({:binop, op, lhs, rhs}, env) do
  {lhs1, env1} = eval(lhs, env)
  {rhs1, env2} = eval(rhs, env1)
  apply_binop(op, lhs1, rhs1, env1)
end

# function call is a bit more involved. It first fetches
# the function from the environment, then evaluates all
# the arguments, left-to-right and finally applies the
# function to the evaluated arguments
def eval({:funcall, fname, args}, env) do
  f = get_fun(env, fname)
  {args1, env1} = Enum.reduce(args, {[], env}, fn arg, {acc, env} ->
    {result, env1} = eval(arg, env)
    {[result | acc], env1}
  end)

  apply(f, Enum.reverse(args1), env1)
end

# def eval(...)

# the driver function chains multiple
# evaluations together and discarding all intermediate results
@spec interpret(ast(), env()) :: {expr(), env()}
def interpret(exprs, env) do
  Enum.reduce(exprs, {:none, env}, &eval/2)
end

The function recursively evaluates tree nodes, left-to-right, depth first. In other words, arguments to functions and operators are evaluated first. One thing to note here is the type of eval function. It takes an expression and returns a modified expression, possibly updating the environment as it proceeds.

Now let's outline a compiler which produces code for an over-simplified stack-based VM. The VM supports only two instructions: push and call. push puts a value on top of the stack, while call pops the required number of values off the stack and puts the result of a function call back.

@spec compile(ast(), static_env()) :: [instruction]
  when instruction: {:push, val} | {:call, ref}

def compile({:integer, val}, _static_env) do
  [{:push, {:integer, val}}]
end

def compile({:string, val}, _static_env) do
  [{:push, {:string, val}}]
end

# ... code for other primitive types

def compile({:unop, op, term}, static_env) do
  op_ref = get_unop(static_env, op)
  # `code` should put the fully evaluated term
  # on top of the stack
  code = compile(term, static_env)
  code ++ [{:call, op_ref}]
end

def compile({:binop, op, lhs, rhs}, static_env) do
  op_ref = get_binop(static_env, op)
  code_lhs = compile(lhs, static_env)
  code_rhs = compile(rhs, static_env)
  code_lhs ++ code_rhs ++ [{:call, op_ref}]
end

def compile({:funcall, fname, args}, static_env) do
  fun_ref = get_fun(static_env, fname)
  code = Enum.flat_map(args, fn arg ->
    compile(arg, static_env)
  end)
  code ++ [{:call, fun_ref}]
end

We can see that the compile function also does depth-first, left-to-right tree traversal, but the return value of the function is a list of instructions. Also, the environment it is evaluated in, is static (compile-time), while for the interpreter the environment is dynamic (run-time).

Interpreters walk ASTs every time they are called, producing a simplified (reduced) AST as a result, while compilers walk the trees only once, producing some low-level code. The funny thing is, the resulting low-level code is in turn can be interpreted by a VM or compiled into some other representation. In real compilers the process can be multistage, combining the two processes at both, run-time and ahead-of-time.

Compiling to Closures

What is "compiling to closures" then? The idea is to walk the tree only once, as if we were to compile it, but instead of producing a list of instructions, generate a chain of suspended function calls. Let's demonstrate it on the same AST type.

## `cc` takes an expression and a static environment and returns
## a function that takes dynamic environment and returns a fully
## evaluated term and and possibly updated dynamic environment
@spec cc(expr(), static_env()) :: (env() -> {term, env()})

def cc({:integer, val}, _static_env) do
  # primitive values are returned as is
  fn env -> {{:integer, val}, env} end
end

# def cc({:string, ...}) # other primitive types

# unary operations require only one closure
def cc({:unop, op, term}, static_env) do
  op_fun = get_unop(static_env, op)
  closure = cc(term, static_env)
  fn env ->
    {val, env1} = closure.(env)
    apply(op_fun, [val], env1)
  end
end

# binary ops require two closures
def cc({:binop, op, lhs, rhs}, static_env) do
  op_fun = get_binop(static_env, op)
  closure_lhs = cc(lhs, static_env)
  closure_rhs = cc(rhs, static_env)
  fn env ->
    {lhs_val, env1} = closure_lhs.(env)
    {rhs_val, env2} = closure_lhs.(env)
    apply(op_fun, [lhs_val, rhs_val], env2)
  end
end

# function calls require a closure for each arg
def cc({:funcall, fname, args}, static_env) do
  f = get_fun(static_env, fname)
  cls = Enum.reduce(args, [], fn arg, cls ->
    [cc(arg, static_env) | cls]
  end)
  |> Enum.reverse()

  fn env ->
    # values for function application must be evaluated at run-time
    {vals, env1} = Enum.reduce(cls, {[], env}, fn c, {vals, env} ->
      {val, env1} = c.(env)
      {[val | vals], env1}
    end)

    apply(f, Enum.reverse(vals), env1)
  end
end

As we can see, the code is very similar to what we had in compile function, where instead of producing instruction sequences we prepare closures at compile-time and instead of concatenating sequences with ++ operator we return new closures. With the above in mind I now think that the "compile" word is somewhat justified. Yes, we don't spit out a lower-level code, but abstracting the "combine" operation, we get exactly the same code. Compare, e.g. the binop clause

def compile({:binop, op, lhs, rhs}, static_env) do
  op_ref = get_binop(static_env, op)
  code_lhs = compile(lhs, static_env)
  code_rhs = compile(rhs, static_env)
  code_lhs ++ code_rhs ++ [{:call, op_ref}]
end

def cc({:binop, op, lhs, rhs}, static_env) do
  op_fun = get_binop(static_env, op)
  closure_lhs = cc(lhs, static_env)
  closure_rhs = cc(rhs, static_env)
  fn env ->
    {lhs_val, env1} = closure_lhs.(env)
    {rhs_val, env2} = closure_rhs.(env1)
    apply(op_fun, [lhs_val, rhs_val], env2)
  end
end

An interesting aspect here is that while cc function is evaluated in a static environment, the returned closure must be evaluated in run-time environment.

Why compile to closures?

The answer, I believe, is performance. Imagine that you need to pass through a large, complex maze. Interpreting AST can be thought of as making a decision at each intersection using a rule book, e.g. "if there are three exits and the previous intersection had at least two exits then we should take the one in the middle". The more rules you have in that rule book, the more time consuming the process becomes. And if you don't have perfect memory, the next time you pass it, you repeat that tedious process again. Now, compiling to closures can be thought of as if someone has already passed through that maze once and marked each exit using the rule book. You still need to pass through, but there is no need to do tedious calculations at each intersection, the path is known. The "rule book" in this analogy corresponds to a type that defines an AST and a maze is analogous to some concrete source code.

Closures in RTypes library

I had a chance to apply the above in practice. The first iteration of RTypes library contained a simple walk-the-tree interpreter. The Extractor module extracted type specifications from a .beam file and returned their AST, while the Checker module implemented the interpreter that walked the extracted AST for a given term.

# primitive integer type
def checker(term, {:type, _line, :integer, []}) do
  is_integer(term)
end

# integer range, e.g. 1..100
def check(term, {:type, _, :range, [{:integer, _, l}, {:integer, _, u}]}) do
  is_integer(term) and term >= l and term <= u
end

# list with elements of type `typ`
defp check(term, {:type, _line, :list, [typ]}, ctx) when is_list(term) do
  Enum.reduce_while(term, true, fn elem, _ ->
    check(elem, typ)
  end)
end

# ... another 50 clauses handling primitive and compound types

It worked quite nicely as a sanity checker during active development phase (for the project that motivated it in the first place, that is). And the return on the time I invested into writing the library was tenfold because validating complex data structures became trivial. Some time later I decided to use it as a validator for JSON requests over HTTP (in another project). When you potentially have thousands requests per second, the performance does matter.

My first thought was to generate Core Erlang and compile it directly to beam files, but I remembered the "compile to functions" discussion I had a few weeks before and decided to give it a try. Modifying Checker module turned out to be a straightforward if not a routine task. The result was Lambda module which did generate the closures

def build({:type, _line, :integer, _args}) do
  fn term -> is_integer(term) end
end

def build({:type, _, :range, [{:integer, _, l}, {:integer, _, u}]}) do
  fn term -> is_integer(term) and term >= l and term <= u end
end

def build({:type, _line, :list, [typ]}) do
  typ? = build(typ)
  fn term ->
    is_list(term) && Enum.all?(term, typ?)
  end
end

## ... another 50 clauses for primitive and compound types

I also created a simple benchmark to measure the performance gain on both a simple and a fairly complex structure, both containing reasonably large terms. To be honest, I wasn't completely sure which one would win. On one hand "closures" in theory should be faster. On the other hand, pattern matching in Erlang is prevailing and thus must be heavily optimised.

It turned out, in both cases the "compiled to closures" checker was roughly 2x faster:

Name ips average deviation median 99th %
simple term, closures 544.60 1.84 ms ±4.32% 1.81 ms 2.22 ms
simple term, interpreter 272.75 3.67 ms ±6.23% 3.60 ms 4.49 ms
complex term, closures 826.93 1.21 ms ±9.68% 1.20 ms 1.61 ms
complex term, interpreter 393.51 2.54 ms ±10.28% 2.54 ms 3.06 ms

The library is available on hex.pm and on GitHub. Running the benchmark is as simple as

$ mix deps.get
$ MIX_ENV=bench mix run bench/run.exs

Conclusion

A walk-the-tree interpreter is often the simplest and the most straightforward solution for anything AST-related. When performance matters, however, compiling to closures may provide a quick and easy gain, which should be attempted before embarking on a journey to byte code compilation, virtual machines, or JIT.

Footnotes:

1

The AST does not allow variable binding, function definition and other things, to allow concise examples.