A Nix parser in OCaml. Part 1: Lexer

Lexer and Parser Dichotomy
Nix, the expression language
Nix strings
Flattening the token stream
Tuning performance

I think I may consider myself lucky for I work in a team which is while being a service assurance (hence naturally risk averse and conservative) team, still doesn't hesitate to use some "fancy" technologies when they make sense.

So we are slowly migrating from Puppet and Ubuntu hosts to NixOS. The technology makes it easy to build reproducible environments down to the byte level and, at the same time, have instantaneous roll-backs and recursive dependency locking. In other words, if a program "works on my machine" it is guaranteed to work in the production environment.

As a byproduct of that migration process I found myself writing considerable amounts of Nix code. Nix, the language, is essentially a DSL for writing package specifications. It is a pure, functional, lazy language and my inner PL geek is overwhelmed with glee to use it in production. The problem, however, is that tooling is almost non-existent (to the best of my knowledge). There's a somewhat stoic Emacs mode and a half-working third-party formatting tool. So the user experience is not that great. To top it off, the language is dynamically typed, which makes it easy to overlook some inevitable bugs lurking in the edge cases or due to omnipresent typos. With the above in mind, and getting some free time over the Christmas holidays, I set myself to write a Nix parser, since it's a prerequisite for almost any of the tools I had in mind¹.

Lexer and Parser Dichotomy

When one wants to write a parser for a language, one has to come up with (or read the specification of) a grammar for that language. The grammar consists of a set of terminal and non-terminal symbols and a set of production rules which allow to generate any valid sentence in that language. The tricky part here (one of) is what exactly to choose as terminals. A lexer is used to convert the input character stream into the token stream or, in other words, into a stream of terminal symbols. The token stream is then fed into a parser which may output a parse tree or interpret it in-place.

For example, in a language which allows only printable ASCII symbols, we can have a relatively small set of terminal symbols which consists of the printable symbols themselves. That makes lexing trivial - an identity function over the character stream - while forcing the parser to do all the "dirty" job. On the other hand, if we have a sophisticated enough lexer (and OCamllex is indeed a sophisticated one), we can pass in some state through a chain of mutually recursive functions, essentially making it akin to a full-featured parser. Striking a balance between the two is somewhat of an art.

Nix, the expression language

Being a pure, functional language, Nix programs are composed from expressions². Those expressions in turn are constructed from values, bindings, and function applications. Values may be atomic, such as numbers, strings, and functions, or composite - lists and attribute sets. The latter are key-value store and the workhorse of the language

let # a binding
    x = [ 1 1 2 ]; # a list
    y = { a = 1; b = { c = 3; }; }; # an attribute set
in
  f y.a y.b.c # application of function f to 1 and 3

Functions are first-class values, represented by lambda expressions. They can be bound to a name, passed in as an argument and returned from a function

nix-repl> let add1 = x: x + 1; in add1 1
2
nix-repl> let g = f: x: (f x) + 1; in g add1 1
3

So conceptually the language is very simple, but has lots of conventions and syntactic sugar to make it a great tool, built for the purpose (please refer to the manual for details). And those conventions, while making it easy to write the expressions, do complicate parsing.

Before parsing, though, comes lexing. And with the latter the main complexity is hidden in Nix strings.

Nix strings

A string in Nix can be either a double-quoted string or the so-called indented string, enclosed in double single quotes. Both allow "antiquotation", also known as string interpolation:

nix-repl> let x = 1; in "x+1=${x+1}"
"x+1=2"

The expression enclosed in ${...} can be an arbitrary Nix expression, possibly including other strings, containing antiquotations as well:

nix-repl> let x = "hello"; y = " world"; in "greeting = ${x + "${y}"}"
"greeting = hello world"

A common approach is to have a lexeme representing the whole string. So we could write a grammar rule in a concise manner, e.g.

assignment -> IDENTIFIER "=" STRING

The question, though, is how can we represent the strings that contain antiquotations? They can not be atomic values. Being arbitrarily nested, they are naturally correspond to recursive data structures. So to lex them we need something that can be called recursively. Happily, OCamllex is powerful enough to do that and more.

After some thought I came up with the following representation

"Hi. I'm  ${ something } or foo ${ zzz } goo zoxx"
^^^^^^^^^^              ^^^^^^^^        ^^^^^^^^^ ^
STR_START               STR_MID         STR_MID   STR_END

In other words, a Nix string is STR_START token followed by any tokens that the expression inside ${...} generates, followed by STR_MID and terminated by STR_END token. To know when to switch between "string" and "normal" modes I introduced a stack of types of curly braces

type braces =
  | AQUOTE
  | SET

type braces_stack = braces list

where AQUOTE corresponds to the ${ } and SET to { }. When lexer encounters ${ character sequence, it places AQUOTE onto the stack and emits AQUOTE_OPEN token, and when it "sees" { symbol it places SET onto the stack and emits LBRACE token. So when at a later point the closing curly brace } is seen, the lexer pops an element from the stack and if it matches with AQUOTE then the emitted token is AQUOTE_CLOSE. Otherwise the token is RBRACE.

Getting back to strings, when lexer in "string" mode hits ${ sequence it places AQUOTE on to the empty, just created stack and switches to "normal" mode. The "normal" mode then produces a list of tokens till it hits AQUOTE_CLOSE and the stack is empty. After that it switches back to "string" mode.

For example, when lexing the expression

"hi ${ {a="there";}.a }" == "hi there"

The sequence of tokens, lexing mode, and the brace stack would be as follows

token	mode	brace stack
`IF`	normal	`[]`
`STR_START`	string	`[]`
`AQUOTE_OPEN`	normal	`[AQUOTE]`
`LBRACE`	normal	`[SET; AQUOTE]`
`ID`	normal	`[SET; AQUOTE]`
`STR_START`	string	`[]` <- new stack
`STR_END`	normal	`[SET; AQUOTE]`
`SEMI`	normal	`[SET; AQUOTE]`
`RBRACE`	normal	`[AQUOTE]`
`SELECT`	normal	`[AQUOTE]`
`AQUOTE_CLOSE`	normal	`[]`
`STR_MID`	string	`[]`
`STR_END`	normal	`[]`
`EQ`	normal	`[]`
`STR_START`	string	`[]`
`STR_END`	normal	`[]`

The above algorithm is implemented in the collect_tokens function

let collect_tokens lexer q lexbuf =
  let stack = ref [] in  (* a new stack *)
  let queue = Queue.create () in (* a local token queue *)
  let rec go () =
    match (try Some (Queue.take queue) with Queue.Empty -> None) with
    | Some token ->
      (
        match token, !stack with
        | AQUOTE_CLOSE, [] -> (* exit condition, stop recursion *)
          Queue.add AQUOTE_CLOSE q
        | EOF, _ -> (* end of input reached, stop recursion *)
          Queue.add EOF q;
        | _, _ ->
          (* otherwise add the token to the global queue and loop *)
          Queue.add token q; go ()
      )
    | None ->
      (* fill the local token queue if it's empty and loop *)
      lexer queue stack lexbuf; go ()
  in
  (* add AQUOTE_OPEN to the global queue put AQUOTE on to the stack *)
  Queue.add AQUOTE_OPEN q; stack := [AQUOTE];
  lexer queue stack lexbuf; go ()

The function is called from the "string" mode ( the ellipsis ... here denote non-relevant parts of code)

(* the "normal" lexer *)
rule get_tokens queue stack = parse
   ...
(* "string" mode lexer *)
and string state buf q = parse
  | '"'   (* terminate when we hit '"' *)
    { Queue.add (token_of_str state buf) q; Queue.add STR_END q }
  | '\n'
    { ...}
  | ("\\n" | "\\r" | "\\t" | "\\\\" | "\\${") as s { ...}
  | "\\" (_ as c) { ... }
  | "${" (* collect all the tokens till we hit the matching '}' *)
    {
      Queue.add (token_of_str state buf) q;
      collect_tokens get_tokens q lexbuf;
      string `Mid (Buffer.create 64) q lexbuf
    }
  | _ as c (* otherwise just add the character to the buffer *)
    { ... }

The token_of_str function returns either STR_START when the state is `Start or STR_MID if the state is `Mid, i.e. =${= has already been seen.

Flattening the token stream

While the mode-switching trick described in the previous section solves the problem of lexing Nix strings, it introduces another problem. The generated parser expects a function that given the lexbuf returns a token

lexer: Lexing.lexbuf -> Parser.token

However, the "string" mode returns not one but multiple tokens at a time, e.g.

"pointless ${ "${"but"}" } legit ${"string"}"

produces

STR_START
  AQUOTE_OPEN
    STR_START
      AQUOTE_OPEN
        STR_START STR_END
          AQUOTE_CLOSE
        STR_END
      AQUOTE_CLOSE
STR_MID
  AQUOTE_OPEN
    STR_START STR_END
  AQUOTE_CLOSE
STR_MID
STR_END

In other words, a single lexer call, given the string above, would produce 16 tokens.

Initially, I was going to store the tokens in an immutable tree and walk the tree during the parsing process. The downside of the approach is that the tree should be completely built before use. This could've been mitigated by using a lazy data structure. However, the signature of the lexer function excludes the possibility of threading the state through the calls ³. So after some consideration I decided to use Queue module from the standard library. It provides a mutable data structure with in-place modifications. The object is passed into both "string" and "normal" modes, and all the emitted tokens are just added to the queue, preserving the order they occur in the source string.

Extracting individual tokens is done through a wrapper function

let rec next_token
    (q: token Queue.t)
    (s: braces list ref)
    (lexbuf: Lexing.lexbuf)
  : token =
  match (try Some (Queue.take q) with Queue.Empty -> None) with
  (* return the token if the queue is not empty *)
  | Some token ->
    token
  (* otherwise replenish the queue and try again *)
  | None ->
    get_tokens q s lexbuf;
    next_token q s lexbuf

Tuning performance

After writing the rules for the other tokens (which was straightforward enough to omit them in this writing) and taking care of the problems mentioned in the above sections, I decided to experiment a bit with the performance. To have a plausible test data set I ran it on the complete nixpkgs source tree

$ time find nixkpkgs -name '*.nix' | xargs cat | lexer > /dev/null

real    0m10.795s
user    0m7.580s
sys     0m3.239s

The command above finds all the *.nix files in the source tree, concatenates them and redirects the result onto the lexer's stdin. The lexer program, in turn, tokenises the input and prints out all the tokens onto stdout. The result is redirected to /dev/null so that it won't affect the time. The number of *.nix files at the moment of testing was 14,674 and their total size added up to roughly 50MB. On the first run I got around 11s. Hmm… that was relatively poor result, especially when compared to the speed of the parser that comes with Nix.

$ time find nixkpkgs -name '*.nix' | xargs nix-instantiate --parse > /dev/null

real    0m4.526s
user    0m3.628s
sys     0m1.061s

So lexing alone was more than twice as slow as the Flex/Bison-based parser. Subtracting the time it took to prepare the input

$ time find ../nixpkgs -name '*.nix' | xargs cat > /dev/null

real    0m0.667s
user    0m0.173s
sys     0m0.680s

I got even worse numbers.

Assuming that the speed of a lexer should be proportional to the size of the generated FSM I tried to reduce the number of states by utilising the trick described in the OCamllex manual. Namely, instead of having distinct rules for every keyword, use a pre-built hash table that returns either a keyword token or an identifier token

(* lookup table for keywords *)
let keyword_table = Hashtbl.create 10
let _ =
  List.iter (fun (kwd, tok) -> Hashtbl.add keyword_table kwd tok)
    [ "with", WITH;
      "rec", REC;
      "let", LET;
      "in", IN;
      "inherit", INHERIT;
      "null", NULL;
      "if" , IF;
      "then", THEN;
      "else", ELSE;
      "assert", ASSERT;
      "or", ORDEF ]

(* a rule for keywords or identifies *)
| ((alpha | '_')+ (alpha_digit | ['_' '\'' '-'])*) as id
    { Queue.add (try Hashtbl.find keyword_table id with Not_found -> ID id) q}

Another "optimisation" was to simplify the rule for URI strings. As a syntactic sugar Nix allows writing URI and path strings omitting the quotes, i.e.

nix-repl> let uri1 = "http://goo.foo"; uri2 = http://goo.foo; in uri1 == uri2
true

So the relatively complex rule for recognising URIs in general I replaced by the simplified rule which recognises only certain protocols.

let uri_chr = ['%' '/' '?' ':' '@' '&' '=' '+' '$' ',' '-' '_' '.' '!' '~' '*' '\'']
let scheme = "http" 's'? | "ftp" | "ssh" | "git" | "mirror" | "svn"
let uri = scheme ':' (alpha_digit | uri_chr)+

The final, and probably the most useless "optimisation" was to create a constant-time lookup table for one-character tokens

(* lookup table for one-character tokens *)
let char_table = Array.make 93 EOF
let _ =
  List.iter (fun (k, v) -> Array.set char_table ((int_of_char k) - 1) v)
    [
      '.', SELECT; '?', QMARK; '!', NOT; '=', ASSIGN; '<', LT;
      '>', GT; '[', LBRACK; ']', RBRACK; '+', PLUS; '-', MINUS;
      '*', TIMES; '/', SLASH; '(', LPAREN; ')', RPAREN;
      ':', COLON; ';', SEMICOLON; ',', COMMA; '@', AS
    ]

let char_tokens =
   ['.' '?' '!' '=' '<' '>' '[' ']' '+' '-' '*' '/' '(' ')' ':' ';' ',' '@']

(* the rule for one-character tokens *)
| char_tokens as c
    { Queue.add (Array.get char_table ((int_of_char c) - 1)) q }

With all the changes in place I reduced the run time to 9.7s. It wasn't a huge improvement but still a decent one. Not until I exhausted all the other ideas I decided to run a profiler ;) . The only excuse I have for not doing so earlier is that using gprof on MacOS isn't quite possible and I don't really know how to use DTrace, especially with ocamlopt generated code. If not for the excellent Landmarks library I'd probably still have been trying to micro-optimise the code wherever possible.

Reflections aside, the first thing I tried to measure was how long it took to output the result. I mean, I had the function token -> string and I was calling it for every one of the ~7 million tokens produced.

It was a jaw-dropping moment when Landmarks told me that the program spent almost 80% of its run time in printing!

I used print_endline function from OCaml Pervasives module. I should've read the documentation carefully

Print a string, followed by a newline character, on standard output and flush standard output.

On the positive side, now I know that it takes about 4s to make 7 million write system calls on my MacBook Pro :D

Changing print_endline into Printf.printf reduced the run time to ~2s and discarding the output altogether (i.e. not calling print at all) got me to 1.7s. Taking into account that find-ing and concatenating the files takes ~0.6s, the lexer seems to be pretty fast, being able to process text at ~50MB/s rate!

Quite happy with the achieved result I set on to writing the parser. In the next post I'll get into details of the said adventure.

The source code of the lexer is available on GitHub.

Footnotes:

To be completely honest, though, I was just looking for the right kind of a nail for the hummer I was eager to try.

Not entirely true. There are assertions which may terminate evaluation at any point making them semantically close to statements.

In addition to the monolithic API, Menhir may expose incremental API which makes it possible to keep the parser/lexer pure. However, the incremental API requires the slower, table-based backed and is more complicated. So I decided to stick to the monolithic API for a time being.

pl-rants