A Nix parser in OCaml. Part 1: Lexer
Table of Contents
I think I may consider myself lucky for I work in a team which is while being a service assurance (hence naturally risk averse and conservative) team, still doesn't hesitate to use some "fancy" technologies when they make sense.
So we are slowly migrating from Puppet and Ubuntu hosts to NixOS. The technology makes it easy to build reproducible environments down to the byte level and, at the same time, have instantaneous roll-backs and recursive dependency locking. In other words, if a program "works on my machine" it is guaranteed to work in the production environment.
As a byproduct of that migration process I found myself writing considerable amounts of Nix code. Nix, the language, is essentially a DSL for writing package specifications. It is a pure, functional, lazy language and my inner PL geek is overwhelmed with glee to use it in production. The problem, however, is that tooling is almost non-existent (to the best of my knowledge). There's a somewhat stoic Emacs mode and a half-working third-party formatting tool. So the user experience is not that great. To top it off, the language is dynamically typed, which makes it easy to overlook some inevitable bugs lurking in the edge cases or due to omnipresent typos. With the above in mind, and getting some free time over the Christmas holidays, I set myself to write a Nix parser, since it's a prerequisite for almost any of the tools I had in mind1.
Lexer and Parser Dichotomy
When one wants to write a parser for a language, one has to come up with (or read the specification of) a grammar for that language. The grammar consists of a set of terminal and non-terminal symbols and a set of production rules which allow to generate any valid sentence in that language. The tricky part here (one of) is what exactly to choose as terminals. A lexer is used to convert the input character stream into the token stream or, in other words, into a stream of terminal symbols. The token stream is then fed into a parser which may output a parse tree or interpret it in-place.
For example, in a language which allows only printable ASCII symbols, we can have a relatively small set of terminal symbols which consists of the printable symbols themselves. That makes lexing trivial - an identity function over the character stream - while forcing the parser to do all the "dirty" job. On the other hand, if we have a sophisticated enough lexer (and OCamllex is indeed a sophisticated one), we can pass in some state through a chain of mutually recursive functions, essentially making it akin to a full-featured parser. Striking a balance between the two is somewhat of an art.
Nix, the expression language
Being a pure, functional language, Nix programs are composed from expressions2. Those expressions in turn are constructed from values, bindings, and function applications. Values may be atomic, such as numbers, strings, and functions, or composite - lists and attribute sets. The latter are key-value store and the workhorse of the language
let # a binding x = [ 1 1 2 ]; # a list y = { a = 1; b = { c = 3; }; }; # an attribute set in f y.a y.b.c # application of function f to 1 and 3
Functions are first-class values, represented by lambda expressions. They can be bound to a name, passed in as an argument and returned from a function
nix-repl> let add1 = x: x + 1; in add1 1 2 nix-repl> let g = f: x: (f x) + 1; in g add1 1 3
So conceptually the language is very simple, but has lots of conventions and syntactic sugar to make it a great tool, built for the purpose (please refer to the manual for details). And those conventions, while making it easy to write the expressions, do complicate parsing.
Before parsing, though, comes lexing. And with the latter the main complexity is hidden in Nix strings.
Nix strings
A string in Nix can be either a double-quoted string or the so-called indented string, enclosed in double single quotes. Both allow "antiquotation", also known as string interpolation:
nix-repl> let x = 1; in "x+1=${x+1}" "x+1=2"
The expression enclosed in ${...} can be an arbitrary Nix expression,
possibly including other strings, containing antiquotations as well:
nix-repl> let x = "hello"; y = " world"; in "greeting = ${x + "${y}"}" "greeting = hello world"
A common approach is to have a lexeme representing the whole string. So we could write a grammar rule in a concise manner, e.g.
assignment -> IDENTIFIER "=" STRING
The question, though, is how can we represent the strings that contain antiquotations? They can not be atomic values. Being arbitrarily nested, they are naturally correspond to recursive data structures. So to lex them we need something that can be called recursively. Happily, OCamllex is powerful enough to do that and more.
After some thought I came up with the following representation
"Hi. I'm  ${ something } or foo ${ zzz } goo zoxx"
^^^^^^^^^^              ^^^^^^^^        ^^^^^^^^^ ^
STR_START               STR_MID         STR_MID   STR_END
In other words, a Nix string is STR_START token followed by any tokens
that the expression inside ${...} generates, followed by STR_MID and
terminated by STR_END token. To know when to switch between "string"
and "normal" modes I introduced a stack of types of curly braces
type braces = | AQUOTE | SET type braces_stack = braces list
where AQUOTE corresponds to the ${ } and SET to { }. When lexer
encounters ${ character sequence, it places AQUOTE onto the stack and emits
AQUOTE_OPEN token, and when it "sees" { symbol it places SET onto the
stack and emits LBRACE token. So when at a later point the closing curly
brace } is seen, the lexer pops an element from the stack and if it matches
with AQUOTE then the emitted token is AQUOTE_CLOSE. Otherwise the token is
RBRACE.
Getting back to strings, when lexer in "string" mode hits ${ sequence it
places AQUOTE on to the empty, just created stack and switches to "normal"
mode. The "normal" mode then produces a list of tokens till it hits
AQUOTE_CLOSE and the stack is empty. After that it switches back to "string"
mode.
For example, when lexing the expression
"hi ${ {a="there";}.a }" == "hi there"
The sequence of tokens, lexing mode, and the brace stack would be as follows
| token | mode | brace stack | 
|---|---|---|
| IF | normal | [] | 
| STR_START | string | [] | 
| AQUOTE_OPEN | normal | [AQUOTE] | 
| LBRACE | normal | [SET; AQUOTE] | 
| ID | normal | [SET; AQUOTE] | 
| STR_START | string | []<- new stack | 
| STR_END | normal | [SET; AQUOTE] | 
| SEMI | normal | [SET; AQUOTE] | 
| RBRACE | normal | [AQUOTE] | 
| SELECT | normal | [AQUOTE] | 
| AQUOTE_CLOSE | normal | [] | 
| STR_MID | string | [] | 
| STR_END | normal | [] | 
| EQ | normal | [] | 
| STR_START | string | [] | 
| STR_END | normal | [] | 
The above algorithm is implemented in the collect_tokens function
let collect_tokens lexer q lexbuf = let stack = ref [] in (* a new stack *) let queue = Queue.create () in (* a local token queue *) let rec go () = match (try Some (Queue.take queue) with Queue.Empty -> None) with | Some token -> ( match token, !stack with | AQUOTE_CLOSE, [] -> (* exit condition, stop recursion *) Queue.add AQUOTE_CLOSE q | EOF, _ -> (* end of input reached, stop recursion *) Queue.add EOF q; | _, _ -> (* otherwise add the token to the global queue and loop *) Queue.add token q; go () ) | None -> (* fill the local token queue if it's empty and loop *) lexer queue stack lexbuf; go () in (* add AQUOTE_OPEN to the global queue put AQUOTE on to the stack *) Queue.add AQUOTE_OPEN q; stack := [AQUOTE]; lexer queue stack lexbuf; go ()
The function is called from the "string" mode ( the ellipsis ... here denote
non-relevant parts of code)
(* the "normal" lexer *) rule get_tokens queue stack = parse ... (* "string" mode lexer *) and string state buf q = parse | '"' (* terminate when we hit '"' *) { Queue.add (token_of_str state buf) q; Queue.add STR_END q } | '\n' { ...} | ("\\n" | "\\r" | "\\t" | "\\\\" | "\\${") as s { ...} | "\\" (_ as c) { ... } | "${" (* collect all the tokens till we hit the matching '}' *) { Queue.add (token_of_str state buf) q; collect_tokens get_tokens q lexbuf; string `Mid (Buffer.create 64) q lexbuf } | _ as c (* otherwise just add the character to the buffer *) { ... }
The token_of_str function returns either STR_START when the state is
`Start or STR_MID if the state is `Mid, i.e. =${= has already been seen.
Flattening the token stream
While the mode-switching trick described in the previous section solves the
problem of lexing Nix strings, it introduces another problem. The generated
parser expects a function that given the lexbuf returns a token
lexer: Lexing.lexbuf -> Parser.token
However, the "string" mode returns not one but multiple tokens at a time, e.g.
"pointless ${ "${"but"}" } legit ${"string"}"
produces
STR_START
  AQUOTE_OPEN
    STR_START
      AQUOTE_OPEN
        STR_START STR_END
          AQUOTE_CLOSE
        STR_END
      AQUOTE_CLOSE
STR_MID
  AQUOTE_OPEN
    STR_START STR_END
  AQUOTE_CLOSE
STR_MID
STR_END
In other words, a single lexer call, given the string above, would produce 16
tokens.
Initially, I was going to store the tokens in an immutable tree and walk the
tree during the parsing process. The downside of the approach is that the tree
should be completely built before use. This could've been mitigated by using a
lazy data structure. However, the signature of the lexer function excludes the
possibility of threading the state through the calls 3. So after some
consideration I decided to use Queue module from the standard library. It
provides a mutable data structure with in-place modifications. The object is
passed into both "string" and "normal" modes, and all the emitted tokens are
just added to the queue, preserving the order they occur in the source string.
Extracting individual tokens is done through a wrapper function
let rec next_token (q: token Queue.t) (s: braces list ref) (lexbuf: Lexing.lexbuf) : token = match (try Some (Queue.take q) with Queue.Empty -> None) with (* return the token if the queue is not empty *) | Some token -> token (* otherwise replenish the queue and try again *) | None -> get_tokens q s lexbuf; next_token q s lexbuf
Tuning performance
After writing the rules for the other tokens (which was straightforward enough to omit them in this writing) and taking care of the problems mentioned in the above sections, I decided to experiment a bit with the performance. To have a plausible test data set I ran it on the complete nixpkgs source tree
$ time find nixkpkgs -name '*.nix' | xargs cat | lexer > /dev/null
real    0m10.795s
user    0m7.580s
sys     0m3.239s
The command above finds all the *.nix files in the source tree, concatenates
them and redirects the result onto the lexer's stdin. The lexer program, in
turn, tokenises the input and prints out all the tokens onto stdout. The
result is redirected to /dev/null so that it won't affect the time. The
number of *.nix files at the moment of testing was 14,674 and their total
size added up to roughly 50MB. On the first run I got around 11s. Hmm… that
was relatively poor result, especially when compared to the speed of the parser
that comes with Nix.
$ time find nixkpkgs -name '*.nix' | xargs nix-instantiate --parse > /dev/null
real    0m4.526s
user    0m3.628s
sys     0m1.061s
So lexing alone was more than twice as slow as the Flex/Bison-based parser. Subtracting the time it took to prepare the input
$ time find ../nixpkgs -name '*.nix' | xargs cat > /dev/null
real    0m0.667s
user    0m0.173s
sys     0m0.680s
I got even worse numbers.
Assuming that the speed of a lexer should be proportional to the size of the generated FSM I tried to reduce the number of states by utilising the trick described in the OCamllex manual. Namely, instead of having distinct rules for every keyword, use a pre-built hash table that returns either a keyword token or an identifier token
(* lookup table for keywords *) let keyword_table = Hashtbl.create 10 let _ = List.iter (fun (kwd, tok) -> Hashtbl.add keyword_table kwd tok) [ "with", WITH; "rec", REC; "let", LET; "in", IN; "inherit", INHERIT; "null", NULL; "if" , IF; "then", THEN; "else", ELSE; "assert", ASSERT; "or", ORDEF ] (* a rule for keywords or identifies *) | ((alpha | '_')+ (alpha_digit | ['_' '\'' '-'])*) as id { Queue.add (try Hashtbl.find keyword_table id with Not_found -> ID id) q}
Another "optimisation" was to simplify the rule for URI strings. As a syntactic sugar Nix allows writing URI and path strings omitting the quotes, i.e.
nix-repl> let uri1 = "http://goo.foo"; uri2 = http://goo.foo; in uri1 == uri2 true
So the relatively complex rule for recognising URIs in general I replaced by the simplified rule which recognises only certain protocols.
let uri_chr = ['%' '/' '?' ':' '@' '&' '=' '+' '$' ',' '-' '_' '.' '!' '~' '*' '\''] let scheme = "http" 's'? | "ftp" | "ssh" | "git" | "mirror" | "svn" let uri = scheme ':' (alpha_digit | uri_chr)+
The final, and probably the most useless "optimisation" was to create a constant-time lookup table for one-character tokens
(* lookup table for one-character tokens *) let char_table = Array.make 93 EOF let _ = List.iter (fun (k, v) -> Array.set char_table ((int_of_char k) - 1) v) [ '.', SELECT; '?', QMARK; '!', NOT; '=', ASSIGN; '<', LT; '>', GT; '[', LBRACK; ']', RBRACK; '+', PLUS; '-', MINUS; '*', TIMES; '/', SLASH; '(', LPAREN; ')', RPAREN; ':', COLON; ';', SEMICOLON; ',', COMMA; '@', AS ] let char_tokens = ['.' '?' '!' '=' '<' '>' '[' ']' '+' '-' '*' '/' '(' ')' ':' ';' ',' '@'] (* the rule for one-character tokens *) | char_tokens as c { Queue.add (Array.get char_table ((int_of_char c) - 1)) q }
With all the changes in place I reduced the run time to 9.7s. It wasn't a huge
improvement but still a decent one. Not until I exhausted all the other ideas I
decided to run a profiler ;) . The only excuse I have for not doing so earlier
is that using gprof on MacOS isn't quite possible and I don't really know how
to use DTrace, especially with ocamlopt generated code. If not for the
excellent Landmarks library I'd probably still have been trying to
micro-optimise the code wherever possible.
Reflections aside, the first thing I tried to measure was how long it
took to output the result. I mean, I had the function token -> string
and I was calling it for every one of the ~7 million tokens produced.
It was a jaw-dropping moment when Landmarks told me that the program spent almost 80% of its run time in printing!
I used print_endline function from OCaml Pervasives module. I should've
read the documentation carefully
Print a string, followed by a newline character, on standard output and flush standard output.
On the positive side, now I know that it takes about 4s to make 7 million
write system calls on my MacBook Pro :D
Changing print_endline into Printf.printf reduced the run time to ~2s and
discarding the output altogether (i.e. not calling print at all) got me to
1.7s. Taking into account that find-ing and concatenating the files takes
~0.6s, the lexer seems to be pretty fast, being able to process text at ~50MB/s
rate!
Quite happy with the achieved result I set on to writing the parser. In the next post I'll get into details of the said adventure.
The source code of the lexer is available on GitHub.
Footnotes:
To be completely honest, though, I was just looking for the right kind of a nail for the hummer I was eager to try.
Not entirely true. There are assertions which may terminate evaluation at any point making them semantically close to statements.
In addition to the monolithic API, Menhir may expose incremental API which makes it possible to keep the parser/lexer pure. However, the incremental API requires the slower, table-based backed and is more complicated. So I decided to stick to the monolithic API for a time being.