pl-rants

Rants about programming languages

Nov 27, 2018

Using Leex and Yecc in Elixir

Table of Contents

Sometimes regular expressions are not enough. Sometimes there needs to be a higher-level construct, a driver that directs the process of parsing, building a complex data structure along the way.

A few weeks ago I found myself exactly in this kind of situation. We are slowly migrating our huge, legacy code base from Perl to Elixir. Most things we, as a team, do is related to network monitoring. And the (meta) data for that comes in many different shapes and forms, including a relation DB, JSON files, and what not. So I used to parse OIDs taken from JSON files using the regex \d+(?:\.\d+)+ and convert them to a list of integers

str
|> String.split(".")
|> Enum.map(&String.to_integer/1)

The expression above takes a string like "1.3.2.5.6" and converts it into the list [1,3,2,5,6].

It works fine for metrics that are represented by a single OID. But we also have something called "calc metrics". And those a strings that are eval-ed by Perl. Looking at the data in our dev environment I found that those expressions, with some variations, look like the following

([1.3.1.5]+[1.3.1.6])*100/[1.3.1.7]

In other words, those were simple arithmetic expressions where OIDs enclosed in square brackets were substituted by their values.

And because arithmetic expressions correspond to Chomsky Type 2 grammar I couldn't use regular expressions anymore. That's a job for a more powerful tool.

Leex and Yecc1

The venerable lex and yacc tools have been around for almost 40 years now but that does not lessen their value. If you find yourself in a need to quickly write a parser for a programming language of sorts (as did I) those are your best friends. And because they are so helpful they've been ported in one way or another to almost any programming language out there. Erlang has those in its standard distribution under leex and yecc names.

Leex

leex obviously stands for "Lexical analyzer generator for Erlang". Its primary job is to take a stream of characters and convert them into a stream of tokens. A token is a terminal character in the alphabet that forms a language's grammar. An example of a token could be the * or + character, or a number, e.g. 42. Or, in this case, a string representation of an OID enclosed in square brackets - [1.3.2.4.6].

What exactly constitutes a token, or, in other words, the rule that maps some character sequence into a token, is defined by a regular expression. In our case we can say that the * token corresponds to the regular expression \*, and the number token is defined as [0-9]+, while the oid token is defined by \[[0-9]+(\.[0-9]+)+\].

Thus the string ([1.3.1.5]+[1.3.1.6])*100/[1.3.1.7] corresponds to the sequence of tokens '(' oid '+' oid ')' '*' number '/' oid.

leex expects those rules to be written in a file as

%% number
[0-9]+      : {token,
                { number,
                  TokenLine,
                  list_to_integer(TokenChars)
                 }
               }.
%% left paren
\(          : {token, {'(', TokenLine}}.
%% ... other tokens
%% skip white spaces
[\s\n\r\t]+ : skip_token.

where a regular expression is on the left side of : and the term on the right side of : defines a token. The term, however, is not an arbitrary. A two-tuple {token, _} emits a token, while atom skip_token well… skips a token, simply discarding the consumed characters. TokenLine and TokenChars are pre-defined variables which contain the line number in the source and the list of consumed characters correspondingly. A rule must end with the full stop. Apart from that, the rhs is a valid Erlang expression so arbitrary Erlang functions (e.g. =listtointeger= above) are available. See leex man page for more details.

The rules go into a file that has .xrl extension, and, for mix to be able to compile it, the file should be placed under the src/ folder. The name of the file determines the name of the module that implements the lexer. That is src/foo.xrl after running mix compile will produce a module :foo that defines a few functions. In particular, calling

:foo.string '1+2*(5-6)'

results in a tuple that contains the token list and the end line number:

{ :ok,
  [ {:number, 1, 1},
    {:"+", 1},
    {:number, 1, 2},
    {:"*", 1},
    {:number, 1, 5},
    {:"-"},
    {:number, 1, 6} ],
  1 }

Yecc

Unpronounceable yecc stands for "An LALR-1 parser generator for Erlang, similar to yacc". The job of the tool is to produce a parser that takes a stream of tokens as its input and generates some data structure (an AST, perhaps) as the output.

To produce a parser module, yecc needs an alphabet2 and grammar rules. It also requires a root symbol which indicates the entry point of the grammar.

The alphabet must define non-terminals and terminal symbols. A non-terminal symbol (e.g. =expr=) stands for a sequence of terminals and/or other non-terminals, while terminal symbols stand only for themselves (e.g + or number) and directly correspond to the tokens produced by lexer.

Grammar is defined as a sequence of production rules, for example

op -> '*' : '$1'.
op -> '/' : '$1'.
op -> '+' : '$1'.
op -> '-' : '$1'.

defines a production rule for a non-terminal symbol op that may stand for terminals *, /, +, -. In the EBNF notation the same rule may be written as

<op> ::= '*' | '/' | '+' | '-'

Again, as in leex rules, the part before : defines the rule, while the part after defines the output structure. The expression $1 means "literally substitute this with the value of the first matched symbol". In the example above the value of the terminals will be the corresponding token, i.e. {'*', 1} or {'+', 1}.

A more complex example would be

expr -> expr op term : {'$2', '$1', '$3'}.
expr -> term : '$1'.

op -> '+' : '+'.
op -> '-' : '-'.

term -> number : '$1'.

which defines a grammar for expressions like 1+1-2 or 1+2+3-5-6. The output of the parser in this case is a tree of nested tuples where nodes are + or - operators and numbers are leaves.

For example, parsing the 1+1-2 expression with the rules above yields

{
   '-',
   {
     '+',
     {:number, 1, 1},
     {:number, 1, 1}
   },
   {:number, 1, 2}
}

First iteration: arithmetic expressions

The tools make rolling out your parser an exemplary easy job. Reading the documentation of the modules and writing lexing and parsing rules for the language of arithmetic expressions on OIDs took me no more than 2 hours. I spend another hour writing a simple interpreter and testing it in the repl. The lexer was defined as

Definitions.
Rules.
%% a number
[0-9]+ : {token, {number, TokenLine, list_to_integer(TokenChars)}}.
%% an OID
\[[0-9]+(\.[0-9]+)*\] : {token, {oid, TokenLine, oid(TokenChars)}}.
%% open/close parens
\( : {token, {'(', TokenLine}}.
\) : {token, {')', TokenLine}}.
%% arithmetic operators
\+ : {token, {'+', TokenLine}}.
\- : {token, {'-', TokenLine}}.
\* : {token, {'*', TokenLine}}.
\/ : {token, {'/', TokenLine}}.
%% white space
[\s\n\r\t]+           : skip_token.

Erlang code.
oid(Oid) ->
    S = tl(lists:droplast(Oid)),
    L = string:split(S, ".", all),
    lists:map(fun list_to_integer/1, L).

The above is the complete code of the lexer.xrl file. The Erlang code. section defines a helper function oid/1 that converts a list of characters representing an OID into a list of integers.

The grammar for arithmetic expressions is well-known and also simple enough to fit into a few lines of code

Nonterminals expr term factor.
Terminals oid number '+' '-' '*' '/' '(' ')'.
Rootsymbol expr.

expr -> expr '+' term : {plus, '$1', '$3'}.
expr -> expr '-' term : {minus, '$1', '$3'}.
expr -> term : '$1'.

term -> factor '*' term : {mult, '$1', '$3'}.
term -> factor '/' term : {divi, '$1', '$3'}.
term -> factor : '$1'.

factor -> '(' expr ')' : '$2'.
factor -> oid : '$1'.
factor -> number : '$1'.

That code went verbatim into parser.yrl file.

Having defined the files above and running mix compile created two modules :lexer and :parser that I could play with:

iex> {:ok, tokens, _} = :lexer.string '1+2*3'
  {:ok,
    [{:number, 1, 1},
     {:"+", 1},
     {:number, 1, 1},
     {:"*", 1},
     {:number, 1, 2}],
   1}
iex> {:ok, ast} = :parser.parse tokens
  {:ok,
    { :plus,
      {:number, 1, 1},
      {:mult, {:number, 1, 2}, {:number, 1, 3}}
    }
   }

After that defining a tree-walk interpreter was a straightforward task:

# two mutally recursive data types, defining the AST
@type expr_ast :: {:mult, aterm, aterm} | {:divi, aterm, aterm}
                  | {:plus, aterm, aterm} | {:minus, aterm, aterm}
                  | aterm

@type aterm :: {:number, any(), integer()}
               | {:oid, any(), SNMP.oid()} | expr_ast

# and the spec for evaluator function
@spec eval(expr_ast, store :: map) :: integer()
# where `store` is a map of oids to their values

# evaluating a number returns its value
def eval({:number, _, n}, _store), do: n
# evaluating an oid returns its value from the store
# or throws an exception
def eval({:oid, _, oid}, store) do
    case Map.fetch(store, oid) do
      {:ok, val} -> val
      :error -> throw({:undefined, oid})
    end
end
# evaluation a binary operation is an application
# of a corresponding operator to the lhs and rhs
# evaluated recursively
def eval({:mult, lhs, rhs}, s), do: eval(lhs, s) * eval(rhs, s)
def eval({:divi, lhs, rhs}, s), do: div(eval(lhs, s), eval(rhs, s))
def eval({:plus, lhs, rhs}, s), do: eval(lhs, s) + eval(rhs, s)
def eval({:minus, lhs, rhs}, s), do: eval(lhs, s) - eval(rhs, s)

A colleague of mine upon seeing the code said "writing an interpreter in a language that has pattern-matching feels like cheating" :)

So a few dozen lines of code contain a full-featured parser and interpreter. Ain't those tools awesome?

Iteration 2: lt and gt

It was all good and fine up to the moment I tried to run the application on one of the bigger sites. I noticed in the log files some parse errors

error: can not parse calc metric "[1.3.5.6] - (2**32)*([1.3.5.7] < 0)"

Wtf?3 I can understand that 2**32 means 2 raised to the power of 32, but what on Earth x < 0 evaluates to? Apparently, < and > expressions in Perl (as True and False in Python) evaluate to 1 or 0 so the "boolean" expression above can legitimately be a part of an arithmetic expression.

Well, the modification of the lexer and parser were minimal.

Lexer:

%% power (should be defined before `*`)
\*\* : {token, {'**', TokenLine}}
%% less-than and greater-than
\< : {token, {'<', TokenLine}}.
\> : {token, {'>', TokenLine}}.

Parser:

Nonterminals expr term factor val.
Rootsymbol comp_expr.
comp_expr -> expr '<' comp_expr : {lt, '$1', '$3'}.
comp_expr -> expr '>' comp_expr : {gt, '$1', '$3'}.
comp_expr -> expr : '$1'.
%% factor now has a sub-component to maintain the precedence
factor -> val '**' factor : {pow, '$1', '$3'}.
factor -> val : '$1'.
val -> '(' comp_expr ')' : '$2'.
val -> oid : '$1'.
val -> number : '$1'.

Modifications to the interpreter were also relatively straightforward - three extra function clauses and a helper function.

def eval({:pow, lhs, rhs}, store) do
    round(:math.pow(eval(lhs, store), eval(rhs, store)))
end
def eval({:lt, lhs, rhs}, store) do
    bool2int(eval(lhs, store) < eval(rhs, store))
end
def eval({:gt, lhs, rhs}, store) do
    bool2int(eval(lhs, store) > eval(rhs, store))
end

defp bool2int(true), do: 1
defp bool2int(false), do: 0

Iteration 3: function calls

And again, it was all good and fine until I test-ran the application on one the largest production sites. I noticed parse errors, mentioning something unfathomable

error: can not parse: [1.3.6.5]+unpack("I",pack("i",[1.3.6.6]),
  illegal characters: ['u']

%$#k!4 The problem was that the complete implementation of Perl's pack and unpack functions is a job worthy if not a Turing Award, then at least a CS degree and an honorable mention in "Workacholics Daily" magazine!

Luckily, "I" and "i" were the only templates I could spot so implementing the functions for that particular case was not such a big deal.

I rolled up my sleeves and added the following lines to the lexer file

Definitions.
%% a symbol which can be a part of a *sensible* function name in Perl
W = [a-zA-Z0-9_]

Rules.

%% identifiers
{W}+ : {token, {ident, TokenLine, list_to_existing_atom(TokenChars)}}.

%% double-quoted strings (do not allow escaped double quotes!)
\"[^\"]*\" : {token, {str, TokenLine, strip_quotes(TokenChars)}}.
\, : {token, {',', TokenLine}}.

Erlang code.
strip_quotes(Str) ->
    S = tl(lists:droplast(Str)),
    list_to_binary(S).

Additions to the parser were move involving

%% extended alphabet to include function calls and strings
Nonterminals expr term factor val comp funcall funargs.
Terminals oid number ident str '+' '-' '*' '/' '(' ')' '**' '<' '>' ','.

funargs -> comp_expr : ['$1'].
funargs -> funargs ',' comp_expr : ['$3' | '$1'].

%% a function call with no arguments
funcall -> ident '(' ')': {funcall, '$1', []}.

%% we have to reverse the arg list here
funcall ->
   ident '(' funargs ')': {funcall, '$1', lists:reverse('$3')}.

%% another case for 'val'
val -> funcall : '$1'.

An interesting one here is funargs which recursively builds a list of function arguments where each argument can be an arbitrary expression. And because the production rule for funargs is left-recursive the resulting list should be reversed.

The interpreter also had to become more sophisticated, because store now should have contained pre-defined identifiers next to OID values. The two added clauses for the evaluator were

# string expressions evaluate to the string itself
def eval({:str, _, s}, _) when is_binary(s), do: s

# function calls
def eval({:funcall, {:ident, _, fun}, args}, store) do
    # find the identifier in the store
    case Map.fetch(store, fun) do
      {:ok, impl_fun} ->
        evaled_args = Enum.map(args, fn arg -> eval(arg, store) end)
        # this calls a function defined by the atom in `impl_fun`
        # in the running process which is potentially dangerous, so `impl_fun`s
        # should be very rigorous in validating their arguments
        Kernel.apply(__MODULE__, impl_fun, evaled_args)
      :error ->
        throw({:undefined_fun, fun})
    end
end

Conclusion?

I do hope that I stumbled upon all the possible cases and that I won't end up writing a full-featured Perl interpreter :) in the process.

Yet, I must admit that having the tools proved to be an invaluable asset. Each iteration took me no more than a few hours to implement, which easily could have been days (weeks?) otherwise.

Footnotes:

1

Originally I didn't plan to include the section, but later on I decided to add more details so that the post could serve as an introductory tutorial for the tools.

2

I don't think that "alphabet" is a commonly used name for that. What I mean here is a union of non-terminal and terminal symbol sets.

3

Which naturally stands for "What is The Fantastically brilliant way to write arithmetic expressions?".

4

Which approxiamtely stands for "When you think that the life is boring and all the deadlines for the project are due to"two weeks ago", the only thing that can spice it up is a big, warm, smelly pile of shit work".