Using Leex and Yecc in Elixir
Table of Contents
Sometimes regular expressions are not enough. Sometimes there needs to be a higher-level construct, a driver that directs the process of parsing, building a complex data structure along the way.
A few weeks ago I found myself exactly in this kind of situation. We are slowly
migrating our huge, legacy code base from Perl to Elixir. Most things we, as a
team, do is related to network monitoring. And the (meta) data for that comes
in many different shapes and forms, including a relation DB, JSON files, and
what not. So I used to parse OIDs taken from JSON files using the regex
\d+(?:\.\d+)+
and convert them to a list of integers
str |> String.split(".") |> Enum.map(&String.to_integer/1)
The expression above takes a string like "1.3.2.5.6"
and converts it into the
list [1,3,2,5,6]
.
It works fine for metrics that are represented by a single OID. But we also
have something called "calc metrics". And those a strings that are eval
-ed by
Perl. Looking at the data in our dev environment I found that those
expressions, with some variations, look like the following
([1.3.1.5]+[1.3.1.6])*100/[1.3.1.7]
In other words, those were simple arithmetic expressions where OIDs enclosed in square brackets were substituted by their values.
And because arithmetic expressions correspond to Chomsky Type 2 grammar I couldn't use regular expressions anymore. That's a job for a more powerful tool.
Leex and Yecc1
The venerable lex
and yacc
tools have been around for almost 40 years now
but that does not lessen their value. If you find yourself in a need to quickly
write a parser for a programming language of sorts (as did I) those are your
best friends. And because they are so helpful they've been ported in one way or
another to almost any programming language out there. Erlang has those in its
standard distribution under leex
and yecc
names.
Leex
leex
obviously stands for "Lexical analyzer generator for Erlang". Its
primary job is to take a stream of characters and convert them into a stream of
tokens. A token is a terminal character in the alphabet that forms a language's
grammar. An example of a token could be the *
or +
character, or a number,
e.g. 42
. Or, in this case, a string representation of an OID enclosed in
square brackets - [1.3.2.4.6]
.
What exactly constitutes a token, or, in other words, the rule that maps some
character sequence into a token, is defined by a regular expression. In our
case we can say that the *
token corresponds to the regular expression \*
,
and the number
token is defined as [0-9]+
, while the oid
token is defined
by \[[0-9]+(\.[0-9]+)+\]
.
Thus the string ([1.3.1.5]+[1.3.1.6])*100/[1.3.1.7]
corresponds to the
sequence of tokens '(' oid '+' oid ')' '*' number '/' oid
.
leex
expects those rules to be written in a file as
%% number [0-9]+ : {token, { number, TokenLine, list_to_integer(TokenChars) } }. %% left paren \( : {token, {'(', TokenLine}}. %% ... other tokens %% skip white spaces [\s\n\r\t]+ : skip_token.
where a regular expression is on the left side of :
and the term on the right
side of :
defines a token. The term, however, is not an arbitrary. A
two-tuple {token, _}
emits a token, while atom skip_token
well… skips a
token, simply discarding the consumed characters. TokenLine
and TokenChars
are pre-defined variables which contain the line number in the source and the
list of consumed characters correspondingly. A rule must end with the full
stop. Apart from that, the rhs
is a valid Erlang expression so arbitrary
Erlang functions (e.g. =listtointeger= above) are available. See leex
man
page for more details.
The rules go into a file that has .xrl
extension, and, for mix
to be able
to compile it, the file should be placed under the src/
folder. The name
of the file determines the name of the module that implements the lexer. That
is src/foo.xrl
after running mix compile
will produce a module :foo
that
defines a few functions. In particular, calling
:foo.string '1+2*(5-6)'
results in a tuple that contains the token list and the end line number:
{ :ok, [ {:number, 1, 1}, {:"+", 1}, {:number, 1, 2}, {:"*", 1}, {:number, 1, 5}, {:"-"}, {:number, 1, 6} ], 1 }
Yecc
Unpronounceable yecc
stands for "An LALR-1 parser generator for Erlang,
similar to yacc". The job of the tool is to produce a parser that takes a
stream of tokens as its input and generates some data structure (an AST,
perhaps) as the output.
To produce a parser module, yecc
needs an alphabet2 and grammar
rules. It also requires a root symbol which indicates the entry point of the
grammar.
The alphabet must define non-terminals and terminal symbols. A non-terminal
symbol (e.g. =expr=) stands for a sequence of terminals and/or other
non-terminals, while terminal symbols stand only for themselves (e.g +
or
number
) and directly correspond to the tokens produced by lexer.
Grammar is defined as a sequence of production rules, for example
op -> '*' : '$1'. op -> '/' : '$1'. op -> '+' : '$1'. op -> '-' : '$1'.
defines a production rule for a non-terminal symbol op
that may stand for
terminals *
, /
, +
, -
. In the EBNF notation the same rule may be written as
<op> ::= '*' | '/' | '+' | '-'
Again, as in leex
rules, the part before :
defines the rule, while the part
after defines the output structure. The expression $1
means "literally
substitute this with the value of the first matched symbol". In the example
above the value of the terminals will be the corresponding token, i.e.
{'*', 1}
or {'+', 1}
.
A more complex example would be
expr -> expr op term : {'$2', '$1', '$3'}. expr -> term : '$1'. op -> '+' : '+'. op -> '-' : '-'. term -> number : '$1'.
which defines a grammar for expressions like 1+1-2
or 1+2+3-5-6
. The output
of the parser in this case is a tree of nested tuples where nodes are +
or
-
operators and numbers are leaves.
For example, parsing the 1+1-2
expression with the rules above yields
{ '-', { '+', {:number, 1, 1}, {:number, 1, 1} }, {:number, 1, 2} }
First iteration: arithmetic expressions
The tools make rolling out your parser an exemplary easy job. Reading the documentation of the modules and writing lexing and parsing rules for the language of arithmetic expressions on OIDs took me no more than 2 hours. I spend another hour writing a simple interpreter and testing it in the repl. The lexer was defined as
Definitions. Rules. %% a number [0-9]+ : {token, {number, TokenLine, list_to_integer(TokenChars)}}. %% an OID \[[0-9]+(\.[0-9]+)*\] : {token, {oid, TokenLine, oid(TokenChars)}}. %% open/close parens \( : {token, {'(', TokenLine}}. \) : {token, {')', TokenLine}}. %% arithmetic operators \+ : {token, {'+', TokenLine}}. \- : {token, {'-', TokenLine}}. \* : {token, {'*', TokenLine}}. \/ : {token, {'/', TokenLine}}. %% white space [\s\n\r\t]+ : skip_token. Erlang code. oid(Oid) -> S = tl(lists:droplast(Oid)), L = string:split(S, ".", all), lists:map(fun list_to_integer/1, L).
The above is the complete code of the lexer.xrl
file. The Erlang code.
section
defines a helper function oid/1
that converts a list of characters
representing an OID into a list of integers.
The grammar for arithmetic expressions is well-known and also simple enough to fit into a few lines of code
Nonterminals expr term factor. Terminals oid number '+' '-' '*' '/' '(' ')'. Rootsymbol expr. expr -> expr '+' term : {plus, '$1', '$3'}. expr -> expr '-' term : {minus, '$1', '$3'}. expr -> term : '$1'. term -> factor '*' term : {mult, '$1', '$3'}. term -> factor '/' term : {divi, '$1', '$3'}. term -> factor : '$1'. factor -> '(' expr ')' : '$2'. factor -> oid : '$1'. factor -> number : '$1'.
That code went verbatim into parser.yrl
file.
Having defined the files above and running mix compile
created two modules
:lexer
and :parser
that I could play with:
iex> {:ok, tokens, _} = :lexer.string '1+2*3' {:ok, [{:number, 1, 1}, {:"+", 1}, {:number, 1, 1}, {:"*", 1}, {:number, 1, 2}], 1} iex> {:ok, ast} = :parser.parse tokens {:ok, { :plus, {:number, 1, 1}, {:mult, {:number, 1, 2}, {:number, 1, 3}} } }
After that defining a tree-walk interpreter was a straightforward task:
# two mutally recursive data types, defining the AST @type expr_ast :: {:mult, aterm, aterm} | {:divi, aterm, aterm} | {:plus, aterm, aterm} | {:minus, aterm, aterm} | aterm @type aterm :: {:number, any(), integer()} | {:oid, any(), SNMP.oid()} | expr_ast # and the spec for evaluator function @spec eval(expr_ast, store :: map) :: integer() # where `store` is a map of oids to their values # evaluating a number returns its value def eval({:number, _, n}, _store), do: n # evaluating an oid returns its value from the store # or throws an exception def eval({:oid, _, oid}, store) do case Map.fetch(store, oid) do {:ok, val} -> val :error -> throw({:undefined, oid}) end end # evaluation a binary operation is an application # of a corresponding operator to the lhs and rhs # evaluated recursively def eval({:mult, lhs, rhs}, s), do: eval(lhs, s) * eval(rhs, s) def eval({:divi, lhs, rhs}, s), do: div(eval(lhs, s), eval(rhs, s)) def eval({:plus, lhs, rhs}, s), do: eval(lhs, s) + eval(rhs, s) def eval({:minus, lhs, rhs}, s), do: eval(lhs, s) - eval(rhs, s)
A colleague of mine upon seeing the code said "writing an interpreter in a language that has pattern-matching feels like cheating" :)
So a few dozen lines of code contain a full-featured parser and interpreter. Ain't those tools awesome?
Iteration 2: lt
and gt
It was all good and fine up to the moment I tried to run the application on one of the bigger sites. I noticed in the log files some parse errors
error: can not parse calc metric "[1.3.5.6] - (2**32)*([1.3.5.7] < 0)"
Wtf?3 I can understand that 2**32
means 2 raised to the power of 32, but
what on Earth x < 0
evaluates to? Apparently, <
and >
expressions in Perl
(as True
and False
in Python) evaluate to 1 or 0 so the "boolean"
expression above can legitimately be a part of an arithmetic expression.
Well, the modification of the lexer and parser were minimal.
Lexer:
%% power (should be defined before `*`) \*\* : {token, {'**', TokenLine}} %% less-than and greater-than \< : {token, {'<', TokenLine}}. \> : {token, {'>', TokenLine}}.
Parser:
Nonterminals expr term factor val. Rootsymbol comp_expr. comp_expr -> expr '<' comp_expr : {lt, '$1', '$3'}. comp_expr -> expr '>' comp_expr : {gt, '$1', '$3'}. comp_expr -> expr : '$1'. %% factor now has a sub-component to maintain the precedence factor -> val '**' factor : {pow, '$1', '$3'}. factor -> val : '$1'. val -> '(' comp_expr ')' : '$2'. val -> oid : '$1'. val -> number : '$1'.
Modifications to the interpreter were also relatively straightforward - three extra function clauses and a helper function.
def eval({:pow, lhs, rhs}, store) do round(:math.pow(eval(lhs, store), eval(rhs, store))) end def eval({:lt, lhs, rhs}, store) do bool2int(eval(lhs, store) < eval(rhs, store)) end def eval({:gt, lhs, rhs}, store) do bool2int(eval(lhs, store) > eval(rhs, store)) end defp bool2int(true), do: 1 defp bool2int(false), do: 0
Iteration 3: function calls
And again, it was all good and fine until I test-ran the application on one the largest production sites. I noticed parse errors, mentioning something unfathomable
error: can not parse: [1.3.6.5]+unpack("I",pack("i",[1.3.6.6]), illegal characters: ['u']
%$#k!4 The problem was that the complete implementation of Perl's pack
and unpack
functions is a job worthy if not a Turing Award, then at least a
CS degree and an honorable mention in "Workacholics Daily" magazine!
Luckily, "I"
and "i"
were the only templates I could spot so implementing
the functions for that particular case was not such a big deal.
I rolled up my sleeves and added the following lines to the lexer file
Definitions. %% a symbol which can be a part of a *sensible* function name in Perl W = [a-zA-Z0-9_] Rules. %% identifiers {W}+ : {token, {ident, TokenLine, list_to_existing_atom(TokenChars)}}. %% double-quoted strings (do not allow escaped double quotes!) \"[^\"]*\" : {token, {str, TokenLine, strip_quotes(TokenChars)}}. \, : {token, {',', TokenLine}}. Erlang code. strip_quotes(Str) -> S = tl(lists:droplast(Str)), list_to_binary(S).
Additions to the parser were move involving
%% extended alphabet to include function calls and strings Nonterminals expr term factor val comp funcall funargs. Terminals oid number ident str '+' '-' '*' '/' '(' ')' '**' '<' '>' ','. funargs -> comp_expr : ['$1']. funargs -> funargs ',' comp_expr : ['$3' | '$1']. %% a function call with no arguments funcall -> ident '(' ')': {funcall, '$1', []}. %% we have to reverse the arg list here funcall -> ident '(' funargs ')': {funcall, '$1', lists:reverse('$3')}. %% another case for 'val' val -> funcall : '$1'.
An interesting one here is funargs
which recursively builds a list of
function arguments where each argument can be an arbitrary expression. And
because the production rule for funargs
is left-recursive the resulting list
should be reversed.
The interpreter also had to become more sophisticated, because store
now
should have contained pre-defined identifiers next to OID values. The two
added clauses for the evaluator were
# string expressions evaluate to the string itself def eval({:str, _, s}, _) when is_binary(s), do: s # function calls def eval({:funcall, {:ident, _, fun}, args}, store) do # find the identifier in the store case Map.fetch(store, fun) do {:ok, impl_fun} -> evaled_args = Enum.map(args, fn arg -> eval(arg, store) end) # this calls a function defined by the atom in `impl_fun` # in the running process which is potentially dangerous, so `impl_fun`s # should be very rigorous in validating their arguments Kernel.apply(__MODULE__, impl_fun, evaled_args) :error -> throw({:undefined_fun, fun}) end end
Conclusion?
I do hope that I stumbled upon all the possible cases and that I won't end up writing a full-featured Perl interpreter :) in the process.
Yet, I must admit that having the tools proved to be an invaluable asset. Each iteration took me no more than a few hours to implement, which easily could have been days (weeks?) otherwise.
Footnotes:
Originally I didn't plan to include the section, but later on I decided to add more details so that the post could serve as an introductory tutorial for the tools.
I don't think that "alphabet" is a commonly used name for that. What I mean here is a union of non-terminal and terminal symbol sets.
Which naturally stands for "What is The Fantastically brilliant way to write arithmetic expressions?".
Which approxiamtely stands for "When you think that the life is boring
and all the deadlines for the project are due to"two weeks ago", the
only thing that can spice it up is a big, warm, smelly pile of shit
work".