Regular Expressions vs Parser Combinators in OCaml
After witnessing the outstanding results demonstrated by Attoparsec library I was intrigued to compare the two approaches in OCaml.
The original problem I had to solve can be formulated as:
Given a short (less than 256 bytes) string of ASCII characters replace any contiguous sequence of any of
/:\ .(),'"
characters by underscore_
.
It is not a completely fare comparison because in Haskell version I had to write my
own implementation of replaceAll
function, while all the libraries I compare
against in OCaml already have that out of the box.
The contenders from the regular expression camp were re, re2, and pcre. And while the latter two are wrappers for Google’s re2 and the venerable pcre libraries, the former one is pure OCaml.
I can’t help but mention the excellent intro
documentation for pcre
and the lack of
thereof for re2
. It took me a while to figure out how to actually use it. And
the relatively poor results it demonstrated may be due to my lack of
understanding how to use it properly.
From the parser combinators camp I chose angstrom which seem to be heavily inspired by Attoparsec library.
Implementation details
The code that uses regular expressions is the very definition of
straightforwardness. The code for each library in encapsulated in a separate
module. All three pre-compile the regular expression and wrap library-specific
replace_all
equivalent:
module LibRe : Benchmarkable = struct
let re = Re.compile @@ Re.Pcre.re "[/:\\\\. (),'\"]+"
let replace_all = Re.replace_string re ~by:"_"
end
module LibRe2 : Benchmarkable = struct
let re = Re2.create_exn "[/:\\\\. (),'\"]+"
let replace_all = Re2.rewrite_exn re ~template:"_"
end
module LibPcre : Benchmarkable = struct
let re = Pcre.regexp "[/:\\\\. (),'\"]+"
let replace_all s = Pcre.replace ~rex:re ~templ:"_" s
end
where Benchmarkable
is a module signature that exposes only one function:
module type Benchmarkable = sig
val replace_all : string -> string
end
Angstrom-related code is slightly more involved. It defines two parsers good
and bad
and then combines them using sep_by1
combinator.
module LibAngstrom : Benchmarkable = struct
open Angstrom
let is_escape_char = function
| '/' | ':' | '\\' | '.' | ' '
| '(' | ')' | ',' | '"' | '\'' -> true
| _ -> false
(* "good" parser consumes any number of characters till it
hits one of the escape characters *)
let good = take_till is_escape_char
(* while "bad" parser consumes one or more escape characters *)
let bad = take_while1 is_escape_char
(* and the final parser is a combination of "good" and "bad"
parsers combined by `sep_by1` combinator that must consume
characters till the end of the input *)
let p = sep_by1 bad good <* end_of_input
let replace_all s = match parse_string p s with
| Result.Ok xs -> String.concat "_" xs
| Result.Error msg -> failwith msg
end
Results and conclusion
I used the excellent core_bench library for running the benchmarks. Although documentation is scarce at best, there is a fantastic blog post that goes at great lengths describing the motivation and implementation details of the library.
Name Time/Run mWd/Run mjWd/Run Prom/Run
--------- ---------- ------------ ----------- -----------
Re 4.18ms 942.30kw 177.06w 177.06w
Pcre 5.89ms 1_668.67kw 3_526.08w 3_526.08w
Re2 8.37ms 14.40kw 0.63w 0.63w
Angstrom 4.81ms 1_294.75kw 2_004.67w 2_004.67w
There are a couple of surprises here. Surprise #1 is how well re
did. I would
have never expected a natively-implemented regular expression library to
outperform C-wrappers. Surprise #2 is re2
. While it allocates minuscule amount
of memory and thus may be a plausible choice for some applications, seeing it
among outsiders in run time was unexpected.
Oh, and Angstrom did great. Easy to use, fast, concise… maybe a bit lacking in
terms of documentation, but reading through the mli
file and examples was
enough to get started.
The source code of the benchmark is available on GitHub. Pull requests are more than welcome.