Regular Expressions vs Parser Combinators in OCaml
Table of Contents
After witnessing the outstanding results demonstrated by Attoparsec library I was intrigued to compare the two approaches in OCaml.
The original problem I had to solve can be formulated as:
Given a short (less than 256 bytes) string of ASCII characters replace any contiguous sequence of any of
/:\ .(),'"characters by_.
It is not a completely fare comparison, because in Haskell version I had to
write my own implementation of replaceAll function, while all the libraries I
compare against in OCaml already have that out of the box.
The contenders from the regular expression camp were re, re2, and pcre. And while the latter two are wrappers for Google's re2 and the venerable pcre libraries, the former one is pure OCaml.
I can't help but mention the excellent intro documentation for pcre and the
lack of thereof for re2. It took me a while to figure out how to actually use
it. And the relatively poor results it demonstrated may be due to my lack of
understanding how to use it properly.
From the parser combinators camp I chose angstrom which seem to be heavily inspired by Attoparsec library.
Implementation details
The code that uses regular expressions is the very definition of
straightforwardness. The code for each library in encapsulated in a separate
module. All three pre-compile the regular expression and wrap library-specific
replace_all equivalent:
module LibRe : Benchmarkable = struct let re = Re.compile @@ Re.Pcre.re "[/:\\\\. (),'\"]+" let replace_all = Re.replace_string re ~by:"_" end module LibRe2 : Benchmarkable = struct let re = Re2.create_exn "[/:\\\\. (),'\"]+" let replace_all = Re2.rewrite_exn re ~template:"_" end module LibPcre : Benchmarkable = struct let re = Pcre.regexp "[/:\\\\. (),'\"]+" let replace_all s = Pcre.replace ~rex:re ~templ:"_" s end
where Benchmarkable is a module signature that exposes only one function:
module type Benchmarkable = sig val replace_all : string -> string end
Angstrom-related code is slightly more involved. It defines two parsers good
and bad and then combines them using sep_by1 combinator.
module LibAngstrom : Benchmarkable = struct open Angstrom let is_escape_char = function | '/' | ':' | '\\' | '.' | ' ' | '(' | ')' | ',' | '"' | '\'' -> true | _ -> false (* "good" parser consumes any number of characters till it hits one of the escape characters *) let good = take_till is_escape_char (* while "bad" parser consumes one or more escape characters *) let bad = take_while1 is_escape_char (* and the final parser is a combination of "good" and "bad" parsers combined by `sep_by1` combinator that must consume characters till the end of the input *) let p = sep_by1 bad good <* end_of_input let replace_all s = match parse_string p s with | Result.Ok xs -> String.concat "_" xs | Result.Error msg -> failwith msg end
Results and conclusion
I used the excellent core_bench library for running the benchmarks. Although
documentation is scarce at best, there is a thorough blog post that details the
motivation and implementation of the library.
Name Time/Run mWd/Run mjWd/Run Prom/Run --------- ---------- ------------ ----------- ----------- Re 4.18ms 942.30kw 177.06w 177.06w Pcre 5.89ms 1_668.67kw 3_526.08w 3_526.08w Re2 8.37ms 14.40kw 0.63w 0.63w Angstrom 4.81ms 1_294.75kw 2_004.67w 2_004.67w
There are a couple of surprises here. Surprise #1 is how well re did.  I
would have never expected a natively-implemented regular expression library to
outperform C-wrappers. Surprise #2 is re2. While it allocates minuscule
amount of memory and thus may be a plausible choice for some applications,
seeing it among outsiders in run time was unexpected.
Oh, and Angstrom did great. Easy to use, fast, concise… maybe a bit lacking
in terms of documentation, but reading through the mli file and examples was
enough to get started.
The source code of the benchmark is available on GitHub. Pull requests are more than welcome.