Regular Expressions vs Parser Combinators in OCaml

After witnessing the outstanding results demonstrated by Attoparsec library I was intrigued to compare the two approaches in OCaml.

The original problem I had to solve can be formulated as:

Given a short (less than 256 bytes) string of ASCII characters replace any contiguous sequence of any of /:\ .(),'" characters by underscore _.

It is not a completely fare comparison because in Haskell version I had to write my own implementation of replaceAll function, while all the libraries I compare against in OCaml already have that out of the box.

The contenders from the regular expression camp were re, re2, and pcre. And while the latter two are wrappers for Google’s re2 and the venerable pcre libraries, the former one is pure OCaml.

I can’t help but mention the excellent intro documentation for pcre and the lack of thereof for re2. It took me a while to figure out how to actually use it. And the relatively poor results it demonstrated may be due to my lack of understanding how to use it properly.

From the parser combinators camp I chose angstrom which seem to be heavily inspired by Attoparsec library.

Implementation details

The code that uses regular expressions is the very definition of straightforwardness. The code for each library in encapsulated in a separate module. All three pre-compile the regular expression and wrap library-specific replace_all equivalent:

module LibRe : Benchmarkable = struct
  let re = Re.compile @@ Re.Pcre.re "[/:\\\\. (),'\"]+"
  let replace_all = Re.replace_string re ~by:"_"
end

module LibRe2 : Benchmarkable = struct
  let re = Re2.create_exn "[/:\\\\. (),'\"]+"
  let replace_all = Re2.rewrite_exn re ~template:"_"
end

module LibPcre : Benchmarkable = struct
  let re = Pcre.regexp "[/:\\\\. (),'\"]+"
  let replace_all s = Pcre.replace ~rex:re ~templ:"_" s
end

where Benchmarkable is a module signature that exposes only one function:

module type Benchmarkable = sig
  val replace_all : string -> string
end

Angstrom-related code is slightly more involved. It defines two parsers good and bad and then combines them using sep_by1 combinator.

module LibAngstrom : Benchmarkable = struct
  open Angstrom

  let is_escape_char = function
    | '/' | ':' | '\\' | '.' | ' '
    | '(' | ')' | ',' | '"' | '\'' -> true
    | _ -> false

  (* "good" parser consumes any number of characters till it
    hits one of the escape characters *)
  let good = take_till is_escape_char
  (* while "bad" parser consumes one or more escape characters *)
  let bad = take_while1 is_escape_char
  (* and the final parser is a combination of "good" and "bad"
     parsers combined by `sep_by1` combinator that must consume
     characters till the end of the input *)
  let p = sep_by1 bad good <* end_of_input

  let replace_all s = match parse_string p s with
    | Result.Ok xs -> String.concat "_" xs
    | Result.Error msg -> failwith msg
end

Results and conclusion

I used the excellent core_bench library for running the benchmarks. Although documentation is scarce at best, there is a fantastic blog post that goes at great lengths describing the motivation and implementation details of the library.

 Name       Time/Run      mWd/Run    mjWd/Run    Prom/Run
 --------- ---------- ------------ ----------- -----------
 Re           4.18ms     942.30kw     177.06w     177.06w
 Pcre         5.89ms   1_668.67kw   3_526.08w   3_526.08w
 Re2          8.37ms      14.40kw       0.63w       0.63w
 Angstrom     4.81ms   1_294.75kw   2_004.67w   2_004.67w

There are a couple of surprises here. Surprise #1 is how well re did. I would have never expected a natively-implemented regular expression library to outperform C-wrappers. Surprise #2 is re2. While it allocates minuscule amount of memory and thus may be a plausible choice for some applications, seeing it among outsiders in run time was unexpected.

Oh, and Angstrom did great. Easy to use, fast, concise… maybe a bit lacking in terms of documentation, but reading through the mli file and examples was enough to get started.

The source code of the benchmark is available on GitHub. Pull requests are more than welcome.