pl-rants

Rants about programming languages

Aug 24, 2018

Haskell vs. Go vs. OCaml vs…

Table of Contents

update 31/08/2018 - moved "Libraries" section into a separate post

A couple of years ago I read the paper Haskell vs. Ada vs. C++ vs. AWK vs. … which left me under impression that Haskell might be a great choice for fast prototyping. Since then I did a few toy projects in Haskell, read Real World Haskell and Programming in Haskell (which was absolutely marvelous), yet it was hard to come across a real-world project where I could verify the claim.

A few weeks ago I finally got the chance. The team I am a part of, apart from writing totally awesome things in Elixir/Erlang, in between times maintains tremendous amount of legacy applications written in Perl with some bits of Python. One of the tools is responsible for daily querying a bunch of RRD files using Perl, enriching it metadata with Python and finally dumping it all into a single .csv file. And when I say "a bunch" that means almost a million files in some cases. No one really looked at the tool for years until the number of files to process has increased 4x times overnight. The tool could not keep up, effectively making the daily reports unusable. It was taking more than 24h to complete a run! A temporary workaround was to run the application on a beefy 24 core, 2x CPU 128GB RAM HP Gen9 blade which was able run the thing in 9 hours.

It just so happened that when the problem manifested itself, I was chatting with my boss about how awesome Haskell was and how OCaml was the implementation language of choice for some modern programming languages, and how most software developers unreasonably turn a blind eye to their great merits. And the boss - otherwise being absolutely rational and pragmatic person - said that my dream might have come true and I was to rewrite that tool in Haskell.

While I was making some finishing touches to the project I was working on at that time, eagerly awaiting for the Haskell Rewrite Project to start, the other team member found… let's say an anti-pattern in the Python part of the tool. That reduced 8 hours of Python crunching data to 20 minutes, thus bringing the total run time down to 1 hour and 20 minutes. The change eliminated the true to life business justification of the rewrite. However, my boss - being one of the most awesome bosses around - allowed me to start the project anyway and prototype the tool in Haskell under condition that I also were to write a prototype in Go and had it all done within a week. By the end of the week I were to present the results of the experiment to the team. I happily accepted the conditions and started the project.

The Problem

In the nutshell, the problem can be described as following:

  • parse command line arguments to gather parameters about DB connection, in-out folders/files and something called device groups
  • query the required set of devices and relevant metadata from a MySQL database
  • based on the device set deduce the list of RRD files
  • for each file in the RRD file set query data points for the 12:00 a.m. - 11:55 p.m. interval of the previous day
  • dump every data point using human-readable metric name, device name, etc. as a row to the specified .csv file

It was a fairly straightforward problem, however the scale did present a challenge.

TL;DR or The Comparison Table

update 17/09/2018 - added multiprocess OCaml and Go+goroutines results

update 28/10/2018 - added async OCaml results

The Go version straight out of the box demonstrated the best performance. The Haskell version, after some tinkering, came second, and the OCaml version, while being slower than the other two, was champion in memory consumption.

I haven't used Go routines or any other concurrency/parallelism techniques in either version initially, because I needed a fair comparison with the Perl/Python version, also wanted as simple solution as possible. Go and Haskell used parallel GC though, so comparison was not completely fair.

Also, it is worth noting that Go version used librrd wrapper and consequently skipped parsing of rrdtool output altogether.

Run time
is how long it took to process those 800000 files.
Memory
is the maximum memory consumption for the file set.
LOC
is the number of lines of code including imports.
Size
is the executable file size (everything linked statically).
Build time
is how long it takes to rebuild the executable. All versions but OCaml + glob have only one file. The build time does not take into account building any extra dependencies.
Impl time
is how many hours (approximately) it took me to implement a version.
Version Run time Memory LOC Size Build time Impl time
Perl + Python 80 min - 480+360 - - -
Go 12 min 1.3GB 277 6.5MB ~ 1s 24 h
Go + goroutines 130s 42GB 301 6.5MB ~ 1s 27 h (24+3)
Haskell (rushed) 144 min 1.7GB 234 4.2MB 1.7s 16 h
Haskell optimised 17 min 1.4GB 286 4.2MB 1.8s 39 h (16+23)
OCaml + find 22 min 60MB 233 2.2MB 0.3s 14 h
OCaml + glob 23 min 100MB 232 + 25 4.7MB 0.4s 18 h (14+4)
OCaml multiprocess 67s 30GB 288 +25 4.7MB 0.4s 23 h (18+5)
OCaml + Lwt 14 min 40s 1.6GB 253 +25 5.3MB 0.8s 24 h (18+6)

The latest stable version of the respective compilers were used for comparison, that is Go 1.10.4, GHC 8.4.3, and OCaml 4.07.0.

Go version

Go had to be the first one because I was not sure if I would be able to complete it in time and wanted as much leeway as possible. I started with A Tour of Go and some other tutorials which consumed one weekend of my time. I excluded the time from the table above, because I already had some experience with Haskell and OCaml and taking it into account did not seem fair.

Go's tools, ecosystem, documentation

I found tooling good, easy to use, fast. Most things just worked out of the box. There is a single entry point - the go command - which allows to build/install/remove packages and executables.

What I liked the most (although it's probably irrelevant for many) is that setting up Emacs environment took me only 30-40 minutes. go-mode, go-eldoc, go-autocomplete, go-rename required little to no extra tweaking delivering a pleasant, virtually bug-free development experience. I especially liked go-playground which provided me with almost repl-like functionality.

Documentation for the language, the standard library, tools, third-party packages is well written. It is approachable and beginner-friendly but comprehensive at the same time.

To sum it up: excellent.

Go's overall impression and thoughts

Go was easy to set up, easy to use. Abundance of documentation and ready-to-use libraries written clearly and concisely helped a lot. Language-wise I had experience not dissimilar to what I had with AWK - somewhat awkward but straightforward - KISS principle at its fullest. I think it is a good language that anyone can start hacking without having to read a Category Theory book.

The flip side of the coin is not that bright I am afraid. There is no a standard way (at time of this writing, at least) to lock dependencies. It is not a problem for small one-off projects, but I can't image how people are able to develop/maintain larger code bases without falling into cabal hell. Not dissimilar to Haskell, NixOS could be an answer to that problem.

Another observation, and it does stand out, especially after Haskell experience, is that language-wise it feels somewhat… lame. Which is not necessarily a bad thing. I've got mixed feelings about it and will probably write a post or two to get into more details.

Haskell version

Evaluation of Haskell in that semi-production setting was the main goal of the experiment. Although I knew enough of the language to start writing code straight away, I wanted to measure how productive I could be (and how much faster the tool could become). The only problem was that I had only two days left to answer the question.

Haskell's tools, ecosystem, documentation

Back In 2016 haskell-mode with interactive-haskell-mode and structured-haskell-mode for Emacs provided a well-rounded development environment. I am not sure what has changed since then, but I could not glue it all together and spent at least three hours trying to make it work cohesively. I ended up using intero-mode and brittany for code formatting. It worked reasonably well, yet determining a type of highlighted expression, especially if it involved polymorphic functions, was highly unreliable. I had to use typed holes all the time. Worst of all was that repl within Emacs tend to produce scary error messages - "unfathomable operation" (or something along the lines) - every time I did anything IO-related. I had to run repl in the terminal to avoid that.

Documentation for most Haskell packages is scarce at best. The majority of libraries provide but a few pointers, merely remarks for the authors themselves, rather than a comprehensive documentations for their users, let alone "getting started" guides or tutorials.

Hoogle is your friend and works just fine from command line/Emacs. It has been invaluable and I wish all languages would provide something similar. The problem, however, is those Haddock pages are hard to navigate. "Synopsys" is useless when there are more than a handful of functions (which is true in most cases) because it does not scroll (!!), effectively displaying only first 10-15 items. The only sane way to navigate through e.g. 100+ functions in ByteString library was calling w3m from within Emacs.

I used stack for building/compiling. It was OK. I think it has become even better over the past two years. My only gripe is that it lacks search command and something like deps fetch/update/remove to simplify handling of dependencies.

Haskell's overall impression and thoughts

Using Haskell was surprisingly hard. When searching for information scattered through Haskell Wiki, StackOverflow posts and multiple blogs I felt more like a detective or investigator rather than a software engineer. Had I had something akin to Real World Haskell but up to date, the whole process could have been smoother. Alas I had not and I struggled.

Another observation I made was that my intuition about performance was almost always off. GHC provides fantastic tools for profiling and it was the only way for me to improve performance. Laziness of the language changes the run time behaviour drastically, and, compared to strict languages, that is something to always keep in mind. I should write another post about the journey to cut down the run time from 144 minutes to 17. It is almost a detective story. All I can say here is that performance hits were never where I expected them to be.

update 05/09/2018 - the post has been published.

OCaml version

I decided to add OCaml into the mix because I wanted to compare Haskell with a strict language that had equally powerfull type system. I also wanted to know if I could write the whole thing without a single type declaration (spoiler: I was not able to).

OCaml's tools, ecosystem, documentation

Setting up Emacs environment was straightforward. There is an excellent merlin tool that provides auto completion, type information, source code navigation and integrates with Emacs seamlessly. There is also a fancy repl called utop that I think far exceeds any stand-alone repls out there in terms of usability and number of colours in use. And also there is an easy to use and feature-full packet manager called opam. I would consider it being superior to Haskell's stack tool in terms of usability if not feature-wise. The icing on the cake is the dune build system. It is insanely fast, although as everything (it would seem) coming from JaneStreet feels a bit too opinionated. Overall my impression of the tools is positive, Go-level-like or even better.

The story is not that great wrt documentation. While many packages have concise and comprehensible READMEs, tutorials on the official web-site are mostly outdated and/or incomplete. Real World OCaml is also somewhat dated and is heavily biased towards Core - the standard library replacement. There is a newer edition of the book in progress but still it is Core (or rather Base)-biased and does not touch on some other excellent libraries out there.

The ecosystem is not that healthy as Go's and definitely not as vast as Haskell's but I found it good enough and, more importantly, useful. For example, I could install a package where the latest commit was 5-8 years ago and it worked straight out of the box. That's something unimaginable with Haskell.

OCaml's overall impression and thoughts

I fell in love with OCaml. It's a simple, explicit language that is pleasant to work with, has strong, sophisticated type system that actually guides the development process. It does not try to produce scary, intimidating error messages ☺. The type-level language has different syntax from the main language, arguably making code easier to "brain-parse".

A pleasant deviation from Haskell, I think, is that the community seems to prefer writing libraries with the goal of getting job done, not to defend a CS thesis. I used batteries because I was looking for something that could interleave IO with pure computations. So I found Enum and a lot, lot more in there. I think it's an excellent library and somehow I find it more to my liking than Core.

I also was stunned by how effortless it was to write a C stub. Just throw in a C file into the source directory, declare external function signature and it's done! I couldn't find a "glob" library so I used find tool in the first version. The program was building a huge pattern and then read the command output using a pipe. It felt somewhat unfair because find was doing all the job. After some research I stumbled upon a piece of code that called a libc function in OCaml. So I wrote ~20 lines of code that wrapped the libc's glob function and used it instead. That painless, almost native-like interop with C code was something I did enjoy.

Conclusion

I believe the experiment was successful at illuminating strengths and weaknesses of the languages in the context of writing small, one-off tools. In other words if you have a bunch of Python/Perl/Ruby scripts that you want/need to make run faster, what language should you choose?

Go had the best bang for the buck best performance per hour spent out of the box. If you have a team of engineers of varying level of expertise by using Go you could expect that a) the result would be good-to-great and b) anyone on the team would be able to maintain it.

Haskell's laziness may be tricky but the excellent built-in profiling tools remedy it, albeit at the cost of longer development time. More importantly though, the fact that the library ecosystem leaves a lot to be desired coupled with the tendency of having stumbling blocks where shouldn't really be any, made Haskell the worst tool for that particular job. On the other hand, if you have time/resources to build a set of domain-specific, optimised libraries, it may pay off. The remaining problem is highly fragile run-time performance. All the time saved by the awesome language features will probably be wasted by figuring out "why that tiny change reduce the performance in half".

OCaml appears be somewhere in the middle. On one hand, insanely fast compiler (significantly faster than Go's!) and a sophisticated type system coupled with powerful module system (much better than the other two's) make the language exceptionally pleasant to work with. On the other hand, it is somewhat lacking in centralised documentation. There aren't that many books oriented both at beginner and advanced levels. The availability of libraries/packages also is not that great. And finally, the absence of a parallel GC may incur performance hits in some situations, although low memory footprint somewhat rectifies it.