Haskell vs. Go vs. OCaml vs…
Table of Contents
update 31/08/2018 - moved "Libraries" section into a separate post
A couple of years ago I read the paper Haskell vs. Ada vs. C++ vs. AWK vs. … which left me under impression that Haskell might be a great choice for fast prototyping. Since then I did a few toy projects in Haskell, read Real World Haskell and Programming in Haskell (which was absolutely marvelous), yet it was hard to come across a real-world project where I could verify the claim.
A few weeks ago I finally got the chance. The team I am a part of, apart from
writing totally awesome things in Elixir/Erlang, in between times maintains
tremendous amount of legacy applications written in Perl with some bits of
Python. One of the tools is responsible for daily querying a bunch of RRD files
using Perl, enriching it metadata with Python and finally dumping it all into a
single .csv
file. And when I say "a bunch" that means almost a million files
in some cases. No one really looked at the tool for years until the number of
files to process has increased 4x times overnight. The tool could not keep up,
effectively making the daily reports unusable. It was taking more than 24h to
complete a run! A temporary workaround was to run the application on a beefy 24
core, 2x CPU 128GB RAM HP Gen9 blade which was able run the thing in 9 hours.
It just so happened that when the problem manifested itself, I was chatting with my boss about how awesome Haskell was and how OCaml was the implementation language of choice for some modern programming languages, and how most software developers unreasonably turn a blind eye to their great merits. And the boss - otherwise being absolutely rational and pragmatic person - said that my dream might have come true and I was to rewrite that tool in Haskell.
While I was making some finishing touches to the project I was working on at that time, eagerly awaiting for the Haskell Rewrite Project to start, the other team member found… let's say an anti-pattern in the Python part of the tool. That reduced 8 hours of Python crunching data to 20 minutes, thus bringing the total run time down to 1 hour and 20 minutes. The change eliminated the true to life business justification of the rewrite. However, my boss - being one of the most awesome bosses around - allowed me to start the project anyway and prototype the tool in Haskell under condition that I also were to write a prototype in Go and had it all done within a week. By the end of the week I were to present the results of the experiment to the team. I happily accepted the conditions and started the project.
The Problem
In the nutshell, the problem can be described as following:
- parse command line arguments to gather parameters about DB connection, in-out folders/files and something called device groups
- query the required set of devices and relevant metadata from a MySQL database
- based on the device set deduce the list of RRD files
- for each file in the RRD file set query data points for the 12:00 a.m. - 11:55 p.m. interval of the previous day
- dump every data point using human-readable metric name, device name, etc. as
a row to the specified
.csv
file
It was a fairly straightforward problem, however the scale did present a challenge.
TL;DR or The Comparison Table
update 17/09/2018 - added multiprocess OCaml and Go+goroutines results
update 28/10/2018 - added async OCaml results
The Go version straight out of the box demonstrated the best performance. The Haskell version, after some tinkering, came second, and the OCaml version, while being slower than the other two, was champion in memory consumption.
I haven't used Go routines or any other concurrency/parallelism techniques in either version initially, because I needed a fair comparison with the Perl/Python version, also wanted as simple solution as possible. Go and Haskell used parallel GC though, so comparison was not completely fair.
Also, it is worth noting that Go version used librrd
wrapper and consequently
skipped parsing of rrdtool
output altogether.
Run time
- is how long it took to process those 800000 files.
Memory
- is the maximum memory consumption for the file set.
LOC
- is the number of lines of code including imports.
Size
- is the executable file size (everything linked statically).
Build time
- is how long it takes to rebuild the executable. All versions but
OCaml + glob
have only one file. The build time does not take into account building any extra dependencies. Impl time
- is how many hours (approximately) it took me to implement a version.
Version | Run time | Memory | LOC | Size | Build time | Impl time |
---|---|---|---|---|---|---|
Perl + Python | 80 min | - | 480+360 | - | - | - |
Go | 12 min | 1.3GB | 277 | 6.5MB | ~ 1s | 24 h |
Go + goroutines | 130s | 42GB | 301 | 6.5MB | ~ 1s | 27 h (24+3) |
Haskell (rushed) | 144 min | 1.7GB | 234 | 4.2MB | 1.7s | 16 h |
Haskell optimised | 17 min | 1.4GB | 286 | 4.2MB | 1.8s | 39 h (16+23) |
OCaml + find | 22 min | 60MB | 233 | 2.2MB | 0.3s | 14 h |
OCaml + glob | 23 min | 100MB | 232 + 25 | 4.7MB | 0.4s | 18 h (14+4) |
OCaml multiprocess | 67s | 30GB | 288 +25 | 4.7MB | 0.4s | 23 h (18+5) |
OCaml + Lwt | 14 min 40s | 1.6GB | 253 +25 | 5.3MB | 0.8s | 24 h (18+6) |
The latest stable version of the respective compilers were used for comparison, that is Go 1.10.4, GHC 8.4.3, and OCaml 4.07.0.
Go version
Go had to be the first one because I was not sure if I would be able to complete it in time and wanted as much leeway as possible. I started with A Tour of Go and some other tutorials which consumed one weekend of my time. I excluded the time from the table above, because I already had some experience with Haskell and OCaml and taking it into account did not seem fair.
Go's tools, ecosystem, documentation
I found tooling good, easy to use, fast. Most things just worked out
of the box. There is a single entry point - the go
command - which allows to
build/install/remove packages and executables.
What I liked the most (although it's probably irrelevant for many) is that
setting up Emacs environment took me only 30-40 minutes. go-mode
, go-eldoc
,
go-autocomplete
, go-rename
required little to no extra tweaking delivering
a pleasant, virtually bug-free development experience. I especially liked
go-playground
which provided me with almost repl-like functionality.
Documentation for the language, the standard library, tools, third-party packages is well written. It is approachable and beginner-friendly but comprehensive at the same time.
To sum it up: excellent.
Go's overall impression and thoughts
Go was easy to set up, easy to use. Abundance of documentation and ready-to-use libraries written clearly and concisely helped a lot. Language-wise I had experience not dissimilar to what I had with AWK - somewhat awkward but straightforward - KISS principle at its fullest. I think it is a good language that anyone can start hacking without having to read a Category Theory book.
The flip side of the coin is not that bright I am afraid. There is no a standard way (at time of this writing, at least) to lock dependencies. It is not a problem for small one-off projects, but I can't image how people are able to develop/maintain larger code bases without falling into cabal hell. Not dissimilar to Haskell, NixOS could be an answer to that problem.
Another observation, and it does stand out, especially after Haskell experience, is that language-wise it feels somewhat… lame. Which is not necessarily a bad thing. I've got mixed feelings about it and will probably write a post or two to get into more details.
Haskell version
Evaluation of Haskell in that semi-production setting was the main goal of the experiment. Although I knew enough of the language to start writing code straight away, I wanted to measure how productive I could be (and how much faster the tool could become). The only problem was that I had only two days left to answer the question.
Haskell's tools, ecosystem, documentation
Back In 2016 haskell-mode
with interactive-haskell-mode
and
structured-haskell-mode
for Emacs provided a well-rounded development
environment. I am not sure what has changed since then, but I could not glue it
all together and spent at least three hours trying to make it work
cohesively. I ended up using intero-mode
and brittany
for code
formatting. It worked reasonably well, yet determining a type of highlighted
expression, especially if it involved polymorphic functions, was highly
unreliable. I had to use typed holes all the time. Worst of all was that repl
within Emacs tend to produce scary error messages - "unfathomable operation"
(or something along the lines) - every time I did anything IO-related. I had to
run repl in the terminal to avoid that.
Documentation for most Haskell packages is scarce at best. The majority of libraries provide but a few pointers, merely remarks for the authors themselves, rather than a comprehensive documentations for their users, let alone "getting started" guides or tutorials.
Hoogle is your friend and works just fine from command line/Emacs. It has been invaluable and I wish all languages would provide something similar. The problem, however, is those Haddock pages are hard to navigate. "Synopsys" is useless when there are more than a handful of functions (which is true in most cases) because it does not scroll (!!), effectively displaying only first 10-15 items. The only sane way to navigate through e.g. 100+ functions in ByteString library was calling w3m from within Emacs.
I used stack for building/compiling. It was OK. I think it has become even
better over the past two years. My only gripe is that it lacks search
command
and something like deps fetch/update/remove
to simplify handling of
dependencies.
Haskell's overall impression and thoughts
Using Haskell was surprisingly hard. When searching for information scattered through Haskell Wiki, StackOverflow posts and multiple blogs I felt more like a detective or investigator rather than a software engineer. Had I had something akin to Real World Haskell but up to date, the whole process could have been smoother. Alas I had not and I struggled.
Another observation I made was that my intuition about performance was almost always off. GHC provides fantastic tools for profiling and it was the only way for me to improve performance. Laziness of the language changes the run time behaviour drastically, and, compared to strict languages, that is something to always keep in mind. I should write another post about the journey to cut down the run time from 144 minutes to 17. It is almost a detective story. All I can say here is that performance hits were never where I expected them to be.
update 05/09/2018 - the post has been published.
OCaml version
I decided to add OCaml into the mix because I wanted to compare Haskell with a strict language that had equally powerfull type system. I also wanted to know if I could write the whole thing without a single type declaration (spoiler: I was not able to).
OCaml's tools, ecosystem, documentation
Setting up Emacs environment was straightforward. There is an excellent merlin tool that provides auto completion, type information, source code navigation and integrates with Emacs seamlessly. There is also a fancy repl called utop that I think far exceeds any stand-alone repls out there in terms of usability and number of colours in use. And also there is an easy to use and feature-full packet manager called opam. I would consider it being superior to Haskell's stack tool in terms of usability if not feature-wise. The icing on the cake is the dune build system. It is insanely fast, although as everything (it would seem) coming from JaneStreet feels a bit too opinionated. Overall my impression of the tools is positive, Go-level-like or even better.
The story is not that great wrt documentation. While many packages have concise and comprehensible READMEs, tutorials on the official web-site are mostly outdated and/or incomplete. Real World OCaml is also somewhat dated and is heavily biased towards Core - the standard library replacement. There is a newer edition of the book in progress but still it is Core (or rather Base)-biased and does not touch on some other excellent libraries out there.
The ecosystem is not that healthy as Go's and definitely not as vast as Haskell's but I found it good enough and, more importantly, useful. For example, I could install a package where the latest commit was 5-8 years ago and it worked straight out of the box. That's something unimaginable with Haskell.
OCaml's overall impression and thoughts
I fell in love with OCaml. It's a simple, explicit language that is pleasant to work with, has strong, sophisticated type system that actually guides the development process. It does not try to produce scary, intimidating error messages ☺. The type-level language has different syntax from the main language, arguably making code easier to "brain-parse".
A pleasant deviation from Haskell, I think, is that the community seems to
prefer writing libraries with the goal of getting job done, not to defend a CS
thesis. I used batteries because I was looking for something that could
interleave IO with pure computations. So I found Enum
and a lot, lot more in
there. I think it's an excellent library and somehow I find it more to my
liking than Core
.
I also was stunned by how effortless it was to write a C stub. Just throw in a
C file into the source directory, declare external
function signature and
it's done! I couldn't find a "glob" library so I used find
tool in the first
version. The program was building a huge pattern and then read the command
output using a pipe. It felt somewhat unfair because find
was doing all the
job. After some research I stumbled upon a piece of code that called a libc
function in OCaml. So I wrote ~20 lines of code that wrapped the libc
's
glob
function and used it instead. That painless, almost native-like interop
with C code was something I did enjoy.
Conclusion
I believe the experiment was successful at illuminating strengths and weaknesses of the languages in the context of writing small, one-off tools. In other words if you have a bunch of Python/Perl/Ruby scripts that you want/need to make run faster, what language should you choose?
Go had the best bang for the buck best performance per hour spent out of the
box. If you have a team of engineers of varying level of expertise by using Go
you could expect that a) the result would be good-to-great and b) anyone on the
team would be able to maintain it.
Haskell's laziness may be tricky but the excellent built-in profiling tools remedy it, albeit at the cost of longer development time. More importantly though, the fact that the library ecosystem leaves a lot to be desired coupled with the tendency of having stumbling blocks where shouldn't really be any, made Haskell the worst tool for that particular job. On the other hand, if you have time/resources to build a set of domain-specific, optimised libraries, it may pay off. The remaining problem is highly fragile run-time performance. All the time saved by the awesome language features will probably be wasted by figuring out "why that tiny change reduce the performance in half".
OCaml appears be somewhere in the middle. On one hand, insanely fast compiler (significantly faster than Go's!) and a sophisticated type system coupled with powerful module system (much better than the other two's) make the language exceptionally pleasant to work with. On the other hand, it is somewhat lacking in centralised documentation. There aren't that many books oriented both at beginner and advanced levels. The availability of libraries/packages also is not that great. And finally, the absence of a parallel GC may incur performance hits in some situations, although low memory footprint somewhat rectifies it.