Scalability of Go and OCaml on a particular problem

This is a follow-up post of Ocaml: run it on 24 cores and Goroutines on many cores.

After publishing those posts I got curious to see how exactly those implementations scale with the number of processes or goroutines. To obtain more accurate results I ran the tests 5 to 10 times for each level of concurrency. The task was somewhat complicated because I could only run those in the relatively narrow time slots where no other OS processes were actively using CPU or the storage. So it took me several weeks to collect the results and there were likely cases where the test runs partially overlapped with other CPU or IO-intensive programs running on the hardware.

Note: These measurements are taken for a nearly static data set, with a concrete implementation of a solution to a particular problem and thus can not be used to draw any sort of a general conclusion.

Execution flow diagram

The idea of either implementation was to evenly distribute the work among the available cores. The result of a DB query was converted into the device set in the main process because that operation takes just a few milliseconds. After chunking the device set each chunk was fed into a separate process/goroutine which produced a list of RRD files of varying size. The intermediate results were later collected into one big list and chunked again to be processed concurrently. After that the implementations slightly differ. The Go version receives data points per file and dumps them to CSV. OCaml version receives data points collected from the whole chunk and only after that dumps the data to CSV.

Below is a flow-like diagram representing the described process. Blue rectangles denote computations happening in the “main process”, sequentially, while red ones run in separate processes/goroutines.

Go with goroutines

Run time

It would seem that the best average result was achieved when the data set was divided into 16-17 parts (~2 min) and then the run time started slowly climbing up.

Speedup

The graph is built for average values. The best speedup is almost 5x.

Multi-process OCaml

Run time

OCaml version behaves slightly better. It plateaued at n_proc set to 22-24 and it stayed approximately at that level ( ~1 min 18s) even with n_proc set to 64. I skipped measurements of all the intermediate points for obvious reasons but took measurements at 48 just to ensure that it would behave the same. Setting the value to 256 already incurs some performance hits.

Speedup

As with Go version, the graph is for average values. There is an abnormality around n_proc values of 11-14 which I don’t have a good explanation for. Maybe I recorded wrong data or there was an inference from other programs running simultaneously on the same hardware. The speedup seems to be asymptotically approaching 18x with the max average value 17.35x.

Conclusion

While both implementations clearly demonstrate behaviour non-contradicting to Amdahl’s law, it would seem that multi-process approach scales better on a multicore system.