Scalability of Go and OCaml on a particular problem
Table of Contents
This is a follow-up post of Ocaml: run it on 24 cores and Goroutines on many cores.
After publishing those posts I got curious to see how exactly those implementations scale with the number of processes or goroutines. To obtain more accurate results I ran the tests 5 to 10 times for each level of concurrency. The task was somewhat complicated because I could only run those in the relatively narrow time slots where no other OS processes were actively using CPU or the storage. So it took me several weeks to collect the results and there were likely cases where the test runs partially overlapped with other CPU or IO-intensive programs running on the hardware.
Note: These measurements are taken for a nearly static data set, with a concrete implementation of a solution to a particular problem and thus can not be used to draw any sort of a general conclusion.
Execution flow diagram
The idea of either implementation was to evenly distribute the work among the available cores. The result of a DB query was converted into the device set in the main process because that operation takes just a few milliseconds. After chunking the device set each chunk was fed into a separate process/goroutine which produced a list of RRD files of varying size. The intermediate results were later collected into one big list and chunked again to be processed concurrently. After that the implementations slightly differ. The Go version receives data points per file and dumps them to CSV. OCaml version receives data points collected from the whole chunk and only after that dumps the data to CSV.
Below is a flow-like diagram representing the described process. Blue rectangles denote computations happening in the "main process", sequentially, while red ones run in separate processes/goroutines.
Go with goroutines
Run time
It would seem that the best average result was achieved when the data set was divided into 16-17 parts (~2 min) and then the run time started slowly climbing up.
Speedup
The graph is built for average values. The best speedup is almost 5x.
Multi-process OCaml
Run time
OCaml version behaves slightly better. It plateaued at n_proc
set to 22-24
and it stayed approximately at that level ( ~1 min 18s) even with n_proc
set
to 64. I skipped measurements of all the intermediate points for obvious
reasons but took measurements at 48 just to ensure that it would behave the
same. Setting the value to 256 already incurs some performance hits.
Speedup
As with Go version, the graph is for average values. There is an
abnormality around n_proc
values of 11-14 which I don't have a good
explanation for. Maybe I recorded wrong data or there was an inference
from other programs running simultaneously on the same hardware. The
speedup seems to be asymptotically approaching 18x with the max average
value 17.35x.
Conclusion
While both implementations clearly demonstrate behaviour non-contradicting to Amdahl's law, it would seem that multi-process approach scales better on a multicore system.