Goroutines on many cores
Table of Contents
After seeing the remarkable effects of OCaml code parallelisation I got curious to learn how fast Go would become using the similar approach to data partitioning.
Chunking device set
Following the OCaml version I first decided to split the device set into N
chunks
and run each chunk in a separate goroutine:
ch := make(chan [][]string) nChunks := runtime.NumCPU() chunkSize := (len(devices) + nChunks - 1) / nChunks for i := 0; i < len(devices); i += chunkSize { end := i + chunkSize if end > len(devices) { end = len(devices) } go runChunk(opts, devices[i:end], interfaces, ch) }
where runChunk
function was almost identical to the function used in the
single-core version with the exception that data points were being pushed to
channel instead of stdout
:
func runChunk(opts Options, devices []Device, interfaces map[int]IfacesMap, ch chan [][]string,) { for _, d := range devices { rrds := listRRDFiles(d, opts, interfaces[d.Id]) for _, file := range rrds { processRRDFile(file, ch) } } ch <- nil }
I used the ch <- nil
trick to indicate that a worker finished
processing its chunk. It was required to communicate when to stop while
reading from the channel. Because the number of chunks was set to the
number of cores I knew exactly how many nil
's I should be getting:
for i := 0; i < nChunks; i++ { for records := <- ch; records != nil; records = <- ch { writer.WriteAll(records) } }
It worked but seemed somewhat convoluted.
Running the program produced results not dissimilar to OCaml version - 2 min 54s.
Chunking rrd file set
Again, following the OCaml lead I anticipated better performance with more uniform core utilisation. That meant chunking not the device set but the rrd file set:
rrdChunkSize := (len(rrds) + nChunks - 1) / nChunks for i := 0; i < len(rrds); i += rrdChunkSize { end := i + rrdChunkSize if end > len(rrds) { end = len(rrds) } go runRRDChunk(rrds[i:end], outCh) } for i := 0; i < nChunks; i++ { for records := <-outCh; records != nil; records = <-outCh { writer.WriteAll(records) } }
where runRRDChunk
was iterating over the slice
func runRRDChunk(rrds []RRDFile, ch chan [][]string) { for _, file := range rrds { processRRDFile(file, ch) } ch <- nil }
and the device set -> file set mapping was done on many cores by slightly
revisited runChunk
mentioned in the previous section.
Results were good, but not great - 2m 10s and the total memory consumption culminating at 42GB. Both are slightly worse than OCaml and I am not sure how exactly do I make it run any faster. Increasing the number of goroutines makes things slower, if anything, and using explicit locking mechanisms would feel like throwing away all the good stuff Go provides.
To sum it up, running Go on 24 cores (48 if we count hyper-threading) made it run only 6 times faster - 2 min 10s vs 12 min. That being said, it is still a solid improvement over the excellent single-core performance.