Goroutines on many cores

After seeing the remarkable effects of OCaml code parallelisation I got curious to learn how fast Go would become using the similar approach to data partitioning.

Chunking device set

Following OCaml version I first decided to split the device set into N chunks and run each chunk in a separate goroutine:

       ch := make(chan [][]string)
       nChunks := runtime.NumCPU()
       chunkSize := (len(devices) + nChunks - 1) / nChunks
       for i := 0; i < len(devices); i += chunkSize {
               end := i + chunkSize
               if end > len(devices) {
                       end = len(devices)
               }
               go runChunk(opts, devices[i:end], interfaces, ch)
       }

where runChunk function was almost identical to the function used in the single-core version with the exception that data points were being pushed to channel instead of stdout:

func runChunk(opts Options, devices []Device, interfaces map[int]IfacesMap, ch chan [][]string) {
       for _, d := range devices {
               rrds := listRRDFiles(d, opts, interfaces[d.Id])
               for _, file := range rrds {
                       processRRDFile(file, ch)
               }
       }
       ch <- nil
}

I used the ch <- nil trick to indicate that a worker finished processing its chunk. It was required to communicate when to stop while reading from the channel. Because the number of chunks was set to the number of cores I knew exactly how many nil’s I should be getting:

       for i := 0; i < nChunks; i++ {
               for records := <- ch; records != nil; records = <- ch {
                       writer.WriteAll(records)
               }
       }

It worked but seemed somewhat convoluted.

Running the program produced results not dissimilar to OCaml version - 2 min 54s.

Chunking rrd file set

Again, following the OCaml lead I anticipated better performance with more uniform core utilisation. That meant chunking not the device set but the rrd file set:

       rrdChunkSize := (len(rrds) + nChunks - 1) / nChunks
       for i := 0; i < len(rrds); i += rrdChunkSize {
               end := i + rrdChunkSize
               if end > len(rrds) {
                       end = len(rrds)
               }
               go runRRDChunk(rrds[i:end], outCh)
       }

       for i := 0; i < nChunks; i++ {
               for records := <-outCh; records != nil; records = <-outCh {
                       writer.WriteAll(records)
               }
       }

where runRRDChunk was iterating over the slice

func runRRDChunk(rrds []RRDFile, ch chan [][]string) {
       for _, file := range rrds {
               processRRDFile(file, ch)
       }
       ch <- nil
}

and the device set -> file set mapping was done on many cores by slightly revisited runChunk mentioned in the previous section.

Results were good, but not great - 2m 10s and the total memory consumption culminating at 42GB. Both are slightly worse than OCaml and I am not sure how exactly do I make it run any faster. Increasing the number of goroutines makes things slower if anything and using explicit locking mechanisms would feel like throwing away all the good stuff Go provides.

To sum it up, running Go on 24 cores (48 if we count hyper-threading) made it run only 6 times faster - 2 min 10s vs 12 min. That being said, it’s still a solid improvement over the excellent single-core performance.