Haskell: journey from 144 min to 17 min

While working on the Haskell vs. Go. vs.. experiment I had to rush the initial implementation because I got only two days before presenting the results. There have been a number of bad choices made and I’ll try to cover those alongside with solutions in this post.

Initial assumption

As I mentioned in the previous post I could not find librrd bindings on Stackage so I decided to start rrdtool per each file.

import System.Process (readProcess)

processRRDFile (start, end) f = do
    output <- readProcess "rrdtool" ["fetch", f, "AVERAGE" {- , ... -}]
    forM_  (drop 2 (lines output)) mapper
  where mapper = -- ...

I did realise that it was a performance-killer because starting a process was a relatively heavy operation. And doing that ~800000 times would take minutes.

Right after the presentation I learnt that rrdtool can be used in a daemon mode where it would receive commands from stdin and output results to stdout. But before making the change I recompiled the program with stack build --profile, just to double-check my assumption.

First 30% improvement

To my greatest surprise the majority of the time was spent not in spawning rrdtool process but in the seemingly innocent expression:

import Text.Printf

timestamp mins = printf "%02d%02d" (mins `div` 60) (mins `mod` 60)

On the production data it was called in the most inner loop, about 270 million times. Replacing it with the following function:

timestamp mins =
    hh <> mm
  where
    hh = preppend $ show (mins `div` 60)
    mm = preppend $ show (mins `mod` 60)
    preppend [c] = ['0', c']
    preppend s = s

improved run-time by 30%.

Next 20% improvement

I re-ran the program with +RTS -p flag and at the time (to my equally great astonishment) the culprit was the expression

parseTimestamp s = (read . takeWhile isDigit) s :: Int

The idea was to read digits till the first-non digit and convert the resulted String into Int in strings like

1232145: 0.5000e+3

And again, replacing read . takeWhile . isDigit with the following

parseInt :: String -> Int
parseInt s =
    fst $ parse s []
  where
    parse cs xs =
       case cs of
         c:cs' | isDigit c ->
                 parse cs' ((ord c - 48) : xs)
         _ ->
            foldl' mult (0, 1) xs
    mult (sum, pow) x = (sum + x*pow, pow * 10)

introduced another 20% improvement compared to the previous version. A slight modification to the function to return the remainder of the string -

parseInt :: String -> (Int, String)
parseInt s =
    parse s []
  where
    parse cs xs =
       case cs of
         c:cs' | isDigit c ->
                 parse cs' ((ord c - 48) : xs)
         _ ->
            let sum = foldl' mult (0, 1) xs in (sum, cs)
    mult (sum, pow) x = (sum + x*pow, pow * 10)

resulted in another 10% improvement compared to the previous result because it allowed me to get rid of (break isSpace) in

readValue = dropWhile isSpace . snd . (break isSpace)

So now I could run a simpler expression

readValue = (dropWhile isSpace) . tail

on the second element of the tuple returned by parseInt above.

Only after the changes processRRDFile popped onto the first place in the .prof file.

ByteString and long-running rrdtool

The result of the changes above reduced the total run time in half. However, it was still bad being just marginally better than the original Perl/Python version.

Proper use of rrdtool

The program was started in the daemon mode and could be communicated with via pipes

import qualified System.Process as SP

main = do
 -- ...
 (Just sin, Just sout, _, ph) <-
    SP.createProcess
        (SP.proc "rrdtool" ["-"])
            {SP.std_in = SP.CreatePile, SP.std_out = SP.CreatePipe}
 -- ... code that uses sin and sout

Sending the command like fetch <file_path> AVERAGE -s <posixstart> -e <posixend> to stdin would result in the same output on stdout as if the program was started per-file, followed by OK:... string. The parsing step would remain, but subsequent starts were no longer required.

Replacing String with ByteString

Using strict version of ByteString has resulted in the most significant improvement. I literally used ByteString wherever I could, e.g

import qualified Data.ByteString.Char8 as S

getResponseLines sout = do
   line <- S.hGetLine sout
   if S.isPrefixOf "OK" line
      then return []
      else do
        rest <- getResponseLines sout
        return (line : rest)

-- parseInt/readValue replacement
parseLine line =
   case S.readInt line of
      Just (ts, rest) -> (ts, (S.drop 2 rest))

I haven’t run it on the production data set, however on my dev data set the two changes above cut 27s of the pre-optimised version down to 3.5s. It was much better but still worse than Go. At that point about 20% of time the program was spending in Glob library, 18% preparing the string to print out and 11% reading rrdtool output line by line.

Using ByteString.Builder

Instead of accumulating intermediate results in a list and then flushing them to stdout I decided to try the Builder interface. As I understand using strict ByteString with Builder gives the best of both worlds - efficient string operations when it is needed and lazy accumulation otherwise. So instead of re-allocating buffers on string concatenation we get the promise of getting the whole thing when (if!) needed.

import qualified Data.ByteString.Builder as B
import System.IO (stdout)

renderLines :: [S.ByteString] -> B.Builder
renderLines lines
    mconcat [ parseLine line <> B.charUtf8 '\n' | line <- lines ]

renderCSVRow (c : cs) =
    B.byteString c <> mconcat [ B.charUtf8 ',' <> B.byteString c' | c' <- cs ]

-- somewhere down the file
  B.hPutBuilder stdout (renderLines lines)

The B.hPutBuilder is the most awesome one. It dumps the Builder straight to the file handle without any unnecessary allocations.

Pre-allocated map

The last piece of optimisation was to eliminate the surprisingly costly timestamp function.

After ByteString conversion it looked like

timestamp :: Int -> S.ByteString
timestamp = str . convert
  where
    convert mins = ((mins `div` 60), (mins `mod` 60))
    str (hh, mm) = (preppend hh) <> (preppend mm)
    preppend x | x < 10 = S.pack $ '0' : (show x)
               | otherwise = S.pack (show x)

I decided to pre-allocate a map to have the most common values cached.

import Data.IntMap.String as IntMap

tsmap :: IntMap.IntMap S.ByteString
tsmap = foldr (\i m -> IntMap.insert i (timestamp i) m) IntMap.empty [0,5..1440]

timestamp' i =
   case IntMap.lookup i tsmap of
     Just s -> s
     Nothing -> timestamp i

And that eliminated the second-most time consuming part of the program.

Conclusion

After all the efforts the most-consuming one was Glob library call, now occupying 48% of the total run time. On the dev data set the time went down to 1.2s (from 27s) and processing the production data took only 17 min (from 144 min). The results correlate well with Go version which takes ~1s on dev data and 12 min on the production set. I attribute the difference to the fact that Go doesn’t have to waste time parsing the output of rrdtool.

Can Haskell be as fast as Go? Definitely yes, however the amount of effort I had to put into that was thrice of what I spent on the initial version while with Go I got the excellent results straight away.

Another observation is that the more “strictness” I introduced the better results I got and that the cost centres were never where I expected them to be. So laziness was like a double-edged sword that should have been treated with care.