pl-rants

Rants about programming languages

Sep 02, 2018

Haskell: journey from 144 min to 17 min

Table of Contents

While working on the Haskell vs. Go. vs.. experiment I had to rush the initial implementation because I had only two days before the experiment result presentation. There have been a number of bad choices made and I'll try to cover those alongside with solutions in this post.

Initial assumption

As I mentioned in the previous post I could not find librrd bindings on Stackage so I decided to start rrdtool for each file.

import System.Process (readProcess)

processRRDFile (start, end) f = do
    output <- readProcess "rrdtool" ["fetch", f, "AVERAGE" {- , ... -}]
    forM_  (drop 2 (lines output)) mapper
  where mapper = -- ...

I did realise that it was a performance killer because starting a process was a relatively heavy operation. And doing that ~800000 times would take minutes.

A few minutes after the presentation ended I found that rrdtool can be used in a daemon mode (of course!) where it would receive commands from stdin and output results to stdout. But before making the change I recompiled the program with stack build --profile, just to double-check my assumption.

First 30% improvement

I was surprised to discover that the majority of the time was spent not in spawning rrdtool process, as I assumed, but in the seemingly innocent expression:

import Text.Printf

timestamp mins = printf "%02d%02d" (mins `div` 60) (mins `mod` 60)

On the production data it was called in the most inner loop, about 270 million times. Replacing it with the following function:

timestamp mins =
    hh <> mm
  where
    hh = preppend $ show (mins `div` 60)
    mm = preppend $ show (mins `mod` 60)
    preppend [c] = ['0', c']
    preppend s = s

improved run-time by 30%.

Next 20% improvement

I re-ran the program with +RTS -p flag and at the time (to my equally great astonishment) the culprit was the expression

parseTimestamp s = (read . takeWhile isDigit) s :: Int

The idea was to read digits till the first-non digit and convert the resulted String into Int in strings like

1232145: 0.5000e+3

And again, replacing read . takeWhile . isDigit with the following

parseInt :: String -> Int
parseInt s =
    fst $ parse s []
  where
    parse cs xs =
       case cs of
         c:cs' | isDigit c ->
                 parse cs' ((ord c - 48) : xs)
         _ ->
            foldl' mult (0, 1) xs
    mult (sum, pow) x = (sum + x*pow, pow * 10)

introduced another 20% improvement compared to the previous version. A slight modification to the function to return the remainder of the string -

parseInt :: String -> (Int, String)
parseInt s =
    parse s []
  where
    parse cs xs =
       case cs of
         c:cs' | isDigit c ->
                 parse cs' ((ord c - 48) : xs)
         _ ->
            let sum = foldl' mult (0, 1) xs in (sum, cs)
    mult (sum, pow) x = (sum + x*pow, pow * 10)

resulted in another 10% improvement compared to the previous result because it allowed me to get rid of (break isSpace) in

readValue = dropWhile isSpace . snd . (break isSpace)

So now I could run a simpler expression

readValue = (dropWhile isSpace) . tail

on the second element of the tuple returned by parseInt above.

Only after the changes processRRDFile popped onto the first place in the .prof file.

ByteString and long-running rrdtool

The result of the changes above reduced the total run time in half. However, it was still bad, being just marginally better than the original Perl/Python version.

Proper use of rrdtool

The program was started in the daemon mode and could be communicated with via pipes

import qualified System.Process as SP

main = do
 -- ...
 (Just sin, Just sout, _, ph) <-
    SP.createProcess
        (SP.proc "rrdtool" ["-"])
            {SP.std_in = SP.CreatePile, SP.std_out = SP.CreatePipe}
 -- ... code that uses sin and sout

Sending the command like

fetch <file_path> AVERAGE -s <posixstart> -e <posixend>

to stdin would result in the same output on stdout as if the program was started per-file, followed by OK:... string. The parsing step would remain, but subsequent starts were no longer required.

Replacing String with ByteString

Using the strict version of ByteString has resulted in the most significant improvement. I literally used ByteString wherever I could, e.g

import qualified Data.ByteString.Char8 as S

getResponseLines sout = do
   line <- S.hGetLine sout
   if S.isPrefixOf "OK" line
      then return []
      else do
        rest <- getResponseLines sout
        return (line : rest)

-- parseInt/readValue replacement
parseLine line =
   case S.readInt line of
      Just (ts, rest) -> (ts, (S.drop 2 rest))

I haven't run it on the production data set so I am not sure what the result would have been like. On my local data set the two changes above cut 27s of the pre-optimised version down to 3.5s. It was much better, yet still worse than the Go version. At that point about 20% of time the program was spending in Glob library, 18% preparing the string to print out and 11% reading rrdtool output line by line.

Using ByteString.Builder

Instead of accumulating intermediate results in a list and then flushing them to stdout I decided to try the Builder interface. As I understand, using strict ByteString with Builder gives the best of both worlds - efficient string operations when it is needed and lazy accumulation otherwise. So instead of re-allocating buffers on string concatenation we get the promise of getting the whole thing when (if!) needed.

import qualified Data.ByteString.Builder as B
import System.IO (stdout)

renderLines :: [S.ByteString] -> B.Builder
renderLines lines
    mconcat [ parseLine line <> B.charUtf8 '\n' | line <- lines ]

renderCSVRow (c : cs) =
    B.byteString c <> mconcat [ B.charUtf8 ',' <> B.byteString c' | c' <- cs ]

-- somewhere down the file
  B.hPutBuilder stdout (renderLines lines)

The B.hPutBuilder is the most awesome one. It dumps the Builder straight to the file handle without any unnecessary allocations.

Pre-allocated map

The last piece of optimisation was to eliminate the surprisingly costly timestamp function.

After ByteString conversion it looked like

timestamp :: Int -> S.ByteString
timestamp = str . convert
  where
    convert mins = ((mins `div` 60), (mins `mod` 60))
    str (hh, mm) = (preppend hh) <> (preppend mm)
    preppend x | x < 10 = S.pack $ '0' : (show x)
               | otherwise = S.pack (show x)

I decided to pre-allocate a map to have the most common values cached.

import Data.IntMap.String as IntMap

tsmap :: IntMap.IntMap S.ByteString
tsmap = foldr (\i m -> IntMap.insert i (timestamp i) m) IntMap.empty [0,5..1440]

timestamp' i =
   case IntMap.lookup i tsmap of
     Just s -> s
     Nothing -> timestamp i

And that eliminated the second-most time consuming part of the program.

Conclusion

After all the efforts the most-consuming one was Glob library call, now occupying 48% of the total run time. On the local data set the time went down to 1.2s (from 27s) and processing the production data took only 17 min (compare to the initial 144 min). The results correlate well with the Go version which takes ~1s on local data set and 12 min on the production set. I attribute the difference to the fact that Go doesn't have to waste time parsing the output of rrdtool.

Can Haskell be as fast as Go? Definitely yes, however the amount of effort I had to put into that was thrice of what I spent on the initial version, while with Go I got the excellent results straight away.

Another observation is that the more "strictness" I introduced the better results I got, and that the cost centres were never where I expected them to be. So laziness was like a double-edged sword that should have been treated with extreme care.