Haskell: journey from 144 min to 17 min
Table of Contents
While working on the Haskell vs. Go. vs.. experiment I had to rush the initial implementation because I had only two days before the experiment result presentation. There have been a number of bad choices made and I'll try to cover those alongside with solutions in this post.
Initial assumption
As I mentioned in the previous post I could not find librrd
bindings on
Stackage so I decided to start rrdtool
for each file.
import System.Process (readProcess) processRRDFile (start, end) f = do output <- readProcess "rrdtool" ["fetch", f, "AVERAGE" {- , ... -}] forM_ (drop 2 (lines output)) mapper where mapper = -- ...
I did realise that it was a performance killer because starting a process was a relatively heavy operation. And doing that ~800000 times would take minutes.
A few minutes after the presentation ended I found that rrdtool
can be used
in a daemon mode (of course!) where it would receive commands from stdin
and
output results to stdout
. But before making the change I recompiled the program
with stack build --profile
, just to double-check my assumption.
First 30% improvement
I was surprised to discover that the majority of the time was spent not in
spawning rrdtool
process, as I assumed, but in the seemingly innocent expression:
import Text.Printf timestamp mins = printf "%02d%02d" (mins `div` 60) (mins `mod` 60)
On the production data it was called in the most inner loop, about 270 million times. Replacing it with the following function:
timestamp mins = hh <> mm where hh = preppend $ show (mins `div` 60) mm = preppend $ show (mins `mod` 60) preppend [c] = ['0', c'] preppend s = s
improved run-time by 30%.
Next 20% improvement
I re-ran the program with +RTS -p
flag and at the time (to my equally great
astonishment) the culprit was the expression
parseTimestamp s = (read . takeWhile isDigit) s :: Int
The idea was to read digits till the first-non digit and convert the resulted
String
into Int
in strings like
1232145: 0.5000e+3
And again, replacing read . takeWhile . isDigit
with the following
parseInt :: String -> Int parseInt s = fst $ parse s [] where parse cs xs = case cs of c:cs' | isDigit c -> parse cs' ((ord c - 48) : xs) _ -> foldl' mult (0, 1) xs mult (sum, pow) x = (sum + x*pow, pow * 10)
introduced another 20% improvement compared to the previous version. A slight modification to the function to return the remainder of the string -
parseInt :: String -> (Int, String) parseInt s = parse s [] where parse cs xs = case cs of c:cs' | isDigit c -> parse cs' ((ord c - 48) : xs) _ -> let sum = foldl' mult (0, 1) xs in (sum, cs) mult (sum, pow) x = (sum + x*pow, pow * 10)
resulted in another 10% improvement compared to the previous result because it
allowed me to get rid of (break isSpace)
in
readValue = dropWhile isSpace . snd . (break isSpace)
So now I could run a simpler expression
readValue = (dropWhile isSpace) . tail
on the second element of the tuple returned by parseInt
above.
Only after the changes processRRDFile
popped onto the first place in the
.prof
file.
ByteString and long-running rrdtool
The result of the changes above reduced the total run time in half. However, it was still bad, being just marginally better than the original Perl/Python version.
Proper use of rrdtool
The program was started in the daemon mode and could be communicated with via pipes
import qualified System.Process as SP main = do -- ... (Just sin, Just sout, _, ph) <- SP.createProcess (SP.proc "rrdtool" ["-"]) {SP.std_in = SP.CreatePile, SP.std_out = SP.CreatePipe} -- ... code that uses sin and sout
Sending the command like
fetch <file_path> AVERAGE -s <posixstart> -e <posixend>
to stdin
would result in the same output on stdout
as if the
program was started per-file, followed by OK:...
string. The parsing step
would remain, but subsequent starts were no longer required.
Replacing String
with ByteString
Using the strict version of ByteString
has resulted in the most significant
improvement. I literally used ByteString
wherever I could, e.g
import qualified Data.ByteString.Char8 as S getResponseLines sout = do line <- S.hGetLine sout if S.isPrefixOf "OK" line then return [] else do rest <- getResponseLines sout return (line : rest) -- parseInt/readValue replacement parseLine line = case S.readInt line of Just (ts, rest) -> (ts, (S.drop 2 rest))
I haven't run it on the production data set so I am not sure what the result
would have been like. On my local data set the two changes above cut 27s of the
pre-optimised version down to 3.5s. It was much better, yet still worse than
the Go version. At that point about 20% of time the program was spending in
Glob
library, 18% preparing the string to print out and 11% reading rrdtool
output line by line.
Using ByteString.Builder
Instead of accumulating intermediate results in a list and then flushing them
to stdout
I decided to try the Builder
interface. As I understand, using
strict ByteString
with Builder
gives the best of both worlds - efficient
string operations when it is needed and lazy accumulation otherwise. So instead
of re-allocating buffers on string concatenation we get the promise of getting
the whole thing when (if!) needed.
import qualified Data.ByteString.Builder as B import System.IO (stdout) renderLines :: [S.ByteString] -> B.Builder renderLines lines mconcat [ parseLine line <> B.charUtf8 '\n' | line <- lines ] renderCSVRow (c : cs) = B.byteString c <> mconcat [ B.charUtf8 ',' <> B.byteString c' | c' <- cs ] -- somewhere down the file B.hPutBuilder stdout (renderLines lines)
The B.hPutBuilder
is the most awesome one. It dumps the Builder
straight to
the file handle without any unnecessary allocations.
Pre-allocated map
The last piece of optimisation was to eliminate the surprisingly costly
timestamp
function.
After ByteString
conversion it looked like
timestamp :: Int -> S.ByteString timestamp = str . convert where convert mins = ((mins `div` 60), (mins `mod` 60)) str (hh, mm) = (preppend hh) <> (preppend mm) preppend x | x < 10 = S.pack $ '0' : (show x) | otherwise = S.pack (show x)
I decided to pre-allocate a map to have the most common values cached.
import Data.IntMap.String as IntMap tsmap :: IntMap.IntMap S.ByteString tsmap = foldr (\i m -> IntMap.insert i (timestamp i) m) IntMap.empty [0,5..1440] timestamp' i = case IntMap.lookup i tsmap of Just s -> s Nothing -> timestamp i
And that eliminated the second-most time consuming part of the program.
Conclusion
After all the efforts the most-consuming one was Glob
library call, now
occupying 48% of the total run time. On the local data set the time went down
to 1.2s (from 27s) and processing the production data took only 17 min (compare
to the initial 144 min). The results correlate well with the Go version which
takes ~1s on local data set and 12 min on the production set. I attribute the
difference to the fact that Go doesn't have to waste time parsing the output of
rrdtool
.
Can Haskell be as fast as Go? Definitely yes, however the amount of effort I had to put into that was thrice of what I spent on the initial version, while with Go I got the excellent results straight away.
Another observation is that the more "strictness" I introduced the better results I got, and that the cost centres were never where I expected them to be. So laziness was like a double-edged sword that should have been treated with extreme care.