When JSON Sucks or The Road To SQLite Enlightenment
Table of Contents
This is a story how I was able to reduce application start time from 30 minutes down to mere 10s1.
One of the projects I worked on was a massively concurrent SNMP poller. It operates on the scale of tens of thousand devices, collecting billions of metrics per day. It is a re-write of a legacy Perl-based poller in Elixir/Erlang.
The application configuration was stored in JSON files. That is, for every device that should be polled there was a corresponding JSON file describing how to access the device and which OIDs to query. Some of those files were fairly large, occupying more than 100MB on disk. The parsed configuration was stored in-memory alongside with a SHA1 hash of a file. The design made possible efficient incremental configuration updates, avoiding re-parsing of unchanged files.
The first iteration of the poller wasn't a proper distributed application. It could only run in the "active/stand-by" setup. In this mode when the active node fails, the stand-by node takes over. Once the failed node comes back to life, the then-active node hands over.
One day, due to a chain of events, the active node failed and the stand-by node took over (a well-anticipated scenario). When the necessary steps were taken, the failed node came back to life and the stand-by node handed over the load. What I did not anticipate was that the parsing of the massive amount of JSON files took roughly 30 minutes and saturated the available CPU cores. During the time the node was stalled and couldn't poll metrics, consequently triggering a 30 minute outage.
The problem could have been mitigated, of course, by preventing hand-over till the configuration files were parsed and the node became fully operational. However, that would still result in a half an hour period with no resiliency. That was unacceptable. I was given two weeks to find and implement a solution.
Back to the drawing board.
Day 1. Let's use a stream JSON parser!
I decided to target large JSON files first. Parsing them and allocating the corresponding structures in memory is non-linear wrt their size. From my past life, when I had to work with XML files, I knew that a stream parser (SAX) could be much faster - O(file size) - albeit required more complicated code for book keeping/inline processing.
The application already used the fastest (to the best of my knowledge) JSON
parser in Erlang world -
jiffy. It is a "DOM style parser", i.e. it allocates
the complete document tree in memory. Parsing the largest configuration file
jiffy, consuming it from memory, took 10s2. A stream JSON
jaxon - with the built-in NIF tokeniser, chugged for 18s. Hmm… I
thought that the tokeniser was the weak spot, but using a hand-written Erlang
tokeniser proved me wrong - it bumped the parse time of the file to 31s.
It was a dead end.
Day 2. JSON No More or SQLite Attempt #1
After a sleepless night I had an insight: the fastest possible way to parse something is to not parse it at all! SQLite allows in-memory databases so if I could put all the metadata into an SQLite database file and then "attach" it to an in-memory database that would eliminate the parsing step altogether.
A crude benchmark demonstrated that attaching a database and copying all the tables with
CREATE TABLE foo AS SELECT * FROM attached.foo
for the largest file took only 300ms! It was a win… or so I thought before I measured it:
iex(11)> Metadata.scan_folder("~/hash-prod.sqlite3-many-files") 10:11:13.958 [info] Processed 73214 devices in 353842 ms
The result (354s) was a dramatic improvement over 30 min but was well above the allowed one minute budged.
I noticed, however, that the problem became IO-bound. CPU utilisation was somewhere between 10-20% per core.
Day 3. SQLite Backup API
SQLite has a wealth of useful APIs. One of them is the Backup API which allows to efficiently copy an on-disk database file into an in-memory database. Exactly what I needed!
The then-existing SQLite libraries for Erlang/Elixir didn't expose the API so I wrote a POC in C which processed all the files sequentially.
Success? Not quite:
[~/sqlite-poc]$ ./poc ~/hash-prod.sqlite3-many-files 712130ms
So processing all the files took approximately 12 minutes. CPU utilisation was not exceeding 7%. The problem was clearly IO-bound. Reasoning that writing a NIF to expose the API was a gamble at that point3, I haven't pursued the path any further.
Re-thinking was required. I went back the drawing board.
Day 4. OBDB (One Big Database)
If the problem is IO bound, what can be done to improve the situation? It is well-known that a sequential read of a large file is the best use-case for both, spinning disks and SSDs alike. I decided to put all the configuration into one large database file.
It is worth mentioning that the original JSON files occupied 18GB. The same data put into many SQLite files took 6.9GB on disk, while having everything in one database brought the size down to 4.8GB. So it was a win-win: less size meant less IO and all the IO was sequential! I was rubbing my hands with glee, anticipated the end of the journey…
iex(11)> Metadata.scan("~/hash-prod.sqlite3-oids-table/OBDB.sqlite3") 10:11:13.958 [info] Processed 73214 devices in 12581 ms
I was almost there. Alas, 2min for a cold start was still over the 1 min requirement ☹
Another problem was the degradation in the overall application performance. The Erlang SQLite library - esqlite - spawns an OS thread per database connection. I had an Erlang process per a polled device and all of them required access to the configuration data. Thus, using a connection per Erlang process was a bad choice (it would spawn 70 thousand OS threads!), while using same connection across all Erlang processes meant most of them were waiting on a single mutex for no good reason.
Day 5. Not SQLite?
I hit the wall and couldn't see any way through. Desperate, I tried putting the configuration data into Redis and Postgres. Both were slower than the OBDB solution4.
LMDB looked promising on paper. However, the only existing Erlang library exhibited the same OS-thread-per-connection problem. I was running out of time and was not sure I could afford several extra days to write a full-featured LMBD wrapper.
Another problem with a key-value store was that I couldn't figure out an elegant solution for incremental, atomic updates, which was a highly desirable feature.
It was a very depressing day. No progress was made and the alternatives were worse in one way or another.
Day 6. SQLite It Is!
A colleague of mine suggested to use Erlang's native serialisation functions -
I discarded the option back on day 2 or 3 because parsing the largest configuration file took 1.4s, which was significantly slower than the SQLite's 300ms and I thought if I couldn't get under 1m with 300ms per file, there was no chance to get there with 1.4s per file.
The insight was that I could just store individual device configuration data as
BLOBs in a SQLite table. That would be the ideal scenario because the data from
disk could be read sequentially, while each individual BLOB could be processed
in parallel by the corresponding Erlang processes. (Note, this could have been
done with JSON as well, but
jiffy takes 10s to parse while
The resulting file occupied 5.9GB on disk. I tried simulating the cold start:
iex(11)> Metadata.scan("~/hash-prod.sqlite3-blob-table/ALL.sqlite3") 11:01:55.449 [info] Processed 73214 devices in 30581 ms
It took only 30s. Victory!
Completely exhausted yet delighted I had a sound sleep for the first time since the start of the endeavour 😁.
Day 7. Enlightenment
The problem was solved, however 5.9GB file didn't quite feel right. Luckily,
the Erlang's serialisation function -
term_to_binary/2 - makes it trivial to
enable compression of the serialised terms. With the compression enabled the
database file shrank to 725MB and it took 28s to process.
Then I moved some device-specific information, such as IP address and credentials, outside of the configuration BLOB and put the compressed terms into a separate table with unique constraint, hoping that some of those tens of thousands of devices may have identical set of OIDs to poll. It worked: 595MB and 25s. So normalisation not only improves a database design, but also boosts its performance 😎.
As a bonus, incremental, atomic updates became trivial and having hashes alongside the BLOBs made reconciliation with an upstream database a breeze.
Finally, I reasoned that if it took 1.4s for Erlang's serialisation and 300ms for SQLite database to copy tables over (see Day 2) I might improve the cold start time even further.
SQLite, being unimaginably awesome, provides exactly the right kind of API which allows to serialise/deserialise an entire database into/from a BLOB. The time it takes is just a few milliseconds. So the idea was to store databases within a container database5.
Unfortunately, I didn't have the time to write a NIF wrapper for the functions so we had to settle down on the 28s solution.
A few weeks later, stealing some weekend time from my family, I wrote the
raw_sqlite3 library. It is a NIF wrapper which exposes as much SQLite API as
possible. It avoids spawning extra OS threads by running on dirty schedulers.
I stored zlib-compressed6, serialised SQLite databases as BLOBs in the
container SQLite database. All those thousands of Erlang process received
their extracted BLOBs, decompressed and called
sqlite3_deserialize on them.
The result was a 800MB file, while the startup became astonishingly fast - mere 10s! Thus, all the changes lead to 180x improvement in time and 22.5x reduction in space compared to the original solution.
SQLite is wicked cool. While JSON data format has some nice properties like the abundance of tooling, being human-readable, native support in many languages, etc., as a storage format it can't hold a candle to a relational database not only in terms of features (transactions, constraints, complex queries), but also in performance and size. Next time I need a configuration storage for my application, I know what I'll be looking at first.
Also, please give
raw_sqlite3 a try. It has been used in production under heavy
load for several months now and its API has mostly stabilised.
I am sorry for the somewhat rude title. I couldn't find a better word to characterise it.
To be fair, it could parse the string in 2s when configured to return lists of tuples. I needed maps, however, so it had to be configured accordingly.
Given that the CPU utilisation was already low enough, it was highly unlikely that running it in parallel would improve the time.
Both, Postgres and Redis are amazing engineering products and in some scenario they are likely to outperform SQLite. Not in my case, however.
More than one person mentioned the "Inception" film upon hearing it.
Perhaps using LZ4 or ZSTD could push the envelope even further.