Well that didn’t take long… I switched everything over to Message Pack from Message Pack for the indices, and JSON for the actual logs as described in the last post. It was just too easy to use the
to_vec methods in the Message Pack crate. I think this will also give me slightly increased performance when reading messages from the disk, because I won’t have to read as much data as pure JSON. That said, I could have probably compress the JSON before writing it down, but it’s pretty small. That said, something like Zstandard does have a Rust crate.
While I had the individual pieces that write logs to disk, and index fields, I needed something to put it all together: the
DataManager. This struct holds an index for each key found in a log message, and the
LogFile that stores all the logs. To my surprise, when I tested it with a single log message, it just worked! It created an index file per key, and recorded the log to the log file.
The next thing to implement was a way to actually read the data. I implemented
get() for the data manager which looks up the correct index for a given key, and then pulls back the combination of in-memory and on-disk locations in the log file where the corresponding records can be found. These locations are then read from the log file, and the records returned in a
The next test was performance. I was curious how quickly all of this would work when attempting to write 10K messages to disk, and then get them back. The answer is both fast and slow. Writing the messages was blazingly fast taking only ~0.06s to write 10K of virtually the same message. Reading those messages back however; not so fast. I didn’t even let the 10K reads finish it was taking so long. Cranking it down to 100 calls to
get() took 14s. Reading the data back from disk was taking ~0.06s per call. I realize it’s 10K messages or 4.2MB of logs being read… but it still seems like it should be faster than this! Also, at 0.06s per call x 100 calls, that’s only 6 (call it 7 between friends) seconds… where is the other half of my seconds? Something is going on here that I don’t yet understand.
My best guess is it has something to do with the deserialization code. Earlier when I was implementing the
get() method for the
IndexFile, I ran into an issue where deserialization was returning an error. I scratched my head for a bit, then realized that I was writing a tuple of
(LogValue, Vec<u64>) to disk, but attempting to only read
Vec<u64>. Somehow, Serde/rmp was able to know this was wrong, and return an error. I haven’t a clue how it’s able to do this. It must be gleaning this type information from somewhere, and then checking what is deserialized matches what is expected. This type-checking must be happening at run-time, and so I’m wondering if it’s costly.
This also makes me wonder if there is something I can disable in rmp to make deserialization go faster in production, or if I should look at something inherently faster like Cap’n Proto. At this point I’m going to punt, because it’s “good enough for now” and I can always change it later.
As mentioned last time, here’s the latest commit before writing this post: 0ae5901.
As always, thanks for reading!