Building a Log Store in Rust – Part 5

4 Feb

The lesson learned this time… there’s probably a library for that! This isn’t terribly surprising, and I’m kinda surprised I had to learn it again. One of my fellow grad students once called Java a scavenger hunt for a class that does what you want. I think Rust will probably turn into the same thing… and that isn’t a bad thing! It simply means people are writing libraries for Rust and that the documentation is fantastic (in most cases).

Speeding Up Record Reading

I was trying to speed up record reads as they were taking ~4s for 100K reads. That’s a decent number of records, but only ~20MB of actual data. The first thing I did was remove the need to seek then read by using the positioned-io crate. This lead to my first issue.

Detour: Fixing Version Mismatches in Cargo

The problem I was facing was that I already declared byteorder as a dependency at version 1.2.1. However, positioned-io brought in byteorder at version 0.5. This caused an incompatibility with reads. I’m not sure this is the right way to do this, but I simply cloned the code, bumped the version, (submitted a PR), and then patched the dependency:
positioned-io = { path = "../positioned-io" }

With positioned-io working properly, I moved to trying to perform the reads on threads. I started with CpuPool via the futures-cpupool package. However, I couldn’t get it work for various reasons. I won’t re-write all the issues I faced, as I leveraged the Rust Programming Language Forum and got some great help from vitalyd and HadrienG

Rayon: Simple Work-Stealing Parallelism for Rust

HadrienG gave me the tip to look at Rayon, and I’m glad he/she (???) did. It took my 10s of lines of not-really-working parallel code, and reduced it to a single line:

let ret = locs.into_par_iter().map(|loc| {self.log_file.get(loc).unwrap()}).collect();

It doesn’t get much easier than that. This code creates a parallel iterator given a vector of locations, and then calls get on the location, and collects all the records into another vector that is eventually returned.

Caching Isn’t Always Faster

My first attempt at reading records faster was simply to cache previous reads in an LruCache. This helped tremendously when the records where in the cache. However, because the cache is mutable (when you don’t want it to be), I couldn’t use it in LogFile because I wanted to keep its self as a reference, not a mutable reference. Instead I moved it into DataManager, and got creative with iterators. I partition the list of locations into those that are in the cache and those that are not. I then read the ones that need to come from disk in parallel, and copy the ones out of the cache that don’t. Finally I stitch it all together, returning the full list.

As it turns out, this is not that much faster. Without caching I was getting a consistent 1.5s. With caching the first call takes 2.6s, but the calls after take 1.2s. So a super-small gain in performance, but with the requirement that all items must be in the cache. Even then, reading directly from disk in parallel is only 0.3s slower. (Note: all of these performance tests are done without optimizations. Enabling optimizations, it takes 0.4s.)  I think it’s because of having to copy values around, but it might also have to do with needing to iterate through the loop of locations twice, once calling contains, which might not be terribly fast.

So I could keep working on other parts of the code while I figured out how to make reading records faster, I created a branch. You can find the code relating to this post:

As always, thanks for reading!

One Reply to “Building a Log Store in Rust – Part 5”

  1. I have a bunch of things in my queue but I try to keep up with this.
    I found my first gripe about rust. This is just silly but I have to mention it because that’s how I am.
    Variables are by default immutable. Okay I get that, force constness and you can’t possibly have the const virus problem you have in c++, and you get all the default safety of having to go out of your way to specify something that can cause you grief in the future. I’m a a fan (so far). Of course I imagine what you end up with instead of the const virus is the mutable virus in rust. But that at least makes some sense.

    But what’s the problem? Variables are inherently mutable. Why? Because THE DEFINITION OF VARIABLE IS SOMETHING THAT CHANGES. Sorry, just had to throw that out there. 🙂
    I’m not saying they should have picked a different word (‘memory storage identifier’), because being cool by coming up with a new word for “variable” is worse than misusing it a bit just to be cool.
    But it struck me as odd so i thought I’d mention.
    And yeah, a const variable in c++ is just as incorrect, but it’s not the default.

    Anyway, so far, still a fan.

    Looping through anything twice will always be slower than not looping through something twice. Wasteful though it may be (I can’t read rust too well and I haven’t had time to sift through the program yet) but perhaps the second time you can use a hash created the first time?

Leave a Reply to stu Cancel reply

Your email address will not be published. Required fields are marked *