I think I finally have the on-disk format/files working; though I’m sure I’ll change them! It’s currently using 2 different serializers (though both through the amazing serde package): JSON and MessagePack. The log file writes the JSON logs directly to disk (though prefixed with the size of the JSON string via the RecordFile). This is a much more verbose format than MessagePack, but it is easier to debug at this stage. Also, converting it to MessagePack should be fairly easy.
For the IndexFile I went with MessagePack for no other reason than I wanted to try something else, and what is written to disk isn’t very human readable to start with. The records are the term (LogValue) found for the field being indexed, and an array of all the locations in the LogFile that have a log where the field’s value is that term. I then keep an in-memory map of term -> location in the index file. So the look-up for a particular
field:term value will go: lookup the term in the in-memory map -> read the record from the index file -> read the logs from the log file. I think this will be fairly efficient, but the only way to actually tell will be test it.
I’ve actually been struggling with what the on-disk format of the IndexFile should be. The problem I keep running into is that the records are not the same size. Because of that, I cannot take an SSTable/LSM-Tree approach and use binary search inside the file to find a particular record. I could write a sorted list of tuples (start, length) that correspond to the actual records, and then binary search this list of tuples because they’d all be the same size. However, for each comparison I’d need to seek and read the actual record. I need to look into how DBs like RethinkDB handle indexing various size records.
In the first post I commented that I wasn’t sure what the difference between
into_iter() was… well I found out when I tired to use the RecordFile iterator I’d created and couldn’t. The long-and-short of it is that
self instead of borrowing it. Because of this, you cannot then use self after you’ve called
into_iter(). This makes it really tough to actually do anything. Looking at the API details for
Vec, this is very clear to me:
fn into_iter(self) -> IntoIter<T> vs
fn iter(&self) -> Iter<T>.
Because the IntoIterator trait does not allow you to implement any other method signature, I was forced to implement the trait a second time for
&'a mut RecordFile. Unfortunately, there is no great way (at least according to the folks on IRC) to use my existing IntoIterator implementation. This also lead me to learning about internal mutability.
My primitive understanding of it is when you have an immutable reference to a structure, yet you want to mutate the fields inside the structure. The way around this is using Rust’s
RefCell. Basically, by wrapping a field in a
RefCell you can call methods to mutate it, even if
self is borrowed as immutable. There is a great blog post by Ricardo Martins that does a much better and thorough job of explaining this topic than I ever could. He even has 2 other posts that go even deeper.
No Before and After in Unit Tests
The last thing I uncovered was that there is no “before” and “after” in unit tests in Rust/Cargo. If you want to run a snipit of code before and/or after each unit test, you must manually copy and paste it before and after every unit test. I believe this is mainly because unit tests are run in parallel in Rust/Cargo, and so something as simple as configuring a logger becomes much more difficult. Because all I’m looking to do at the moment is initialize my logger, it isn’t that big of a deal; however, I could see this becoming a bigger problem as the complexity of my code grows. There is however an RFC for this, so I have hope that this feature will come soon. Who knows, maybe I’ll even implement it 😉
I want to get in the habit of linking the last commit I made before writing this blog entry, so here it is: abb3940. This way, if you’re following along with the blog, you can see exactly what the code looked like at this point in time.
As always, thanks for reading!