~/blog

Building a key-value storage engine from scratch

An embedded key-value append-only storage engine. The following text contains some of the reasoning about decisions I had to make. This was written in Rust but I don't mention much of the impl details, so if you're not familiar with the language it's still possible to understand the structural decisions. You can check out the repo on: https://github.com/viniciusvidaldev/appendix-db

set off

As the name suggests, a storage needs to store stuff, so first of all, we need to define how and where. This project is an embedded storage system, so where is easy, a file in the filesystem. This one was easy, but now, in what format? You might think of a .txt file, since most of what gets stored is text, and it would be easy to debug, but it's not a good fit. A .txt file is meant to store ASCII (or UTF-8) text, and a storage engine has to accept arbitrary bytes, anything that the user hands it. Another reason, reserved bytes like newlines collide with user data and force escaping, and escaping means the on-disk size stops matching the real size. In the end, it would be paying for trade-offs it doesn't even use, so I discarded this format. The next option, and the one I chose, was working with a raw binary file, I would have a contiguous blob of bytes that follows whatever format I impose. No reserved characters, no escaping, no hidden costs. A linear sequence also means I can index into it by offset, something that'll matter in next steps.

format

I said that I could impose a format for each entry, now I had to define it. The first thing to decide would be how to separate entries from one another, maybe we could use a delimiter? Something like key1:item1,key2:item2,key3:item3. But there's a problem with this, the delimiter is a reserved byte, and any value containing , would collide with it and force escaping, the same problem I just rejected text for. And even if I picked a safer delimiter, to read a full entry I would still have to go byte by byte until I find it, I can't jump ahead. A clever approach to this is to use no delimiter at all. Instead, every entry says at the start how long it is. If I know the next entry is N bytes, I just read N bytes and stop, no delimiter to find, no scanning, and no byte is reserved. I'd put a small header at the start of every record telling the reader the size of what comes after, and then I can jump from one entry to the next just by reading the header and skipping ahead. So the record layout looks like this:

[key_len: u16][val_len: u32][key][value]

The two length fields come first, so the reader knows exactly how many bytes to pull for the key and value before reading either. As you can see both lengths have a defined data type, with this some constraints are already set for this storage engine, keys can be as big as 2^16, 64KBs, while for the value is 2^32, 4GBs.

checksum

But there's still a hole in this format. The file is append-only, and writes can be interrupted, the machine could crash mid-write, lose power, whatever. When I open the file again, the last record might be only half-written, and the reader would happily decode garbage lengths and pull more garbage, and worse, it would be misaligned for every record after it. A solution to this is to use the good and old concept of a checksum, that consists of computing a small fingerprint of the bytes in a record and storing it alongside them. When I read a record back, I recompute the fingerprint from the bytes I just pulled and compare it with the one stored. If they match, the record is intact, if they don't, I know something went wrong and I stop there. I put it at the very start of each record because I can't trust any of the other bytes until I've verified them. For the algorithm I went with one I had never used before. SHA256 felt like overkill, it's designed to resist tampering by an adversary, and produces a 32-byte digest, here I just need to catch accidental corruption from a torn write. CRC32C does exactly that in 4 bytes, and it's hardware-accelerated on modern CPUs, so it's basically free to compute. So each entry looks like this now:

[checksum: u32][key_len: u16][val_len: u32][key][value]

A fixed size header and the actual key/value.

index / read

I can write records to disk now, but I haven't said anything about how to read them. The naive approach would be to scan the file from the beginning every time someone asks for a key, reading record by record until I find a match. That works, but it's O(n) on every single get, and worse, if the same key was set multiple times, I'd still have to scan the whole file to be sure I picked up the latest version. Not great. A better approach is to keep a small map in memory, the key as the lookup, and where to find the value on disk as the entry. Remember when I said the file would be indexable by offset? This is where it matters. The map looks something like this:

language: rust
HashMap<Key, (offset, value_len)>

And here's the nice part. The offset doesn't even need to point at the start of the record, it points straight at the value bytes, skipping the header and the key entirely. About as cheap as reading from a file gets. Writes do two things now, append the record to the file as before, and update the in-memory map with the new offset. If the same key gets set again later, the map just overwrites the previous entry, the old bytes are still on disk but the map only points at the latest one.

delete

So far I can write and read, but I haven't said anything about deleting. The obvious approach would be to find the record on disk and remove it, but that breaks the append-only rule that everything else has been built on. I can't mutate the file, I can only append to it.

The trick is to make delete a write, just a different kind of write. Instead of removing anything from disk, I append a new record that says "this key is gone". When replaying the log later, the reader sees the set first, then the delete record, and ends up with the key absent. The in-memory map follows the same logic, on a delete, just remove the entry. The old value bytes are still on disk, but nothing points at them anymore.

This means I now have two kinds of records, sets and deletes, and the format so far doesn't say anything about that, the reader has no way to tell them apart. So the header grows by one byte, an entry type byte that says what kind of record this is:

[checksum: u32][entry_type: u8][key_len: u16][val_len: u32][key][value]

One byte is plenty, I only have two kinds today, and even if I add more later, 256 record types is more than I'll ever need. One last thing, what if someone tries to delete a key that doesn't exist? I made this a no-op, not an error. The alternative would be to force every caller to check first, and that's just an annoying API. The end state is the same either way, the key isn't there, which is what the caller wanted.

replay

The index lives in memory, so it disappears every time the process restarts. The records on disk are the only durable thing in the system, and the index has to be rebuilt from them on start. Replay is also where I get to find out if the last shutdown was clean. The tail record might be half-written from a crash, and if I trust it without checking I'd be building the index on top of corruption and silently exposing garbage on the next read.

The naive way is two passes. First pass scans the whole log to validate every record and find where the trustworthy tail ends. Second pass walks it again to build the index. Imagine a 10GB log, that's 20GB of sequential I/O on every restart just to open the database. The work in both passes is almost identical anyway, walk records in order, parse the header, check the checksum, and the only thing that differs is what I do with each valid record once I've verified it. Folding them together costs nothing extra.

So replay is a single walk. Each record gets verified, then either fed into the index or skipped, and the moment anything fails the checks the walk stops there. The walk goes in file order, which is also write order, so when the same key is set twice the later record naturally overrides the earlier one. The position right before the failing record becomes the truncation point, and the writer chops everything after it before any new appends happen.

handling multi-thread concurrency

Everything so far has assumed a single thread. Reads don't mutate anything and should be able to fan out across threads, writes need to be careful about what they share, and the in-memory index sits in between.

reads

A read looks up an offset in the index and pulls bytes from the file. Neither step mutates anything I own, but the file handle has an internal cursor that moves on every read, and that cursor turns into shared state the moment two threads use the same handle.

The fix I went with was pread (read_exact_at on Unix). The offset is a parameter to the read call itself. The kernel positions atomically, no cursor gets touched. Many threads can hammer the same handle at the same time and the reads never coordinate. No locks involved.

writes

Every write appends to the tail of the log and updates the index. The writer caches the end-of-file offset to skip a syscall per append, and that cache is shared between any two threads that try to write. The append + the fsync + the index update also need to be one atomic step, otherwise writer A could log its record and writer B could update the index first, leaving the index pointing at the wrong record on disk.

A Mutex around the writer covers all of this. Every caller wants to mutate the writer state, append bytes, update the cached offset, modify the index. There's no path that just reads what's behind the lock, so the read/write split that RwLock is built around would be wasted here. Mutex is the lighter pick that says "one at a time", which is the only thing the writer needs.

index

The index sits in the middle, read by gets and written by sets and deletes. A Mutex would serialize every read behind every write, throwing away the pread concurrency the read path was built around. Readers don't conflict with each other, there's no reason to queue them, and RwLock is exactly the primitive for that.

compaction

The log only ever appends. Every overwrite and every delete leaves bytes on disk that the index will never point at again. Past a certain point those have to be reclaimed.

The in-memory index already defines what's live, so compaction is just materializing that into a fresh file beside the original. The original stays untouched until the new file is fully written and fsynced, so a crash at any point leaves the canonical data undisturbed. A rename swaps it in atomically once it's ready, and fsyncing the parent directory ensures that rename survives a power cut, since the kernel won't necessarily flush directory metadata on its own.

After the swap the old file descriptor still pins the old inode. Reads through it would pull bytes from a deleted file at offsets that no longer mean anything in the new one, so the reader has to be reopened. That swap happens under concurrent reads, and putting a lock on it would cancel the lock-free pread concurrency the read path was built around. ArcSwap makes the swap itself atomic and keeps reads a single atomic load.

The trigger is two conditions rather than one. A tiny mostly-dead file isn't worth the syscall overhead, and a huge mostly-live file isn't worth the rewrite. Both the volume and the waste ratio have to cross a threshold. Checking the ratio after every write has to be O(1), so the live byte count is maintained incrementally on every set and delete rather than derived from the index on demand.

Next steps

large logs / large values

If logs get too large we could start segmenting the log into immutable closed segments plus one active segment. Compaction could then work per-segment-group rather than over the whole file, so its cost stops scaling with total data. Something like: data/appendix.db/ 000001.log closed, 256 MiB 000002.log closed, 256 MiB 000003.log closed, 256 MiB 000004.log active, ~73 MiB and growing Writes always go to the active segment. Once 000004.log crosses the size threshold, the writer would fsync and close it, then open 000005.log as the new active segment. The record format inside each file stays exactly what we have today, we'd just have many small log files instead of one big one. The index entry would grow from (offset, value_len) to (segment_id, offset, value_len) so reads know which file to pread from. Compaction would then pick a group of cold segments (say the three with the lowest live-byte ratio), stream their live entries into a fresh segment, fsync, atomically swap the old files out, and update the index. The active segment never participates, so writes don't have to pause

For oversized values, we could write anything past some threshold to a separate blob file and store just a pointer in the log, which would also let big payloads skip compaction rewrites. And once the keyset grows too large to keep in memory, the index would need to move partially onto disk, with sharded hash tables or an LSM-style structure being reasonable options for that.

recovery

Replay today is fine because the log is small, but once it grows to a few GB the open path becomes a problem. Every record has to be checksummed before the database is usable, so opening a 10GB log means 10GB of sequential I/O plus all the CRC work, every single time. At small sizes that's nothing, at large sizes that's minutes.

The first lever worth pulling is hint files. Every time a segment closes, we'd write a tiny sidecar next to it containing just (key, offset, value_len) triples, the bare minimum the index needs to be rebuilt. On the next open the closed segments wouldn't be touched at all, the hint files get loaded directly into the index and only the active segment still needs a full scan. Something like: data/appendix.db/ 000001.log 000001.hint 000002.log 000002.hint 000003.log (active)

Hint files are basically free to write, they get produced once when a segment closes and never touched again, and they shrink open time from "walk every byte of every log file" to "walk the active segment", which is bounded by the rollover threshold.

On top of that, a clean-shutdown marker would handle the rest. The setup is small. An explicit close() on Appendix writes an empty CLEAN file next to the segments and fsyncs the parent directory so the file's existence survives a power cut. Every append already fsyncs as it lands, so close has nothing else to flush.

On the next open, before any new writes happen, we check whether CLEAN exists and remove it. A present marker says the last shutdown went through close cleanly, the active segment ends on a complete record, and replay can just walk lengths to rebuild the index. A missing marker says something killed the database before close could run, and replay does the regular CRC-validating walk to find the last good record and truncate to it.

Only close creates CLEAN and open removes it before doing anything else, so every failure mode short of a graceful shutdown leaves no marker on the next open and lands on the slow path automatically. Panic, kill, power cut, all the same answer.

secondary indexes

For secondary indexes the store has to know how to extract index keys from values. The caller would register an extractor per index, something like Fn(&[u8]) -> Vec<IndexKey>, that runs on every value as it's set. Each index would be its own HashMap<IndexKey, HashSet<PrimaryKey>>, or a BTreeMap if we want range queries. The mappings store primary keys, which stay stable across compaction, so compaction can move records around freely without invalidating any of the indexes.

For persistence, the simple option is to rebuild on replay by running each extractor over every record, since the value is already in the log. If extractor cost on open becomes a problem, we'd write the extractor output to the log at set time under its own type byte, so replay reads the secondary entries back directly and the extractor only runs once per write.

DynamoDB takes a different angle worth holding onto. GSIs there are derivable from the item itself because the indexed fields live inside the value, so there's no separate index to persist. Appendix can't do that today since values are opaque bytes, but if the value format ever becomes structured, the storage engine could pull index keys out directly and the persistence question mostly goes away.