Test Project: A Toy Key-Value Database
Below is specification of the core functionality and 3 optional extensions. Please finish the core functionality first which is the prerequisite for the extensions, and feel free to pick any subset of the extensions to work on.
Overview
For the purpose of this test project, this toy database keeps track of key-value mappings where keys are 64-bit integers and values are variable length strings, and allows clients to query and update the mapping. More specifically, the database allows clients to
set
a mapping by specifying a key-value pair, which either creates a new mapping or overrides an existig nmapping;get
a mapping's value given a 64-bit key;remove
a mapping for a given 64-bit key;
The database supports persistent state, which means all changes should be persisted to file(s) on the disk after the updates are processed by the database, and will be available for future queries even after restarts.
For the purpose of this test project, we recommend using a log-structured file storage to hold the persistent state of the database. A log-structured file storage uses an append-only file to hold all database updates where new updates are only appended to the end of the file (thus the "log" in the name). Specifically
- When processing a
set
update, we append a new record to the end of the file containing the key-value pair. Note that there might be otherset
updates earlier in the file already. - When processing a
remove
update, we append a new record to the end of the file indicating the key is now removed. Note that the value for the key is still stored earlier in the file by previous updates but we don't proactively remove them (yet).
The log-structured storage contains all historical updates in the file which indicates 2 consequences
- If we start from scratch and replay all updates in the file sequentially, we should be able to recover the same final state.
- The file may contain many obsolete entries (garbages) - consider what happens if we update the same key 300 times - which means we can consider a) generating an index containing all keys that are present in the final state and the location of the latest value for each key, and b) run garbage collection by removing obsolete entries while making sure it still recovers the same final state (see Stage
Optional extension 1
below).
Core functionality: a library to read and write local database
The library should provide a way to open a local database with a path (which could be the path to a file or a folder, depending on your design).
The library should provide at least these APIs:
set
: given an open database instance, set the associated value of a certain key. If an existing value was overwriten, return it.remove
: given an open database instance, remove a certain key (and its associated value). If an existing value was removed, return it.get
: given an open database instance, return the associated value for a certain key, if there is any.
Since a database is typically used to store large amount of data, note that you should assume that we do not have enough memory to keep all the values in memory. For simplicity, you can assume that we have enough memory to hold all the keys and some metadata in memory.
set
, remove
, get
should all have O(1)
times disk accesses (can be amortized).
You can use the log-structured storage described above to implement the storage of such a database. It is possible to implement this with one single data file. If using multiple data files, the number of files should be significantly smaller than the number of keys (for example, using one file for each key is not a good idea).
Specifically, the implementation of database operations should be:
- When opening the database, read all the data file(s) sequentially to build an in-memory index that maps keys to their values' locations in data file(s).
- For
get
, use the in-memory index to find out whether a key exists in data file(s) and its corresponding value's location in data file(s), and read the data file to return the value. - For
set
andrm
, simply append the commands (key0
is set tox
, key1
is removed) to the end of a data file, and update the in-memory index accordingly.
Optional extension 1: garbage collection
To avoid disk space grows unboundedly when the same keys are set and removed many times, some garbage collection mechanism needs to be implemented to remove the unnecessary stored set
and remove
commands in the data file(s). If garbage collection is correctly implemented, the disk space usage should never be an order of magnitude larger than all existing data in the database.
For the purpose of this project, it's okay if you do garbage collection in "pause-the-world" style without any concurrency or multi-threading. It's desirable if your garbage collection mechanism doesn't change the property that set
, remove
, get
should all have amortized O(1)
times disk accesses.
Optional extension 2: implement an in-memory cache for key-value pairs
In the Core functionality
implementation, the get
operation is implemented by locating the data in the file using in-memory index and then reading the file. In this section, please implement an in-memory cache for the key-value pairs. Feel free to design a cache management policy that you find reasonable but please
- provide a short one-sentence comment describing the cache management policy you designed;
- make sure the total memory usage of the cache does not exceed 16MB (only including the keys and values stored in the cache, excluding overhead of book-keeping data structures).
Optional extension 3: implement a database server and client tool
Most databases work in the way of serving network connections from the clients. Please implement a database server that serves client requests in some TCP
-based protocol.
The server should be an executable that listens to a TCP
port and be able to handle concurrent connections from multiple clients.
The client should be a command line tool that can be used to connect to a server, perform set
, remove
, get
, and close the connection. The client tool should have at least 3 sub-commands: set
, remove
, get
that correspond to the server. Each sub-command takes the address (host and port) of the database as an argument, as well as other necessary arguments, and prints the result to stdout
.
You should feel free to use existing RPC frameworks such as gRPC
, Thrift
, CapnProto
or similar frameworks. If you have no preference, we suggest using gRPC
as it's mature, relatively easy-to-use, and well-documented.