Test Project: A Toy Key-Value Database

Engage in a Conversation

Test Project: A Toy Key-Value Database

Below is specification of the core functionality and 3 optional extensions. Please finish the core functionality first which is the prerequisite for the extensions, and feel free to pick any subset of the extensions to work on. CourseNana.COM

Overview

For the purpose of this test project, this toy database keeps track of key-value mappings where keys are 64-bit integers and values are variable length strings, and allows clients to query and update the mapping. More specifically, the database allows clients to CourseNana.COM

set a mapping by specifying a key-value pair, which either creates a new mapping or overrides an existig nmapping;
get a mapping's value given a 64-bit key;
remove a mapping for a given 64-bit key;

The database supports persistent state, which means all changes should be persisted to file(s) on the disk after the updates are processed by the database, and will be available for future queries even after restarts. CourseNana.COM

For the purpose of this test project, we recommend using a log-structured file storage to hold the persistent state of the database. A log-structured file storage uses an append-only file to hold all database updates where new updates are only appended to the end of the file (thus the "log" in the name). Specifically CourseNana.COM

When processing a set update, we append a new record to the end of the file containing the key-value pair. Note that there might be other set updates earlier in the file already.
When processing a remove update, we append a new record to the end of the file indicating the key is now removed. Note that the value for the key is still stored earlier in the file by previous updates but we don't proactively remove them (yet).

The log-structured storage contains all historical updates in the file which indicates 2 consequences CourseNana.COM

If we start from scratch and replay all updates in the file sequentially, we should be able to recover the same final state.
The file may contain many obsolete entries (garbages) - consider what happens if we update the same key 300 times - which means we can consider a) generating an index containing all keys that are present in the final state and the location of the latest value for each key, and b) run garbage collection by removing obsolete entries while making sure it still recovers the same final state (see Stage Optional extension 1 below).

Core functionality: a library to read and write local database

The library should provide a way to open a local database with a path (which could be the path to a file or a folder, depending on your design). CourseNana.COM

The library should provide at least these APIs: CourseNana.COM

set: given an open database instance, set the associated value of a certain key. If an existing value was overwriten, return it.
remove: given an open database instance, remove a certain key (and its associated value). If an existing value was removed, return it.
get: given an open database instance, return the associated value for a certain key, if there is any.

Since a database is typically used to store large amount of data, note that you should assume that we do not have enough memory to keep all the values in memory. For simplicity, you can assume that we have enough memory to hold all the keys and some metadata in memory. CourseNana.COM

set, remove, get should all have O(1) times disk accesses (can be amortized). CourseNana.COM

You can use the log-structured storage described above to implement the storage of such a database. It is possible to implement this with one single data file. If using multiple data files, the number of files should be significantly smaller than the number of keys (for example, using one file for each key is not a good idea). CourseNana.COM

Specifically, the implementation of database operations should be: CourseNana.COM

When opening the database, read all the data file(s) sequentially to build an in-memory index that maps keys to their values' locations in data file(s).
For get, use the in-memory index to find out whether a key exists in data file(s) and its corresponding value's location in data file(s), and read the data file to return the value.
For set and rm, simply append the commands (key 0 is set to x, key 1 is removed) to the end of a data file, and update the in-memory index accordingly.

Optional extension 1: garbage collection

To avoid disk space grows unboundedly when the same keys are set and removed many times, some garbage collection mechanism needs to be implemented to remove the unnecessary stored setand remove commands in the data file(s). If garbage collection is correctly implemented, the disk space usage should never be an order of magnitude larger than all existing data in the database. CourseNana.COM

For the purpose of this project, it's okay if you do garbage collection in "pause-the-world" style without any concurrency or multi-threading. It's desirable if your garbage collection mechanism doesn't change the property that set, remove, get should all have amortized O(1) times disk accesses. CourseNana.COM

Optional extension 2: implement an in-memory cache for key-value pairs

In the Core functionality implementation, the get operation is implemented by locating the data in the file using in-memory index and then reading the file. In this section, please implement an in-memory cache for the key-value pairs. Feel free to design a cache management policy that you find reasonable but please CourseNana.COM

provide a short one-sentence comment describing the cache management policy you designed;
make sure the total memory usage of the cache does not exceed 16MB (only including the keys and values stored in the cache, excluding overhead of book-keeping data structures).

Optional extension 3: implement a database server and client tool

Most databases work in the way of serving network connections from the clients. Please implement a database server that serves client requests in some TCP-based protocol. CourseNana.COM

The server should be an executable that listens to a TCP port and be able to handle concurrent connections from multiple clients. CourseNana.COM

The client should be a command line tool that can be used to connect to a server, perform set, remove, get, and close the connection. The client tool should have at least 3 sub-commands: set, remove, get that correspond to the server. Each sub-command takes the address (host and port) of the database as an argument, as well as other necessary arguments, and prints the result to stdout. CourseNana.COM

You should feel free to use existing RPC frameworks such as gRPC, Thrift, CapnProto or similar frameworks. If you have no preference, we suggest using gRPC as it's mature, relatively easy-to-use, and well-documented. CourseNana.COM

Test Project: A Toy Key-Value Database

Test Project: A Toy Key-Value Database

Overview

Core functionality: a library to read and write local database

Optional extension 1: garbage collection

Optional extension 2: implement an in-memory cache for key-value pairs

Optional extension 3: implement a database server and client tool

Get in Touch with Our Experts