Assignment 4 - B+-Tree
In this assignment you are implementing a B+-tree index. The index should be backed up by a page file and pages of the index should be accessed through your buffer manager. As discussed in the lecture each node should occupy one page. However, for debugging purposes you should support trees with a smaller fan-out and still let each node occupy a full page. A B+-tree stores pointer to records (the
RID introduced in the last assignment) index by a keys of a given datatype. The assignment only requires you to support
DT_INT (integer) keys (see optional extensions). Pointers to intermediate nodes should be represented by the page number of the page the node is stored in. To make testing easier your implementation should follow the conventions stated below.
- Leaf Split: In case a leaf node need to be split during insertion and
nis even, the left node should get the extra key. E.g, if
n = 2and we insert a key 4 into a node
[1,5], then the resulting nodes should be
. For odd values of
nwe can always evenly split the keys between the two nodes. In both cases the value inserted into the parent is the smallest value of the right node.
- Non-Leaf Split: In case a non-leaf node needs to be split and
nis odd, we cannot split the node evenly (one of the new nodes will have one more key). In this case the "middle" value inserted into the parent should be taken from the right node. E.g., if
n = 3and we have to split a non-leaf node
[1,3,4,5], the resulting nodes would be
. The value inserted into the parent would be
- Leaf Underflow: In case of a leaf underflow your implementation should first try to
redistributevalues from a sibling and only if this fails merge the node with one of its siblings. Both approaches should prefer the left sibling. E.g., if we can borrow values from both the left and right sibling, you should borrow from the left one.
You can use the B+-tree shown below to sanity check your implementation of inserts. It has been generated by the following sequence of key insertions
1,11,13,17,23,52 for RIDs
Index Manager Functions
These functions are used to initialize the index manager and shut it down, freeing up all acquired resources.
These functions are used to create or delete a b-tree index. The later should also remove the corresponding page file. Furthermore, before a client can access an b-tree index it has to be opened (
openBtree). When closing a b-tree (
closeBtree), the index manager should ensure that all new or modified pages of the index are flushed back to disk (use the buffer manager function for that).
These functions are used to find, insert, and delete keys in/from a given B+-tree.
findKey return the
RID for the entry with the search key in the b-tree. If the key does not exist this function should return
insertKey inserts a new key and record pointer pair into the index. It should return error code
RC_IM_KEY_ALREADY_EXISTS if this key is already stored in the b-tree.
deleteKey removes a key (and corresponding record pointer) from the index. It should return
RC_IM_KEY_NOT_FOUND if the key is not in the index. For deletion it is up to the client whether this is handled as an error.
Furthermore, clients can scan through all entries of a BTree in sort order using the
closeTreeScan methods. The
nextEntry method should return
RC_IM_NO_MORE_ENTRIES if there are no more entries to be returned (the scan has gone beyond the last entry of the B+-tree. Below is an example of how the scan can be used.
printTree is used to create a string representation of a b-tree. It is used in the test cases and can be helpful for debugging. You code has to generate the format described in the following. Each node of a b-tree is represented as one line in the string representation. The order of nodes should be in depth-first pre-order. The following representation should be used for each node
(pos)[pointer,key,pointer, ...]. Key should be represented using the serializeValue method from the last assignment. Pointers to nodes should be shown as the node position according to depth-first pre-order traversal. Thus, the string representations generated by
printTree is independent of the actual positions of the index nodes on disk.
RID=s should be represented as =PageNumber.Slot. As an example reconsider the b-tree shown above. Its string representation is shown below.
(0)[1,13,2,23,3] (1)[1.1,1,2.3,11,2] (2)[1.2,13,3.5,17,3] (3)[4.4,23,3.2,52]
Integrate the B+-tree with the record manager: Modify your record manager to use the B+-tree for search. That is when creating a new relation with the record manager you should let the user specify which attribute is the key of the table. The record manager should then create and maintain the B+-tree for storing the keys from this attribute. When a new record is inserted, the corresponding key should be inserted in the B+-tree. A scan that searches for a record with a given key should use the B+-tree instead of scanning through all the records. If you want to support keys with more than one attribute, you are allowed to change the b-tree interface to support that.
Allow different data types as keys: Allow the other data types introduced in assignment 3 to be used as keys for the B+-tree.
Implement pointer swizzling: Once a B-tree node is memory, conceptual pointers to that node (
pagenum) should be actual memory pointers. Note that B-trees pages are only pointed to by their intermediate parent node. Thus, a simple way to realize swizzling at page load time, is to immediately replace the pointer to a child node with the memory pointer once some code tries to access it. The complicated part is to make sure that memory pointers are replaced with
pagenum if the buffer manager evicts a page from the buffer that was pointed to by a memory pointer. A reasonable way to implement this is to create a bookkeeping data structure that stores all places of pointers to a node that have been replaced by memory pointers. The buffer manager has to be extended with a callback function parameter. This callback is called whenever a page is evicted. For this extension the callback would be a clean-up function that accesses the bookkeeping data structure to figure out where it has to replace memory pointers with page numbers. This data structure is also needed when a node gets evicted that has memory pointer to one or more of its child nodes.
Support multiple entries for one key: Instead of enabling only a single
RID to be stored for each key, allow insertions of multiple entries with the same key. You are allowed to change the
btree_mgr.h to support accessing all the entries for a key. Please do not change the signature of the
findKey method, but rather add a new
findAllEntries method that returns an array/list of entries. The
findKey should return the first entry only. Instead of
RIDs the pointers in leave nodes should reference a page storing all entries for a key. If there are more entries for a key then you can fit onto one page, additional pages should be organized as a linked list. That is each page stores a pointer to the next page.
Source Code Structure
Your source code directories should be structured as follows. You should reuse your existing storage manager and buffer manager implementations (also record manager if you plan to implement the record manager integration extension). So before you start to develop, please copy these implementations into the assign4 folder.
Put all source files in a folder
assign4 in your git repository
This folder should contain at least …
- the provided header and C files
- a make file for building your code
- a bunch of
*.hfiles implementing the B+-tree
README.md: A markdown or text file with a brief description of your solution
E.g., the structure may look like that:
Defines several helper methods for implementing test cases such as
This file implements several test cases using the
expr.h interface. Please let your make file generate a
test_expr binary for this code. You are encouraged to extend it with new test cases or use it as a template to develop your own test files.
This file implements several test cases using the
btree_mgr.h interface. Please let your make file generate a
test_assign4 binary for this code. You are encouraged to extend it with new test cases or use it as a template to develop your own test files.