Assignment Two – 25%
Algorithms and Data Structures – COMP3506/7505 – Semester 2, 2024
Due: 3pm on Friday October 18th (week 12)
The main objective of this assignment is to extend your knowledge from assignment one to build more complex data structures and solve more complex problems. In particular, you will be working on graph and compression problems. In this second assignment, we will leave more of the design choices up to you (like the k-mers part in A1). This assessment will make up 25% of your total grade. We recommend you start early.
A Getting Started
The assignment is structured similarly to assignment one. The skeleton codebase, data, software dependencies, implementation rules, are described below. Rules for success: Think before you code. Think before you post an Ed question. Use a pen and paper. Don’t be afraid to be wrong. Give yourself time to think. Start thinking about these problems early. Read the entire spec before you do anything at all.
Codebase
The codebase contains a number of data structures stubs that you must implement, as well as some scripts that allow your code to be tested. Figure 1 shows a snapshot of the project directory tree with the different files categorized. Note that we provide you with (simplified) versions of the data structures built during assignment one. You are permitted to modify any of the files listed. You may also use structures/util.py for any utilities that do not deserve their own file, or add your own data structures if you think they may help; store them in their own files inside the structures directory.
Data
We also provide a number of test graphs for you to use, but you are encouraged to build further test graphs of your own; you may also share your test graphs with other students if you wish. Each graph is represented as a simple text file that stores an adjacency list for each vertex in the graph. There are three specific types of graphs, each with their own subdirectory. All graph types are undirected. 4N graphs are simple graphs where each vertex can be thought of as occupying a position on a square grid/lattice. As such, these nodes can have at most 4 neighbours. KN graphs are an extension that allow an arbitrary number of neighbors. POSW graphs extend KN graphs to apply positive integer weights to edges. The appendix in Section M contains an example of each graph type.
Dependencies
Our codebase is written for Python 3.10+ as we have provided type annotations; as such, you
will need to use Python 3.10 at minimum. The second assignment has one special dependency – the curses library – that allows your algorithms to be visualized in a simple terminal window.
Summary
2 COMP3506/7505 – Semester 2, 2024
structures
dynamic_array.py linked_list.py bit_vector.py entry.py graph.py pqueue.py
map.py
bloom_filter.py
util.py
algorithms
pathfinding.py
problems.py
compression.py
test_structures.py
test_pathfinding.py
test_problems.py
Figure 1 The directory tree organized by data structures (inside the structures directory), and the three executable programs (in the root directory, coloured orange).
4N
one.graph
two.graph
three.graph
four.graph
five.graph
KN
one.graph
two.graph
three.graph
four.graph
five.graph
POSW
one.graph
two.graph
three.graph
four.graph
five.graph
Figure 2 The data tree organized by graph types. 4N are the most simple grid-based graphs. KN are graphs where each node has an arbitrary degree. POSW are graphs with arbitrary degree nodes and positive weights between the edges.
If you are developing locally, you may need to install curses. See the documentation1 for more information. This library is already available on moss.
1 https://docs.python.org/3/howto/curses.html
Assignment 2 3
Note that you can do the entire assignment without using the visualizer, but it will be less fun and you won’t be able to show off to your friends. The visualizer is only useful for the earlier pathfinding solutions on grids (Task 2), and it must be executed in a terminal window .
Implementation Rules
The following list outlines some important information regarding the skeleton code, and your implementation. If you have any doubts, please ask on Ed discussion.
-
❙ The code is written in Python and, in particular, should be executed with Python 3.10 or higher. The EAIT student server, moss, has Python 3.11 installed. We recommend using moss for the development and testing of your assignment, but you can use your own system if you wish.
-
❙ You are not allowed to use built-in methods or data structures – this is an algorithms and data structures course, after all. If you want to use a dict (aka {}), you will need to implement that yourself. Lists can be used as “dumb arrays” by manually allocating space like myArray = [None] * 10 but you may not use built-ins like clear, count, copy, extend, index, insert, remove, reverse, sort, min, max, and so on. List func- tions like sorted, reversed, zip are also banned. Similarly, don’t use any other collections or structures such as set. You cannot use the default hash function. Be sensible – if you need the functionality provided by these methods, you may implement them yourself.
-
❙ You are not allowed to use libraries such as numpy, pandas, scipy, collections, etc.
-
❙ Exceptions: The only additional libraries you can use are random, math, and functools (but only for the total_ordering decorator). You are allowed to use range and
enumerate to handle looping. You may use tuples (for example; mytup = ("abc", 123)) to store multiple objects of different types. You may use len wherever you like, and you can use list slicing if given a Python list as input. If we ask for a Python list in a function return type, you can use append or pop.
4
COMP3506/7505 – Semester 2, 2024
B Task 1: Data Structures
We’ll start off by implementing some new data structures. All we specify is the interface; the choice of design is yours, as long as the interface behaves correctly and efficiently. You may test these with the test_stuctures.py program: python3.11 test_structures.py.
Task 1.1: Fix and Extend the Priority Queue (3 marks)
A queue is a data structure that can handle efficient access to elements on a first-in-first- out basis. Recall that a priority queue is an extension of a simple queue that supports efficient access to the element with the highest priority; new elements can be inserted with a given (arbitrary) priority, and the priority queue must be able to support efficient dequeue operations based on the priority order. For this assignment, we will assume priority values are numeric and comparable (so, they may be floats or integers), with lower values representing a higher priority. In other words, we’re going to be supporting a min-heap.
We have provided you with a semi-working priority queue in pqueue.py. Unfortunately, we ran out of time to get it working perfectly, so there are a few2 bugs lurking in the implementation. First, crush these bugs so the priority queue works properly (1 mark)!
Once your heap is operating correctly, we need to handle a few more subtleties; we’d like to support in-place construction in linear time through the ip_build function, and in-place sorting via the sort function. Note that the in-place operations should operate directly on the data array without creating a copy — running in-place heap sort will yield a sorted array, but will destroy the heap ordering. As such, you may assume the user will no longer use the heap once the sort function has been used.3 Welcome to UQ(ueue).
You can test via: python3.11 test_structures.py --pq
Task 1.2: Implement a Map (3 marks)
Your next job is to implement a concrete data structure to support the map interface. Recall that a map allows items to be stored via unique keys. In particular, given a key/value pair (k, v) (otherwise known as an entry), a map M can support efficient insertions (associate k with v in M), accesses (return the value v associated with k if k ∈ M), updates (update the value stored by key k from v to v′) and deletes (remove (k,v) from M). In other words, you
will be supporting operations like a Python dict (aka {}) class. Test via: python3.11 test_structures.py --map
Task 1.3: Implement a Bloom Filter (3 marks)
Bloom filters are an interesting probabilistic data structure that can support extremely efficient and compact set membership operations. In particular, they use hashing in combination with bitvectors to toggle on sets of bits; when looking up a given key k, the key is hashed via a series of unique hash functions and mapped to various indexes of the bitvector. Next, these bits are observed; if they are all on, then we return True which means “yes, this key might be in the set.” Otherwise, we return False which means “No, this key is definitely not in the set.” Your Bloom filter does not need to double check that the True values are definitely in the set; that job is for another data
Assignment 2 5
C Preliminaries: The Graph Class
Many of the following problems (all of Task 2, and some aspects of Task 3) will require the use of a graph data structure. We have provided a concrete implementation of a graph data structure for you, and you will need to get familiar with it in order to progress. The graph types are defined in structures/graph.py.
Graph Types
There are two key types of graphs. The Graph class is the base class which stores nodes and edges. Each node in the graph (Node) stores an id which is the index of the node in the graph’s adjacency list. For example, if you have a node with an id 22, this means that the node will be stored at the Graph’s self._nodes[22] and can be accessed via the Graph’s get_node() function. The Graph also provides a function to return a list of neighbours given an index/node identifier.
There is a special LatticeGraph type that extends the Graph class (and a LatticeNode that extends Node). This specialized graph is used only for graphs that are placed on a lattice. In other words, these graphs can be thought of as simple grids, where each vertex has between zero and four neighbors. As such, some additional properties including the number of logical rows and columns in the (4N) graph are stored. For your purposes, the only real difference you need to know about with this special type is that you can ask for the (x, y) coordinates of a given LatticeNode using the get_coordinates() function. You can also directly return the nodes to the north/south/east/west using the appropriate get_north()
(etc) functions.
Your Implementations
All of the following tasks have pre-made function stubs. You should pay close attention to the type hints so you know what is expected to be taken as parameters, and what should be returned.
6 COMP3506/7505 – Semester 2, 2024
Figure 3 The Mega Gurkey – Artwork by Jesse Irwin.
D (Optional) Backstory for the Remainder of Assignment Two
Last year, the COMP3506/7505 cohort helped Barry Malloc capture an enterprising Aus- tralian Brush Turkey4 (named Gurkey Tobbler) that was ruining his garden. Afterwards, the chief scientist at MallocLabs (Dr. Amongus) transported Gurkey to the lab to conduct some genomic sequencing. Thanks to your great work on DNA compatibility, Dr. Amongus has since discovered that Gurkey DNA is compatible with that of the Loxodonta Africana, the African Bush Elephant!5
While this is a crowning scientific discovery, there is one (big) problem; Dr. Amongus has created a giant hybrid mega Gurkey through the irresponsible use of genetic modification tools. Our goal in this section is to find the mega Gurkey before it’s too late, and to help Barry conduct further analysis on the Gurkey genome.
Meta comment: Why do we make up these crazy backstories and bury details inside them? Well, because you need to practice looking at a problem you have and extracting the important details. It is highly unlikely you will ever be given an extremely well specified problem. It is
also a lot more fun this way :-)
4 https://en.wikipedia.org/wiki/Australian_brushturkey 5 https://en.wikipedia.org/wiki/African_bush_elephant
Assignment 2 7
E Task 2: Pathfinding Algorithms Getting Started
To get started, we will focus on lattice graphs. Note that we have provided some graphs for you already, and the ones we are interested (for now) are those in the data/4N directory. However, your solutions here must also work on the data/KN and data/POSW graphs (note that KN and POSW are the same types of graph if an algorithm does not use edge weights).
We have provided a program called test_pathfinding.py to help you test your al- gorithms. This program allows different pathfinding algorithms through two dimensional mazes to be tested (mandatory) and visualized (optional). Note that in order to make life easier, we’re randomly generating the origin and goal vertices, so you will need to supply a seed to the random number generator (via --seed) to yield different origins and goals each time you run the program. All implementations for Task 2 must be inside the algorithms/pathfinding.py file, where appropriate stubs are provided for you.
Task 2.1: Breadth-First Search (2 marks)6
Given some arbitrary start vertex u, and some goal vertex v, Breadth-First Search (BFS) systematically walks across the graph until either v is found, or there are no remaining vertices to explore. Figure 4 provides a sketch of this process. You must implement the bfs_traversal() stub; note that both the visited list, and the path, are expected to be returned. Please see the type annotations for the specific details about what should be returned.
To make your results reproducible, you must enqueue/push the unvisited neighbours in the order they are given to you from the get_neighbours() function.
Finally, while we will be visualizing our BFS on the lattice graphs, you must ensure that your algorithms translate to graphs with arbitrary degree. This should be trivial to implement. For the avoidance of doubt, your BFS algorithm will be tested on the KN graphs.
Test via: python3.11 test_pathfinding.py --graph data/4N/one.graph --bfs --seed <number> [--viz]
Note that the --viz flag is optional (and triggers the visualizer to run) and <number> should be substituted with an integer.
Task 2.2: Dijkstra’s Algorithm (3 marks)
BFS is nice; it is quite simple and it works well at finding the Gurkey when the graph is unweighted. However, Brisbane is a hilly city, and some paths are more expensive than others; we’ll need to take this into account to find the true shortest path to the Gurkey. We also don’t necessarily know where the Gurkey will be, so it would be good to find the shortest path from our current location to all possible locations.
Your goal is as follows. Given a weighted graph and a source node, return the cost of the lowest-cost path to all reachable nodes. If a node is not reachable (for instance, if the Gurkey has destroyed all of the bridges) then you should not return it in your list. Please see the type annotations for the specific details about what this function should return.
Test via: python3.11 test_pathfinding.py --graph data/POSW/one.graph --dijkstra --seed <number>
8 COMP3506/7505 – Semester 2, 2024
Input graph ABCD EFGH IJL
OP
eue on each step
-
(1) B
-
(2) A F
-
(3) F
-
(4) E GJ
-
(5) GJI
-
(6) J I H
Visited Nodes
[B, A, F, E, G, J, ...]
BFS starts at an arbitrary vertex B, by enqueueing it. At each step, a dequeue returns the next candidate vertex which is then marked visited (and added to the visited list). Each neighboring vertex of the current candidate (that is yet to be visited) is enqueued.
BFS halts when either the goal node (H) is visited, or when there are no other vertices to visit.
Figure 4 A sketch of breadth-first search starting at vertex B and searching for vertex H. A queue keeps track of the next vertices to visit, and they are visited as they are dequeued. A list can be used to track the order in which nodes are visited.
Note that it does not really make sense to use the --viz flag with Dijkstra, because the 4N graphs do not have edge weights (and the viz tool needs to use 4N graphs).
Dijkstra's Algorithm
Input Graph G
7B5E A192
Destination, Path, and Cost
A [G, C] 6 B [G, C] 5 C [G] 4 D[]4 E[G]4 G[]2 H[]2
Path is not required to be returned, but shown for clarity.
Origin/Source: F
Output: A list of vertices and their associated shortest path cost from the source. Cost is calculated as the sum of edge weights across the shortest path. Graphs are guaranteed to have positive edge weights.
Figure 5 A sketch of Dijkstra’s algorithm. See the code for the expected output format and structure.
Assignment 2 9
Task 2.3: Depth-First Search (2 marks – COMP7505 only)
Depth-First Search (DFS) operates very similarly to Breadth-First Search. However, instead of using a FIFO queue, it uses a LIFO stack. You must implement the dfs_traversal() stub (plus any additional data structures you may require); note that both the visited set, and the path, are expected to be returned. Please see the type annotations for the specific details about what these functions should return.
To make your results reproducible, you must push the unvisited neighbours in the order they are given to you from the get_neighbours() function.
Finally, while you can visualize DFS on the lattice graphs, you must ensure that your algorithms translate to graphs with arbitrary degree. This should be trivial to implement. For the avoidance of doubt, your DFS algorithm will be tested on the KN graphs.
Test via: python3.11 test_pathfinding.py --graph data/4N/one.graph --dfs --seed <number> [--viz]
Note that the --viz flag is optional (and triggers the visualizer to run) and <number> should be substituted with an integer.
Input graph Stack on each step ABCDJL
DFS starts at an arbitrary vertex B, by marking it as visited and pushing it on the stack. At each step, the stack is popped to get the current candidate vertex (which is then added to the visited list) and each neighboring vertex is marked as visited and pushed onto the stack.
DFS halts when either the goal node (I) is visited, or when there are no other vertices to visit.
A GGHD EFGH BFFEEEE
IJL OP
(1) (2) (3) (4) (5) (6) (7)
Visited Nodes
[B, A, F, J, G, H, L, ...]
Figure 6 A sketch of depth-first search starting at vertex B and searching for node I. A stack keeps track of the next vertices to visit, and they are visited as they are popped from the stack. A list can be used to track the order in which nodes are visited.
10 COMP3506/7505 – Semester 2, 2024
F Task 3: Problem Solving
Now that the mega Gurkey is back in the lab, we will need to conduct additional testing to ensure such an event never happens again (well, at least until COMP3506/7505 2025). Unfortunately, Dr. Amongus has already been fired, so it is our job to help Barry Malloc determine how this all happened in the first place.
Task 3.1: Maybe Maybe Maybe (3 marks)
MallocLabs has a huge database of k-mers that have been sequenced throughout their many years of operation. To determine which genomes may have been involved in the genetic modification of the Gurkey, we can simply compare the Gurkey genome to all genomes in the database to find out which ones match, and return those for further analysis. The problem, however, is that the database contains trillions of k-mers.
Our job is to create a fast and compact filtering algorithm that, given a list of database k-mers, D, and another list of query k-mers Q, returns a list of k-mers that from Q that are likely to be in D. We award more marks for having lower false positive rates; the maximum allowed false positive rate is 10%, and then we will measure at 5% and 1%. Note that lower false positive rates might come at higher time and space costs.
Test via: python3.11 test_problems.py --maybe --seed <number>
Maybe Maybe Maybe
Input k-mer database D
GCTACTCC CTAAGTTT TTTCTGTT ATCTACTT GTACTTTC
Input query k-mers Q CTGTATCC
GTACTTTC
CCTCTCCC
ATCTACTT
ATCCATCC
Output (likely) matches
GTACTTTC
ATCTACTT
AACCGGTT
Note the false positive in the output. Figure 7 A sketch of Maybe3 – See the code for the expected output format and structure.
AACCGGTT
Assignment 2 11
Task 3.2: Dora and the Chin Bicken (3 marks)
MallocLabs’ spies have recently discovered that their main competitor CallocLabs7 has hired Dr. Amongus, and are planning to release a giant Ibis named Chin Bicken to wreak havoc on MallocLabs HQ! Barry and the team need to get prepared. The head honchos at MallocLabs have decided on the following strategy:
❙ ❙ ❙
Chin Bicken will, at some point, attack the MallocLabs HQ;
Since Chin Bicken is enormous, it may attack different parts of the building simultaneously;
At this time, MallocLabs will release a robot – Dora – which does the following:
-
It will receive as input an undirected graph G where vertices represent rooms in MallocLabs HQ, and edges represent undamaged connections between rooms.
-
From its starting location, it will explore all reachable rooms of the building to collect genomic data left by the Chin Bicken.
-
This genomic data comes in the form of special gene symbols, s, represented by a single character; there is one at each vertex of G.
-
Next, the robot builds a gene symbol frequency table T which maps each gene symbol s to its total frequency in G, denoted fs.
-
Once T is computed, the robot builds a minimum redundancy code via Huffman’s algorithm, resulting in a codebook CG mapping each s to a codeword cs.
-
Finally, the robot receives a sequence L = ⟨s0, . . . , sn−1⟩ of n symbols, drawn from all symbols appearing in G. This sequence represents the specific body part of Chin Bicken that MallocLabs believes is its weak point. The robot will use CG to encode all s ∈ L into one long bitvector B. That is, B will hold the concatenation of the encoding of each symbol in L: cs0 cs1 · · · csn−1 .
❙ Once the robot produces B, Barry can feed it into the GeneCoder6000 to develop a weapon to fend off the Chin Bicken. Of course, Dora will need to be fast. It has to visit all vertices in the graph before Chin Bicken causes any further chaos, after all. You have been tasked to write the logic for Dora. Get to it!
12 COMP3506/7505 – Semester 2, 2024
Task 3.3: Chain Reaction (3 marks)
To progress further with the reconstruction of Dr. Amongus’ cloning programme, Barry now needs to find what is called the optimal reaction compound. We are given n candidate compounds, each of which is represented by a unique ⟨x,y⟩ coordinate based on their reactivity in two specific dimensions of interest. Each compound also holds a floating point value known as the spike radius r.
Compound A is said to cause a reaction with compound B if the circle centered on ⟨xA,yA⟩ with radius r overlaps with the compound8 at ⟨xB,yB⟩; however, reactions do not occur naturally — they must be triggered by some other reaction. When a compound reacts, any compound that it is reactive with it will also be triggered (and so on).
You are given one charged molecule to set off a chain reaction, and you must select the given compound i ∈ [0, n − 1] that will maximize the total number of compounds in the chain reaction. If there are ties, return the one with the smallest identifier.
Test via: python3.11 test_problems.py --chain --seed <number> Chain Reaction
Input: List of x, y coordinates with associated radius values.
Output: The identifier of the compound that we should use to trigger the largest chain reaction. The answer here is 3. If there is a tie, return the lowest identifier.
Triggered Node Compounds in Reaction
1 [1, 2]
-
2 [2]
-
3 [3, 1, 2, 4, 6]
4 [4, 6]
5 [5]
6 [6, 4]
Figure 9 A sketch of the Chain Reaction problem. See the code for the expected output format and structure.
Assignment 2 13
Task 3.4: Lost in the Labyrinth (aka notably more k-cool) (2 marks)
The attack from CallocLabs compromised some of the building structure at MallocLabs HQ, and the team is concerned that the Gurkey might break free. Barry would like to build a labyrinth to contain the Gurkey, and has offers from various construction companies. However, he is concerned that some of these companies are trying to scam him, so we need to help him to come up with an algorithm to determine whether a labyrinth can even be constructed from each offer.
Each company treats the design of a labyrinth as a graph problem. They provide us with four integers: n, m, k, and c, where n is the number of vertices (|V |), m is the number of edges (|E|), k is the diameter of the graph, and c is the cost to produce it. From these four integers, we must determine if their offer is valid or not. A labyrinth is considered valid if it conforms to the following rules:
-
❙ It is a connected graph.
-
❙ It has no double edges or self loops.
❙ The largest shortest simple path between any two vertices v1 and v2 (the diameter) isat most k. (In other words, if you found the shortest simple path between every pair of vertices, the diameter of the graph is the length of longest one of these.)
Given a list of offers, you must return the cheapest offer that can be constructed. If there are ties, return the one with the smallest identifier.
python3.11 test_problems.py --labyrinth --seed <number>
14 COMP3506/7505 – Semester 2, 2024
G Task 4: Txt Cmprsn (up to 3 bonus marks)
Keen for more punishment? We have just the thing... We’ll be running a simple compression challenge. You will be given an arbitrary file, and you need to provide a compression/de- compression algorithm. The stubs are provided for you in the compression.py file. There will be no marks given for incorrect/lossy algorithms; the output (after decompression) must exactly match the provided input. The marking scheme is as follows.
-
❙ One mark: Your algorithm can compress our file to at least half of its original size.
-
❙ One mark: Your algorithm is in the top 50% of those submitted.
-
❙ One mark: Your algorithm is in the top 10 of those submitted.
We will have a public compression leaderboard available that can be observed. There will be a separate submission area on Gradescope for this part. Please just submit your single file compress.py without any zipping. All of your references need to be placed in this file.