Programming Project #3: Genome-scale comparisons for Covid strains
Due Apr 25 by 11:59pm Points 20 Submitting a file upload Available until Apr 26 at 1 1:59pm This assignment was locked Apr 26 at 1 1:59pm.
General Instructions: You are allowed to work in teams of two each (including yourself), or by yourself, for this programming assignment. You are encouraged to use C or C++ (Python or Java are also okay but less preferred owing to memory ef ficiency). Grading will be based on a combination of factors such as correctness, coding style, implementation efficiency , exception handling, source code documentation, and code reusability . When coding, please remember that you will be reusing the code for this assignment in future projects of this course. So make sure you provide a nice interface and modularized code structure during design. For this project, there will not be any demo. So your report + the source code will be the key deliverables. Also, there will be no extensions. Submission: All submissions should include a cover page: DOC (assignment submissions without the cover sheet will not be graded) If you worked as a team, be sure to include both your names on the cover page. The assignment should be zipped into an archive folder (named after your lastname(s) - e.g., Project3-XYZ.zip), and uploaded into this Canvas assignment dropbox for Programming Project
3. Submissions are due by 1 1:59pm Pacific Time on the due date. A 24-hour grace period is
allowed although it will incur an automatic late penalty of 10%. Submissions that are more than 24 hours late will NOT be accepted. No exceptions. If you worked as a team of two members, then each member should submit so that I can upload the grade for both. Both members will get exactly the same grade.
Academic integrity policy: Reproduction of source codes or code snippets from online resources or other's people's assignments is not allowed. All source code should be entirely yours, written from scratch. If violation is found, it will be considered cheating.
The goal of this programming assignment is to perform a couple of dif ferent types of analyses on coronavirus genomic strains. See the lecture notes for a quick summary of all the steps of this project. Quick background and some relevant external resources: Coronavirus represents a family of viruses which is the main pathogen in SARS (2003-2004), MERS (2012-2014), and in the novel coronavirus COVID-19 of the present. Thousands of viral strains from this family have been sequenced till date. The NCBI V irus (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#) database maintains thousands of genome accessions for this family of viruses, including sequenced genomes for the dif ferent strains seen so far .
- The Nextstrain web portal (https://nextstrain.org/) tracks the real-time evolution of the novel coronavirus (COVID-19) along with resources to other viral pathogens. You can view a dendrogram in dif ferent modes on this website, that shows the evolutionary trajectory for the dozens of strains already sequenced.
- Testing for COVID-19 relies on our ability to extract and amplify the part of the DNA that is unique to a strain. Since the viral genetic material is RNA, a process called Reverse Transcription PCR is used to first convert the extracted RNA into a DNA, and then the DNA copies are target amplified in order to ascertain if the strain is COVID-19 (positive) or not (negative). There are many tutorials online that explain this process now , but for a simple (high-level) primer , you can refer to this YouTube video (https://www.youtube.com/watch?v=mELtUXvnhIQ) by PBS Nova. Project objectives: Input: A set of k genomes. W e will refer to this set as S = {s_1, s_2, ...., s_k}. For this programming project, you will implement two core functions to analyze genomes of coronavirus strains. Given a set of k genomes (isolates from dif ferent strains), : Task 1) Identify "DNA fingerprints" for each of the k strains. Here, a "DNA fingerprint" for a given strain s_i is defined to be a s hortest unique substring of s_i - i.e., a substring that is present only in strain s_i, and is the shortest possible such candidate. If there are multiple candidates for the DNA fingerprint of a strain, output any one of them and its length. On the other hand, if there exists no such fingerprint for a string s_i, then you need to state so (or) output the entire string s_i.
Compute and output an all-pairs "similarity matrix" D, which contains k rows and k columns. Cell (i,j) contains simply a similarity value (defined below) between strain s_i and strain s_j. Note that the similarity value as defined below , is a symmetric function and therefore the matrix D will also be symmetric along the main diagonal. Definition: The similarity value between two strings ( s_1, s2 ) is defined over a given alignment of those two strings. Given the two strings and an alignment (local or global), the similarity value is simply equal to the number of matches in the alignment. For example, if the two strings are tacgt and gaccgg , and the alignment is a local alignment: a c g a c c g then, the similarity value for this <input, alignment> combination is 3 (as there are 3 matches). Note that, as defined above, there are numerous similarity values that you can compute for the same pair of sequences (depending on the alignment). The idea is to report the "best possible" such similarity value for the string pair . Note: I am not deliberately using the word "optimal" here (as what I propose below in the algorithm is a heuristic). Read on.
Detailed Specification: PDF Task 1) Detecting DNA fingerprints For this task, your approach should follow roughly the following steps: 1) Build the GST for set S. 2) Using the GST , locate nodes that correspond to short shared matches, which upon a single character extension guarantees uniqueness in a given string s_i. This step can be implemented by doing a post- order traversal of the tree and using a coloring scheme to color nodes by their string origins. The algorithms were described in the class lectures. Please refer to those notes along with the lecture recording for any queries. Task 2) Computing similarity matrix For this task, you will need to compare every pair of sequences in set S and report a similarity score. To perform the comparison, you can do an alignment between each pair of sequences. However , you should assume that performing a brute-force alignment using dynamic programming will NOT be feasible because of the long string lengths (30Kbp vs. 30Kbp). Therefore, I would like you to implement an approach of identifying a longest common substring between the two strings and then using that as a "seed" to extend on either side to identify a local alignment anchored on the seed. More specifically , the steps are as follows: For each pair of sequences ( s_i, s_j ):
Build the GST for {s_i, s_j }. 2) Use the GST to find a longest common substring ("lcs") and its coordinates on s_i and s_j. Let us say , the lcs identified starts at index x_1 in s_i and index x_2 in s_j; and ends at index y_1 in s_i and index y_2 in s_j. Also, let the length of the lcs be denoted by b. 3) a) Extract the prefixes s_i[1...x_1] and s_j[1...x_2] ; b) Reverse the prefixes - let us call the reversed versions of the strings s_i_rev and s_j_rev ; c) Compute a variant of the global alignment between s_i_rev and s_j_rev (using af fine gap penalty function), where the recurrence is same as the Needleman-W unsch's algorithm in the forward phase. However , for finding the optimal answer , pick the cell that contains the maximum score from the overall matrix (i.e., not necessarily the last cell). Let a be the number of matches for the alignment ending at this max-cell. For all alignments, use the following alignment parameters: match score = +1, mismatch penalty = -2, opening gap (h) = -5, gap extension (g) = -1. 4) a) Extract the suf fixes s_i[y_1...m] and s_j[y_2...n] ; let us call these suffixes s_i_fwd and s_j_fwd respectively . b) Compute a variant of the global alignment between s_i_fwd and s_j_fwd (using af fine gap penalty function), where the recurrence is same as the Needleman-W unsch's algorithm in the forward phase. However , for finding the optimal answer , pick the cell that contains the maximum score from the overall matrix (i.e., not necessarily the last cell). Let c be the number of matches for the alignment ending at this max-cell. 5) Then, the similarity value for the string pair ( s_i, s_j ) is equal to a+b+c . Set D[i,j] = a+b+c ; Output the similarity matrix D. Input File Formats:
- The F ASTA file format is the same format used in Project #1. For this assignment you can assume that there is only one string in the file. But that string can be spread over multiple (contiguous) lines in the input F ASTA file.
- For the input alphabet, specify a file with all the symbols in a single line delimited by space. For coding convenience, you can assume the contents of this file to be case sensitive. For example, if you specify {A,C,G,T}in the alphabet file, you can safely assume that the input file has its sequence written in uppercase. Conversely , if you find a lowercase letter somewhere in the input sequence then you could exit with an error .
Test inputs: Covid-19 Inputs: (source: NCBI V irus database (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/) ) String IDStrain name Input sequence fileSeque length (in bp 1 COVID-19 ChinaFASTA 2 COVID-19 USAFASTA 3 COVID-19 AustraliaFASTA 4 COVID-19 IndiaFASTA 5 COVID-19 BrazilFASTA 6 SARS_2003FASTA 7 SARS_2017FASTA 8 MERS_2012_SaudiFASTA 10 MERS_2014_USAFASTA
Report: In a separate W ord or PDF document, report the following:
- System configuration: CPU used, Clock rate, RAM, Cache size (if you know it).
- Quality:
a. Create a table that shows for each input string, the length of its corresponding fingerprint
(task 1 output). As for the fingerprint themselves, please output them to a file.
b. Show the similarity matrix D as a table of values (for all pairs of sequences).
c. Discuss your results and observations. More specifically ,
- what does the fingerprint lengths tell you about each strain and its uniqueness?
- what does the similarity matrix D tell you about the relationship between these strains? Do you see any logical groupings visible through this matrix?
- Performance: a. Create tables and/or charts to show the runtime results of the code. Please break down the runtime into the dif ferent components: Task 1 Fingerprinting performance: suf fix tree construction time; time to identify fingerprints; total time for the entire fingerprinting task; Task 2 Similarity matrix performance: time spent building suf fix trees; time spent performing the alignments; total time for the computation of the matrix D. In addition, for task 2 performance, report the length of the longest common substrings reported for every pair of strings. You can show that too in the form of a table.
- Other comments: If you have any other observations or comments, you can add them here in this section.
CHECKLIST FOR SUBMISSION: Cover page source code