COMP90016 - Assignment 1
Version 1 Last edited 13/3/2023
Semester 1, 2023
NAME = ""
ID = ""
Marking
Cells that must be completed to receive marks are clearly labelled. Some cells are code cells, in which you must complete the code to solve a problem. Others are markdown cells, in which you must write your answers to short-answer questions.
Cells that must be completed to receive marks are labelled like this:
# -- GRADED CELL (1 mark) - complete this cell --
Some graded cells are code cells, in which you must complete the code to solve a problem. Other graded cells are markdown cells, in which you must write your answers to short-answer questions.
You will see the following text in graded code cells:
# YOUR CODE HERE
raise NotImplementedError()
You must remove the raise NotImplementedError()
line from the cell, and replace it with your solution.
Only add answers to graded cells. If you want to import a library or use a helper function, this must be included in a graded cell.
Only graded cells will be marked. Don't make changes outside graded cells, and don't add or remove cells from the notebook.
Word limits, where stated, will be strictly enforced. Answers exceeding the limit will not be marked.
Run-time limits will be imposed for each coding question. The run-time of a code cell can be calculated by including
%time
at the top of your cell. Cells exceeding the run-time limit will not be marked. The run-time limits only apply to test cases that are included in this document.
No marks are allocated to commenting in your code. We do however, encourage efficient and well-commented code.
The total marks for the assignment add up to 100, and it will be worth 10% of your overall subject grade.
Part 1: 55 marks
Part 2: 25 marks
Part 3: 20 marks
Submitting
Before you turn this assignment in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE", as well as your name and student ID number below:
Overview
In this assignment, you will answer questions about working with short reads, sequence motifs and codon bias.
You will use the biopython
library in your functions. You may want to refer to sections of the biopython
documentation for additional help (https://biopython.org/wiki/Documentation). Additional to biopython
and standard Python 3 functions and methods, you may also use any other library we have used in Computational Genomics including collections
, numpy
, pandas
, math
, itertools
, seaborn
and matplotlib
.
Part 1: Working with short reads
Setup
import os
import requests
from IPython.core.display import HTML
# Handy function to fetch our datafile
def fetch_file(url):
response = requests.get(url)
if response.status_code == 200:
print('File found!')
filename = os.path.basename(url).split('?', 1)[0]
with open(filename, 'wb') as f:
f.write(response.content)
f.close()
print(f'Saved to: {filename}')
else:
print('File not found')
# Make the notebook pretty
HTML(requests.get('https://raw.githubusercontent.com/melbournebioinformatics/COMP90016/main/data/2023/style/custom.css').text)
# Fetch assignment data
url = 'https://github.com/melbournebioinformatics/COMP90016/blob/main/data/2023/Assignment_01/data/comp90016_assignment_1.fastq.gz?raw=true'
fetch_file(url)
First, we read in a read set from the comp90016_assignment_1.fastq.gz
file. Note that converting a readset into a list of biopython
objects makes it easier to handle.
import gzip
from Bio import SeqIO, SeqRecord, Seq
fname = 'comp90016_assignment_1.fastq.gz'
# Our fastq file is compressed using gzip.
# We must open it before SeqIO can read the contents
with gzip.open(fname, "rt") as handle:
readset = list(SeqIO.parse(handle, "fastq"))
Questions
In the cells below, complete the following tasks:
Question 1.1
(5 marks) Challenge: Write a Python function to compute the following stats for an input readset: - The minimum read length - The mean read length and - The maximum read length Input: a list of Bio.SeqRecord.SeqRecord objects. Output: Return a `list` with the minimum read length as an integer, the mean read length as a floating-point number and the maximum read length as in integer. The elements of the output list should be in the order: minimum read length, mean read length, maximum read length. If the input list is empty, return None.# GRADED CELL 1.1 (5 marks, max 1 min run-time)
def read_lengths(reads):
"""
Calculate minimum, mean and maximum read lengths in a readset.
Assume reads is a list of Bio.SeqRecord.SeqRecord objects containing DNA sequences.
Return a list with the minimum read length as an integer, the mean read length as a floating-point number and the maximum read length as in integer.
The output list should be in the order: minimum read length, mean read length, maximum read length.
If the input list is empty, return None.
"""
# YOUR CODE HERE
# creating an empty list to store results
length_list = []
# checking the empty situation
if not reads:
return None
# obtaining the minimum & the maximum read length
read_min = min(reads, key = len)
read_max = max(reads, key = len)
minimum = len(read_min)
maximum = len(read_max)
# calculating the mean read length
total_len = 0
for element in reads:
total_len += len(element)
avg = float(total_len) / float(len(reads))
# inserting
length_list.extend([minimum, avg, maximum])
return length_list
#raise NotImplementedError()
# Test your function in this cell
# First we will create a list of dummy reads
demo_reads = [SeqRecord.SeqRecord(Seq.Seq('GTTGGATTCATGAAAGA'), 'ERR024571.2', '', ''),
SeqRecord.SeqRecord(Seq.Seq('ATGAAATGAATGTCTTGA'), 'ERR024571.2', '', '')]
print(read_lengths(demo_reads)) # should output [17, 17.5, 18]
print(read_lengths(readset))
[17, 17.5, 18]
[56, 75.445, 76]
# --- AUTOGRADING CELL DO NOT EDIT ----
Question 1.2
(15 marks) Suppose quality control metrics indicated that there were unacceptable levels of adapter contamination in the reads. The sequence of the adapter has been determined and it matches an adapter sequence used in the sequencing. Challenge: Write a Python function that performs the following tasks and returns the new readset as list of SeqRecord objects. - If a read has an unbroken subsequence of length min_match bases or greater in common with the adapter sequence, trim all bases in the reads that are 3’ of the start of the adapter sequence match, including the first matching base. - Remove any read that is shorter than min_length bases long after trimming. Assumptions: - Assume min_match and min_length are positive integers. - Assume reads is list of Bio.SeqRecord.SeqRecord objects, each object contains a sequence that is a Bio.Seq.Seq object. - Assume adapter_seq is a Bio.SeqRecord.SeqRecord object. If the readset list or the adapter_seq is empty, return None.# GRADED CELL 1.2 (15 marks, max 1 min run-time)
def adapter_trim(reads, adapter_seq, min_match, min_length):
"""
Trim reads in a readset based on a given adapter sequence.
If a read has an unbroken subsequence of length min_match bases or greater in common with the adapter sequence, trim all bases in the reads that are 3’ of the start of the adapter sequence match, including the first matching base.
Remove any read that is min_length bases long or shorter after trimming.
Assume min_match and min_length are positive integers.
Assume reads is list of Bio.SeqRecord.SeqRecord objects.
Assume adapter_seq is a Bio.SeqRecord.SeqRecord object.
If the readset list or the adapter_seq is empty, return None.
Return the new readset as a list of Bio.SeqRecord.SeqRecord objects.
"""
# YOUR CODE HERE
# checking the empty situation
if not reads or not adapter_seq:
return None
# trim starts from 3': trim starts from right
# if matching_sequence < min_match: no trimming
# else: trimming
# if left_over < min_length: remove read
# return the reads!!!
#raise NotImplementedError()
#return filtered_reads
# Test your function in this cell
illumina_adapter = SeqRecord.SeqRecord(Seq.Seq('GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGATCGGAAGAGCACACG'), '', '', '')
demo_adapter = SeqRecord.SeqRecord(Seq.Seq('GATGAAAG'), '', '', '')
# should output a list with just one SeqRecord object containing the sequence ATGAAATGAATGTCTTGA
print(adapter_trim(demo_reads, demo_adapter, 7, 15))
# Test with longer min length
#print(len(adapter_trim(readset, illumina_adapter, 10, 50)))
[4, 3]
# --- AUTOGRADING CELL DO NOT EDIT ----
Question 1.3
(10 marks) Challenge: Write Python code to produce two histograms showing the distribution of read lengths in... - The original `comp90016_assignment_1.fastq.gz` readset. - The output readset after adapters have been removed from `comp90016_assignment_1.fastq.gz` using your function from part 1.2. - Trim reads with settings: `adapter_seq=illumina_adapter, min_match=10, min_length=50` You may show both distributions on the same set of axes, or two completely separate histograms. The plots should be produced inline, in the Jupyter notebook. Choose an appropriate bin-width. Label your axes appropriately (include units). Use labels to make it clear which distribution has been pre-processed. You may use the readset list of Bio.SeqRecord.SeqRecord objects as input. Your code does not have to include a function.# Here's some Jupyter magic to render plots in the notebook
%matplotlib inline
# You may want to import some additional packages for building and formatting your graphs (non-essential)
# Un-comment as reqired
#import numpy as np
#import matplotlib.pyplot as plt
#import seaborn
#from collections import Counter
# GRADED CELL 1.3 (10 marks, max 1 min run-time)
# Use this cell to make your histograms.
# YOUR CODE HERE
raise NotImplementedError()
# Set the variables below to report total read count and mean read length before and after trimming.
# Stats for reads
#length_stats = None
# Stats for trimmed reads
#trimmed_reads = None
#trimmed_length_stats = None
print(f'Mean read length before adapter trimming : {length_stats[1]:.3f}')
print(f'Mean read length after adapter trimming : {trimmed_length_stats[1]:.3f}')
print(f'Number of reads in the readset before adapter trimming : {len(readset)}')
print(f'Number of reads in the readset after adapter trimming : {len(trimmed_reads)}')
Question 1.4
(5 marks, max 50 words) In words, compare the two histograms and the two mean read lengths from before and after adapter removal.-- GRADED CELL (5 marks) - complete this cell --
YOUR ANSWER HERE
Question 1.5
(10 marks, max 100 words) Describe factors should impact your choice of min_match and min_length when using your adapter removal function? Describe two possible consequences of choosing inappropriate values for these parameters?
-- GRADED CELL (10 marks) - complete this cell --
You answer here
-
The size of both read sequences and the adapter would affect our choice.
The quality of selected dataset and the choice of an optimal threshold are also influencing factors.
In addition, genetic code may affact results when translating specific queries.
-
An unexpected reduction in the number of reads may occur as a result, since some reads fail to pass minimum quality criteria during trimming.
Besides, reads maybe substantially shortened and thus the reduction in information may introduce some bias in the following comparison.
Question 1.6
(10 marks, max 150 words) Sequence data can be stored and shared in a variety of file formats. Describe the main differences between FASTA, FASTQ and GenBank file formats. For each of these three file formats, give an example of the type of data that might be appropriately stored in this format.-- GRADED CELL (10 marks) - complete this cell --
YOUR ANSWER HERE
-
FASTA is a text-based file format, which contains a single-line header and sequences. Sequences could have numbers and spaces within, and there are no additional columns or blank lines in a such file. For instance, this format could be used to store sequence information of amino acids.
-
FASTQ is also a text-based file format, while it can store not only sequences but corresponding quality values. Besides, there is always a sequence identifier and spacer for formatting in FASTQ. Nucleotide sequences would be an example of this file storage.
-
Finally, GenBank is a flat format, storing more types of sequence information and additional information that is more than FASTA and FASTQ. For example, GenBank could store DNA sequence.
Part 2: Sequence motifs
Sequence motifs are short, recurring patterns in nucleic-acid sequences. Many are involved in important biological functions.
Setup
# Set up two DNA sequences to test your code on. Do not change these sequences.
linear_seq = Seq.Seq('TTACAGTGATTATGAAAACTTTGCGGGGCATGGCTACGACTTGTTCAGCCACGTCCGAGGGCAGAAACCTCGAGGGGTTTGTATGTTCAGCTATCTTCTACCCATCCCCGGAGGTTAAGTACGAGGGGAGATGCGGAAGAGGCTCTCGATCATCCCGTGGGACATCAACCTTTCCCTTGATAAAGCACCCCGCTCGGGTA')
circular_seq = Seq.Seq('TGGCAGAGAGAACGCCTTCTGAATTGTGCTATCCTTCGACCTTATCAAAGCTTGCTACCAATAATTAGGATTATTGCCTTGCGACAGACCTCCTACTCACACTGCCTCACATTGAGCTAGTCAGTGAGCGATTAGCTTGACCCGCTCTCTAGGGTCGCGAGTACGTGAGCTAGGGCTCCGGACTGGGCTATATAGTCGAG')
# Set up a dictionary of sequence motifs. Do not change this dictionary.
motif_dict = {'motif_a': Seq.Seq('TACAGTG'),
'motif_b': Seq.Seq('AGCTTGCT'),
'motif_c': Seq.Seq('ATATATAC'),
'motif_d': Seq.Seq('CGAGGGG'),
'motif_e': Seq.Seq('CGAGTG')}
Questions
Question 2.1
(10 marks) Challenge: Write a Python function to count the number of times sequence motifs are present in a DNA sequence. Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values. Assume seq is a Bio.Seq.Seq object. Return a pandas DataFrame with each motif represented as a row. The first column should contain motif names as strings, the second column should contain integer counts of exact matches. Overlapping motifs should be considered. The column names should be “Motif” and “Counts”. If either seq or motifs are empty, return None.# GRADED CELL Question 2.1 (10 marks, max 1 min run-time)
import pandas as pd
def motif_count(seq, motifs):
"""
Count the number of times sequence motifs are present in a DNA sequence.
Overlapping motifs should be considered.
Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values.
Assume seq is an Bio.Seq.Seq object.
Return a pandas DataFrame with each motif represented as a row.
The first column should contain motif names as strings, the second column should contain integer counts of exact matches.
The column names should be “Motif” and “Counts”, in that order.
If either seq or motifs are empty, return None.
"""
# YOUR CODE HERE
# checking the empty situation
if not seq or motifs == {}:
return None
#print(seq) -> GTTGGATTCATGAAAGA
#print(motifs.keys()) -> 'motif_a', 'motif_b'
#print(motifs.values())
motif_list = []
count_list = []
for motif_seq in motifs.values():
# overlapping situation
count = sum(1 for i in range(len(seq)) if seq.startswith(motif_seq, i))
count_list.append(count)
if motif_seq in seq:
#print("found!")
motif_list= motifs.keys()
# creating a DataFrame
df = pd.DataFrame(list(zip(motif_list, count_list)), columns =['Motif', 'Counts'])
return df
#raise NotImplementedError()
# ~~ Test your function in this cell ~~
demo_seq = Seq.Seq('GTTGGATTCATGAAAGA')
demo_motifs = {'motif_a': Seq.Seq('GTTG'), 'motif_b': Seq.Seq('AAAA')}
# Should output a pandas dataframe with 2 rows and 2 columns. Row 1: motif_a, 1. Row 2: motif_b, 0
'''
Motif Counts
0 motif_a 1
1 motif_b 0
'''
print(motif_count(demo_seq, demo_motifs))
# Test on linear seq
print(motif_count(linear_seq, motif_dict))
# --- AUTOGRADING CELL DO NOT EDIT ----
Question 2.2
(15 marks) A colleague is working on sequence motifs in circular DNA molecules. Circular DNA molecules can also be stored in FASTQ files and Bio.Seq.Seq objects but care has to be taken when programming so that the final base is treated as though it is adjacent to the first base. You decide to help your colleague by enhancing your function from 2.1 Challenge: Write a Python function to count the number of times sequence motifs are present in a circular DNA sequence. The function should count the number of exact matches and the number of near misses. A near miss is defined as a sequence that is the same length a sequence motif but has a single base mismatch. Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values. Assume seq_circular is an Bio.Seq.Seq object. Return a pandas DataFrame with each motif represented as a row. The first column should contain motif names as strings, the second column should contain integer counts of exact matches, the third column should contain integer counts of near misses. The column names should be “Motif”, “Match_counts” and "Near_miss_counts". If either seq_circular or motifs are empty, return None.# Import additional library
from collections import Counter
# Helper function
# Use this to identify near miss instances of a reference motif i.e. containing a single base mismatch
def hamming_dist(s1, s2):
"""
A helper function to calculate the hamming distance between two sequences.
Assume the input sequences are strings.
They should be of equal length in order to calculate hamming distance
"""
assert len(s1) == len(s2)
return sum(c1 != c2 for c1, c2 in zip(s1, s2))
# GRADED CELL Question 2.2 (15 marks, max 1 min run-time)
def motif_count_circular(seq, motifs):
"""
Count the number of times sequence motifs are present in a circular DNA sequence.
Counts the number of exact matches and the number of near misses.
A near miss is defined as a sequence that is the same length a sequence motif but has a single base mismatch.
Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values.
Assume seq_circular is an Bio.Seq.Seq object.
Return a pandas DataFrame with each motif represented as a row.
The first column should contain motif names as strings, the second column should contain integer counts of exact matches, the third column should contain integer counts of near misses.
The column names should be “Motif”, “Match_counts” and "Near_miss_counts".
If either seq_circular or motifs are empty, return None.
"""
# YOUR CODE HERE
raise NotImplementedError()
# Test your function in this cell
# Should output a pandas dataframe with 2 rows and 3 columns that looks like this:
'''
Motif Match_counts Near_miss_counts
0 motif_a 1 0
1 motif_b 0 3
'''
print(motif_count_circular(demo_seq, demo_motifs))
# Consider function behaviour on a circular sequence
print(motif_count_circular(circular_seq, motif_dict))
# --- AUTOGRADING CELL DO NOT EDIT ----
Part 3: Codon bias
Setup
The frequency of occurrence of synonymous codons in coding DNA differs between species. Some codons that encode a particular amino acid are common, while other are rare. This is referred to as codon bias.
Questions
Question 3.1
(10 marks) Challenge: Write a Python function to calculate the codon usage frequencies from a set of DNA coding sequences. Assume coding_seqs is a list of Bio.Seq.Seq objects. Assume the DNA sequences begin with a start codon, end with a stop codon and do not contain any introns. Return a 2D dictionary, where the keys are one-letter amino-acid strings and the values are dictionaries containing codon frequencies. The inner dictionaries must have DNA codon strings as keys and codon usage frequencies as values (floating-point numbers between 0 and 1). The codon usage frequency for a codon is the total number of occurrences of that codon divided by the total number of codons that code for the same amino acid. If coding_seqs is of length 0, return None.# GRADED CELL Question 3.1 (10 marks, max 1 min run-time)
from collections import Counter
from collections import defaultdict
def codon_usage(coding_seqs):
"""
Calculate the codon usage frequencies from a set of DNA coding sequences.
Assume coding_seqs is a list of Bio.Seq.Seq objects.
Assume the DNA sequences begin with a start codon, end with a stop codon and do not contain any introns.
Return a 2D dictionary, where the keys are one-letter amino-acid strings and the values are dictionaries containing codon frequencies.
The inner dictionaries must have DNA codon strings as keys and codon usage frequencies as values (floating-point numbers between 0 and 1).
The codon usage frequency for a codon is the total number of occurrences of that codon divided by the total number of codons that code for the same amino acid.
If coding_seqs is of length 0, return None.
"""
# YOUR CODE HERE
raise NotImplementedError()
# ~~ Test your function in this cell ~~
demo_seqs = [Seq.Seq('ATGTCGTAA'), Seq.Seq('ATGTCCAAATAG')]
sequence_a = Seq.Seq('ATGGCTGAAGCCGCATCCCCAGCTTTTATAGAGTATCTCCCACGATCTGACCCGTTGCTCTGTATTATACACAAGGTTGGAGTCGGATGTGAGTCTTCTCACCGGAGACCCAAGACATAG')
sequence_b = Seq.Seq('ATGGTTCTCCTTTGGATCTTATTCGGCAAAAGCATCGCGGCGTCACTAGCTAGTTACGTTTTGAGGACGCTCGCAGATTCTGCCAACCAATCTTATACGAAAATACGGAGGCACGCGTAA')
sequence_c = Seq.Seq('ATGCTAGTGATCCCGTCGTGGGCTAGAGAGCGGCGAGAGTGGGGTTTGGAAATTCGCGACATTGAGCCAACTATGGCTAATGCCATGGGCGGATTATGGGGGTCACTCAGAACTGTATAA')
sequence_d = Seq.Seq('ATGCTACTTACCAAGAGATATCTTCTTAATACACTGATCCATAACGTCATTTCGCGGCTGAGATGGTGGGTGCATGAGCAAGTAATTGTGACTCCGCGGGTTTCGCCAAAGCAAACCTAA')
seqs = [sequence_a, sequence_b, sequence_c, sequence_d]
# Should return {'M':{'ATG':1.0}, 'S':{'TCG':0.5, 'TCC':0.5}, 'K':{'AAA':1.0}, '*':{'TAA':0.5, 'TAG':0.5}}
print(codon_usage(demo_seqs))
#print(codon_usage(seqs))
# --- AUTOGRADING CELL DO NOT EDIT ----
-- GRADED CELL (10 marks) - complete this cell --
YOUR ANSWER HERE
Condon bias indicates the difference in frequency due to the preferential use of synonymous codons, which could influence translation rate and accuracy.
Since The RNA viruses utilize RNA as genetic material, they would be then coded for different viral proteins. The molecular functions genes and cells are then accordingly influenced by base combinations.
END OF ASSIGNMENT
Submitting
Before you turn this assignment in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).
Make sure you have filled in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE"
Your completed notebook file containing all your answers must be turned in via LMS in .ipynb
format.
You must also submit a copy of this notebook in html
format with the output cleared.
You can do this by using the clear all output
option in the menu.
Your submission should include only two files with names formatted as: Assignment_1.ipynb and Assignment_1.html