COMP90016 Computational Genomics - Assignment 1: Working with short reads

Engage in a Conversation

COMP90016 - Assignment 1

Version 1 Last edited 13/3/2023 CourseNana.COM

Semester 1, 2023

NAME = ""

ID = ""

Marking

Cells that must be completed to receive marks are clearly labelled. Some cells are code cells, in which you must complete the code to solve a problem. Others are markdown cells, in which you must write your answers to short-answer questions. CourseNana.COM

Cells that must be completed to receive marks are labelled like this: CourseNana.COM

# -- GRADED CELL (1 mark) - complete this cell -- CourseNana.COM

Some graded cells are code cells, in which you must complete the code to solve a problem. Other graded cells are markdown cells, in which you must write your answers to short-answer questions. CourseNana.COM

You will see the following text in graded code cells: CourseNana.COM

# YOUR CODE HERE
raise NotImplementedError()

You must remove the raise NotImplementedError() line from the cell, and replace it with your solution. CourseNana.COM

Only add answers to graded cells. If you want to import a library or use a helper function, this must be included in a graded cell. CourseNana.COM

Only graded cells will be marked. Don't make changes outside graded cells, and don't add or remove cells from the notebook. CourseNana.COM

Word limits, where stated, will be strictly enforced. Answers exceeding the limit will not be marked. CourseNana.COM

Run-time limits will be imposed for each coding question. The run-time of a code cell can be calculated by including %time at the top of your cell. Cells exceeding the run-time limit will not be marked. The run-time limits only apply to test cases that are included in this document. CourseNana.COM

No marks are allocated to commenting in your code. We do however, encourage efficient and well-commented code. CourseNana.COM

The total marks for the assignment add up to 100, and it will be worth 10% of your overall subject grade. CourseNana.COM

Part 1: 55 marks CourseNana.COM

Part 2: 25 marks CourseNana.COM

Part 3: 20 marks CourseNana.COM

Submitting

Before you turn this assignment in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All). CourseNana.COM

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your name and student ID number below: CourseNana.COM

Overview

In this assignment, you will answer questions about working with short reads, sequence motifs and codon bias. CourseNana.COM

You will use the biopython library in your functions. You may want to refer to sections of the biopython documentation for additional help (https://biopython.org/wiki/Documentation). Additional to biopython and standard Python 3 functions and methods, you may also use any other library we have used in Computational Genomics including collections, numpy, pandas, math, itertools, seaborn and matplotlib. CourseNana.COM

Part 1: Working with short reads

Setup

import os
import requests
from IPython.core.display import HTML

# Handy function to fetch our datafile
def fetch_file(url): 
    response = requests.get(url)
    if response.status_code == 200:
        print('File found!')
        filename = os.path.basename(url).split('?', 1)[0]
        with open(filename, 'wb') as f:
            f.write(response.content)
            f.close()
        print(f'Saved to: {filename}')
    else:
        print('File not found')

# Make the notebook pretty
HTML(requests.get('https://raw.githubusercontent.com/melbournebioinformatics/COMP90016/main/data/2023/style/custom.css').text)

# Fetch assignment data
url = 'https://github.com/melbournebioinformatics/COMP90016/blob/main/data/2023/Assignment_01/data/comp90016_assignment_1.fastq.gz?raw=true'
fetch_file(url)

First, we read in a read set from the comp90016_assignment_1.fastq.gz file. Note that converting a readset into a list of biopython objects makes it easier to handle. CourseNana.COM

import gzip
from Bio import SeqIO, SeqRecord, Seq

fname = 'comp90016_assignment_1.fastq.gz'

# Our fastq file is compressed using gzip. 
# We must open it before SeqIO can read the contents
with gzip.open(fname, "rt") as handle:
    readset = list(SeqIO.parse(handle, "fastq"))

Questions

In the cells below, complete the following tasks: CourseNana.COM

Question 1.1

(5 marks) Challenge: Write a Python function to compute the following stats for an input readset: - The minimum read length - The mean read length and - The maximum read length Input: a list of Bio.SeqRecord.SeqRecord objects. Output: Return a `list` with the minimum read length as an integer, the mean read length as a floating-point number and the maximum read length as in integer. The elements of the output list should be in the order: minimum read length, mean read length, maximum read length. If the input list is empty, return None.

# GRADED CELL 1.1 (5 marks, max 1 min run-time)

def read_lengths(reads):
    """
    Calculate minimum, mean and maximum read lengths in a readset. 
    Assume reads is a list of Bio.SeqRecord.SeqRecord objects containing DNA sequences. 
    Return a list with the minimum read length as an integer, the mean read length as a floating-point number and the maximum read length as in integer.
    The output list should be in the order: minimum read length, mean read length, maximum read length.
    If the input list is empty, return None.
    """

    # YOUR CODE HERE

    # creating an empty list to store results 
    length_list = []

    # checking the empty situation
    if not reads:
        return None

    # obtaining the minimum & the maximum read length
    read_min = min(reads, key = len)
    read_max = max(reads, key = len)
    minimum = len(read_min)
    maximum = len(read_max)

    # calculating the mean read length
    total_len = 0
    for element in reads:
        total_len += len(element)
        avg = float(total_len) / float(len(reads))

    # inserting
    length_list.extend([minimum, avg, maximum])
    return length_list

    #raise NotImplementedError()

# Test your function in this cell
# First we will create a list of dummy reads
demo_reads = [SeqRecord.SeqRecord(Seq.Seq('GTTGGATTCATGAAAGA'), 'ERR024571.2', '', ''),
             SeqRecord.SeqRecord(Seq.Seq('ATGAAATGAATGTCTTGA'), 'ERR024571.2', '', '')]

print(read_lengths(demo_reads)) # should output [17, 17.5, 18]

print(read_lengths(readset))

[17, 17.5, 18]
[56, 75.445, 76]

# --- AUTOGRADING CELL DO NOT EDIT ----

Question 1.2

(15 marks) Suppose quality control metrics indicated that there were unacceptable levels of adapter contamination in the reads. The sequence of the adapter has been determined and it matches an adapter sequence used in the sequencing. Challenge: Write a Python function that performs the following tasks and returns the new readset as list of SeqRecord objects. - If a read has an unbroken subsequence of length min_match bases or greater in common with the adapter sequence, trim all bases in the reads that are 3’ of the start of the adapter sequence match, including the first matching base. - Remove any read that is shorter than min_length bases long after trimming. Assumptions: - Assume min_match and min_length are positive integers. - Assume reads is list of Bio.SeqRecord.SeqRecord objects, each object contains a sequence that is a Bio.Seq.Seq object. - Assume adapter_seq is a Bio.SeqRecord.SeqRecord object. If the readset list or the adapter_seq is empty, return None.

# GRADED CELL 1.2 (15 marks, max 1 min run-time)

def adapter_trim(reads, adapter_seq, min_match, min_length):
    """
    Trim reads in a readset based on a given adapter sequence.
    If a read has an unbroken subsequence of length min_match bases or greater in common with the adapter sequence, trim all bases in the reads that are 3’ of the start of the adapter sequence match, including the first matching base. 
    Remove any read that is min_length bases long or shorter after trimming.
    Assume min_match and min_length are positive integers.
    Assume reads is list of Bio.SeqRecord.SeqRecord objects.
    Assume adapter_seq is a Bio.SeqRecord.SeqRecord object.
    If the readset list or the adapter_seq is empty, return None.
    Return the new readset as a list of Bio.SeqRecord.SeqRecord objects.
    """

    # YOUR CODE HERE

    # checking the empty situation
    if not reads or not adapter_seq:
        return None

    # trim starts from 3': trim starts from right 

    # if matching_sequence < min_match: no trimming
    # else: trimming

    # if left_over < min_length: remove read

    # return the reads!!!

    #raise NotImplementedError()

    #return filtered_reads

# Test your function in this cell
illumina_adapter = SeqRecord.SeqRecord(Seq.Seq('GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGATCGGAAGAGCACACG'), '', '', '')
demo_adapter = SeqRecord.SeqRecord(Seq.Seq('GATGAAAG'), '', '', '')

# should output a list with just one SeqRecord object containing the sequence ATGAAATGAATGTCTTGA
print(adapter_trim(demo_reads, demo_adapter, 7, 15)) 

# Test with longer min length
#print(len(adapter_trim(readset, illumina_adapter, 10, 50)))

[4, 3]

# --- AUTOGRADING CELL DO NOT EDIT ----

Question 1.3

(10 marks) Challenge: Write Python code to produce two histograms showing the distribution of read lengths in... - The original `comp90016_assignment_1.fastq.gz` readset. - The output readset after adapters have been removed from `comp90016_assignment_1.fastq.gz` using your function from part 1.2. - Trim reads with settings: `adapter_seq=illumina_adapter, min_match=10, min_length=50` You may show both distributions on the same set of axes, or two completely separate histograms. The plots should be produced inline, in the Jupyter notebook. Choose an appropriate bin-width. Label your axes appropriately (include units). Use labels to make it clear which distribution has been pre-processed. You may use the readset list of Bio.SeqRecord.SeqRecord objects as input. Your code does not have to include a function.

# Here's some Jupyter magic to render plots in the notebook
%matplotlib inline

# You may want to import some additional packages for building and formatting your graphs (non-essential)
# Un-comment as reqired
#import numpy as np
#import matplotlib.pyplot as plt
#import seaborn
#from collections import Counter

# GRADED CELL 1.3 (10 marks, max 1 min run-time)

# Use this cell to make your histograms.

# YOUR CODE HERE
raise NotImplementedError()

# Set the variables below to report total read count and mean read length before and after trimming.

# Stats for reads
#length_stats = None 

# Stats for trimmed reads
#trimmed_reads = None
#trimmed_length_stats = None

print(f'Mean read length before adapter trimming : {length_stats[1]:.3f}')
print(f'Mean read length after adapter trimming : {trimmed_length_stats[1]:.3f}')

print(f'Number of reads in the readset before adapter trimming : {len(readset)}')
print(f'Number of reads in the readset after adapter trimming : {len(trimmed_reads)}')

Question 1.4

(5 marks, max 50 words) In words, compare the two histograms and the two mean read lengths from before and after adapter removal.

-- GRADED CELL (5 marks) - complete this cell --

YOUR ANSWER HERE CourseNana.COM

Question 1.5

(10 marks, max 100 words) Describe factors should impact your choice of min_match and min_length when using your adapter removal function? Describe two possible consequences of choosing inappropriate values for these parameters?

CourseNana.COM

-- GRADED CELL (10 marks) - complete this cell --

You answer here CourseNana.COM

The size of both read sequences and the adapter would affect our choice. CourseNana.COM

The quality of selected dataset and the choice of an optimal threshold are also influencing factors. CourseNana.COM

In addition, genetic code may affact results when translating specific queries. CourseNana.COM
An unexpected reduction in the number of reads may occur as a result, since some reads fail to pass minimum quality criteria during trimming. CourseNana.COM

Besides, reads maybe substantially shortened and thus the reduction in information may introduce some bias in the following comparison. CourseNana.COM

Question 1.6

(10 marks, max 150 words) Sequence data can be stored and shared in a variety of file formats. Describe the main differences between FASTA, FASTQ and GenBank file formats. For each of these three file formats, give an example of the type of data that might be appropriately stored in this format.

-- GRADED CELL (10 marks) - complete this cell --

YOUR ANSWER HERE CourseNana.COM

FASTA is a text-based file format, which contains a single-line header and sequences. Sequences could have numbers and spaces within, and there are no additional columns or blank lines in a such file. For instance, this format could be used to store sequence information of amino acids. CourseNana.COM
FASTQ is also a text-based file format, while it can store not only sequences but corresponding quality values. Besides, there is always a sequence identifier and spacer for formatting in FASTQ. Nucleotide sequences would be an example of this file storage. CourseNana.COM
Finally, GenBank is a flat format, storing more types of sequence information and additional information that is more than FASTA and FASTQ. For example, GenBank could store DNA sequence. CourseNana.COM

Part 2: Sequence motifs

Sequence motifs are short, recurring patterns in nucleic-acid sequences. Many are involved in important biological functions. CourseNana.COM

Setup

# Set up two DNA sequences to test your code on. Do not change these sequences.
linear_seq = Seq.Seq('TTACAGTGATTATGAAAACTTTGCGGGGCATGGCTACGACTTGTTCAGCCACGTCCGAGGGCAGAAACCTCGAGGGGTTTGTATGTTCAGCTATCTTCTACCCATCCCCGGAGGTTAAGTACGAGGGGAGATGCGGAAGAGGCTCTCGATCATCCCGTGGGACATCAACCTTTCCCTTGATAAAGCACCCCGCTCGGGTA')
circular_seq = Seq.Seq('TGGCAGAGAGAACGCCTTCTGAATTGTGCTATCCTTCGACCTTATCAAAGCTTGCTACCAATAATTAGGATTATTGCCTTGCGACAGACCTCCTACTCACACTGCCTCACATTGAGCTAGTCAGTGAGCGATTAGCTTGACCCGCTCTCTAGGGTCGCGAGTACGTGAGCTAGGGCTCCGGACTGGGCTATATAGTCGAG')

# Set up a dictionary of sequence motifs. Do not change this dictionary.
motif_dict = {'motif_a': Seq.Seq('TACAGTG'), 
              'motif_b': Seq.Seq('AGCTTGCT'), 
              'motif_c': Seq.Seq('ATATATAC'), 
              'motif_d': Seq.Seq('CGAGGGG'), 
              'motif_e': Seq.Seq('CGAGTG')}

Questions

Question 2.1

(10 marks) Challenge: Write a Python function to count the number of times sequence motifs are present in a DNA sequence. Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values. Assume seq is a Bio.Seq.Seq object. Return a pandas DataFrame with each motif represented as a row. The first column should contain motif names as strings, the second column should contain integer counts of exact matches. Overlapping motifs should be considered. The column names should be “Motif” and “Counts”. If either seq or motifs are empty, return None.

# GRADED CELL Question 2.1 (10 marks, max 1 min run-time)

import pandas as pd

def motif_count(seq, motifs):
    """
    Count the number of times sequence motifs are present in a DNA sequence. 
    Overlapping motifs should be considered.
    Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values. 
    Assume seq is an Bio.Seq.Seq object. 
    Return a pandas DataFrame with each motif represented as a row. 
    The first column should contain motif names as strings, the second column should contain integer counts of exact matches. 
    The column names should be “Motif” and “Counts”, in that order. 
    If either seq or motifs are empty, return None.
    """

    # YOUR CODE HERE

    # checking the empty situation
    if not seq or motifs == {}:
        return None

    #print(seq) -> GTTGGATTCATGAAAGA
    #print(motifs.keys()) -> 'motif_a', 'motif_b'
    #print(motifs.values())

    motif_list = []
    count_list = []

    for motif_seq in motifs.values():
        # overlapping situation
        count = sum(1 for i in range(len(seq)) if seq.startswith(motif_seq, i)) 
        count_list.append(count)

        if motif_seq in seq:
            #print("found!")
            motif_list= motifs.keys()

    # creating a DataFrame  
    df = pd.DataFrame(list(zip(motif_list, count_list)), columns =['Motif', 'Counts']) 
    return df

    #raise NotImplementedError()

# ~~ Test your function in this cell ~~

demo_seq = Seq.Seq('GTTGGATTCATGAAAGA')
demo_motifs = {'motif_a': Seq.Seq('GTTG'), 'motif_b': Seq.Seq('AAAA')}

# Should output a pandas dataframe with 2 rows and 2 columns. Row 1: motif_a, 1. Row 2: motif_b, 0
'''
     Motif  Counts
0  motif_a       1
1  motif_b       0
'''
print(motif_count(demo_seq, demo_motifs)) 

# Test on linear seq
print(motif_count(linear_seq, motif_dict))

# --- AUTOGRADING CELL DO NOT EDIT ----

Question 2.2

(15 marks) A colleague is working on sequence motifs in circular DNA molecules. Circular DNA molecules can also be stored in FASTQ files and Bio.Seq.Seq objects but care has to be taken when programming so that the final base is treated as though it is adjacent to the first base. You decide to help your colleague by enhancing your function from 2.1 Challenge: Write a Python function to count the number of times sequence motifs are present in a circular DNA sequence. The function should count the number of exact matches and the number of near misses. A near miss is defined as a sequence that is the same length a sequence motif but has a single base mismatch. Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values. Assume seq_circular is an Bio.Seq.Seq object. Return a pandas DataFrame with each motif represented as a row. The first column should contain motif names as strings, the second column should contain integer counts of exact matches, the third column should contain integer counts of near misses. The column names should be “Motif”, “Match_counts” and "Near_miss_counts". If either seq_circular or motifs are empty, return None.

# Import additional library
from collections import Counter 

# Helper function
# Use this to identify near miss instances of a reference motif i.e. containing a single base mismatch
def hamming_dist(s1, s2):
    """
    A helper function to calculate the hamming distance between two sequences.
    Assume the input sequences are strings.
    They should be of equal length in order to calculate hamming distance
    """
    assert len(s1) == len(s2)
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))

# GRADED CELL Question 2.2 (15 marks, max 1 min run-time)

def motif_count_circular(seq, motifs):
    """
    Count the number of times sequence motifs are present in a circular DNA sequence. 
    Counts the number of exact matches and the number of near misses. 
    A near miss is defined as a sequence that is the same length a sequence motif but has a single base mismatch. 
    Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values. 
    Assume seq_circular is an Bio.Seq.Seq object. 
    Return a pandas DataFrame with each motif represented as a row. 
    The first column should contain motif names as strings, the second column should contain integer counts of exact matches, the third column should contain integer counts of near misses.
    The column names should be “Motif”, “Match_counts” and "Near_miss_counts".
    If either seq_circular or motifs are empty, return None.
    """

    # YOUR CODE HERE

    raise NotImplementedError()

# Test your function in this cell
# Should output a pandas dataframe with 2 rows and 3 columns that looks like this:
'''
     Motif  Match_counts  Near_miss_counts
0  motif_a             1                 0
1  motif_b             0                 3
'''
print(motif_count_circular(demo_seq, demo_motifs)) 

# Consider function behaviour on a circular sequence
print(motif_count_circular(circular_seq, motif_dict))

# --- AUTOGRADING CELL DO NOT EDIT ----

Part 3: Codon bias

Setup

The frequency of occurrence of synonymous codons in coding DNA differs between species. Some codons that encode a particular amino acid are common, while other are rare. This is referred to as codon bias. CourseNana.COM

Questions

Question 3.1

(10 marks) Challenge: Write a Python function to calculate the codon usage frequencies from a set of DNA coding sequences. Assume coding_seqs is a list of Bio.Seq.Seq objects. Assume the DNA sequences begin with a start codon, end with a stop codon and do not contain any introns. Return a 2D dictionary, where the keys are one-letter amino-acid strings and the values are dictionaries containing codon frequencies. The inner dictionaries must have DNA codon strings as keys and codon usage frequencies as values (floating-point numbers between 0 and 1). The codon usage frequency for a codon is the total number of occurrences of that codon divided by the total number of codons that code for the same amino acid. If coding_seqs is of length 0, return None.

# GRADED CELL Question 3.1 (10 marks, max 1 min run-time)

from collections import Counter
from collections import defaultdict

def codon_usage(coding_seqs):
    """
    Calculate the codon usage frequencies from a set of DNA coding sequences. 
    Assume coding_seqs is a list of Bio.Seq.Seq objects. 
    Assume the DNA sequences begin with a start codon, end with a stop codon and do not contain any introns. 
    Return a 2D dictionary, where the keys are one-letter amino-acid strings and the values are dictionaries containing codon frequencies. 
    The inner dictionaries must have DNA codon strings as keys and codon usage frequencies as values (floating-point numbers between 0 and 1). 
    The codon usage frequency for a codon is the total number of occurrences of that codon divided by the total number of codons that code for the same amino acid. 
    If coding_seqs is of length 0, return None.
    """

   # YOUR CODE HERE

    raise NotImplementedError()

# ~~ Test your function in this cell ~~
demo_seqs = [Seq.Seq('ATGTCGTAA'), Seq.Seq('ATGTCCAAATAG')]

sequence_a = Seq.Seq('ATGGCTGAAGCCGCATCCCCAGCTTTTATAGAGTATCTCCCACGATCTGACCCGTTGCTCTGTATTATACACAAGGTTGGAGTCGGATGTGAGTCTTCTCACCGGAGACCCAAGACATAG')
sequence_b = Seq.Seq('ATGGTTCTCCTTTGGATCTTATTCGGCAAAAGCATCGCGGCGTCACTAGCTAGTTACGTTTTGAGGACGCTCGCAGATTCTGCCAACCAATCTTATACGAAAATACGGAGGCACGCGTAA')
sequence_c = Seq.Seq('ATGCTAGTGATCCCGTCGTGGGCTAGAGAGCGGCGAGAGTGGGGTTTGGAAATTCGCGACATTGAGCCAACTATGGCTAATGCCATGGGCGGATTATGGGGGTCACTCAGAACTGTATAA')
sequence_d = Seq.Seq('ATGCTACTTACCAAGAGATATCTTCTTAATACACTGATCCATAACGTCATTTCGCGGCTGAGATGGTGGGTGCATGAGCAAGTAATTGTGACTCCGCGGGTTTCGCCAAAGCAAACCTAA')
seqs = [sequence_a, sequence_b, sequence_c, sequence_d]

# Should return {'M':{'ATG':1.0}, 'S':{'TCG':0.5, 'TCC':0.5}, 'K':{'AAA':1.0}, '*':{'TAA':0.5, 'TAG':0.5}}
print(codon_usage(demo_seqs)) 

#print(codon_usage(seqs))

# --- AUTOGRADING CELL DO NOT EDIT ----

Question 3.2 (10 marks, max 100 words) Viruses infect cells and use the molecular machinery of their host cell to replicate themselves. Explain why codon bias is an important factor in the potential transmission of RNA viruses between different species.

-- GRADED CELL (10 marks) - complete this cell --

YOUR ANSWER HERE CourseNana.COM

Condon bias indicates the difference in frequency due to the preferential use of synonymous codons, which could influence translation rate and accuracy. CourseNana.COM

Since The RNA viruses utilize RNA as genetic material, they would be then coded for different viral proteins. The molecular functions genes and cells are then accordingly influenced by base combinations. CourseNana.COM

END OF ASSIGNMENT

Submitting

Make sure you have filled in any place that says YOUR CODE HERE or "YOUR ANSWER HERE" CourseNana.COM

Your completed notebook file containing all your answers must be turned in via LMS in .ipynb format. You must also submit a copy of this notebook in html format with the output cleared. You can do this by using the clear all output option in the menu. CourseNana.COM

Your submission should include only two files with names formatted as: Assignment_1.ipynb and Assignment_1.html CourseNana.COM

COMP90016 Computational Genomics - Assignment 1: Working with short reads

COMP90016 - Assignment 1

Semester 1, 2023

Marking

Submitting

Overview

Part 1: Working with short reads

Setup

Questions

Question 1.1

Question 1.2

Question 1.3

Question 1.4

-- GRADED CELL (5 marks) - complete this cell --

Question 1.5

-- GRADED CELL (10 marks) - complete this cell --

Question 1.6

-- GRADED CELL (10 marks) - complete this cell --

Part 2: Sequence motifs

Setup

Questions

Question 2.1

Question 2.2

Part 3: Codon bias

Setup

Questions

Question 3.1

-- GRADED CELL (10 marks) - complete this cell --

END OF ASSIGNMENT

Submitting

Get in Touch with Our Experts