Omics Data Analysis, Assignment 4
Assignment 4
Domain identification & Protein coding gene prediction
Deadline: Nov. 14 23:59
There are 15 points total (100% is 12 pts, with 3 bonus points on Q3).
1. [3pt] Suppose we have just got an assembled yeast genome sequence from sequencing project, and we want to predict all protein coding genes using Augustus. The genome sequence is available in:
/mnt/scratch/yling/Genomics/Hw04/sce_genome.fa
• Create a shell script (yourlastname_augustus.sh) which noted your commands to run Augustus on sce_genome.fa.
• Use the o
o o o o o o
following parameters:
Search on both strand
Predict only complete genes
Predict gene independently on each strand
Provide the path to a proper config directory
Output file should be gff3 format
Use saccharomyces_cerevisiae_S288C as the species_
Set the output name as yourname_sce_genome_augustus.out
-
Because each augustus process will take about 40 minutes to get the result. So I suggest each group just do it at most once. Everyone should just hand in yourname_augustus.sh to tell me how you get this result.
-
Before running augustus, remember to use top to check if there are someone large programs still running (Like hmmscan, augustus etc.). If so, wait until they are done. Or you should use nohup and & in your command.
2. [3pt] Next we will determine what kinds of protein domains can be found in these predicted protein sequences. We will use hmmscan to scan the predicted proteins using all the Pfam protein domain Hidden Markov Models.
• In order to do this, we have to extract the predicted protein sequences from yourname_sce_genome_augustus.out first. I have parsed the Augustus output file and get all the protein coding genes out. Link it to your own directory:
/mnt/scratch/yling/Genomics/Hw04/Ling_sce_genome_augustus_out.fa
All the sequence in the file just has >gxx as gene label, no more annotation information.
-
Also, use ln to link the directory where Pfam-A stored in your own directory, so you can use Pfam-A.hmm directly:
ln –s /mnt/scratch/yling/OmicsDataAnalysis/ ./Pfam
-
Run hmmscan with the following options:
1|Page
Omics Data Analysis, Assignment 4
o Use Pfam-A.hmm as the database
o UseLing_sce_genome_augustus_out.faassequencefile
o Don’t output alignments [hint:--noali]
o Save table of per-sequence hits to file and call that file
yourname_sce_augustsus.pfam. [Hint: --tblout]
o Use HMM profile's trusted cutoffs (TCs) to set all thresholding [Hint: --cut_tc]
• Also, it will take a long time (about 85 minutes) to run this program. So, you can skip the real running procedure, just write down and hand it the script (yourname_hmmscan.sh) on how to do it.
-
[3pt bonus] Note that for Question 2 we need to extract sequences out of the Augustus output. Write a program that:
-
Take yourname_sce_genome_augustus.out as input
-
Generate an output that looks like Ling_sce_genome_augustus.out.fa.
-
Call this program yourname_get_gff_seq.py and submit it.
-
-
[3pt] Create a program called find_pfam_annotated_gene.py to get the gene sequence of those predicted proteins which matches Pfam models. Save those sequences in a file named pfam_annotated_gene.fa. And, report (print out) the %Pfam annotation. (annotated genes/all predicted genes). [Hint: In yling_sce_augustus.pfam, target name(Pfam domain name), length of target name: 20; accession (Pfam ID), length of accession: 10, query name (gene ID, g1,g2,gx....), length of query name: 20]
• The hmmscan result yling_sce_augustus.pfam could be use as one input. (I removed the command lines information at the end of this file). It is available in:
/mnt/scratch/yling/Genomics/Hw04/yling_sce_augustus.pfam
• Ling_sce_gene_augustus_out.fa in the same directory could be used as another input for Q4.
5. [3pt] Create a program called finding_Pfam_abundancy.py to extract the information of how many proteins each domain hits, write the information into a file named Yeast_Pfam_domain_counts, which has 3 columns information (delimited by tab):
Domain name Pfam_ID Gene counts (one count per protein)
Besides Yeast_Pfam_domain_counts, you must report (print out) which domain is the most
abundant one in predicted proteins of Saccharomyces cerevisiae.
Zip 5 files below in a package named yourname_ODA_Assignment4, and hand it over:
yourname_augustus.sh
yourname_hmmscan.sh
yourname_get_gff_seq.py
find_pfam_annotated_gene.py finding_Pfam_abundancy.py
2|Page