1. Homepage
  2. Programming
  3. Omics Data Analysis, Assignment 4: Protein coding genes prediction

Omics Data Analysis, Assignment 4: Protein coding genes prediction

Engage in a Conversation
Omics Data AnalysisProtein coding genes predictionPython

Omics Data Analysis, Assignment 4 CourseNana.COM

Assignment 4
Domain identification & Protein coding gene prediction
CourseNana.COM

Deadline: Nov. 14 23:59 CourseNana.COM

There are 15 points total (100% is 12 pts, with 3 bonus points on Q3). CourseNana.COM

1. [3pt] Suppose we have just got an assembled yeast genome sequence from sequencing project, and we want to predict all protein coding genes using Augustus. The genome sequence is available in: CourseNana.COM

/mnt/scratch/yling/Genomics/Hw04/sce_genome.fa CourseNana.COM

Create a shell script (yourlastname_augustus.sh) which noted your commands to run Augustus on sce_genome.fa. CourseNana.COM

Use the o CourseNana.COM

o o o o o o CourseNana.COM

following parameters:
Search on both strand
Predict only complete genes
Predict gene independently on each strand Provide the path to a proper config directory Output file should be gff3 format
CourseNana.COM

Use saccharomyces_cerevisiae_S288C as the species_
Set the output name as
yourname_sce_genome_augustus.out CourseNana.COM

  • Because each augustus process will take about 40 minutes to get the result. So I suggest each group just do it at most once. Everyone should just hand in yourname_augustus.sh to tell me how you get this result. CourseNana.COM

  • Before running augustus, remember to use top to check if there are someone large programs still running (Like hmmscan, augustus etc.). If so, wait until they are done. Or you should use nohup and & in your command. CourseNana.COM

2. [3pt] Next we will determine what kinds of protein domains can be found in these predicted protein sequences. We will use hmmscan to scan the predicted proteins using all the Pfam protein domain Hidden Markov Models. CourseNana.COM

In order to do this, we have to extract the predicted protein sequences from yourname_sce_genome_augustus.out first. I have parsed the Augustus output file and get all the protein coding genes out. Link it to your own directory: CourseNana.COM

/mnt/scratch/yling/Genomics/Hw04/Ling_sce_genome_augustus_out.fa CourseNana.COM

All the sequence in the file just has >gxx as gene label, no more annotation information. CourseNana.COM

  • Also, use ln to link the directory where Pfam-A stored in your own directory, so you can use Pfam-A.hmm directly: CourseNana.COM

    ln s /mnt/scratch/yling/OmicsDataAnalysis/ ./Pfam CourseNana.COM

  • Run hmmscan with the following options: CourseNana.COM

CourseNana.COM

Omics Data Analysis, Assignment 4 CourseNana.COM

o Use Pfam-A.hmm as the database
o UseLing_sce_genome_augustus_out.faassequencefile o Don’t output alignments [hint:--noali]
o Save table of per-sequence hits to file and call that file CourseNana.COM

yourname_sce_augustsus.pfam. [Hint: --tblout] CourseNana.COM

o Use HMM profile's trusted cutoffs (TCs) to set all thresholding [Hint: --cut_tc] CourseNana.COM

Also, it will take a long time (about 85 minutes) to run this program. So, you can skip the real running procedure, just write down and hand it the script (yourname_hmmscan.sh) on how to do it. CourseNana.COM

  1. [3pt bonus] Note that for Question 2 we need to extract sequences out of the Augustus output. Write a program that: CourseNana.COM

    • Take yourname_sce_genome_augustus.out as input CourseNana.COM

    • Generate an output that looks like Ling_sce_genome_augustus.out.fa. CourseNana.COM

    • Call this program yourname_get_gff_seq.py and submit it. CourseNana.COM

  2. [3pt] Create a program called find_pfam_annotated_gene.py to get the gene sequence of those predicted proteins which matches Pfam models. Save those sequences in a file named pfam_annotated_gene.fa. And, report (print out) the %Pfam annotation. (annotated genes/all predicted genes). [Hint: In yling_sce_augustus.pfam, target name(Pfam domain name), length of target name: 20; accession (Pfam ID), length of accession: 10, query name (gene ID, g1,g2,gx....), length of query name: 20] CourseNana.COM

The hmmscan result yling_sce_augustus.pfam could be use as one input. (I removed the command lines information at the end of this file). It is available in: CourseNana.COM

/mnt/scratch/yling/Genomics/Hw04/yling_sce_augustus.pfam CourseNana.COM

Ling_sce_gene_augustus_out.fa in the same directory could be used as another input for Q4. CourseNana.COM

5. [3pt] Create a program called finding_Pfam_abundancy.py to extract the information of how many proteins each domain hits, write the information into a file named Yeast_Pfam_domain_counts, which has 3 columns information (delimited by tab): CourseNana.COM

Domain name Pfam_ID Gene counts (one count per protein)
Besides
Yeast_Pfam_domain_counts, you must report (print out) which domain is the most CourseNana.COM

abundant one in predicted proteins of Saccharomyces cerevisiae. CourseNana.COM

Zip 5 files below in a package named yourname_ODA_Assignment4, and hand it over: yourname_augustus.sh
yourname_hmmscan.sh
yourname_get_gff_seq.py
CourseNana.COM

find_pfam_annotated_gene.py finding_Pfam_abundancy.py CourseNana.COM

2|Page  CourseNana.COM

Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
Omics Data Analysis代写,Protein coding genes prediction代写,Python代写,Omics Data Analysis代编,Protein coding genes prediction代编,Python代编,Omics Data Analysis代考,Protein coding genes prediction代考,Python代考,Omics Data Analysishelp,Protein coding genes predictionhelp,Pythonhelp,Omics Data Analysis作业代写,Protein coding genes prediction作业代写,Python作业代写,Omics Data Analysis编程代写,Protein coding genes prediction编程代写,Python编程代写,Omics Data Analysisprogramming help,Protein coding genes predictionprogramming help,Pythonprogramming help,Omics Data Analysisassignment help,Protein coding genes predictionassignment help,Pythonassignment help,Omics Data Analysissolution,Protein coding genes predictionsolution,Pythonsolution,