Top Banner
Gene Finding Genome Annotation
13

Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Gene FindingGenome Annotation

Page 2: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Gene finding is a cornerstone of genomic analysis

• Genome content and organization• Differential expression analysis• Epigenomics• Population biology & evolution• Medical genomics

Page 3: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Basic Approaches

• Computational– Absolute rules: • start and stop codons

– Statistical probabilities:• which codon is a true start?• Introns• splice junctions• codon usage

• Experimental– Comparison with known

genes/proteins (BLAST)– Expressed sequence tags– RNAseq data

Page 4: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Computational Gene Prediction

• Statistical properties of protein-coding genes differ from those of non-coding sequence– Long ORFs• On average stop codons should occur 3 times in every

64 codons (~1/21)

– Codon biascodon Amino

acid%

ACA Thr 24.6

ACC Thr 35.5

ACG Thr 28.4

ACU Thr 11.4

Page 5: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Gene features tend to occur in specific sequence contexts

from Korf(2004)

a. Splice acceptor sitesb. Splice donor sitesc. Translation startsd. Splice acceptor sites

for A. thaliana genes predicted using C. elegans parameters

Page 6: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Many of the ab initio gene finders use Hidden Markov Models (HMMs)

• HMMs– Contain parameters defining probabilities that

specific gene features occur in different sequence contexts

• They can be used to predict– transcription start sites– Intron splice junctions– Poly-A addition sites– promoters

Page 7: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Standard practice is to perform gene predictions with multiple programs

We will run two programs in today’s exercise:

• SNAP– Korf (2004) Gene finding in novel genomes BMC

Bioinformatics 5:59• AUGUSTUS– Stanke et al (2004) AUGUSTUS: a web server for

gene finding in eukaryotes. Nucl. Acids Research 32:W309

Page 8: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Gene validation

• Independent evidence that our candidate gene is, in fact, a gene

– Conserved protein motifs

– Blast matches

– Expressed sequence tags

– RNAseq reads

Page 9: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

For today’s exercise

• We will use the following evidences:– Genes/proteins already identified in M.oryzae

(many being well supported by blast, EST and other transcriptomic data)

• Splice junction information from the RNAseq mapping that we performed yesterday

Page 10: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Information overload!!!

• Results from:– SNAP– AUGUSTUS

• Magnaporthe genes• Magnaporthe proteins• RNAseq mapping data• How are we going to make sense out of these

highly redundant datasets?

Page 11: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Enter…MAKER

• Synthesizes multiple forms of gene prediction data– Predictions and evidences

• Outputs a single, consistent set of genes and gene models, including quality values

• Uses a standard gene annotation format– GFF3 (related to the GTF format used

yesterday)– Results can be imported into a genome browser

Page 12: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

GFF3 format1 2 3 4 5 6 7 8 9

seqid source type Start End Score Strand phase attributes

Page 13: Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Gene finding is an iterative process

SNAP AUGUSTUS

HMM

GENEMODELS

BLASTmatches

ESTs

MAKER