Top Banner
Introduction to ab initio and evidence-based gene finding Wilson Leung 08/2015
29
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Introduction to ab initio and evidence-based gene finding

Wilson Leung 08/2015

Page 2: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Outline

Overview of computational gene predictions

Different types of eukaryotic gene predictors

Common types of gene prediction errors

Page 3: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Computational gene predictions

Identify genes in genomic sequenceProtein-coding genesNon-coding RNA genesRegulatory regions (enhancers, promoters)

Predictions must be confirmed experimentally

Eukaryotic gene predictions have high error ratesTwo major types of RefSeq records

NM/NP = experimentally confirmed

XM/XP = computational predictions

Page 4: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Primary goal of computational gene prediction algorithmsLabel each nucleotide in a genomic sequence

Identify the most likely sequence of labels (i.e. optimal path)

TTTCACACGTAAGTATAGTGTGTGASequence

EEEEEEEESSIIIIIIIIIIIIIIIPath 1

EEEEEEEEEEEESSIIIIIIIIIIIPath 2

EEEEEEEEEEEEEEEEESSIIIIIIPath 3

Exon (E)5’ Splice Site

(S)Intron (I)Labels

Page 5: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Basic properties of gene prediction algorithms

Predictions must satisfy a set of biological constraints

Coding region must begin with a start codonInitial exon must occur before splice sites and intronsCoding region must end with a stop codon

Model these rules using a finite state machine (FSM)

Use species-specific characteristics to improve the accuracy of gene predictions

Distribution of exon and intron sizesBase frequencies (e.g., GC content, codon bias)

Page 6: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Prokaryotic gene predictions

Prokaryotes have relatively simple gene structure

Single open reading frameAlternative start codons: AUG, GUG, UUG

Gene finders can predict most prokaryotic genes accurately (> 90% sensitivity and specificity)

Glimmer Salzberg S., et al., Microbial gene identification using interpolated Markov models, NAR. (1998) 26, 544-548

Page 7: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Eukaryotic gene predictions have high error rates

Gene finders generally do a poor job (<50%) predicting genes in eukaryotes

More variations in the gene modelsAlternative splicing (multiple isoforms)Non-canonical splice sitesAlternate start codon (e.g., Fmr1)Stop codon read through (e.g., gish)Nested genes (e.g., ko)Trans-splicing (e.g., mod(mdg4))Pseudogenes

Page 8: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Types of eukaryotic gene predictors

Ab initioGENSCAN, geneid, SNAP, GlimmerHMM

Evidence-based (extrinsic)Augustus, genBlastG, Exonerate, GenomeScan

Comparative genomicsTwinscan/N-SCAN, SGP2

Combine ab initio and evidence-based approaches

EVM, GLEAN, Gnomon, JIGSAW, MAKER

Page 9: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Ab initio gene predictionAb initio = from the beginning

Predict genes using only the genomic DNA sequence

Search for signals of protein coding regionsTypically based on a probabilistic model

Hidden Markov Models (HMM)

Support Vector Machines (SVM)

GENSCANBurge, C. and Karlin, S., Prediction of complete gene structures in human genomic DNA, JMB. (1997), 268, 78-94

Page 10: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Hidden Markov Models (HMM)

A type of supervised machine learning algorithm

Uses Bayesian statisticsMakes classifications based on characteristics of training data

Many types of applicationsSpeech and gesture recognitionBioinformatics

Gene predictions

Sequence alignments

ChIP-seq analysis

Protein folding

Page 11: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Supervised machine learning

Norvig, P. How to write a spelling corrector. http://www.norvig.com/spell-correct.html

Use aggregated data of previous search results to predict the search term and the correct spelling

Page 12: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

GEP curriculum on HMM

Developed by Anton Weisstein (Truman State University) and Zane Goodwin (TA for Bio 4342)

Use a HMM to predict a splice donor site

Use Excel to experiment with different emission and transition probabilities

See the Beyond Annotation section of the GEP web site

Page 13: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

GENSCAN HMM Model

GENSCAN uses the following information to construct gene models:

Promoter, splice site and polyadenylation signals

Hexamer frequencies and base compositions

Probability of coding and non-coding DNA

Distribution of gene, exon and intron lengths

Burge, C. and Karlin, S., Prediction of complete gene structures in human genomic DNA, JMB. (1997) 268, 78-94

Page 14: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Use multiple HMMs to describe different parts of a gene

Stanke M., Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. (2003) 19 Suppl 2:ii215-25.

Page 15: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Evidence-based gene predictions

Use sequence alignments to improve predictions

EST, cDNA or protein from closely-related species Exon sensitivity:

Percent of real exons identified

Exon specificity: Percent of predicted exons that are correct

Yeh RF, et al., Computational Inference of Homologous Gene Structures in the Human Genome, Genome Res. (2001) 11, 803-816

Page 16: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Augustus gene prediction service

http://bioinf.uni-greifswald.de/augustus/

Stanke, M., et al., Using native and syntenically mapped cDNA alignments to improve de novo gene finding, Bioinformatics. (2008) 24, 637-644

Page 17: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Predictions using comparative genomics

Use whole genome alignments from one or more informant species

CONTRAST predicts 50% of genes correctly

Requires high quality whole genome alignments and training data

Flicek P, Gene prediction: compare and CONTRAST. Genome Biology 2007, 8, 233

Page 18: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Generate consensus gene models

Gene predictors have different strengths and weaknesses

Create consensus gene models by combining results from multiple gene finders and sequence alignments

GLEAN Eisik CG, Mackey AJ, Reese JT, et al. Creating a honey bee consensus gene set. Genome Biology 2007, 8:R13

GLEAN-R (reconciled) reference gene sets for 11 Drosophila species available at FlyBase

Page 19: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

GLEAN-R prediction for the ey ortholog in D. erecta

Single GLEAN-R prediction per genomic location Models have not been confirmed experimentallyGLEAN-R RefSeq records have XM and XP prefixes

Page 20: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Automated annotation pipelines

http://www.ncbi.nlm.nih.gov/projects/genome/guide/gnomon.shtml

NCBI Gnomon gene prediction pipeline

Integrate biological evidence into the predicted gene models

Examples: EannotNCBI GnomonEnsemblUCSC Gene Build

EGASP results for the Ensembl pipeline:

71.6% gene sensitivity67.3% gene specificity

Page 21: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

New Gnomon predictions for eight Drosophila species

Based on RNA-Seq data from either the same or closely-related species

D. simulans, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. willistoni, D. virilis, and D. mojavensis

Predictions include untranslated regions and multiple isoforms

Records not yet available through the NCBI RefSeq database

Page 22: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Final models

Geneprediction

s

Common problems with gene finders

Split single gene into multiple predictions

Fused with neighboring genes

Missing exons

Over predict exons or genes

Missing isoforms

Page 23: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Non-canonical splice donors and acceptors

Donor site Count

GC 599

AT 27

GA 14

Acceptor site

Count

AC 33

TG 28

AT 17

Frequency of non-canonical splice sites in FlyBase Release 6.06 (Number of unique introns:

71,476)

Many gene predictors only predict genes with canonical splice donor (GT) and acceptor (AG) sites

Check Gene Record Finder or FlyBase for genes that use non-canonical splice sites in D. melanogaster

Page 24: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Annotate unusual features in gene models using D. melanogaster as a reference

Examine the “Comments on Gene Model” section of the FlyBase Gene Report

Non-canonical start codon:

Stop codon read through:

Page 25: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Nested genes in Drosophila

D. mel.

D. erecta

Page 26: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Trans-spliced gene in Drosophila

A special type of RNA processing where exons from two primary transcripts are ligated together

Page 27: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Gene prediction results for the GEP annotation projects

Gene prediction results are available through the GEP UCSC Genome Browser mirror

Under the Genes and Gene Prediction Tracks section Access the predicted peptide sequence by clicking on the feature, then click on the Predicted Protein link

Original gene predictor output available inside the folder Genefinder in the annotation package

The Genscan folder contains a PDF with a graphical schematic of the gene predictions

Page 28: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Summary

Gene predictors can quickly identify potentially interesting features within a genomic sequence

The predictions are hypotheses that must be confirmed experimentally

Eukaryotic gene predictors generally can accurately identify internal exons

Much lower sensitivity and specificity when predicting complete gene models

Page 29: Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Questions?

http://www.flickr.com/photos/cristinacosta/4304968451/sizes/m/in/photostream/