Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov
Jan 15, 2016
Gene Recognition
Credits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSerge Saxonov
The Central Dogma
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
Locating Genes
• We have a genome sequence, maybe with related genomes aligned to it…where are the genes?
• Yeast genome is about 70% protein coding• About 6000 genes
• Human genome is about 1.5% protein coding• About 22,000 genes
Finding Genes in Yeast
Start codonATG
5’ 3’
Stop codonTAG/TGA/TAA
Intergenic Coding Intergenic
Mean coding length about 1500bp (500 codons)
Transcript
Finding Genes in Yeast
• ORF Scanning Look for long open reading frames (ORFs)
ORFs start with ATG and contain no in-frame stop codons
Long ORFs unlikely to occur by chance (i.e., they are probably genes)
Finding Genes in Yeast
Yeast ORF distribution
Introns: The Bane of ORF Scanning
Start codonATG
5’ 3’
Stop codonTAG/TGA/TAA
Splice sites
Intergenic Exon Intron IntergenicExon ExonIntron
Transcript
Introns: The Bane of ORF Scanning
• Drosophila:
• 3.4 introns per gene on average
• mean intron length 475, mean exon length 397
• Human:
• 8.8 introns per gene on average
• mean intron length 4400, mean exon length 165
• ORF scanning is defeated
Where are the genes?Where are the genes?
Needles in a Haystack
Now What?
• We need to use more information to help recognize genes
Regular structure
Exon/intron lengths
Nucleotide composition
Biological signals• Start codon, stop codon, splice sites
Patterns of conservation
Regular Gene Structure
• Protein coding region starts with ATG, ends with TAA/TAG/TGA
• Exons alternate with introns
• Introns start with GT/GC, end with AG
• Each exon has a reading frame determined by the codon position at the end of the last exon
Next Exon:Frame 0
Next Exon:Frame 1
Exon/Intron Lengths
Nucleotide Composition
• Base composition in exons is characteristic due to the genetic code
Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG
Biological Signals
• How does the cell recognize start/stop codons and splice sites? In part, from characteristic base composition
• Donor site (start of intron) is recognized by a section of U1 snRNA
U1 snRNA: GUCCAUUCADonor site consensus: MAGGTRAGT
M means “A or C”, R means “A or G”
atg
tga
ggtgag
ggtgag
ggtgag
caggtg
cagatg
cagttg
caggccggtgag
5’ 3’Donor site
Position
-8 … -2 -1 0 1 2 … 17
A 26 … 60 9 0 0 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 100 0 41 … 27T 23 … 13 8 0 99 3 … 25
Splice Sites
Splice Sites
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
• WMM: weight matrix model = PSSM (Staden 1984)• WAM: weight array model = 1st order Markov (Zhang & Marr 1993)• MDD: maximal dependence decomposition (Burge & Karlin 1997)
Decision-tree algorithm to take pairwise dependencies into account
• For each position I, calculate Si = ji2(Ci, Xj)
• Choose i* such that Si* is maximal and partition into two subsets, until
• No significant dependencies left, or
• Not enough sequences in subset
Train separate WMM models for each subset
All donor splice sites
G5
not G5
G5G-1
G5
not G-1
G5G-1
A2
G5G-1
not A2
G5G-1
A2U6
G5G-1A2
not U6
Splice Sites
Patterns of Conservation
• Functional sequences are much more conserved than nonfunctional sequences
• Signal sequences show compensatory mutations If one position mutates away from consensus, often a
different one will mutate to consensus
• Coding sequence shows three-periodic pattern of conservation
Three Periodicity
• Most amino acids can be coded for by more than one DNA triplet
• Usually, the degeneracy is in the last position
Human CCTGTT (Proline, Valine)Mouse CCAGTC (Proline, Valine)Rat CCAGTC (Proline, Valine)Dog CCGGTA (Proline, Valine)Chicken CCCGTG (Proline, Valine)
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
Exon Exon ExonIntronIntronIntergenic Intergenic
Hidden Markov Models for Gene Finding
Intergene State
First Exon State
IntronState
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
Exon Exon ExonIntronIntronIntergenic Intergenic
Hidden Markov Models for Gene Finding
Intergene State
First Exon State
IntronState
GENSCAN
GENSCAN
• Burge and Karlin, Stanford, 1997
• Before The Human Genome Project No alignments available Estimated human gene count was 100,000
• Explicit state duration HMM (with tricks) Intergenic and intronic regions have geometric length
distribution Exons are only possible when correct flanking
sequences are present
GENSCAN
• Output probabilities for NC and CDS depend on previous 5 bases (5th-order) P(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5)
• Each CDS frame has its own model
• WAM models for start/stop codons and acceptor sites
• MDD model for donor sites
• Separate parameters for regions of different GC content
GENSCAN Performance
• First program to do well on realistic sequences Long, multiple genes in both orientations
• Pretty good sensitivity, poor specificity 70% exon Sn, 40% exon Sp
• Not enough exons per gene
• Was the best gene predictor for about 4 years
TWINSCAN
• Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001
• Uses an informant sequence to help predict genes For human, informant is normally mouse
• Informant sequence consists of three characters Match: | Mismatch: : Unaligned: .
• Informant sequence assumed independent of target sequence
The TWINSCAN Model
• Just like GENSCAN, except adds models for conservation sequence
• 5th-order models for CDS and NC, 2nd-order models for start and stop codons and splice sites One CDS model for all frames
• Many informants tried, but mouse seems to be at the “sweet spot”
TWINSCAN Performance
• Slightly more sensitive than GENSCAN, much more specific Exon sensitivity/specificity about 75%
• Much better at the gene level Most genes are mostly right, about 25% exactly right
• Was the best gene predictor for about 4 years
N-SCAN
• Gross and Brent, Washington University in St. Louis, 2005
• If one informant sequence is good, let’s try more!
• Also several other improvements on TWINSCAN
N-SCAN Improvements
• Multiple informants
• Richer models of sequence evolution
• Frame-specific CDS conservation model
• Conserved noncoding sequence model
• 5’ UTR structure model
• GENSCAN
• TWINSCAN
• N-SCAN
HMM Outputs
Target GGTGAGGTGACCAAGAACGTGTTGACAGTA
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA.
..
Phylogenetic Bayesian Network Models
)|()|()|(
)|()|()|()(),,,,,,(
3323
21211321
ARPAMPAAP
AHPAAPACPAPAAARMCHP
Homology-Based Gene Prediction
• Idea: Try to predict a gene in one organism using a known orthologous gene or protein from another organism
• Genewise Protein homology
• Projector Gene structure homology
• Very accurate if (and only if??) homology is high
Evaluating Performance
• Three main levels of performance: gene, exon, nucleotide
• Two measures of performance: Sensitivity: what fraction of the true features did we
predict correctly? Specificity: what fraction of our predicted features
were correct?
• Testing standard is whole-genome prediction Predicting on single-gene sequences is easier and less
interesting
Exact Exon Accuracy
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Exon Sn Exon Sp
GENSCAN EXONIPHY SGP2 TWINSCAN 2.0 N-SCAN
Exact Gene Accuracy
0
0.1
0.2
0.3
0.4
0.5
Gene Sn Gene Sp
GENSCAN SGP2 TWINSCAN 2.0 N-SCAN
Intron Sensitivity By Length
0
0.2
0.4
0.6
0.8
1
0-10
10-2
0
20-3
0
30-4
0
40-5
0
50-6
0
60-7
0
70-8
0
80-9
0
90-1
00Length (Kb)
N-SCAN
SGP2
GENSCAN
TWINSCAN
Human Informant Effectiveness
00.10.20.30.40.50.60.70.80.9
Gene Sn Gene Sp Exon Sn Exon Sp
Chicken Rat Mouse All
Drosophila Informant Effectiveness
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Gene Sn Gene Sp Exon Sn Exon Sp
A. gambiae D. yakuba D. pseudoobscura All
The Future
• Many new genomes being sequenced—they will need annotations! Current experimental “shotgun” methods not enough However, cheap targeted experiments are available to
verify predicted genes
• Promising directions in gene prediction: Conditional random fields Multiple informants—can we actually get them to
work???