Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman. Partially modified by Benny Chor.
Dec 21, 2015
Gene prediction and HMMComputational Genomics 2005/6
Lecture 9b
Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman. Partially modified by Benny Chor.
Annotation of Genomic Sequence
Given the sequence of an organism’s genome, we would like to be able to identify:– Genes– Exon boundaries & splice sites– Beginning and end of translation– Alternative splicings– Regulatory elements (e.g. promoters)
The only certain way to do this is experimentally, but it is time consuming and expensive. Computational
methods can achieve reasonable accuracy quickly, and help direct experimental approaches.
primary goals
secondary goals
Prokaryotic Gene Structure
Promoter CDS Terminator
transcription
Genomic DNA
mRNA
Most bacterial promoters contain the Shine-Delgarno signal, at about -10 that has the consensus sequence: 5'-TATAAT-3'.
The terminator: a signal at the end of the coding sequence that terminates the transcription of RNA
The coding sequence is composed of nucleotide triplets. Each triplet codes for an amino acid. The AAs are the building blocks of proteins.
Pieces of a (Eukaryotic) Gene(on the genome)
5’
3’
3’
5’
~ 1-100 Mbp
5’
3’
3’
5’
……
……
~ 1-1000 kbp
exons (cds & utr) / introns(~ 102-103 bp) (~ 102-105 bp)
Polyadenylation site
promoter (~103 bp)
enhancers (~101-102 bp)other regulatory sequences (~ 101-102 bp)
What is it about genes that we can measure (and model)?
• Most of our knowledge is biased towards protein-coding characteristics
– ORF (Open Reading Frame): a sequence defined by in-frame AUG and stop codon, which in turn defines a putative amino acid sequence.
– Codon Usage: most frequently measured by CAI (Codon Adaptation Index)
• Other phenomena– Nucleotide frequencies and correlations:
• value and structure– Functional sites:
• splice sites, promoters, UTRs, polyadenylation sites
A simple measure: ORF length Comparison of Annotation and Spurious ORFs in S. cerevisiae
Basrai MA, Hieter P, and Boeke J Genome Research 1997 7:768-771
Codon Adaptation Index (CAI)
• Parameters are empirically determined by examining a “large” set of example genes
• This is not perfect– Genes sometimes have unusual codons for a reason– The predictive power is dependent on length of
sequence
max
i
i
codon
i codons codon
fCAI
f
Splice signals (mice): GT , AG
General Things to Remember about (Protein-coding) Gene Prediction Software
• It is, in general, organism-specific
• It works best on genes that are reasonably similar to something seen previously
• It finds protein coding regions far better than non-coding regions
• In the absence of external (direct) information, alternative forms will not be identified
• It is imperfect! (It’s biology, after all…)
Simple HMM : Prokaryotes
0002.0001.00
0996.0001.05.0
0002.0998.05.0
0000
32.0
18.0
18.0
32.0
25.0
25.0
22.0
28.0
H
xm(i) = probability of being in state m at position i;
H(m,yi) = probability of emitting character yi in state m;
mk = probability of transition from state k to m.
Outline: Rest of Lecture
• Eukaryotic gene structure
• Modeling gene structure
• Using the model to make predictions
• Improving the model topology
• Modeling fixed-length signals
A eukaryotic gene
• This is the human p53 tumor suppressor gene on chromosome 17.
• Genscan is one of the most popular gene prediction algorithms.
A eukaryotic gene
3’ untranslated region
Final exon
Initial exon
Introns
Internal exons
This particular gene lies on the reverse strand.
An Intron
3’ splice site 5’ splice site
revcomp(CT)=AGrevcomp(AC)=GTGT: signals start of intron
AG: signals end of intron
Signals vs contents
• In gene finding, a small pattern within the genomic DNA is referred to as a signal, whereas a region of genomic DNA is a content.
• Examples of signals: splice sites, starts and ends of transcription or translation, branch points, transcription factor binding sites
• Examples of contents: exons, introns, UTRs, promoter regions
Prior knowledge
• We want to build a probabilistic model of a gene that incorporates our prior knowledge.
• E.g., the translated region must have a length that is a multiple of 3.
Prior knowledge
• The translated region must have a length that is a multiple of 3.
• Some codons are more common than others.• Exons are usually shorter than introns.• The translated region begins with a start signal
and ends with a stop codon.• 5’ splice sites (exon to intron) are usually GT; • 3’ splice sites (intron to exon) are usually AG.• The distribution of nucleotides and dinucleotides
is usually different in introns and exons.
A simple gene model
Transcriptionstop
Transcriptionstart
Start EndGene
Intergenic
Intergenic Intergenic
Intergenic
A probabilistic gene model
Transcriptionstop
Transcriptionstart
Start EndGene
Intergenic
Intergenic Intergenic
Intergenic
Every box stores transition probabilities for outgoing arrows.Every arrow stores emission probabilities for emitted nucleotides.
0.67
0.33
1.00
0.25
0.75
Pr(TACAGTAGATATGA) = 0.0001 Pr(AACAGT) = 0.001
Pr(AACAGTAC) = 0.002…
Parse
• For a given sequence, a parse is an assignment of gene structure to that sequence.
• In a parse, every base is labeled, corresponding to the content it (is predicted to) belongs to.
• In our simple model, the parse contains only “I” (intergenic) and “G” (gene).
• A more complete model would contain, e.g., “-” for intergenic, “E” for exon and “I” for intron.
S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCGP = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGG
TATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC GGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
The probability of a parse
Transcriptionstop
Transcriptionstart
Start EndGene
Intergenic
Intergenic Intergenic
Intergenic
0.67
0.33
1.00
0.25
0.75
Pr(ACTGACTACTACGACTACGATCTACTACGGGCGCGACCT) =
0.0000543
Pr(ATGCGTATGTTTTGA) = 0.00000000142
Pr(ACTGACTATGCGATCTACGACTCGACTAGCTAC) = 0.0000789
Pr(parse P| sequence S, model M) = 0.67 0.0000543 1.00 0.00000000142 0.75 x 0.0000789 = 3.057 10-18
S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCGTATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTACP = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGGGGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Finding the best parse
• For a given sequence S, the model M assigns a probability Pr(P|S,M) to every parse P.
• We want to find the parse P* that receives the highest probability.
MSpPp
,Prmaxarg*
Beyond Simplest Model
• Improving the gene model topology• Fixed-length signals
– PSSMs– Dependencies between positions
• Variable-length contents– Using HMMs– Semi-Markov models
• Parsing algorithms– Viterbi– Posterior decoding
• Including other types of data– Expressed sequence tags– Orthology
Improved model topology
• Draw a model that includes introns
Transcriptionstop
Transcriptionstart
Start EndGene
Intergenic 2
Intergenic 1 Intergenic 4
Intergenic 3
Improved model topology
Transcriptionstop
Transcriptionstart
Start
End
5’ splicesite
3’ splicesite
Improved model topology
Transcriptionstop
Transcriptionstart
Start
End
5’ splicesite
3’ splicesite
4 intergenics1 intron4 exons
Improved model topology
Transcriptionstop
Transcriptionstart
Start
End
5’ splicesite
3’ splicesite
Single exonInitial exon
Intron
Internal exon
Final exon
Modeling the 5’ splice site
• Most introns begin with the letters “GT.”
• We can add this signal to the model.
5’ splicesite
3’ splicesiteIntronGT
Modeling the 5’ splice site
• Most introns begin with the letters “GT.”• We can add this signal to the model.• Indeed, we can model each nucleotide
with its own arrow.
5’ splicesite
3’ splicesiteIntronG T
Pr(A)=0Pr(C)=0Pr(G)=0Pr(T)=1
Pr(A)=0Pr(C)=0Pr(G)=1Pr(T)=0
Modeling the 5’ splice site
• Like most biological phenomenon, the splice site signal admits exceptions.
• The resulting model of the 5’ splice site is a length-2 PSSM.
5’ splicesite
3’ splicesiteIntronG T
Pr(A)=0.01Pr(C)=0.01Pr(G)=0.01Pr(T)=0.97
Pr(A)=0.01Pr(C)=0.01Pr(G)=0.97Pr(T)=0.01
Real splice sites
• Real splice sites show some conservation at positions beyond the first two.
• We can add additional arrows to model these states.
weblogo.berkeley.edu
Modeling the 5’ splice site
5’ splicesite
3’ splicesiteIntron
Adding signals
Transcriptionstop
Transcriptionstart
Start
End
5’ splicesite
3’ splicesite
Single exonInitial exon
Intron
Internal exon
Final exon
Red ellipses correspondto signal models like this:
Positional Independence
Pr(“ACTT”|M) = Pr(“A” at position 1 and “C” at position 2 and “T” at
position 3 and “T” at position 4|M)= Pr(“A” at position 1|M) Pr(“C” at position 2|M)
Pr(“T” at position 3|M) Pr(“T” at position 4|M)
• In general, probabilities of independent events get multiplied.
• A PSSM assumes independence among nucleotides at different positions.
Positional dependence
• In this data, every time a “G” appears in position 1, an “A” appears in position 3.
• Conversely, an “A” in position 1 always occurs with a “T” in position 3.
ACTG
ACTT
GCAC
ACTT
ACTA
GCAT
ACTA
ACTT
nth-order PSSM
• Normally, PSSM entry (i,j) gives the score for observing the ith letter in position j.
• In an nth-order PSSM, each score is conditioned on the preceding letters in the sequence.
• The entries A|A, C|A, G|A and T|A should sum to 1.
1 2 3 4
A|A 0.25 0.45 0.12 0.21
A|C 0.29 0.20 0.24 0.15
A|G 0.33 0.13 0.41 0.33
A|T 0.13 0.22 0.23 0.31
C|A 0.34 0.35 0.09 0.10
…
T|T 0.19 0.24 0.25 0.31
2nd-order PSSM
nth-order PSSM
• Normally, PSSM entry (i,j) gives the score for observing the ith letter in position j.
• In an nth-order PSSM, each score is conditioned on the preceding letters in the sequence.
• How many rows are in a 3rd-order PSSM for nucleotides? nth-order?
1 2 3 4
A|A 0.25 0.45 0.12 0.21
A|C 0.29 0.20 0.24 0.15
A|G 0.33 0.13 0.41 0.33
A|T 0.13 0.22 0.23 0.31
C|A 0.34 0.35 0.09 0.10
…
T|T 0.19 0.24 0.25 0.31
2nd-order PSSM
The probability of observing an “A”
in position 3, given that we
already observed a “C” in position
2.
Conditional probability
• What is the probability of observing an “A” at position 2, given that we observed a “C” at the previous position?
GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG
Conditional probability
• What is the probability of observing an “A” at position 2, given that we observed a “C” at the previous position?
• Answer: total number of CA’s divided by total number of C’s in position 1.
• 3/11 = 27%• Probability of observing CA = 3/18 =
17%.
GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG
Conditional probability
• The conditional probability Pr(x|y) =
Number of occurrences of y:xNumber of occurrences of y:*
where * is any letter.
GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG
Conditional probability
• What is the probability of observing a “G” at position 3, given that we observed a “C” at the previous position?
GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG
Conditional probability
• What is the probability of observing a “G” at position 3, given that we observed a “C” at the previous position?
• Answer: 9/12 = 75%.
GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG
Modeling signals
Transcriptionstop
Transcriptionstart
Start
End
5’ splicesite
3’ splicesite
Single exonInitial exon
Intron
Internal exon
Final exon
Red ellipses may correspond to nth-order PSSMs.
Modeling variable-length regions
Exon length
Modeling variable-length regions
1. The easy way, using standard HMMs.
2. And why that’s not so great.
How are variable-length insertions modeled in protein HMMs?
The HMM solution
5’ splicesite
3’ splicesiteIntron
Fixed-length signals
Variable-length content
5’ splicesite
3’ splicesiteIntron
Codons
starttranslation
endtranslationSingle
exon
starttranslation
endtranslationSingle
exon
0 1 2
2
0 1
The complete model
Transcriptionstop
Transcriptionstart
Start
End
5’ splicesite
3’ splicesite
Single exonInitial exon
Intron
Internal exon
Final exon
Red ellipses correspond to nth-order PSSMs.Every arrow contains an invisible box with a self-loop.
A small problem
• Say that each blue arrow emits one letter.
• What is the probability that the intron will be exactly 2 letters long?
• 3 letters long?
• 4 letters long?
5’ splicesite
3’ splicesiteIntron
0.1
0.9
A small problem
• Say that each blue arrow emits one letter.
• What is the probability that the intron will be exactly 2 letters long? 10%
• 3 letters long? 9%
• 4 letters long? 8.1%
5’ splicesite
3’ splicesiteIntron
0.1
0.9
A small problemHMMs tend to
produce geometric
distributions
Real contents are not necessarily geometric.
Building an HMM
• Input: annotated gene sequences
• Output: HMM parameters– Emission distributions within each content– Length distributions of contents– Transition distributions between contents
A more realistic (and complex) HMM model for Gene
Prediction (Genie)
Kulp, D., PhD Thesis, UCSC 2003
Assessing performance:Sensitivity and Specificity
•Testing of predictions is performed on sequences where the gene structure is known
•Sensitivity is the fraction of known genes (or bases or exons) correctly predicted
–“Am I finding the things that I’m supposed to find”
•Specificity is the fraction of predicted genes (or bases or exons) that correspond to true genes
–“What fraction of my predictions are true”?
•In general, increasing one decreases the other
Graphic View of Specificity and Sensitivity
iveFalseNegatveTruePositi
veTruePositi
AllTrue
veTruePositiSn
iveFalsePositveTruePositi
veTruePositi
eAllPositiv
veTruePositiSp
Quantifying the tradeoff:Correlation Coefficient
FNTNPNFPTPPP
FNTPAPFPTNAN
PNAPPPAN
FNFPTNTPCC
;
;;
Specificity/Sensitivity Tradeoffs
•Ideal Distribution of Scores
•More Realistically…
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40 45 50
score (arb units)
co
un
t
random sequence true sites
0
200
400
600
800
1000
1200
0 10 20 30 40 50
score (arb units)
co
un
t
random sequence true sites
Bayesian Statistics
•Bayes’ Rule
•M: the model, D: data or evidence
D
MMDDM
PP|P
|P posterior
likelihood prior
marginal
continuousP|P
discreteP|PP
dMMMD
MMDD
Basic Bayesian Statistics
•Bayes’ Rule is at the heart of much predictive software
•In the simplest example, we can simply compare two models, and reduce it to a log-odds ratio
2
1
2
1
2
1
M
Mlog
M
Mlog
M
Mlog
P
P
dataP
dataP
dataP
dataP
Genetic +
Genetic -
short +
short -
intergenic
Initiation +
Initiation - Termination -
Termination +
overlap 0
overlap 1 overlap 2
overlap 3
Prokaryotes HMMs: Taking Overlaps onTwo Strands into Account
Genetic +
Genetic -
short +
short -
Initiation +
Initiation - Termination -
Termination +
overlap 0
overlap 1 overlap 2
overlap 3
Coding region (genes)
intergenic
A A A
A A C
A A G….…..…
T T T
Transition from any codon to any other.
Model of
all
possible
64 codons
Coding region (genes)
Integenic regions and overlap regions:
Model Design (3)
Two consecutive genes either overlap each other or separated by an itergenic region.
The overlaping segment or the intergenic region is bordered in one of 4 possible ways.
5'
3'
3'
5'
5'
3'
3'
5'
5'
3'
3'
5'
5'
3'
3'
5'
5'
3'
3'
5'
5'
3'
3'
5'
5'
3'
3'
5'
5'
3'
3'
5'
Tail–Head
Head–Tail
Tail–Tail
Head–Head
Intermediate intergenic region Overlapping Region
Tail Head
Example 1
Genetic +
Genetic -
short +
short -
Initiation +
Initiation - Termination -
Termination +
overlap 0
overlap 1 overlap 2
overlap 3 Transition between two genes on the same strand.
5'
3'
3'
5'
intergenic
Example 2
Genetic +
Genetic -
short +
short -
Initiation +
Initiation - Termination -
Termination +
overlap 0
overlap 1 overlap 2
overlap 3 Two genes on the opposite strands.
5'
3'
3'
5'
intergenic
Transitions between genes
Genetic +
Genetic -
short +
short -
intergenic
Initiation +
Initiation - Termination -
Termination +
overlap 0
overlap 1 overlap 2
overlap 3
5'
3'
3'
5'
Intergenic regions are modeled by
profile HMMs.
Intergenic Regions
5'
3'
3'
5'
We model two different types of intergenic regions:
1. Short intergenic sequences:
9 bases long.
Model situations where two same strand genes are close together.
This situation is common in polycistronic operons.
2. Long intergenic sequences are the more common case.
They are modeled by the following 2 profile HMMs:
Transcription termination signal: 18 bases long.
Promoter region including the Shine-Dalgarno signal: 25 bases long.
ACGT----
A--CG--T
A----CGT
ACG----T
ACGT----
Weight matrix models[i] (WMM) are used to represent overlapping regions of 1 or 4 bases,
consisting of the stop codon of the previous gene and the start codon of the next one. .
T A A GT
N N A T G A N NA-C-G-T-
A-C-G-T-
A---CG-T
ACG----T
ACGT----
A----CGT
A-C-G-T-
A-C-G-T-
Overlap Regions (1)
1 base overlap of stop codon TAA or TGA, with init codon ATG:
4 bases overlap: First gene terminated by TGA, second gene starts with [AG]TG:
WMM formatbases
bases WMM format
Overlap regions of 1 or 4 bases:
For each one of the 4 possible paths described (headhead, head tail, tail tail,
tailhead), all possible frame differences are allowed.
For example: a tailhead transition allows a 1 or 2 bases' shift of the reading frame.
Overlap Regions (2)
Overlap regions of 6 or more bases:
Frame 1Stop codon
Frame 1Stop codon
Frame 1
Frame 2
Frame 3
Init codon
Initcodon
Frame 2/3
5'
3'
3'
5'