8/13/2019 Gene Predict http://slidepdf.com/reader/full/gene-predict 1/69 Gene prediction and HMM Computational Genomics
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 1/69
Gene prediction and HMMComputational Genomics
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 2/69
Annotation of
Genomic SequenceGiven the sequence of an organism ’s genome, wewould like to be able to identify: – Genes – Exon boundaries & splice sites – Beginning and end of translation – Alternative splicings – Regulatory elements (e.g. promoters)The only certain way to do this is experimentally,but it is time consuming and expensive. Computationalmethods can achieve reasonable accuracy quickly, and
help direct experimental approaches.
primary goals
secondary goals
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 3/69
Prokaryotic Gene Structure
Promoter CDS Terminator
transcription
Genomic DNA
mRNA
Most bacterial promoters contain the Shine-Delgarno signal, atabout -10 that has the consensus sequence: 5'-TATAAT-3'.
The terminator : a signal at the end of the coding sequence thatterminates the transcription of RNA
The coding sequence is composed of nucleotide triplets. Eachtriplet codes for an amino acid. The AAs are the building blocksof proteins.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 4/69
Pieces of a (Eukaryotic) Gene(on the genome)
5’
3’
3’
5’
~ 1-100 Mbp
5’
3’
3’
5’
…
… …
…
~ 1-1000 kbp
exons ( cds & utr ) / introns(~ 10 2-10 3 bp) (~ 10 2-10 5 bp)
Polyadenylationsite
promoter (~10 3 bp)
enhancers (~10 1-10 2 bp)other regulatory sequences
(~ 10 1-10 2 bp)
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 5/69
What is it about genes that we
can measure (and model)?• Most of our knowledge is biased towardsprotein-coding characteristics – ORF (Open Reading Frame): a sequence defined by in-
frame AUG and stop codon, which in turn defines aputative amino acid sequence. – Codon Usage : most frequently measured by CAI (Codon
Adaptation Index)• Other phenomena
– Nucleotide frequencies and correlations:• value and structure
– Functional sites:• splice sites, promoters, UTRs, polyadenylation sites
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 6/69
A simple measure: ORF length Comparison ofAnnotation and Spurious ORFs in S. cerevisiae
Basrai MA, Hieter P, and Boeke J Genome Research 1997 7:768-771
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 7/69
Codon Adaptation Index (CAI)
• Parameters are empirically determined byexamining a “large” set of example genes
• This is not perfect – Genes sometimes have unusual codons for a reason – The predictive power is dependent on length of
sequence
max
i
i
codon
i codons codon
f CAI f
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 8/69
Splice signals (mice): GT , AG
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 9/69
General Things to Remember about(Protein-coding) Gene Prediction Software
• It is, in general, organism-specific
• It works best on genes that are reasonably similar
to something seen previously
• It finds protein coding regions far better than non-coding regions
• In the absence of external (direct) information,alternative forms will not be identified
• It is imperfect! (It’s biology, after all…)
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 10/69
Simple HMM : Prokaryotes
0002.0001.000996.0001.05.00002.0998.05.0
0000
32.018.018.032.0
25.025.022.028.0
H
xm(i) = probability of being in state m at position i;
H(m,yi) = probability of emitting character y i in state m;
mk = probability of transition from state k to m.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 11/69
Outline: Rest of Lecture
• Eukaryotic gene structure• Modeling gene structure
• Using the model to make predictions• Improving the model topology• Modeling fixed-length signals
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 12/69
A eukaryotic gene
• This is the human p53 tumor suppressorgene on chromosome 17.
• Genscan is one of the most popular geneprediction algorithms.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 13/69
A eukaryotic gene
3’ untranslatedregion
Final exon
Initial exon
Introns
Internal exons
This particular gene lies on the reverse strand.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 14/69
An Intron
3’ splice site 5 ’ splice site
revcomp(CT)=AG
revcomp(AC)=GTGT: signals start of intron AG: signals end of intron
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 15/69
Signals vs contents
• In gene finding, a small pattern within thegenomic DNA is referred to as a signal , whereasa region of genomic DNA is a content .
• Examples of signals : splice sites, starts andends of transcription or translation, branchpoints, transcription factor binding sites
• Examples of contents : exons, introns, UTRs,promoter regions
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 16/69
Prior knowledge
• We want to build a probabilistic model of agene that incorporates our prior knowledge .
• E.g., the translated region must have alength that is a multiple of 3.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 17/69
Prior knowledge
• The translated region must have a length that isa multiple of 3.
• Some codons are more common than others.
• Exons are usually shorter than introns.• The translated region begins with a start signal
and ends with a stop codon.• 5’ splice sites ( exon to intron ) are usually GT;• 3’ splice sites ( intron to exon ) are usually AG.• The distribution of nucleotides and dinucleotides
is usually different in introns and exons.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 18/69
A simple gene model
Transcriptionstop
Transcriptionstart
Start EndGene
Intergenic
Intergenic Intergenic
Intergenic
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 19/69
A probabilistic gene model
Transcriptionstop
Transcriptionstart
Start EndGene
Intergenic
Intergenic Intergenic
Intergenic
Every box stores transition probabilities for outgoing arrows.Every arrow stores emission probabilities for emitted nucleotides.
0.67
0.33
1.00
0.25
0.75
Pr(TACAGTAGATATGA) = 0.0001
Pr(AACAGT) = 0.001Pr(AACAGTAC) = 0.002
…
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 20/69
Parse
• For a given sequence, a parse is an assignment of genestructure to that sequence.
• In a parse, every base is labeled, corresponding to thecontent it (is predicted to) belongs to.
• In our simple model, the parse contains only “I”(intergenic ) and “G” ( gene ).
• A more complete model would contain, e.g., “ -” forintergenic , “E” for exon and “I” for intron .
S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCT ATGCG
P = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII GGGGG
TATGTTTTGA ACTGACTATGCGATCTACGACTCGACTAGCTAC
GGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 21/69
The probability of a parse
Transcriptionstop
Transcriptionstart
Start EndGene
Intergenic
Intergenic Intergenic
Intergenic
0.67
0.33
1.00
0.25
0.75
Pr(ACTGACTACTACGACTACGAT
CTACTACGGGCGCGACCT) =0.0000543
Pr(ATGCGTATGTTTTGA) =0.00000000142
Pr(ACTGACTATGCGATCTACGACTCGACTAGCTAC) = 0.0000789
Pr(parse P| sequence S, model M)= 0.67 0.0000543 1.00 0.00000000142 0.75 x 0.0000789= 3.057 10 -18
S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCT ATGCGTATGTTTTGA ACTGACTATGCGATCTACGACTCGACTAGCTAC
P = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII GGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 22/69
Finding the best parse
• For a given sequenceS, the model Massigns a probability
Pr(P|S,M) to everyparse P.• We want to find the
parse P* that receivesthe highestprobability.
M S p P p
,Pr maxarg*
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 23/69
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 24/69
Improved model topology
• Draw a model that includes introns
Transcriptionstop
Transcriptionstart
Start EndGene
Intergenic 2
Intergenic 1 Intergenic 4
Intergenic 3
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 25/69
Improved model topology
Transcriptionstop
Transcriptionstart
Start
End
5’ splice site
3’ splice site
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 26/69
Improved model topology
Transcriptionstop
Transcriptionstart
Start
End
5’ splice site
3’ splice site
4 intergenics1 intron4 exons
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 27/69
Improved model topology
Transcriptionstop
Transcriptionstart
Start
End
5’ splice site
3’ splice site
Single exonInitial exon
Intron
Internal exon
Final exon
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 28/69
Modeling the 5 ’ splice site
• Most introns begin with the letters “GT.”
• We can add this signal to the model.
5’ splice site
3’ splice siteIntronGT
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 29/69
Modeling the 5 ’ splice site
• Most introns begin with the letters “GT.”
• We can add this signal to the model.• Indeed, we can model each nucleotide
with its own arrow.
5’ splice site
3’ splice siteIntronG T
Pr(A)=0Pr(C)=0Pr(G)=0Pr(T)=1
Pr(A)=0Pr(C)=0Pr(G)=1Pr(T)=0
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 30/69
Modeling the 5 ’ splice site
• Like most biological phenomenon, the
splice site signal admits exceptions.• The resulting model of the 5 ’ splice site isa length-2 PSSM.
5’ splice site
3’ splice siteIntronG T
Pr(A)=0.01Pr(C)=0.01Pr(G)=0.01Pr(T)=0.97
Pr(A)=0.01Pr(C)=0.01Pr(G)=0.97Pr(T)=0.01
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 31/69
Real splice sites
• Real splice sites show some conservation atpositions beyond the first two.
• We can add additional arrows to model thesestates.
weblo o.berkele .edu
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 32/69
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 33/69
Adding signals
Transcriptionstop
Transcriptionstart
Start
End
5’ splice site
3’ splice site
Single exonInitial exon
Intron
Internal exon
Final exon
Red ellipses correspondto signal models like this:
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 34/69
Positional Independence
Pr(“ACTT”|M)= Pr(“A” at position 1 and “C” at position 2 and “T”
at position 3 and “T” at position 4|M)
= Pr(“A” at position 1|M) Pr(“C” at position 2|M) Pr(“T” at position 3|M) Pr(“T” at position 4|M)
• In general, probabilities of independent events
get multiplied.• A PSSM assumes independence among
nucleotides at different positions.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 35/69
Positional dependence
• In this data, everytime a “G” appears inposition 1 , an “A”
appears in position 3.• Conversely, an “A” in
position 1 alwaysoccurs with a “T” inposition 3.
A C T G
A C T T
G C A C
A C T T
A C T A
G C A T
A C T A
A C T T
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 36/69
n th-order PSSM
• Normally, PSSM entry (i,j)gives the score forobserving the i th letter inposition j.
• In an n th-order PSSM,each score is conditionedon the preceding letters inthe sequence.
• The entries A|A, C|A, G|Aand T|A should sum to 1.
1 2 3 4
A|A 0.25 0.45 0.12 0.21
A|C 0.29 0.20 0.24 0.15
A|G 0.33 0.13 0.41 0.33
A|T 0.13 0.22 0.23 0.31
C|A0.34 0.35 0.09 0.10
…
T|T 0.19 0.24 0.25 0.31
2nd -order PSSM
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 37/69
n th-order PSSM
• Normally, PSSM entry (i,j)gives the score forobserving the i th letter inposition j.
• In an n th-order PSSM,each score is conditionedon the preceding letters inthe sequence.
• How many rows are in a3 rd-order PSSM fornucleotides? n th-order?
1 2 3 4
A|A 0.25 0.45 0.12 0.21
A|C 0.29 0.20 0.24 0.15
A|G 0.33 0.13 0.41 0.33
A|T 0.13 0.22 0.23 0.31
C|A0.34 0.35 0.09 0.10
…
T|T 0.19 0.24 0.25 0.31
2nd -order PSSM
The probability ofobserving an “A”
in position 3,given that we
already observeda “C” in position
2.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 38/69
Conditional probability
• What is the probability of observingan “A” at position 2, given that weobserved a “C” at the previous
position?
GCG
CAG
CCG
GCG
CCG
CCG
GCG
CCT
CCG
GGG
CGG
GCG
AGGCAG
CCT
CAT
CCT
GCG
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 39/69
Conditional probability
• What is the probability of observing an“A” at position 2, given that we observeda “C” at the previous position?
• Answer: total number of CA’s divided bytotal number of C’s in position 1.
• 3/11 = 27%• Probability of observing CA = 3/18 =
17%.
GCG
CAG
CCG
GCG
CCG
CCG
GCG
CCT
CCG
GGG
CGG
GCG
AGGCAG
CCT
CAT
CCT
GCG
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 40/69
Conditional probability
• The conditional probability Pr(x|y) =
Number of occurrences of y:x
Number of occurrences of y:*
where * is any letter.
GCG
CAG
CCG
GCG
CCG
CCG
GCG
CCT
CCG
GGG
CGG
GCG
AGGCAG
CCT
CAT
CCT
GCG
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 41/69
Conditional probability
• What is the probability of observinga “G” at position 3, given that weobserved a “C” at the previous
position?
GCG
CAG
CCG
GCG
CCG
CCG
GCG
CCT
CCG
GGG
CGG
GCG
AGGCAG
CCT
CAT
CCT
GCG
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 42/69
Conditional probability
• What is the probability of observinga “G” at position 3, given that weobserved a “C” at the previous
position?• Answer: 9/12 = 75%.
GCG
CAG
CCG
GCG
CCG
CCG
GCG
CCT
CCG
GGG
CGG
GCG
AGGCAG
CCT
CAT
CCT
GCG
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 43/69
Modeling signals
Transcriptionstop
Transcriptionstart
Start
End
5’ splice site
3’ splice site
Single exonInitial exon
Intron
Internal exon
Final exon
Red ellipses may correspond to n th-order PSSMs.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 44/69
Modeling variable-length regions
Exon length
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 45/69
Modeling variable-length regions
1. The easy way, using standard HMMs.2. And why that’s not so great.
How are variable-length insertionsmodeled in protein HMMs?
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 46/69
The HMM solution
5’ splice site
3’ splice siteIntron
Fixed-length signals
Variable-length content
5’ splice site
3’ splice siteIntron
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 47/69
Codons
starttranslation
endtranslationSingle
exon
starttranslation
endtranslationSingle
exon
0 1 2
2
0 1
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 48/69
The complete model
Transcriptionstop
Transcriptionstart
Start
End
5’ splice site
3’ splice site
Single exonInitial exon
Intron
Internal exon
Final exon
Red ellipses correspond to nth
-order PSSMs.Every arrow contains an invisible box with a self-loop.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 49/69
A small problem
• Say that each blue arrow emits one letter.• What is the probability that the intron will
be exactly 2 letters long?
• 3 letters long?• 4 letters long?
5’ splice site
3’ splice siteIntron
0.1
0.9
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 50/69
A small problem
• Say that each blue arrow emits one letter.• What is the probability that the intron will
be exactly 2 letters long? 10%
• 3 letters long? 9%• 4 letters long? 8.1%
5’ splice site
3’ splice siteIntron
0.1
0.9
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 51/69
A small problemHMMs tend to
producegeometric
distributions
Real contents are not necessarily geometric.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 52/69
Building an HMM
• Input: annotated gene sequences• Output: HMM parameters
– Emission distributions within each content – Length distributions of contents – Transition distributions between contents
A more realistic (and complex)
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 53/69
A more realistic (and complex)HMM model for Gene
Prediction (Genie)
Kulp, D., PhD Thesis, UCSC 2003
Assessing performance:
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 54/69
Assessing performance:Sensitivity and Specificity
•Testing of predictions is performed on sequenceswhere the gene structure is known
•Sensi t iv i ty is the fraction of known genes (or bases
or exons) correctly predicted –“Am I finding the things that I’m supposed to find”
•Specif ic i ty is the fraction of predicted genes (or
bases or exons) that correspond to true genes –“What fraction of my predictions are true?”
•In general, increasing one decreases the other
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 55/69
Graphic View of Specificity and Sensitivity
ive FalseNegat veTruePositiveTruePositi
AllTrueveTruePositi
Sn
ive FalsePosit veTruePositiveTruePositi
e AllPositivveTruePositi
Sp
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 56/69
Quantifying the tradeoff:
Correlation Coefficient
FN TN PN FP TP PP
FN TP AP FP TN AN
PN AP PP AN FN FP TN TP
CC
;
;;
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 57/69
Specificity/Sensitivity Tradeoffs
•Ideal Distribution ofScores
•More Realistically…
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40 45 50
score (arb units)
c o u n
t
random sequence true sites
0
200
400
600
800
1000
1200
0 10 20 30 40 50
score (arb units)
c o u n
t
random sequence true sites
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 58/69
Bayesian Statistics
•Bayes’ Rule
•M : the model, D: data or evidence
D
M M D D M
P
P|P|P posterior
likelihood prior
marginal
continuousP|P
discreteP|PP
dM M M D
M M D D
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 59/69
Basic Bayesian Statistics
•Bayes’ Rule is at the heart of much predictivesoftware
•In the simplest example, we can simply compare two
models, and reduce it to a log-odds ratio
2
1
2
1
2
1
M
Mlog
M
Mlog
M
Mlog
P
P
data P
data P
data P
data P
Prokaryotes HMMs: Taking Overlaps on
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 60/69
Genetic +
Genetic -
short +
short -
intergenic
Initiation +
Initiation - Termination -
Termination +
overlap 0
overlap 1 overlap 2
overlap 3
y g pTwo Strands into Account
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 61/69
Genetic +
Genetic -
short +
short -
Initiation +
Initiation - Termination -
Termination +
overlap 0
overlap 1 overlap 2
overlap 3
Coding region (genes)
intergenic
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 62/69
A A A
A A C
A A G…. …. ….
T T T
Transition
from anycodon toany other.
Model of
all
possible
64 codons
Coding region (genes)
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 63/69
Integenic regions and overlap regions:
Model Design (3)
Two consecutive genes either overlap each other or separated by an itergenic region.
The overlaping segment or the intergenic region is bordered in one of 4 possible ways.
5'3'
3'5'
5'3'
3'5'
5'3'
3'5'
5'3'
3'5'
5'3'
3'5'
5'3'
3'5'
5'3'
3'5'
5'3'
3'5'
Tail – Head
Head – Tail
Tail – Tail
Head – Head
Intermediate intergenic region Overlapping Region
Tail Head
5' 3'
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 64/69
Example 1
Genetic +
Genetic -
short +
short -
Initiation +
Initiation - Termination -
Termination +
overlap 0
overlap 1 overlap 2
overlap 3 Transition between twogenes on the same strand.
3' 5'
intergenic
5' 3'
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 65/69
Example 2
Genetic +
Genetic -
short +
short -
Initiation +
Initiation - Termination -
Termination +
overlap 0
overlap 1 overlap 2
overlap 3 Two genes on the oppositestrands.
3' 5'
intergenic
T i i b5'3'
3'5'
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 66/69
Transitions between genes
Genetic +
Genetic -
short +
short -
intergenic
Initiation +
Initiation - Termination -
Termination +
overlap 0
overlap 1 overlap 2
overlap 3
3 5
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 67/69
Intergenic regions are modeled by
profile HMMs.
Intergenic Regions
5'3'
3'5'
We model two different types of intergenic regions:
1. Short intergenic sequences:
9 bases long.
Model situations where two same strand genes are close together.
This situation is common in polycistronic operons.
2. Long intergenic sequences are the more common case.
They are modeled by the following 2 profile HMMs:
Transcription termination signal: 18 bases long.
Promoter region including the Shine-Dalgarno signal: 25 bases long.
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 68/69
A C
G T----
A-- C
G-- T
A---- C
G T
A C
G---- T
A C
G T----
Weight matrix models [i] (WMM) are used to represent overlapping regions of 1 or 4 bases,
consisting of the stop codon of the previous gene and the start codon of the next one..
T A A GT
N N A T G A N NA- C- G- T-
A- C- G- T-
A--- C G- T
A C G---- T
A C G T----
A---- C G T
A- C- G- T-
A- C- G- T-
Overlap Regions (1)
1 base overlap of stop codon TAA or TGA, with init codon ATG:
4 bases overlap: First gene terminated by TGA, second gene starts with [AG]TG:
WMM format bases
bases WMM format
Overlap regions of 1 or 4 bases:
8/13/2019 Gene Predict
http://slidepdf.com/reader/full/gene-predict 69/69
For each one of the 4 possible paths described (head head, head tail, tail tail,
tail head), all possible frame differences are allowed.
For example: a tail head transition allows a 1 or 2 bases' shift of the reading frame.
Overlap Regions (2)
Overlap regions of 6 or more bases:
Frame 1Stop codon
Frame 1Stop codon
Frame 1
Frame 2
Frame 3
Initcodon
Initcodon
Frame 2/3