HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren
Dec 19, 2015
HMM for CpG Islands
Arti Kelkar
Pete Rossetti
Peter Warren
HMM for CpG Islands
• HMM history
• General background
• Three Fundamental problems1.Evaluation
2.Decoding
3.Training
HMM for CpG Islands
• HMM Applications– Bioinformatics– Non-Bioinformatics
• CpG Islands Problem– CpG Islands– Definition– Why interesting– Hidden Markov Model for CpG– What’s Hidden
• Mathematica Implementation– Training– Decoding
Andrei Andreyevich Markov1856-1922
• Early 1900s– Markov conceives “Markov chains” including a proof
of the Central Limit theorem for Markov Chains– Studies with Chebyshev and takes over his classes at
Univ. of St. Petersburg
• 1913– Russian government celebrates the 300th anniversary of
the House of Romanov– AA Markov organizes a counter-celebration – the 200th
anniversary of Bernoulli’s Law of Large Numbers
AA Markov
• 1960s– Use of HMMs developed by a cold-war era
research team in a classified program at the Communication Research Division of the Institute for Defense Analyses. (Oscar Rothaus).
• 1970s– HMM work is de-classified and is soon being
used in many peaceful applications.
HMM – History
Markov Chain
• Sunny yesterday• ==> 0.5 probability that it will be sunny today and
0.25 that it will be cloudy or rainy
Hidden Markov Model
HMM Definition
• Hidden Markov Model is a triplet (Π, A, B)– Π Vector of initial state probabilities– A Matrix of state transition probabilities– B Matrix of observation probabilities– N Number of hidden states in the model– M Number of observation symbols
HMM – Three Problems
• Evaluation
• Decoding
• Training
Given a set of HMMs, which is the one mostlikely to have produced the observation sequence?
HMM - Overview Evaluation Problem
GACGAAACCCTGTCTCTATTTATCC
HMM 1 HMM 2 HMM 3 HMM n…
p(HMM-1)?p(HMM-2)?
p(HMM-3)?p(HMM-n)?
Forward Algorithm is used to find Max[p(HMMs)]
• States A+,C+,G+,T+,A-,C-,G-,T-
HMM - Overview Decoding Problem
C+
T+
A-
C-
G-
T-
A+
G+
C+
T+
A-
C-
G-
T-
A+
G+
C+
T+
A-
C-
G-
T-
A+
G+
C+
T+
A-
C-
G-
T-
A+
G+
C+
T+
A-
C-
G-
T-
A+
G+
C G C G AObs seq
HMM - OverviewTraining Problem
AATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGAATCCCAAATCTGAGCGGACAGATGAGGGGGCGCAGAGGAAAAACAGGTTTTGGACCCTACATAAANAGAGAGGTTCGTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGTAACTTGTTTTNGTCGCAGCTGGTCTTGCCTTTGCTGGGGCTGCTGAC
0.17 0.26 0.42 0.11 0.01 0.01 0.01 0.010.16 0.36 0.26 0.18 0.01 0.01 0.01 0.010.15 0.33 0.37 0.11 0.01 0.01 0.01 0.010.07 0.35 0.37 0.17 0.01 0.01 0.01 0.010.01 0.01 0.01 0.01 0.29 0.2 0.27 0.20.01 0.01 0.01 0.01 0.31 0.29 0.07 0.290.01 0.01 0.01 0.01 0.24 0.23 0.29 0.20.01 0.01 0.01 0.01 0.17 0.23 0.28 0.28
A+ C+ G+ T+ A- C- G- T-
A+
C+
G+
T+
A-
C-
G-
T-
From raw seqence data… to Transition Probabilities
How?
• DNA Sequence analysis
• Protein family profiling
• Prediction of protein folding
• Prediction of genes
• Horizontal gene transfer
• Radiation hybrid mapping, linkage analysis
• Prediction of DNA functional sites.
• CpG island prediction
• Splicing signals prediction
HMM - Applications BioInformatics
• Speech Recognition• Vehicle Trajectory Projection• Gesture Learning for Human-Robot Interface• Positron Emission Tomography (PET)• Optical Signal Detection• Digital Communications• Music Analysis
HMM - Applications Non-BioInformatics
Some HMM based Bioinformatics Resources
• PROBE www.ncbi.nlm.nih.gov/ • BLOCKS www.blocks.fhcrc.org/• META-MEME
www.cse.ucsd.edu/users/bgrundy/metameme.1.0.html• SAM www.cse.ucsc.edu/research/compbio/sam.html • HMMERS hmmer.wustl.edu/ • HMMpro www.netid.com/ • GENEWISE www.sanger.ac.uk/Software/Wise2/ • PSI-BLAST www.ncbi.nlm.nih.gov/BLAST/newblast.html• PFAM www.sanger.ac.uk/Pfam/
CpG ISLANDS“CpG” means “C precedes G”
Not CG base pairs
HMM for CpG Islands
• Nucleotides - 4 bases in DNA:– A (Adenine)
– C (Cytosine)
– G (Guanine)
– T (Thymine)
HMM for CpG Islands
HMM for CpG Islands What’s a “CpG Island”
CG-rich region: P(CG) ~ 0.25
……
Gene coding regionPromoter region
CG-poor regions: P(CG) ~ 0.07!
• Away from gene regions:– The C in CG pairs is usually methylated– Methylation inhibits gene transcription– These CGs tend to mutate to TG
• Near promoter and coding regions:– Methylation is suppressed:– CGs remain CGs– Makes transcription easier!
HMM for CpG Islands Why the difference?
• CpG-rich regions are associated with genes which are frequently transcribed.
• Helps to understand gene expression related to location in genome.
HMM for CpG Islands Motivation:
• Q: Why an HMM?• It can answer the questions:
– Short sequence: does it come from a CpG island or not?
– Long sequence: where are the CpG islands?
• So, what’s a good model? – Well, we need states for ISLAND bases and
NON-ISLAND bases …
HMM for CpG Islands Motivation:
HMM for CpG Islands Straight Markov Models
CpG Island (+)
CpG NON-Island (-)
A+
P(A) = 1
C+
P(C) = 1
G+
P(G) = 1
T+
P(T) = 1
END
START
A-
P(A) = 1
C-
P(C) = 1
G-
P(G) = 1
T-
P(T) = 1
END
START
HMM for CpG Islands Combined Hidden Markov Model
A+
P(A) = 1
T+
P(T) = 1
G+
P(G) = 1
C+
P(C) = 1
A-
P(A) = 1
T-
P(T) = 1
G-
P(G) = 1
C-
P(C) = 1
END
START
CpG Island
CpG NON-Island
HMM for CpG IslandsWhat’s “hidden”?
A+ T+G+C+
END
START
A- T-G-C-
CpG Island
CpG NON-Island
Hidden:
Visible:
A G TC
HMM for CpG IslandsThe Three Problems
• (Evaluation – not in CpG Islands)
• Training
• Decoding
HMM for CpG IslandsTraining Problem
CG-RICH sequences
AATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGAATCCCAAATCTGAGCGGACAGATGAGGGGGCGCAGAGGAAAAACAGGTTTTGGACCCTACATAAANAGAGAGGTTCGTAAATAGAGA
CG-POOR sequences
GGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGTAACTTGTTTTNGTCGCAGCTGGTCTTGCCTTTGCTGGGGCTGCTGA
HOW?
ML or Forward/Backward algorithm
A+ C+ G+ T+ A- C- G- T-
0.17 0.26 0.42 0.11 0.01 0.01 0.01 0.010.16 0.36 0.26 0.18 0.01 0.01 0.01 0.010.15 0.33 0.37 0.11 0.01 0.01 0.01 0.010.07 0.35 0.37 0.17 0.01 0.01 0.01 0.010.01 0.01 0.01 0.01 0.29 0.2 0.27 0.20.01 0.01 0.01 0.01 0.31 0.29 0.07 0.290.01 0.01 0.01 0.01 0.24 0.23 0.29 0.20.01 0.01 0.01 0.01 0.17 0.23 0.28 0.28
A+
C+
G+
T+
A-
C-
G-
T-
A+ C+ G+ T+ A- C- G- T-
0.17 0.26 0.42 0.11 0.01 0.01 0.01 0.010.16 0.36 0.26 0.18 0.01 0.01 0.01 0.010.15 0.33 0.37 0.11 0.01 0.01 0.01 0.010.07 0.35 0.37 0.17 0.01 0.01 0.01 0.010.01 0.01 0.01 0.01 0.29 0.2 0.27 0.20.01 0.01 0.01 0.01 0.31 0.29 0.07 0.290.01 0.01 0.01 0.01 0.24 0.23 0.29 0.20.01 0.01 0.01 0.01 0.17 0.23 0.28 0.28
A+
C+
G+
T+
A-
C-
G-
T-
Viterbi Algorithm• Decoding- Meaning of observation sequence by looking at
the underlying states.
• Hidden states A+,C+,G+,T+,A-,C-,G-,T-
• Observation sequence CGCGA
• State sequences C+,G+,C+,G+,A+ or C-,G-,C-,G-,A-
or C+,G-,C+,G-,A+
• Most Probable Path C+,G+,C+,G+,A+
HMM for CpG Islands Decoding Problem
Viterbi AlgorithmHidden Markov model: S, akl, , el(x).
Observed symbol sequence E = x1,….,xn.Find - Most probable path of states that resulted
in symbol sequence ELet vk(i) be the partial probability of the most
probable path of the symbol sequence x1, x2, ….., xi ending in state k. Then:
v l(i + 1) = e l(xi+1) max(vk(i) akl)
HMM for CpG Islands Decoding Problem II
HMM for CpG Islands Decoding Problem III
C+
T+
A-
C-
G-
T-
A+
G+
C+
T+
A-
C-
G-
T-
A+
G+
C+
T+
A-
C-
G-
T-
A+
G+
T+
A-
C-
G-
T-
A+
G+
C+
T+
A-
C-
G-
A+
G+
C G C G
T-
C+
A
Summary
• Computationally less expensive than forward algorithm.
• Partial probability of reaching final state is the probability of the most probable path.
• Decision of best path based on whole sequence, not an individual observation.
HMM for CpG Islands Decoding Problem III
Now, on to our Mathematica
implementation…
HMM for CpG Islands
References…
R.Dubin,S.Eddy, A.Krogh, and G. Mitchison. "Biologiclal Sequence Analysis: Probablistic models of Proteins and nucleic acids. Cambridge University Press, 1998. chapters 3 and 5.
A.Krogh,M.Brown,I.Saira Mian,Kimmen Sjolander and David Haussler "Hidden Markov Models in Computational Biology Appications to Protein Modeling J.Mol Biol. (1994) 253, 1501-1531
L. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, Vol. 77, No. 2, Feb. 1989
On-line tutorial:http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.html
HMM for CpG Islands