Top Banner
HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren
33

HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM for CpG Islands

Arti Kelkar

Pete Rossetti

Peter Warren

Page 2: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM for CpG Islands

• HMM history

• General background

• Three Fundamental problems1.Evaluation

2.Decoding

3.Training

Page 3: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM for CpG Islands

• HMM Applications– Bioinformatics– Non-Bioinformatics

• CpG Islands Problem– CpG Islands– Definition– Why interesting– Hidden Markov Model for CpG– What’s Hidden

• Mathematica Implementation– Training– Decoding

Page 4: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

Andrei Andreyevich Markov1856-1922

Page 5: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

• Early 1900s– Markov conceives “Markov chains” including a proof

of the Central Limit theorem for Markov Chains– Studies with Chebyshev and takes over his classes at

Univ. of St. Petersburg

• 1913– Russian government celebrates the 300th anniversary of

the House of Romanov– AA Markov organizes a counter-celebration – the 200th

anniversary of Bernoulli’s Law of Large Numbers

AA Markov

Page 6: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

• 1960s– Use of HMMs developed by a cold-war era

research team in a classified program at the Communication Research Division of the Institute for Defense Analyses. (Oscar Rothaus).

• 1970s– HMM work is de-classified and is soon being

used in many peaceful applications.

HMM – History

Page 7: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

Markov Chain

• Sunny yesterday• ==> 0.5 probability that it will be sunny today and

0.25 that it will be cloudy or rainy

Page 8: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

Hidden Markov Model

Page 9: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM Definition

• Hidden Markov Model is a triplet (Π, A, B)– Π Vector of initial state probabilities– A Matrix of state transition probabilities– B Matrix of observation probabilities– N Number of hidden states in the model– M Number of observation symbols

Page 10: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM – Three Problems

• Evaluation

• Decoding

• Training

Page 11: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

Given a set of HMMs, which is the one mostlikely to have produced the observation sequence?

HMM - Overview Evaluation Problem

GACGAAACCCTGTCTCTATTTATCC

HMM 1 HMM 2 HMM 3 HMM n…

p(HMM-1)?p(HMM-2)?

p(HMM-3)?p(HMM-n)?

Forward Algorithm is used to find Max[p(HMMs)]

Page 12: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

• States A+,C+,G+,T+,A-,C-,G-,T-

HMM - Overview Decoding Problem

C+

T+

A-

C-

G-

T-

A+

G+

C+

T+

A-

C-

G-

T-

A+

G+

C+

T+

A-

C-

G-

T-

A+

G+

C+

T+

A-

C-

G-

T-

A+

G+

C+

T+

A-

C-

G-

T-

A+

G+

C G C G AObs seq

Page 13: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM - OverviewTraining Problem

AATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGAATCCCAAATCTGAGCGGACAGATGAGGGGGCGCAGAGGAAAAACAGGTTTTGGACCCTACATAAANAGAGAGGTTCGTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGTAACTTGTTTTNGTCGCAGCTGGTCTTGCCTTTGCTGGGGCTGCTGAC

0.17 0.26 0.42 0.11 0.01 0.01 0.01 0.010.16 0.36 0.26 0.18 0.01 0.01 0.01 0.010.15 0.33 0.37 0.11 0.01 0.01 0.01 0.010.07 0.35 0.37 0.17 0.01 0.01 0.01 0.010.01 0.01 0.01 0.01 0.29 0.2 0.27 0.20.01 0.01 0.01 0.01 0.31 0.29 0.07 0.290.01 0.01 0.01 0.01 0.24 0.23 0.29 0.20.01 0.01 0.01 0.01 0.17 0.23 0.28 0.28

A+ C+ G+ T+ A- C- G- T-

A+

C+

G+

T+

A-

C-

G-

T-

From raw seqence data… to Transition Probabilities

How?

Page 14: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

• DNA Sequence analysis

• Protein family profiling

• Prediction of protein folding

• Prediction of genes

• Horizontal gene transfer

• Radiation hybrid mapping, linkage analysis

• Prediction of DNA functional sites.

• CpG island prediction

• Splicing signals prediction

HMM - Applications BioInformatics

Page 15: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

• Speech Recognition• Vehicle Trajectory Projection• Gesture Learning for Human-Robot Interface• Positron Emission Tomography (PET)• Optical Signal Detection• Digital Communications• Music Analysis

HMM - Applications Non-BioInformatics

Page 16: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

Some HMM based Bioinformatics Resources

• PROBE www.ncbi.nlm.nih.gov/ • BLOCKS www.blocks.fhcrc.org/• META-MEME

www.cse.ucsd.edu/users/bgrundy/metameme.1.0.html• SAM www.cse.ucsc.edu/research/compbio/sam.html • HMMERS hmmer.wustl.edu/ • HMMpro www.netid.com/ • GENEWISE www.sanger.ac.uk/Software/Wise2/ • PSI-BLAST www.ncbi.nlm.nih.gov/BLAST/newblast.html• PFAM www.sanger.ac.uk/Pfam/

Page 17: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

CpG ISLANDS“CpG” means “C precedes G”

Not CG base pairs

HMM for CpG Islands

Page 18: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

• Nucleotides - 4 bases in DNA:– A (Adenine)

– C (Cytosine)

– G (Guanine)

– T (Thymine)

HMM for CpG Islands

Page 19: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM for CpG Islands What’s a “CpG Island”

CG-rich region: P(CG) ~ 0.25

……

Gene coding regionPromoter region

CG-poor regions: P(CG) ~ 0.07!

Page 20: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

• Away from gene regions:– The C in CG pairs is usually methylated– Methylation inhibits gene transcription– These CGs tend to mutate to TG

• Near promoter and coding regions:– Methylation is suppressed:– CGs remain CGs– Makes transcription easier!

HMM for CpG Islands Why the difference?

Page 21: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

• CpG-rich regions are associated with genes which are frequently transcribed.

• Helps to understand gene expression related to location in genome.

HMM for CpG Islands Motivation:

Page 22: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

• Q: Why an HMM?• It can answer the questions:

– Short sequence: does it come from a CpG island or not?

– Long sequence: where are the CpG islands?

• So, what’s a good model? – Well, we need states for ISLAND bases and

NON-ISLAND bases …

HMM for CpG Islands Motivation:

Page 23: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM for CpG Islands Straight Markov Models

CpG Island (+)

CpG NON-Island (-)

A+

P(A) = 1

C+

P(C) = 1

G+

P(G) = 1

T+

P(T) = 1

END

START

A-

P(A) = 1

C-

P(C) = 1

G-

P(G) = 1

T-

P(T) = 1

END

START

Page 24: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM for CpG Islands Combined Hidden Markov Model

A+

P(A) = 1

T+

P(T) = 1

G+

P(G) = 1

C+

P(C) = 1

A-

P(A) = 1

T-

P(T) = 1

G-

P(G) = 1

C-

P(C) = 1

END

START

CpG Island

CpG NON-Island

Page 25: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM for CpG IslandsWhat’s “hidden”?

A+ T+G+C+

END

START

A- T-G-C-

CpG Island

CpG NON-Island

Hidden:

Visible:

A G TC

                         

Page 26: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM for CpG IslandsThe Three Problems

• (Evaluation – not in CpG Islands)

• Training

• Decoding

Page 27: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM for CpG IslandsTraining Problem

CG-RICH sequences

AATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGAATCCCAAATCTGAGCGGACAGATGAGGGGGCGCAGAGGAAAAACAGGTTTTGGACCCTACATAAANAGAGAGGTTCGTAAATAGAGA

CG-POOR sequences

GGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGTAACTTGTTTTNGTCGCAGCTGGTCTTGCCTTTGCTGGGGCTGCTGA

HOW?

ML or Forward/Backward algorithm

A+ C+ G+ T+ A- C- G- T-

0.17 0.26 0.42 0.11 0.01 0.01 0.01 0.010.16 0.36 0.26 0.18 0.01 0.01 0.01 0.010.15 0.33 0.37 0.11 0.01 0.01 0.01 0.010.07 0.35 0.37 0.17 0.01 0.01 0.01 0.010.01 0.01 0.01 0.01 0.29 0.2 0.27 0.20.01 0.01 0.01 0.01 0.31 0.29 0.07 0.290.01 0.01 0.01 0.01 0.24 0.23 0.29 0.20.01 0.01 0.01 0.01 0.17 0.23 0.28 0.28

A+

C+

G+

T+

A-

C-

G-

T-

A+ C+ G+ T+ A- C- G- T-

0.17 0.26 0.42 0.11 0.01 0.01 0.01 0.010.16 0.36 0.26 0.18 0.01 0.01 0.01 0.010.15 0.33 0.37 0.11 0.01 0.01 0.01 0.010.07 0.35 0.37 0.17 0.01 0.01 0.01 0.010.01 0.01 0.01 0.01 0.29 0.2 0.27 0.20.01 0.01 0.01 0.01 0.31 0.29 0.07 0.290.01 0.01 0.01 0.01 0.24 0.23 0.29 0.20.01 0.01 0.01 0.01 0.17 0.23 0.28 0.28

A+

C+

G+

T+

A-

C-

G-

T-

Page 28: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

Viterbi Algorithm• Decoding- Meaning of observation sequence by looking at

the underlying states.

• Hidden states A+,C+,G+,T+,A-,C-,G-,T-

• Observation sequence CGCGA

• State sequences C+,G+,C+,G+,A+ or C-,G-,C-,G-,A-

or C+,G-,C+,G-,A+

• Most Probable Path C+,G+,C+,G+,A+

HMM for CpG Islands Decoding Problem

Page 29: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

Viterbi AlgorithmHidden Markov model: S, akl, , el(x).

Observed symbol sequence E = x1,….,xn.Find - Most probable path of states that resulted

in symbol sequence ELet vk(i) be the partial probability of the most

probable path of the symbol sequence x1, x2, ….., xi ending in state k. Then:

v l(i + 1) = e l(xi+1) max(vk(i) akl)

HMM for CpG Islands Decoding Problem II

Page 30: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

HMM for CpG Islands Decoding Problem III

C+

T+

A-

C-

G-

T-

A+

G+

C+

T+

A-

C-

G-

T-

A+

G+

C+

T+

A-

C-

G-

T-

A+

G+

T+

A-

C-

G-

T-

A+

G+

C+

T+

A-

C-

G-

A+

G+

C G C G

T-

C+

A

Page 31: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

Summary

• Computationally less expensive than forward algorithm.

• Partial probability of reaching final state is the probability of the most probable path.

• Decision of best path based on whole sequence, not an individual observation.

HMM for CpG Islands Decoding Problem III

Page 32: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

Now, on to our Mathematica

implementation…

HMM for CpG Islands

Page 33: HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren.

References…

R.Dubin,S.Eddy, A.Krogh, and G. Mitchison. "Biologiclal Sequence Analysis: Probablistic models of Proteins and nucleic acids. Cambridge University Press, 1998. chapters 3 and 5.

A.Krogh,M.Brown,I.Saira Mian,Kimmen Sjolander and David Haussler "Hidden Markov Models in Computational Biology Appications to Protein Modeling J.Mol Biol. (1994) 253, 1501-1531

L. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, Vol. 77, No. 2, Feb. 1989

On-line tutorial:http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.html

HMM for CpG Islands