Top Banner
HMM Hidden Markov Model
55

HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Dec 26, 2015

Download

Documents

Violet Charles
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

HMM

Hidden Markov Model

Page 2: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

CpG islands

In human genome, CG dinucleotides are relatively rare CpG pairs undergo a process called

methylation that modifies the C nucleotide

A methylated C mutates (with relatively high chance) to a T

Promotor regions are CG rich These regions are not methylated, and

thus mutate less often These are called CG (aka CpG) islands

Page 3: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
Page 4: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Initiation of Transcription from a Promoter

Page 5: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Properties of CpG Islands

Page 6: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
Page 7: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Finding CpG islands

Simple approach: Pick a window of size N

(N = 100, for example) Compute log-ratio for the sequence in the

window, and classify based on that

Problems: How do we select N ? What do we do when the window

intersects the boundary of a CpG island?

Page 8: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Slot machine

Page 9: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Fair Bet Casino The game is to flip coins, which results in

only two possible outcomes: Head or Tail. The Fair coin will give Heads and Tails

with same probability ½. The Biased coin will give Heads with

prob. ¾.

Fair coin: ½ ½

Biased coin: ¾ ¼

Page 10: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

The “Fair Bet Casino” (cont’d)

Thus, we define the probabilities: P(H|F) = P(T|F) = ½ P(H|B) = ¾, P(T|B) = ¼ The crooked dealer changes

between Fair and Biased coins with probability 10%

Game start:

T H H H H T T T T H T

F FF F F F FBBB B

Page 11: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

The Fair Bet Casino Problem

Input: A sequence x = x1x2x3…xn of coin tosses made by two possible coins (F or B).

Output: A sequence π = π1 π2 π3…

πn, with each πi being either F or B indicating that xi is the result of tossing the Fair or Biased coin respectively.

Page 12: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Log-odds Ratio

We define log-odds ratio as follows:

log2(P(x|fair coin) / P(x|biased coin)) = Σk

i=1 log2(p+(xi) / p-

(xi))

= n – k log23

Page 13: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Log-odds Ratio in Sliding Windows

x1x2x3x4x5x6x7x8…xn

Consider a sliding window of the outcome sequence. Find the log-odds for this short window.

Log-odds value

0

Fair coin most likely used

Biased coin most likely used

Disadvantages:- the length of Fair coin seq is not known in advance- different windows may classify the same position differently

Page 14: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

HMM

Page 15: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Markov Process & Transition Matrix

A stochastic process for which the probability of entering a certain state depends only on the last state occupied is called a Markov process. The process is governed by a transition matrix.

Ex. Suppose that the 1995 state of land use in a city of 50 square miles of area is

Assuming that the transition matrix for 5-year intervals are given by

The 2000 state: I (Residential) = ?

I (residential used) 30%II (commercially used) 20%III (industrially used) 50%

To I To II To III

From I 0.8 0.1 0.1

From II

0.1 0.7 0.2

From III

0 0.1 0.9

.8*30+.1*20+0*50 = 26%

Page 16: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Markov ModelA stochastic process for which the

probability of entering a certain state depends only on the last state occupied is called a Markov process.

Residentiallyused

Commerciallyused

Industriallyused

0.8

0.1 0.9

0.0

0.1

0.2

0.7

0.1 0.1

RIICCRRRCCII

Markov Chain

Page 17: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Basic Mathematics Markov chains: probability of a

sequence: S=a1a2...an

P(S)=P(a1)P(a2|a1)P(a3|a1 a2)...P(an|a1... an-1) P(S)=P(a1)P(a2|a1)P(a3|a2)...P(an|an-1) P(S)=P(a1) P P(ai|ai-1)

ExerciseProbability of sequence CATG

Bayes’ theorem

A T G C

A 0.2 0.35 0.15 0.3

T 0.25 0.15 0.4 0.2

G 0.25 0.25 0.25 0.25

C 0.3 0.25 0.2 0.25

Page 18: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Hidden Markov Model1 2 4 3 1 4 5 2 3 4 6 1 2 2 1 …Which dice was used is hidden.

Page 19: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

HMM The state sequence is hidden The process is governed by a transition

matrix

akl = P (i=l | i-1=k) and emission probabilities

ek(b)= P (xi=b | i=k) HMMs: prob. depends on states passed

thru if known: (states = s1 s2 ... sn)

P(Seq)=P(a1|s1)P(s2|s1)P(a2|s2)P(s3|s2)...P(sn|sn-

1)P(an|sn) if unknown, sum over all possible paths find the

sequence that maximize P(S)?

Page 20: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Hidden Markov Model (HMM)

Can be viewed as an abstract machine with k hidden states that emits symbols from an alphabet Σ.

Each state has its own probability distribution, and the machine switches between states according to this probability distribution.

While in a certain state, the machine makes 2 decisions: What state should I move to next? What symbol - from the alphabet Σ -

should I emit?

Page 21: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Why “Hidden”?

Observers can see the emitted symbols of an HMM but have no ability to know which state the HMM is currently in.

Thus, the goal is to infer the most likely hidden states of an HMM based on the given sequence of emitted symbols.

Page 22: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Fair Bet Casino Problem

Fair Bet Casino ProblemAny observed outcome of coin tosses could have been generated by any sequence of states!

Need to incorporate a way to grade different sequences differently.

Decoding Problem

Page 23: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

HMM Parameters

Σ: set of emission characters.Ex.: Σ = {H, T} for coin tossing

Σ = {1, 2, 3, 4, 5, 6} for dice tossing Σ = {A, C, G, T} for a DNA seq

Q: set of hidden states, each emitting symbols from Σ.

Q={F,B} for coin tossing Q={intron, exon} for a gene

Page 24: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

HMM Parameters (cont’d)

A = (akl): a |Q| |Q| matrix of probability of changing from state k to state l.

aFF = 0.9 aFB = 0.1

aBF = 0.1 aBB = 0.9

E = (ek(b)): a |Q| |Σ| matrix of probability of emitting symbol b while being in state k.

eF(0) = ½ eF(1) = ½

eB(0) = ¼ eB(1) = ¾

Page 25: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

HMM for Fair Bet Casino The Fair Bet Casino in HMM terms:

Σ = {0, 1} (0 for Tails and 1 Heads)Q = {F,B} – F for Fair & B for Biased coin.

Transition Probabilities A *** Emission Probabilities EFair Biased

Fair aFF = 0.9

aFB = 0.1

Biased

aBF = 0.1

aBB = 0.9

Tails(0) Heads(1)

Fair eF(0) = ½

eF(1) = ½

Biased

eB(0) = ¼

eB(1) = ¾

Page 26: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

HMM for Fair Bet Casino (cont’d)

HMM model for the Fair Bet Casino Problem

Page 27: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Hidden Paths A path π = π1… πn in the HMM is defined

as a sequence of states. Consider path π = FFFBBBBBFFF and

sequence x = 01011101001 (0=T, 1=H)

x 0 1 0 1 1 1 0 1 0 0 1

π = F F F B B B B B F F FP(xi|πi) ½ ½ ½ ¾ ¾ ¾ ¼ ¾ ½ ½ ½

P(πi-1 πi) ½ 9/10 9/10 1/10

9/10 9/10

9/10 9/10

1/10 9/10

9/10 Transition probability from state πi-1 to state πi

Probability that xi was emitted from state πi

Page 28: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Decoding Problem

Goal: Find an optimal (most probable) hidden path of states given observations.

Input: Sequence of observations x = x1…xn generated by an HMM M(Σ, Q, A, E)

Output: A path that maximizes P(x, π) over all possible paths π.

Page 29: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Decoding Problem

T H H H H T T T TF

B

Page 30: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

P(x, π) Calculation

P(x, π): Probability that sequence x was generated by the path π: n

P(x, π) = P(π0→ π1) · Π P(xi|πi) · P(πi → πi+1)

i=1

= a π0, π1 · Π e πi (xi) · a πi,

πi+1

= Π e πi+1 (xi+1) · a

πi, πi+1

if we count from i=0 instead of i=1 to

i=n

Number of possible paths?

Page 31: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Building Manhattan for Decoding Problem

Andrew Viterbi used the Manhattan grid model to solve the Decoding Problem.

Every choice of π = π1… πn

corresponds to a path in the graph. The only valid direction in the graph

is eastward. This graph has |Q|2(n-1) edges.

?

Page 32: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
Page 33: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Graph for Decoding Problem

Page 34: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Decoding Problem vs. Alignment Problem

Valid directions in the alignment problem.

Valid directions in the decoding problem.

Page 35: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Decoding Problem as Finding a Longest Path

in a DAG The Decoding Problem is reduced to

finding a longest path in the directed acyclic graph (DAG) above.

Notes: the length of the path is defined as the product of its edges’ weights, not the sum.

Page 36: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Decoding Problem: weights of edges

The weight w = el(xi+1) . akl

w?

(k, i) (l, i+1)

T H H H H T T T TF

B

Page 37: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Decoding Problem n

Maximize: P(x, π) = Π e πi+1 (xi+1) . a πi, πi+1

i=0

Page 38: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Decoding Problem (cont’d)

Every path in the graph has the probability P(x,π) (= length of the path).

The Viterbi algorithm finds the path that maximizes P(x, π) among all possible paths.

The Viterbi algorithm runs in O(n|Q|2) time.

?

Page 39: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Decoding Problem and Dynamic Programming

sl,i+1=maxk Q {sk,i · weight of (k,i) (l,i+1)}

=maxk Q {sk,i · akl · el (xi+1)}

=el (xi+1) · maxk Q {sk,i · akl}

lk

i i+1

Page 40: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Decoding Problem (cont’d)

Initialization: sbegin,0 = 1

sk,0 = 0 for k ≠ begin.

Let π* be the optimal path. Then,

P(x, π*) = maxk Є Q {sk,n . ak,end}

Page 41: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Most probable path: Viterbi alg

Dynamic programming define pi,j as the prob of the most probable path

ending in state j after emitting element i Define solution recursively:

suppose we know pi-1,j states up to previous char

update: pi,k=P(ai, sk) * maxj(pi-1,j*P(sk, sj)) traceback

keep table of state probs, start with 1st char, assign prob to each state, iterate updates...

a1 a2 a3 a4s1 0.6 0.9 0.9 0.2s2 0.1 0.1 0.05 0.8s3 0.3 0 0.05 0

Page 42: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Most probable path: Viterbi alg

Dynamic programming

a1 a2 a3 a4

s1 0.6 0.9 0.9 0.2

s2 0.1 0.1 0.05 0.8

s3 0.3 0 0.05 0

Page 43: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Problem with Viterbi Algorithm

The value of the product can become extremely small, which leads to under-flowing.

To avoid overflowing, use log value instead.

sk,i+1= log el(xi+1) + max k Є Q {sk,i + log(akl)}

Page 44: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Exercise

Consider a hidden Markov model with the following transition and emission matrices:

What is the most probable sequence of states for a given DNA sequence ACGG?

exon intron

exon 0.9 0.1

intron 0.05 0.95

purine pyrimidine

exon 0.3 0.7

intron 0.5 0.5

Page 45: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Forward-Backward Problem

Given: a sequence of coin tosses generated by an HMM.

Goal: find the probability that the dealer was using a biased coin at a particular time i.

T H H H H T T T T H T

Page 46: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Forward-Backward Problem

T H H H H T T T TF

Bi

Page 47: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Forward Algorithm Define fk,i (forward probability) as

the probability of emitting the prefix x1…xi and reaching the state πi = k.

The recurrence for the forward algorithm:

fk,i = ek(xi) Σ fl,i-1 alk l Є Q

Similar to Viterbi, except replace ‘max’ with probabilistic ‘sum’!

kl

i1 i

Page 48: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Dishonest Casino Computing posterior probabilities

for “fair” at each point in a long sequence:

Page 49: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Backward Algorithm

However, forward probability is not the only quantity that provides info to P(πi = k|x).

The sequence of transitions and emissions that the HMM undergoes between πi+1 and πn also affect P(πi = k|x).

forward xi backward

Page 50: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Define backward probability bk,i as the probability of being in state πi = k and emitting the suffix xi+1…xn.

The recurrence for the backward algorithm:

bk,i = Σ akl el(xi+1) bl,i+1

l Є Q

Backward Algorithm (cont’d)

k l

i i+1

Page 51: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

The probability that the dealer used a biased coin at any moment i:

P(x, πi = k) fk(i) . bk(i)

P(πi = k|x) = _______________ = ____________

P(x) P(x)

Backward-Forward Algorithm

P(x) is the sum of P(x, πi = k) over all k

Page 52: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Markov chain for CpG Islands

Construct a Markov chain for CpG rich and another for CpG poor regions

Using maximum likelihood estimates from 60K nucleotide, the two models:

Page 53: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Ratio Test for CpC islands

Given a sequence X1,…,Xn , compute the likelihood ratio

1

1

1

11

1

( , , | )( , , ) log

( , , | )

log i i

i i

i i

nn

n

X X

i X X

X Xi

P X XS X X

P X X

A

A

2

2

Page 54: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

HMM Approach Build one model that include “+” states

and “-” states

A state “remembers” last nucleotide and the type of region

A transition from a state to a + corresponds to the start of a CpG island

Page 55: HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

p? q?