Hidden Markov Model and Conditional Random Fieldsepxing/Class/10708-07/Slides/lecture12-CRF-HMM... · Hidden Markov Model and Conditional Random Fields ... Viterbi decoding zGIVEN

1

1

School of Computer Science

Hidden Markov Model and

Conditional Random Fields

Probabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)

Lecture 12, Oct 29, 2007

Eric XingEric XingReceptor A

Kinase C

TF F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

Receptor A

Kinase C

TF F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

X1 X2

X3 X4 X5

X6

X7 X8

Reading: J-Chap. 12, and addition papers

Eric Xing 2

FeedbacksTodayOffice hour

Recitation

Project milestoneThis Wednesday

2

Eric Xing 3

Hidden Markov Model revisitTransition probabilities between any two states

or

Start probabilities

Emission probabilities associated with each state

or in general:

A AA Ax2 x3x1 xT

y2 y3y1 yT...

... ,)|( , jiit

jt ayyp === − 11 1

( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miiiitt K211 1

( ).,,,lMultinomia~)( Myp πππ K211

( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211

( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1

Eric Xing 4

Applications of HMMsSome early applications of HMMs

finance, but we never saw them speech recognition modelling ion channels

In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.

Some current applications of HMMs to biologymapping chromosomesaligning biological sequencespredicting sequence structureinferring evolutionary relationshipsfinding genes in DNA sequence

3

Eric Xing 5

Typical structure of a gene

Eric Xing 6

E0 E1 E2

E

poly-A

3'UTR5'UTR

tEi

Es

I0 I1 I2

intergenicregion

Forward (+) strandReverse (-) strand

Forward (+) strandReverse (-) strand

promoter

GAGAACGTGTGAGAGAGAGGCAAGCCGAAAAATCAGCCGCCGAAGGATACACTATCGTCGTCCTTGTCCGACGAACCGGTGGTCATGCAAACAACGCACAGAACAAATTAATTTTCAAATTGTTCAATAAATGTCCCACTTGCTTCTGTTGTTCCCCCCTTTCCGCTAGCGGAATTTTTTATATTCTTTTGGGGGCGCTCTTTCGTTGACTTTTCGAGCACTTTTTCGATTTTCGCGCGCTGTCGAACGGCAGCGTATTTATTTACAATTTTTTTTGTTAGCGGCCGCCGTTGTTTGTTGCAGATACACAGCGCACACATATAAGCTTGCACACTGATGCACACACACCGACACGTTGTCACCGAAATGAACGGGACGGCCATATGACTGGCTGGCGCTCGGTATGTGGGTGCAAGCGAGATACCGCGATCAAGACTCGAACGAGACGGGTCAGCGAGTGATACCGATTCTCTCTCTTTTGCGATTGGGAATAATGCCCGACTTTTTACACTACATGCGTTGGATCTGGTTATTTAATTATGCCATTTTTCTCAGTATATCGGCAATTGGTTGCATTAATTTTGCCGCAAAGTAAGGAACACAAACCGATAGTTAAGATCCAACGTCCCTGCTGCGCCTCGCGTGCACAATTTGCGCCAATTTCCCCCCTTTTCCAGTTTTTTTCAACCCAGCACCGCTCGTCTCTTCCTCTTCTTAACGTTAGCATTCGTACGAGGAACAGTGCTGTCATTGTGGCCGCTGTGTAGCTAAAAAGCGTAATTATTCATTATCTAGCTATCTTTTCGGATATTATTGTCATTTGCCTTTAATCTTGTGTATTTATATGGATGAAACGTGCTATAATAACAATGCAGAATGAAGAACTGAAGAGTTTCAAAACCTAAAAATAATTGGAATATAAAGTTTGGTTTTACAATTTGATAAAACTCTATTGTAAGTGGAGCGTAACATAGGGTAGAAAACAGTGCAAATCAAAGTACCTAAATGGAATACAAATTTTAGTTGTACAATTGAGTAAAATGAGCAAAGCGCCTATTTTGGATAATATTTGCTGTTTACAAGGGGAACATATTCATAATTTTCAGGTTTAGGTTACGCATATGTAGGCGTAAAGAAATAGCTATATTTGTAGAAGTGCATATGCACTTTATAAAAAATTATCCTACATTAACGTATTTTATTTGCTTTAAAACCTATCTGAGATATTCCAATAAGGTAAGTGCAGTAATACAATGTAAATAATTGCAAATAATGTTGTAACTAAATACGTAAACAATAATGTAGAAGTCCGGCTGAAAGCCCCAGCAGCTATAGCCGATATCTATATGATTTAAACTCTTGTCTGCAACGTTCTAATAAATAAATAAAATGCAAAATATAACCTATTGAGACAATACATTTATTTTATTTTTTTATATCATCAATCATCTACTGATTTCTTTCGGTGTATCGCCTAATCCATCTGTGAAATAGAAATGGCGCCACCTAGGTTAAGAAAAGATAAACAGTTGCCTTTAGTTGCATGACTTCCCGCTGGAT

4

3

2

1

=)|•(

θθθθ

yp

GENSCAN (Burge & Karlin),)|( , ji

it

jt ayyp === − 11 1Transition probabilities:

4

Eric Xing 7

The Forward AlgorithmWe want to calculate P(x), the likelihood of x, given the HMM

Sum over all possible ways of generating x:

To avoid summing over an exponential number of paths y, define

(the forward probability)

The recursion:

),,...,()(def

11 1 ==== ktt

kt

kt yxxPy αα

∑ −==i

kiit

ktt

kt ayxp ,)|( 11 αα

∑=k

kTP α)(x

∑ ∑ ∑ ∑ ∏ ∏= =

−==

yyxx

1 2 112 1

y y y

T

t

T

tttyyy

N ttyxpapp )|(),()( ,πL

Eric Xing 8

The Backward AlgorithmWe want to compute ,

the posterior probability distribution on the t th position, given x

We start by computing

The recursion:

)|( x1=ktyP

Forward, αtk Backward,

),...,,,,...,(),( Ttk

ttk

t xxyxxPyP 11 11 +=== x

)|...()...(

),,...,|,...,(),,...,(

, 1111

11

111

===

===

+

+

ktTt

ktt

kttTt

ktt

yxxPyxxP

yxxxxPyxxP

)|,...,( 11 == +ktTt

kt yxxPβ

it

itt

iik

kt yxpa 111, )1|( +++ == ∑ ββ

A Axt+1 xT

yt+1 yT...

Axt

yt

...

...

...

5

Eric Xing 9

The Junction tree algorithmA junction tree for the HMM

Rightward pass

This is exactly the forward algorithm!

Leftward pass …

This is exactly the backward algorithm!

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

),( 11 xyψ ),( 21 yyψ ),( 32 yyψ ),( TT yy 1−ψ

),( 22 xyψ ),( 33 xyψ ),( TT xyψ

)( 2yζ )( 3yζ )( Tyζ)( 1yφ )( 2yφ⇒⇒

),( 1+tt yyψ)( ttt y→−1µ

),( 11 ++ tt xyψ

)( 11 ++→ ttt yµ

)( 1+↑ tt yµ

∑+

+↑++←+←− =1

11111ty

tttttttttt yyyyy )()(),()( µµψµ

∑

∑

→−++

++→−+

+=

=

t

tt

t

ytttyytt

yttttttt

yayxp

yxpyyyp

)()|(

)|()()|(

, 111

1111

1µ

µ

∑ +↑→−+++→ =ty

tttttttttt yyyyy )()(),()( 11111 µµψµ

∑+

++++←+=1

11111ty

ttttttt yxpyyyp )|()()|( µ

),( 1+tt yyψ)( ttt y←−1µ )( 11 ++← ttt yµ

),( 11 ++ tt xyψ

)( 1+↑ tt yµ

Eric Xing 10

Summary of the F-B algorithm

),,,...,()(def

1111 === −→−k

tttttkt yxxxPkµα

)|,...,()(def

111 === +←−k

tTtttk

t yxxPkµβ

)|()|()1()1(

),1,1(

111111

:11

def,

ttttj

tttittt

Tj

tit

jit

yypyxpyy

xyyp

+++++←→−

+

==∝

===

µµ

ξ

∑ −==i

kiit

ktt


it

itt

iik

kt yxpa 111, )1|( +++ == ∑ ββ

)|(,, 1111 == +++

ittji

jt

it

jit yxpaβαξ

∑=∝==j

jit

it

itT

it

it xyp ,

:1

def)|1( ξβαγ

)|(),(

)|()(def

def

11

1

1 ===

==

+i

tj

t

ittt

yypji

yxpi

A

B

( )( )

( )( )ttt

tttt

ttt

ttt

βαγβαξ

ββαα

∗=∗∗=

∗=∗=

++

++

−

. . .

. .

AB

BABA

T

T

11

11

1

The matrix-vector form:

6

Eric Xing 11

Posterior decodingWe can now calculate

Then, we can askWhat is the most likely state at position t of sequence x:

Note that this is an MPA of a single hidden state, what if we want to a MPA of a whole hidden state sequence?

Posterior Decoding:

This is different from MPA of a whole sequenceof hidden states

This can be understood as bit error ratevs. word error rate

)()(),(

)|(xx

xx

PPyPyP

kt

kt

ktk

tβα

==

==11

)|(maxarg* x1== ktkt yPk

{ } : * Tty tk

t L11 ==

Example:MPA of X ?MPA of (X, Y) ?

x y P(x,y)0 0 0.350 1 0.051 0 0.31 1 0.3

Eric Xing 12

Viterbi decodingGIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is maximized:

y* = argmaxy P(y|x) = argmaxπ P(y,x) Let

= Probability of most likely sequence of states ending at state yt = k

The recursion:

Underflows are a significant problem

These numbers become extremely small – underflowSolution: Take the logs of all values:

),,...,,,...,(max ,--},...{ -1111111

== kttttyy

kt yxyyxxPV

t

itkii

ktt

kt VayxpV 11 −== ,max)|(

x1 x2 x3 ……………………...……..xN

State 12

K

x1 x2 x3 ……………………...……..xN

State 12

K

x1 x2 x3 ……………………...……..xN

State 12

K

x1 x2 x3 ……………………...……..xN

State 12

K

Vi(t)k

tV

tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( LLKK11121111 −

= π

( )( )itkii

ktt

kt VayxpV 11 −++== ,logmax)|(log

7

Eric Xing 13

The Viterbi Algorithm – derivationDefine the viterbi probability:

),,...,,,...,(max ,},...{ 111111 1== +++

kttttyy

kt yxyyxxPV

t

),...,,,...,(),...,,,...,|(max ,},...{ ttttk

ttyy yyxxPyyxxyxPt 111111 1

1== ++

),,,...,,,...,()|(max ,},...{ tttttk

ttyy yxyyxxPyyxPt 111111 1

1 −−++ ==

itki

ktti VayxP ,, )|(max 111 == ++

),,,...,,,...,(max)|(max },...{, 111 111111 11==== −−++ −

ittttyy

it

ktti yxyyxxPyyxP

t

itkii

ktt VayxP ,, max)|( 111 == ++

Eric Xing 14

Computational Complexity and implementation details

What is the running time, and space required, for Forward, and Backward?

Time: O(K2N); Space: O(KN).

Useful implementation technique to avoid underflowsViterbi: sum of logsForward/Backward: rescaling at each position by multiplying by a constant

∑ −==i

kiit

ktt


it

itt

iik

kt yxpa 111, )1|( +++ == ∑ ββ

itkii

ktt

kt VayxpV 11 −== ,max)|(

8

Eric Xing 15

Learning HMMSupervised learning: estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he

changes dice

QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation

Eric Xing 16

Learning HMM: two scenariosSupervised learning: if only we knew the true state path then ML parameter estimation would be trivial

E.g., recall that for complete observed tabular BN:

What if y is continuous? We can treat as N×Tobservations of, e.g., a GLIM, and apply learning rules for GLIM …

Unsupervised learning: when the true state path is unknown, we can fill in the missing values using inference recursions.

The Baum Welch algorithm (i.e., EM)Guaranteed to increase the log likelihood of the model after each iterationConverges to local optimum, depending on initial conditions

∑=

kjikij

ijkMLijk n

n

,','

θ ∑ ∑∑ ∑

= −

= −=•→

→=

nTt

itn

jtnn

Tt

itnML

ij yyy

ijia

2 1

2 1

,

,,

)(#)(#

∑ ∑∑ ∑

=

==•→

→=

nTt

itn

ktnn

Tt

itnML

ik yxy

ikib

1

1

,

,,

)(#)(#

( ){ }NnTtyx tntn :,::, ,, 11 ==

9

Eric Xing 17

The Baum Welch algorithmThe complete log likelihood

The expected complete log likelihood

EMThe E step

The M step ("symbolically" identical to MLE)

∏ ∏∏ ⎟⎟⎠

⎞⎜⎜⎝

⎛==

==−

n

T

ttntn

T

ttntnnc xxpyypypp

1211 )|()|()(log),(log),;( ,,,,,yxyxθl

∑∑∑∑∑==

− ⎟⎠⎞⎜

⎝⎛+⎟

⎠⎞⎜

⎝⎛+⎟

⎠⎞⎜

⎝⎛=

− n

T

tkiyp

itn

ktn

n

T

tjiyyp

jtn

itn

niyp

inc byxayyy

ntnntntnnn 1211

11,)|(,,,)|,(,,)|(, logloglog),;(

,,,, xxxyxθ πl

)|( ,,, ni

tni

tni

tn ypy x1===γ

)|,( ,,,,,, n

jtn

itn

jtn

itn

jitn yypyy x1111 ==== −−ξ

∑ ∑∑ ∑

−

=

==n

T

ti

tn

n

T

tjitnML

ija 1

1

2

,

,,

γ

ξ

∑ ∑∑ ∑

−

=

==n

T

ti

tn

ktnn

T

ti

tnMLik

xb 1

1

1

,

,,

γ

γ

Nn

inML

i∑= 1,γ

π

Eric Xing 18

Shortcomings of Hidden Markov Model

HMM models capture dependences between each state and only its corresponding observation

NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.

Mismatch between learning objective function and prediction objective function

HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)

Y1 Y2 … … … Yn

X1 X2 … … … Xn

10

Eric Xing 19

Solution:Maximum Entropy Markov Model (MEMM)

Models dependence between each state and the full observation sequence explicitly

More expressive than HMMs

Discriminative modelCompletely ignores modeling P(X): saves modeling effortLearning objective function consistent with predictive function: P(Y|X)

Y1 Y2 … … … Yn

X1:n

Eric Xing 20

MEMM: the Label bias problem

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 40.4

0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

What the local transition probabilities say:

• State 1 almost always prefers to go to state 2

• State 2 almost always prefer to stay in state 2

11

Eric Xing 21


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Probability of path 1-> 1-> 1-> 1:

• 0.4 x 0.45 x 0.5 = 0.09

Eric Xing 22


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Probability of path 2->2->2->2 :

• 0.2 X 0.3 X 0.3 = 0.018 Other paths:1-> 1-> 1-> 1: 0.09

12

Eric Xing 23


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Probability of path 1->2->1->2:

• 0.6 X 0.2 X 0.5 = 0.06Other paths:1->1->1->1: 0.09 2->2->2->2: 0.018

Eric Xing 24


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Probability of path 1->1->2->2:

• 0.4 X 0.55 X 0.3 = 0.066Other paths:1->1->1->1: 0.09 2->2->2->2: 0.0181->2->1->2: 0.06

13

Eric Xing 25


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Most Likely Path: 1-> 1-> 1-> 1

• Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.

• why?

Eric Xing 26


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Most Likely Path: 1-> 1-> 1-> 1

• State 1 has only two transitions but state 2 has 5:

• Average transition probability from state 2 is lower

14

Eric Xing 27


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Label bias problem in MEMM:

• Preference of states with lower number of transitions over others

Eric Xing 28

Solution: Do not normalize probabilities locally

State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

From local probabilities ….

15

Eric Xing 29

Solution: Do not normalize probabilities locally

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 420

3010

20

10

20

20

30

2020

30

10

10

30

5

510

30

20

20

20

From local probabilities to local potentials

• States with lower transitions do not have an unfair advantage!

Eric Xing 30

Office hour

Mid-term project milestone due today

Next week

16

Eric Xing 31

From MEMM ….

Y1 Y2 … … … Yn

X1:n

Eric Xing 32

CRF is a partially directed modelDiscriminative model like MEMMUsage of global normalizer Z(x) overcomes the label bias problem of MEMMModels the dependence between each state and the entire observation sequence (like MEMM)

From MEMM to CRF

Y1 Y2 … … … Yn

x1:n

17

Eric Xing 33

Conditional Random FieldsGeneral parametric form:

Y1 Y2 … … … Yn

x1:n

Eric Xing 34

CRFs: InferenceGiven CRF parameters λ and µ, find the y* that maximizes P(y|x)

Can ignore Z(x) because it is not a function of y

Run the max-product algorithm on the junction-tree of CRF:

Y1 Y2 … … … Yn

x1:n

Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn

Y2 Y3Yn-2 Yn-1

Same as Viterbi decoding used in HMMs!

18

Eric Xing 35

CRF learningGiven {(xd, yd)}d=1

N, find λ*, µ* such that

Computing the gradient w.r.t λ:

Gradient of the log-partition function in an exponential family is the expectation of the

sufficient statistics.

Eric Xing 36

CRF learning

Computing the model expectations:

Requires exponentially large number of summations: Is it intractable?

Tractable!Can compute marginals using the sum-product algorithm on the chain

Expectation of f over the corresponding marginal probability of neighboring nodes!!

19

Eric Xing 37

CRF learningComputing marginals using junction-tree calibration:

Junction Tree Initialization:

After calibration:

Y1 Y2 … … … Yn

x1:n

Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn

Y2 Y3Yn-2 Yn-1

Also called forward-backward algorithm

Eric Xing 38

CRF learningComputing feature expectations using calibrated potentials:

Now we know how to compute rλL(λ,µ):

Learning can now be done using gradient ascent:

20

Eric Xing 39

CRF learningIn practice, we use a Gaussian Regularizer for the parameter vector to improve generalizability

In practice, gradient ascent has very slow convergenceAlternatives:

Conjugate Gradient methodLimited Memory Quasi-Newton Methods

Eric Xing 40

CRFs: some empirical resultsComparison of error rates on synthetic data

CRF error HMM error

HMM error

MEM

M e

rror

MEM

M e

rror

CR

F er

ror

Data is increasingly higher order in the direction of arrow

CRFs achieve the lowest error rate for higher order data

21

Eric Xing 41

CRFs: some empirical resultsParts of Speech tagging

Using same set of features: HMM >=< CRF > MEMMUsing additional overlapping features: CRF+ > MEMM+ >> HMM

Eric Xing 42

Other CRFsSo far we have discussed only 1-dimensional chain CRFs

Inference and learning: exact

We could also have CRFs for arbitrary graph structure

E.g: Grid CRFsInference and learning no longer tractableApproximate techniques used

MCMC SamplingVariational InferenceLoopy Belief Propagation

We will discuss these techniques in the future

22

Eric Xing 43

SummaryConditional Random Fields are partially directed discriminative modelsThey overcome the label bias problem of MEMMs by using a global normalizerInference for 1-D chain CRFs is exact

Same as Max-product or Viterbi decodingLearning also is exact

globally optimum parameters can be learnedRequires using sum-product or forward-backward algorithm

CRFs involving arbitrary graph structure are intractable in generalE.g.: Grid CRFsInference and learning require approximation techniques

MCMC samplingVariational methodsLoopy BP

Hidden Markov Model and Conditional Random Fieldsepxing/Class/10708-07/Slides/lecture12-CRF-HMM... · Hidden Markov Model and Conditional Random Fields ... Viterbi decoding zGIVEN

Documents