1 1 School of Computer Science Hidden Markov Model and Conditional Random Fields Probabilistic Graphical Models (10 Probabilistic Graphical Models (10- 708) 708) Lecture 12, Oct 29, 2007 Eric Xing Eric Xing X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8 Reading: J-Chap. 12, and addition papers Eric Xing 2 Feedbacks Today Office hour Recitation Project milestone This Wednesday
22
Embed
Hidden Markov Model and Conditional Random Fieldsepxing/Class/10708-07/Slides/lecture12-CRF-HMM... · Hidden Markov Model and Conditional Random Fields ... Viterbi decoding zGIVEN
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Applications of HMMsSome early applications of HMMs
finance, but we never saw them speech recognition modelling ion channels
In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.
Some current applications of HMMs to biologymapping chromosomesaligning biological sequencespredicting sequence structureinferring evolutionary relationshipsfinding genes in DNA sequence
Computational Complexity and implementation details
What is the running time, and space required, for Forward, and Backward?
Time: O(K2N); Space: O(KN).
Useful implementation technique to avoid underflowsViterbi: sum of logsForward/Backward: rescaling at each position by multiplying by a constant
∑ −==i
kiit
ktt
kt ayxp ,)|( 11 αα
it
itt
iik
kt yxpa 111, )1|( +++ == ∑ ββ
itkii
ktt
kt VayxpV 11 −== ,max)|(
8
Eric Xing 15
Learning HMMSupervised learning: estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,
as he changes dice and produces 10,000 rolls
Unsupervised learning: estimation when the “right answer” is unknown
Examples:GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice
QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation
Eric Xing 16
Learning HMM: two scenariosSupervised learning: if only we knew the true state path then ML parameter estimation would be trivial
E.g., recall that for complete observed tabular BN:
What if y is continuous? We can treat as N×Tobservations of, e.g., a GLIM, and apply learning rules for GLIM …
Unsupervised learning: when the true state path is unknown, we can fill in the missing values using inference recursions.
The Baum Welch algorithm (i.e., EM)Guaranteed to increase the log likelihood of the model after each iterationConverges to local optimum, depending on initial conditions
∑=
kjikij
ijkMLijk n
n
,','
θ ∑ ∑∑ ∑
= −
= −=•→
→=
nTt
itn
jtnn
Tt
itnML
ij yyy
ijia
2 1
2 1
,
,,
)(#)(#
∑ ∑∑ ∑
=
==•→
→=
nTt
itn
ktnn
Tt
itnML
ik yxy
ikib
1
1
,
,,
)(#)(#
( ){ }NnTtyx tntn :,::, ,, 11 ==
9
Eric Xing 17
The Baum Welch algorithmThe complete log likelihood
The expected complete log likelihood
EMThe E step
The M step ("symbolically" identical to MLE)
∏ ∏∏ ⎟⎟⎠
⎞⎜⎜⎝
⎛==
==−
n
T
ttntn
T
ttntnnc xxpyypypp
1211 )|()|()(log),(log),;( ,,,,,yxyxθl
∑∑∑∑∑==
− ⎟⎠⎞⎜
⎝⎛+⎟
⎠⎞⎜
⎝⎛+⎟
⎠⎞⎜
⎝⎛=
− n
T
tkiyp
itn
ktn
n
T
tjiyyp
jtn
itn
niyp
inc byxayyy
ntnntntnnn 1211
11,)|(,,,)|,(,,)|(, logloglog),;(
,,,, xxxyxθ πl
)|( ,,, ni
tni
tni
tn ypy x1===γ
)|,( ,,,,,, n
jtn
itn
jtn
itn
jitn yypyy x1111 ==== −−ξ
∑ ∑∑ ∑
−
=
==n
T
ti
tn
n
T
tjitnML
ija 1
1
2
,
,,
γ
ξ
∑ ∑∑ ∑
−
=
==n
T
ti
tn
ktnn
T
ti
tnMLik
xb 1
1
1
,
,,
γ
γ
Nn
inML
i∑= 1,γ
π
Eric Xing 18
Shortcomings of Hidden Markov Model
HMM models capture dependences between each state and only its corresponding observation
NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.
Mismatch between learning objective function and prediction objective function
HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)
Y1 Y2 … … … Yn
X1 X2 … … … Xn
10
Eric Xing 19
Solution:Maximum Entropy Markov Model (MEMM)
Models dependence between each state and the full observation sequence explicitly
More expressive than HMMs
Discriminative modelCompletely ignores modeling P(X): saves modeling effortLearning objective function consistent with predictive function: P(Y|X)
• States with lower transitions do not have an unfair advantage!
Eric Xing 30
Office hour
Mid-term project milestone due today
Next week
16
Eric Xing 31
From MEMM ….
Y1 Y2 … … … Yn
X1:n
Eric Xing 32
CRF is a partially directed modelDiscriminative model like MEMMUsage of global normalizer Z(x) overcomes the label bias problem of MEMMModels the dependence between each state and the entire observation sequence (like MEMM)
From MEMM to CRF
Y1 Y2 … … … Yn
x1:n
17
Eric Xing 33
Conditional Random FieldsGeneral parametric form:
Y1 Y2 … … … Yn
x1:n
Eric Xing 34
CRFs: InferenceGiven CRF parameters λ and µ, find the y* that maximizes P(y|x)
Can ignore Z(x) because it is not a function of y
Run the max-product algorithm on the junction-tree of CRF:
Y1 Y2 … … … Yn
x1:n
Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn
Y2 Y3Yn-2 Yn-1
Same as Viterbi decoding used in HMMs!
18
Eric Xing 35
CRF learningGiven {(xd, yd)}d=1
N, find λ*, µ* such that
Computing the gradient w.r.t λ:
Gradient of the log-partition function in an exponential family is the expectation of the
sufficient statistics.
Eric Xing 36
CRF learning
Computing the model expectations:
Requires exponentially large number of summations: Is it intractable?
Tractable!Can compute marginals using the sum-product algorithm on the chain
Expectation of f over the corresponding marginal probability of neighboring nodes!!
19
Eric Xing 37
CRF learningComputing marginals using junction-tree calibration:
Junction Tree Initialization:
After calibration:
Y1 Y2 … … … Yn
x1:n
Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn
Y2 Y3Yn-2 Yn-1
Also called forward-backward algorithm
Eric Xing 38
CRF learningComputing feature expectations using calibrated potentials:
Now we know how to compute rλL(λ,µ):
Learning can now be done using gradient ascent:
20
Eric Xing 39
CRF learningIn practice, we use a Gaussian Regularizer for the parameter vector to improve generalizability
In practice, gradient ascent has very slow convergenceAlternatives:
SummaryConditional Random Fields are partially directed discriminative modelsThey overcome the label bias problem of MEMMs by using a global normalizerInference for 1-D chain CRFs is exact
Same as Max-product or Viterbi decodingLearning also is exact
globally optimum parameters can be learnedRequires using sum-product or forward-backward algorithm
CRFs involving arbitrary graph structure are intractable in generalE.g.: Grid CRFsInference and learning require approximation techniques