Sequential Data Modeling - Conditional Random Fields
Post on 16-Oct-2021
5 Views
Preview:
Transcript
1
Sequential Data Modeling – Conditional Random Fields
Sequential Data Modeling -Conditional Random Fields
Graham NeubigNara Institute of Science and Technology (NAIST)
2
Sequential Data Modeling – Conditional Random Fields
Prediction Problems
Given x, predict y
3
Sequential Data Modeling – Conditional Random Fields
Prediction Problems
Given x, predict yA book review
Oh, man I love this book!This book is so boring...
Is it positive?yesno
BinaryPrediction(2 choices)
A tweetOn the way to the park!
公園に行くなう!
Its languageEnglish
Japanese
Multi-classPrediction(several choices)
A sentence
I read a book
Its parts-of-speech StructuredPrediction(millions of choices)I read a book
DET NNVBDN
4
Sequential Data Modeling – Conditional Random Fields
Logistic Regression
5
Sequential Data Modeling – Conditional Random Fields
Example we will use:
● Given an introductory sentence from Wikipedia
● Predict whether the article is about a person
● This is binary classification (of course!)
Given
Gonso was a Sanron sect priest (754-827)in the late Nara and early Heian periods.
Predict
Yes!
Shichikuzan Chigogataki Fudomyoo isa historical site located at Magura, MaizuruCity, Kyoto Prefecture.
No!
6
Sequential Data Modeling – Conditional Random Fields
Review: Linear Prediction Model
● Each element that helps us predict is a feature
● Each feature has a weight, positive if it indicates “yes”, and negative if it indicates “no”
● For a new example, sum the weights
● If the sum is at least 0: “yes”, otherwise: “no”
contains “priest” contains “(<#>-<#>)”contains “site” contains “Kyoto Prefecture”
wcontains “priest”
= 2 wcontains “(<#>-<#>)”
= 1w
contains “site” = -3 w
contains “Kyoto Prefecture” = -1
Kuya (903-972) was a priestborn in Kyoto Prefecture. 2 + -1 + 1 = 2
7
Sequential Data Modeling – Conditional Random Fields
Review: Mathematical Formulation
y = sign (w⋅ϕ (x))
= sign (∑i=1
Iw i⋅ϕ i(x))
● x: the input
● φ(x): vector of feature functions {φ1(x), φ
2(x), …, φ
I(x)}
● w: the weight vector {w1, w
2, …, w
I}
● y: the prediction, +1 if “yes”, -1 if “no”● (sign(v) is +1 if v >= 0, -1 otherwise)
8
Sequential Data Modeling – Conditional Random Fields
-10 -5 0 5 100
0.5
1
w*phi(x)
p(y
|x)
Perceptron and Probabilities
● Sometimes we want the probability● Estimating confidence in predictions● Combining with other systems
● However, perceptron only gives us a prediction
P( y∣x )
In other words:
P( y=1∣x)=1 if w⋅ϕ (x )≥0
y=sign(w⋅ϕ( x))
P( y=1∣x)=0 if w⋅ϕ (x)<0
9
Sequential Data Modeling – Conditional Random Fields
-10 -5 0 5 100
0.5
1
w*phi(x)p
(y|x
)
The Logistic Function● The logistic function is a “softened” version of the
function used in the perceptron
-10 -5 0 5 100
0.5
1
w*phi(x)
p(y
|x)
Perceptron Logistic Function
P( y=1∣x)=e w⋅ϕ( x)
1+ew⋅ϕ(x)
● Can account for uncertainty
● Differentiable
10
Sequential Data Modeling – Conditional Random Fields
Logistic Regression
● Train based on conditional likelihood
● Find the parameters w that maximize the conditional likelihood of all answers y
i given the example x
i
● How do we solve this?
w=argmaxw
∏iP( y i∣x i ;w)
11
Sequential Data Modeling – Conditional Random Fields
Review: Perceptron Training Algorithmcreate map wfor I iterations
for each labeled pair x, y in the dataphi = create_features(x)y' = predict_one(w, phi)if y' != y
w += y * phi
● In other words● Try to classify each training example● Every time we make a mistake, update the weights
12
Sequential Data Modeling – Conditional Random Fields
Stochastic Gradient Descent● Online training algorithm for probabilistic models
(including logistic regression)
create map wfor I iterations
for each labeled pair x, y in the dataw += α * dP(y|x)/dw
● In other words● For every training example, calculate the gradient
(the direction that will increase the probability of y)● Move in that direction, multiplied by learning rate α
13
Sequential Data Modeling – Conditional Random Fields
-10 -5 0 5 100
0.2
0.4
w*phi(x)
dp
(y|x
) /d
w*p
hi(
x)
Gradient of the Logistic Function
● Take the derivative of the probability
dd w
P ( y=1∣x ) =d
d wew⋅ϕ( x)
1+ew⋅ϕ(x)
= ϕ (x )ew⋅ϕ(x)
(1+ew⋅ϕ(x))
2
dd w
P ( y=−1∣x ) =d
d w(1−
ew⋅ϕ(x)
1+ew⋅ϕ(x) )
= −ϕ(x )ew⋅ϕ(x)
(1+ew⋅ϕ(x))
2
14
Sequential Data Modeling – Conditional Random Fields
Example: Initial Update● Set α=1, initialize w=0
x = A site , located in Maizuru , Kyoto y = -1
w⋅ϕ(x)=0
w←w+−0.25ϕ (x )
wunigram “A”
= -0.25w
unigram “site”= -0.25w
unigram “,” = -0.5
wunigram “located”
= -0.25wunigram “in”
= -0.25
wunigram “Maizuru”
= -0.25
wunigram “Kyoto”
= -0.25
dd w
P ( y=−1∣x ) = −e0
(1+e0)2 ϕ (x )
= −0.25ϕ (x)
15
Sequential Data Modeling – Conditional Random Fields
Example: Second Updatex = Shoken , monk born in Kyoto y = 1
w⋅ϕ (x )=−1
w←w+0.196 ϕ( x)
wunigram “A”
= -0.25w
unigram “site”= -0.25w
unigram “,” = -0.304
wunigram “located”
= -0.25wunigram “in”
= -0.054
wunigram “Maizuru”
= -0.25
wunigram “Kyoto”
= -0.054
-0.5 -0.25 -0.25
wunigram “Shoken”
= 0.196w
unigram “monk”= 0.196
wunigram “born”
= 0.196
dd w
P ( y=1∣x ) =e1
(1+e1)2 ϕ (x )
= 0.196ϕ (x )
16
Sequential Data Modeling – Conditional Random Fields
Calculating Optimal Sequences, Probabilities
17
Sequential Data Modeling – Conditional Random Fields
Sequence Likelihood
● Logistic regression considered probability of
● What if we want to consider probability of a sequence?
y∈{−1,+1}
P( y∣x )
I visited Nara
PRN VBD NNP
Xi
Yi
P(Y∣X )
18
Sequential Data Modeling – Conditional Random Fields
φ( )*w=1φ( )*w=2
φ( )*w=0
φ( )φ( )φ( )
Calculating Multi-class Probabilities● Each sequence has it's own feature vector
● Use weights for each feature to calculate scores
time fliesN Vφ( ) φ
T,<S>,N=1 φ
T,N,V=1 φ
T,V,</S>=1 φ
E,N,time=1 φ
E,V,flies=1
time fliesV N
φT,<S>,V
=1 φT,V,N
=1 φT,N,</S>
=1 φE,V,time
=1 φE,N,flies
=1
time fliesN N
φT,<S>,N
=1 φT,N,N
=1 φT,N,</S>
=1 φE,N,time
=1 φE,N,flies
=1
time fliesV V
φT,<S>,V
=1 φT,V,V
=1 φT,V,</S>
=1 φE,V,time
=1 φE,V,flies
=1
wT,<S>,N
=1 wE,N,time
=1wT,V,</S>
=1
time fliesN Vφ( )*w=3 time flies
V Ntime fliesN N
time fliesV V
19
Sequential Data Modeling – Conditional Random Fields
exp(φ( )*w)=2.72exp(φ( )*w)=7.39
exp(φ( )*w)=1.00
The Softmax Function● Turn into probabilities by taking exponent and
normalizing (the Softmax function)
● Take the exponent and normalizetime fliesN Vexp(φ( )*w)=20.08 time flies
V Ntime fliesN N
time fliesV V
P(Y∣X )=ew⋅ϕ(Y , X )
∑Yew⋅ϕ(Y , X )
P(N V | time flies)=.6437
P(N N | time flies)=.2369 P(V V | time flies)=0.0872
P(V N | time flies)=0.0320
20
Sequential Data Modeling – Conditional Random Fields
Calculating Edge Features
● Like perceptron, can calculate features for each edge
<S>
N
V
N
V
</S>
time flies
φT,<S>,N
=1
φT,<S>,V
=1
φE,N,time
=1
φE,V,time
=1
φT,N,N
=1
φT,V,V
=1
φT,V,N
=1
φT,N,V
=1
φE,V,flies
=1
φE,N,flies
=1
φE,N,flies
=1
φE,V,flies
=1
φT,V,</S>
=1
φT,N,</S>
=1
21
Sequential Data Modeling – Conditional Random Fields
Calculating Edge Probabilities
● Calculate scores, and take exponent
<S>
N
V
N
V
</S>
time fliesew*φ=7.39P=.881
ew*φ=1.00P=.119
ew*φ=1.00P=.237
ew*φ=1.00P=.032
ew*φ=1.00P=.644
ew*φ=2.72P=.731
ew*φ=1.00P=.269
ew*φ=1.00P=.087
● This is now the same form as the HMM● Can use the Viterbi algorithm● Calculate probabilities using forward-backward
22
Sequential Data Modeling – Conditional Random Fields
Conditional Random Fields
23
Sequential Data Modeling – Conditional Random Fields
Maximizing CRF Likelihood
● Want to maximize the likelihood for sequences
● For convenience, we consider the log likelihood
● Want to find gradient for stochastic gradient descent
P(Y∣X )=ew⋅ϕ(Y , X )
∑Yew⋅ϕ(Y , X )
log P(Y∣X )=w⋅ϕ(Y , X)− log∑Yew⋅ϕ(Y , X )
dd w
log P (Y∣X )
w=argmaxw
∏iP(Y i∣X i ;w )
24
Sequential Data Modeling – Conditional Random Fields
Deriving a CRF Gradient:
log P(Y∣X ) = w⋅ϕ (Y , X )−log∑Yew⋅ϕ(Y , X )
= w⋅ϕ (Y , X )−log Z
dd w
log P (Y∣X ) = ϕ(Y , X )−d
d wlog∑Y
ew⋅ϕ(Y , X )
= ϕ(Y , X )−1Z∑Y
dd w
ew⋅ϕ(Y , X )
= ϕ(Y , X )−∑Y
ew⋅ϕ(Y , X )
Zϕ (Y , X )
= ϕ(Y , X )−∑YP (Y∣X )ϕ (Y , X )
25
Sequential Data Modeling – Conditional Random Fields
In Other Words...
● To get the gradient we:
dd w
log P (Y∣X )=ϕ (Y , X )−∑YP (Y∣X )ϕ (Y , X )
add the correct feature vector
subtract the expectation of the features
26
Sequential Data Modeling – Conditional Random Fields
Example
φ( )φ( )φ( )
time fliesN Vφ( ) φ
T,<S>,N=1 φ
T,N,V=1 φ
T,V,</S>=1 φ
E,N,time=1 φ
E,V,flies=1
time fliesV N
φT,<S>,V
=1 φT,V,N
=1 φT,N,</S>
=1 φE,V,time
=1 φE,N,flies
=1
time fliesN N
φT,<S>,N
=1 φT,N,N
=1 φT,N,</S>
=1 φE,N,time
=1 φE,N,flies
=1
time fliesV V
φT,<S>,V
=1 φT,V,V
=1 φT,V,</S>
=1 φE,V,time
=1 φE,V,flies
=1
P=.644
P=.237
P=.087
P=.032
φT,<S>,N
, φE,N,time
= 1-.644-.237 = .119 φT,N,V
= 1-.644 = .356
φT,V,</S>
, φE,V,flies
= 1-.644-.087 = .269
φT,V,N
= 0-.032 = -.032
φT,N,N
= 0-.237 = -.237
φT,V,V
= 0-.087 = -.087
φT,<S>,V
, φE,V,time
= 0-.032-.087 = -.119
φT,N,</S>
, φE,V,flies
= 0-.032-.237 = -.269
27
Sequential Data Modeling – Conditional Random Fields
Combinatorial Explosion
● Problem!: The number of hypotheses is exponential.
dd w
log P (Y∣X )=ϕ (Y , X )−∑YP (Y∣X )ϕ (Y , X )
O(T|X|)
T = number of tags
28
Sequential Data Modeling – Conditional Random Fields
Calculate Feature Expectationsusing Edge Probabilities!
● If we know the edge probabilities, just multiply them!
<S>
N
V
time
φT,<S>,N
=1
φT,<S>,V
=1
φE,N,time
=1
φE,V,time
=1
…
ew*φ=7.39P=.881
ew*φ=1.00P=.119
φT,<S>,N
, φE,N,time
= 1-.881 = .119
φT,<S>,V
, φE,V,time
= 0-.119 = -.119
φT,<S>,N
, φE,N,time
= 1-.644-.237 = .119
φT,<S>,V
, φE,V,time
= 0-.032-.087 = -.119
Same answer as when weexplicitly expand all Y!
29
Sequential Data Modeling – Conditional Random Fields
CRF Training Procedure
create map wfor I iterations
for each labeled pair X, Y in the datagradient = φ(Y,X)calculate eφ(y,x)*w for each edgerun forward-backward algorithm to get P(edge)for each edge
gradient -= P(edge)*φ(edge)w += α * gradient
● Can perform stochastic gradient descent, like logistic regression
● Only major difference is gradient calculation
● Learning rate α
30
Sequential Data Modeling – Conditional Random Fields
Learning Algorithms
31
Sequential Data Modeling – Conditional Random Fields
Batch Learning
create map wfor I iterations
for each labeled pair x, y in the dataw += α * dP(y|x)/dw
● Online Learning: Update after each example
● Batch Learning: Update after all examples
Online Stochastic Gradient Descent
create map wfor I iterations
for each labeled pair x, y in the datagradient += α * dP(y|x)/dw
w += gradient
Batch Stochastic Gradient Descent
32
Sequential Data Modeling – Conditional Random Fields
Batch Learning Algorithms:Newton/Quasi-Newton Methods
● Newton-Raphson Method:● Choose how far to update using the second-order
derivatives (the Hessian matrix)● Faster convergence, but |w|*|w| time and memory
● Limited Memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS):● Guesses second-order derivatives from first-order● Most widely used?● Library: http://www.chokkan.org/software/liblbfgs/
● More information:http://homes.cs.washington.edu/~galen/files/quasi-newton-notes.pdf
33
Sequential Data Modeling – Conditional Random Fields
Online Learning vs. Batch Learning
● Online:● In general, simpler mathematical derivation● Often converges faster
● Batch:● More stable (does not change based on order)● Trivially parallelizable
34
Sequential Data Modeling – Conditional Random Fields
Regularization
35
Sequential Data Modeling – Conditional Random Fields
Cannot Distinguish BetweenLarge and Small Classifiers
● For these examples:
● Which classifier is better?
-1 he saw a bird in the park+1 he saw a robbery in the park
Classifier 1he +3saw -5a +0.5bird -1robbery +1in +5the -3park -2
Classifier 2bird -1robbery +1
36
Sequential Data Modeling – Conditional Random Fields
Cannot Distinguish BetweenLarge and Small Classifiers
● For these examples:
● Which classifier is better?
-1 he saw a bird in the park+1 he saw a robbery in the park
Classifier 1he +3saw -5a +0.5bird -1robbery +1in +5the -3park -2
Classifier 2bird -1robbery +1
Probably classifier 2!It doesn't use
irrelevant information.
37
Sequential Data Modeling – Conditional Random Fields
Regularization
● A penalty on adding extra weights
● L2 regularization:● Big penalty on large weights,
small penalty on small weights● High accuracy
● L1 regularization:● Uniform increase whether large
or small● Will cause many weights to
become zero → small model
-2 -1 0 1 20
1
2
3
4
5
L2L1
38
Sequential Data Modeling – Conditional Random Fields
Regularization in Logistic Regression/CRF
● To do so in logistic regression/CRF, we add the penalty to the log likelihood (for the whole corpus)
● c adjusts the strength of the regularization● smaller: more freedom to fit the data● larger: less freedom to fit the data, better generalization
● L1 also used, slightly more difficult to optimize
w=argmaxw
(∏iP (Y i∣X i;w))−c∑w∈w
w2
L2 Regularization
39
Sequential Data Modeling – Conditional Random Fields
Conclusion
40
Sequential Data Modeling – Conditional Random Fields
Conclusion
● Logistic regression is a probabilistic classifier
● Conditional random fields are probabilistic structured discriminative prediction models
● Can be trained using● Online stochastic gradient descent (like peceptron)● Batch learning using a method such as L-BFGS
● Regularization can help solve problems of overfitting
41
Sequential Data Modeling – Conditional Random Fields
Thank You!
top related