Top Banner
Conditional Random Fields
47

Conditional Random Fields

Feb 03, 2016

Download

Documents

Elisha

Conditional Random Fields. Sequence Labeling: The Problem. Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:. DT. NN. VBD. IN. DT. NN. The cat sat on the mat. Sequence Labeling: The Problem. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Conditional Random Fields

Conditional Random Fields

Page 2: Conditional Random Fields

Sequence Labeling: The Problem

• Given a sequence (in NLP, words), assign appropriate labels to each word.

• For example, POS tagging:

The cat sat on the mat .DT NN VBD IN DT NN .

Page 3: Conditional Random Fields

Sequence Labeling: The Problem

• Given a sequence (in NLP, words), assign appropriate labels to each word.

• Another example, partial parsing (aka chunking):

The cat sat on the matB-NP I-NP B-VPB-PP B-NP I-NP

Page 4: Conditional Random Fields

Sequence Labeling: The Problem

• Given a sequence (in NLP, words), assign appropriate labels to each word.

• Another example, relation extraction:

The cat sat on the matB-ArgI-ArgB-Rel I-Rel B-Arg I-Arg

Page 5: Conditional Random Fields

The CRF Equation

• A CRF model consists of – F = <f1, …, fk>, a vector of “feature functions”

– θ = < θ1, …, θk>, a vector of weights for each feature function.

• Let O = < o1, …, oT> be an observed sentence

• Let X = <x1, …, xT> be the latent variables.

• This is the same as the Maximum Entropy equation!

'

,'exp

,exp)(

x

OxFθ

OxFθO|xXP

Page 6: Conditional Random Fields

• Note that the denominator depends on O, but not on y (it’s marginalizing over y).

• Typically, we write

where

CRF Equation, standard format

Ox,FθO

OxX exp)(

1)|(Z

P

x'

O,x'FθO exp)(Z

Page 7: Conditional Random Fields

Making Structured Predictions

Page 8: Conditional Random Fields

Structured prediction vs. Text Classification

Recall: max. ent. for text classification:

CRFs for sequence labeling:

What’s the difference?

doc,Fθ

doc,Fθdoc

docO|

c

cZ

cAP

c

cc

maxarg

exp)(

1maxarg)(maxarg

Oy,Fθ

Oy,FθO

O|yA

y

yy

maxarg

exp)(

1maxarg)(maxarg

ZP

Page 9: Conditional Random Fields

Structured prediction vs. Text Classification

Two (related) differences, both for the sake of efficiency:

1)Feature functions in CRFs are restricted to graph parts (described later)

2)We can’t do brute force to compute the argmax. Instead, we do Viterbi.

Page 10: Conditional Random Fields

Finding the Best Sequence

Best sequence is

Recall from HMM discussion:If there are

K possible states for each xi variable,

and N total xi variables,

Then there are KN possible settings for xSo brute force can’t find the best sequence. Instead, we resort to a Viterbi-like dynamic program.

Ox,Fθ

Ox,FθO

O|xX

x

xx

maxarg

exp)(

1maxarg)(maxarg

ZP

Page 11: Conditional Random Fields

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

tjtttxx

j ohxooxxtt

The state sequence which maximizes the score of seeing the observations to time t-1, landing in state hj at time t, and seeing the observation at time t

X1 Xt-1 Xt=hj

Page 12: Conditional Random Fields

oTo1 otot-1 ot+1

Viterbi Algorithm

)(maxargˆ TX ii

T

)1(ˆ1

^

tXtX

t

)(maxarg)ˆ( TXP ii

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

Page 13: Conditional Random Fields

Viterbi Algorithm

1)(max)1(

tjoijii

j batt

1)(maxarg)1(

tjoijii

j batt Recursive Computation

oTo1 otot-1 ot+1

X1 Xt-1 Xt=hj Xt+1

),,...,...(max)( 1111... 11

tjtttxx

j ohxooxxtt

??!

??!

Page 14: Conditional Random Fields

Feature functions and Graph parts

To make efficient computation (dynamic programs) possible, we restrict the feature functions to:

Graph parts (or just parts): A feature function that counts how often a particular configuration occurs for a clique in the CRF graph.

Clique: a set of completely connected nodes in a graph. That is, each node in the clique has an edge connecting it to every other node in the clique.

Page 15: Conditional Random Fields

Clique Example

The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.

X1

o1

X2

o2

X3

o3

X4

o4

X5

o5

X6

o6

CRF

Page 16: Conditional Random Fields

Clique Example

The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.

CRF

Individual node cliques

X1

o1

X2

o2

X3

o3

X4

o4

X5

o5

X6

o6

Page 17: Conditional Random Fields

Clique Example

The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.

X1

o1

X2

o2

X3

o3

X4

o4

X5

o5

X6

o6

CRF

Pair-of-node cliques

Page 18: Conditional Random Fields

Clique Example

For non-linear-chain CRFs (something we won’t normally consider in this class), you can get larger cliques:

X1

o1

X2

o2

X3

o3

X4

o4

X5

o5

X6

o6

CRF

Larger cliques

X5’

Page 19: Conditional Random Fields

Graph part as Feature Function Example

Graph parts are feature functions f(x,o) that count how many cliques have a particular configuration.

For example, f(x,o) = count of [xi = Noun].

Here, x2 and x6 are both Nouns, so f(x,o) = 2.

x1=D

o1

x2=N

o2

x3=V

o3

x4=D

o4

x5=A

o5

x6=N

o6

CRF

Page 20: Conditional Random Fields

Graph part as Feature Function Example

For a pair-of-nodes example, f(x,o) = count of [xi = Noun,xi+1=Verb]

Here, x2 is a Noun and x3 is a Verb, so f(x,o) = 1.

CRF x1=D

o1

x2=N

o2

x3=V

o3

x4=D

o4

x5=A

o5

x6=N

o6

Page 21: Conditional Random Fields

Features can depend on the whole observation

In a CRF, each feature function can depend on o, in addition to a clique in x

Normally, we draw a CRF like this:

HMM

CRF

X1

o1

X2

o2

X3

o3

X4

o4

X5

o5

X6

o6

X1

o1

X2

o2

X3

o3

X4

o4

X5

o5

X6

o6

Page 22: Conditional Random Fields

Features can depend on the whole observation

In a CRF, each feature function can depend on o, in addition to a clique in x

But really, it’s more like this:

This would cause problems for a generative model, but in a conditional model, o is always a fixed constant. So we can still calculate relevant algorithms like Viterbi efficiently.

HMM

CRF

X1

o1

X2

o2

X3

o3

X4

o4

X5

o5

X6

o6

X 1

o1

X2

o2

X3

o3

X4

o4

X5

o5

X6

o6

Page 23: Conditional Random Fields

Graph part as Feature Function Example

An example part including x and o: f(x,o) = count of [xi = A or D,xi+1=N,o2=cat]

Here, x1 is a D and x2 is a N, plus x5 is a A and x6 is a N, plus o2=cat, so f(x,o) = 2.

Notice that the clique x5-x6 is allowed to depend on o2.

x1=D

The

x2=N

cat

x3=V

chased

x4=D

the

x5=A

tiny

x6=N

fly

CRF

Page 24: Conditional Random Fields

Graph part as Feature Function Example

An more usual example including x and o: f(x,o) = count of [xi = A or D,xi+1=N,oi+1=cat]

Here, x1 is a D and x2 is a N, plus o2=cat, so f(x,o)=1.

x1=D

The

x2=N

cat

x3=V

chased

x4=D

the

x5=A

tiny

x6=N

fly

CRF

Page 25: Conditional Random Fields

The CRF Equation, with Parts

• A CRF model consists of – P = <p1, …, pk>, a vector of parts

– θ = < θ1, …, θk>, a vector of weights for each part.

• Let O = < o1, …, oT> be an observed sentence

• Let X = <x1, …, xT> be the latent variables.

)(

,exp)(

O

OxPθO|xX

ZP

Page 26: Conditional Random Fields

Viterbi Algorithm – 2nd Try

),,(

),(

)(

max)1(

1

1

oPθ

oPθ

jtitpairpair

jtoneone

i

ij

hxhx

hx

t

t

Recursive

Computation

oTo1 otot-1 ot+1

X1 Xt-1 Xt=hj Xt+1

),,...(max)( 11... 11

oPθ jttxx

j hxxxtt

),,(

),(

)(

maxarg)1(

1

1

oPθ

oPθ

jtitpairpair

jtoneone

i

ij

hxhx

hx

t

t

Page 27: Conditional Random Fields

Supervised Parameter Estimation

Page 28: Conditional Random Fields

Conditional Training• Given a set of observations o and the correct labels x

for each, determine the best θ:

• Because the CRF equation is just a special form of the maximum entropy equation, we can train it exactly the same way: – Determine the gradient– Step in the direction of the gradient– Repeat until convergence

)(maxarg θo|xθ

,P

Page 29: Conditional Random Fields

Recall: Training a ME model

Training is an optimization problem:find the value for λ that maximizes the conditional log-likelihood of the training data:

29

Traindc iii

Traindc

dZdcf

dcPTrainCLL

,

,

)(log),(

)|(log)(

Page 30: Conditional Random Fields

Recall: Training a ME model

Optimization is normally performed using some form of gradient descent:0) Initialize λ0 to 0

1) Compute the gradient: ∇CLL2) Take a step in the direction of the gradient:λi+1 = λi + α ∇CLL

3) Repeat until CLL doesn’t improve:stop when |CLL(λi+1) – CLL(λi)| < ε

30

Page 31: Conditional Random Fields

Recall: Training a ME model

Computing the gradient:

31

TraindciPi

Traindcc i

ii

c iiii

i

Traindc c iii

ii

Traindc iii

ii

dcfdcf

dcf

dcfdcfdcf

dcfdcf

dZdcfTrainCLL

,

,

,

,

),(E),(

),(exp

),(exp),(),(

),(explog),(

)(log),()(

Page 32: Conditional Random Fields

Traindc cii

Traindcc i

ii

c iiii

i

Traindc c iii

ii

Traindc iii

ii

dcfdcPdcf

dcf

dcfdcfdcf

dcfdcf

dZdcfTrainCLL

, '

,'

,

,

),'()|'(),(

),''(exp

),'(exp),'(),(

),'(explog),(

)(log),()(

λ

Recall: Training a ME model

Computing the gradient:

32Involves a sum over all possible classes

Page 33: Conditional Random Fields

Recall: Training a ME model:Expected feature counts

• In ME models, each document d is classified independently.

• The sum involves as many terms as there are classes c’.

• Very doable.

'

),'()|'(c

i dcfdcPλ

Page 34: Conditional Random Fields

Train, tttti

tttti

Train,ti

tttii

titttii

tttti

tttti

Train, titttii

itttti

Train, titttii

ii

oxxfPoxxf

oxxf

oxxfoxxf

oxxf

oxxfoxxf

oZoxxfTrainCLL

ox x'λ

ox'x'

x'

ox x'

ox

ox' ),','()|(),,(

),'',''(exp

),','(exp),','(

),,(

),','(explog),,(

)(log),,()(

11

,1

,11

1

,11

,1

Training a CRF

34

The hard part for CRFs

Page 35: Conditional Random Fields

Training a CRF: Expected feature counts

• For CRFs, the term

involves an exponential sum.

• The solution again involves dynamic programming, very similar to the Forward algorithm for HMMs.

x'

λ ox't

ttti oxxfP ),','()|( 1

Page 36: Conditional Random Fields

CRFs vs. HMMs

Page 37: Conditional Random Fields

Generative (Joint Probability) Models

• HMMs are generative models: That is, they can compute the joint probability P(sentence, hidden-states)

• From a generative model, one can compute– Two conditional models:

• P(sentence | hidden-states) and • P(hidden-states| sentence)

– Marginal models P(sentence) and P(hidden-states)

• For sequence labeling, we want P(hidden-states | sentence)

Page 38: Conditional Random Fields

Discriminative (Conditional) Models

• Most often, people are most interested in the conditional probability P(hidden-states | sentence)For example, this is the distribution needed for sequence labeling.

• Discriminative (also called conditional) models directly represent the conditional distribution P(hidden-states | sentence)– These models cannot tell you the joint distribution, marginals, or other

conditionals.– But they’re quite good at this particular conditional distribution.

Page 39: Conditional Random Fields

Discriminative vs. GenerativeHMM (generative) CRF (discriminative)

Marginal, orLanguage model:P(sentence)

Forward algorithm or Backward algorithm,

linear in length of sentence

Can’t do it.

Find optimal label sequence

Viterbi,Linear in length of

sentence

Viterbi,Linear in length of

sentence

Supervised parameter estimation

Bayesian learning,Easy and fast

Convex optimization,Can be slow-ish (multiple passes through the data)

Unsupervised parameter estimation

Baum-Welch (non-convex optimization),

Slow but doable

Very difficult, and requires making extra assumptions.

Feature functions Parents and children in the graph

Restrictive!

Arbitrary functions of a latent state and any

portion of the observed nodes

Page 40: Conditional Random Fields

CRFs vs. HMMs, a closer look

It’s possible to convert an HMM into a CRF:Set pprior,state(x,o) = count[x1=state]Set θprior,state = log PHMM(x1=state) = log state

Set ptrans,state1,state2(x,o)= count[xi=state1,xi+1=state2]Set θtrans,state1,state2 = log PHMM(xi+1=state2|xi=state1)

= log Astate1,state2

Set pobs,state,word(x,o)= count[xi=state,oi=word]Set θobs,state,word = log PHMM(oi=word|xi=state)

= log Bstate,word

Page 41: Conditional Random Fields

CRF vs. HMM, a closer look

If we convert an HMM to a CRF, all of the CRF parameters θ will be logs of probabilities.Therefore, they will all be between –∞ and 0

Notice: CRF parameters can be between –∞ and +∞.

So, how do HMMs and CRFs compare in terms of bias and variance (as sequence labelers)?– HMMs have more bias– CRFs have more variance

Page 42: Conditional Random Fields

Comparing feature functionsThe biggest advantage of CRFs over HMMs is that they can handle

overlapping features.

For example, for POS tagging, using words as a features (like xi=“the” or xj=“jogging”) is quite useful.

However, it’s often also useful to use “orthographic” features, like “the word ends in –ing” or “the word starts with a capital letter.”

These features overlap: some words end in “ing”, some don’t.

• Generative models have trouble handling overlapping features correctly

• Discriminative models don’t: they can simply use the features.

Page 43: Conditional Random Fields

CRF Example

A CRF POS Tagger for English

Page 44: Conditional Random Fields

Vocabulary

We need to determine the set of possible word types V.

Let V = {all types in 1 million tokens of Wall Street Journal text, which we’ll use for training}

U {UNKNOWN} (for word types we haven’t seen)

Page 45: Conditional Random Fields

L = Label Set

Standard Penn Treebank tagsetNumber Tag Description

1. CC Coordinating conjunction

2. CD Cardinal number

3. DT Determiner

4. EX Existential there

5. FW Foreign word

6. IN Preposition or subordinating conjunction

7. JJ Adjective

8. JJR Adjective, comparative

Number Tag Description

9. JJS Adjective, superlative

10. LS List item marker

11. MD Modal

12. NN Noun, singular or mass

13. NNS Noun, plural

14. NNP Proper noun, singular

15. NNPS Proper noun, plural

16. PDT Predeterminer

17. POS Possessive ending

Page 46: Conditional Random Fields

L = Label SetNumber Tag Description

18. PRP Personal pronoun

19. PRP$ Possessive pronoun

20. RB Adverb

21. RBR Adverb, comparative

22. RBS Adverb, superlative

23. RP Particle

24. SYM Symbol

25. TO to

26. UH Interjection

27. VB Verb, base form

28. VBD Verb, past tense

29. VBG Verb, gerund or present participle

Number Tag Description

30. VBN Verb, past participle

31. VBP Verb, non-3rd person singular present

32. VBZ Verb, 3rd person singular present

33. WDT Wh-determiner

34. WP Wh-pronoun

35. WP$ Possessive wh-pronoun

36. WRB Wh-adverb

Page 47: Conditional Random Fields

CRF FeaturesFeature Type Description

Prior k xi = k

Transition k,k’ xi = k and xi+1=k’

Word k,w xi = k and oi=wk,w xi = k and oi-1=wk,w xi = k and oi+1=wk,w,w’ xi = k and oi=w and oi-1=w’k,w,w’ xi = k and oi=w and oi+1=w’

Orthography: Suffix s in {“ing”,”ed”,”ogy”,”s”,”ly”,”ion”,”tion”, “ity”, …} and k xi=k and oi ends with s

Orthography: Punctuation k xi = k and oi is capitalizedk xi = k and oi is hyphenatedk xi = k and oi contains a periodk xi = k and oi is ALL CAPSk xi = k and oi contains a digit (0-9)…