Top Banner
Sequence Labeling Grzegorz Chrupala and Nicolas Stroppa Google Saarland University META Stroppa and Chrupala (UdS) Sequences 2010 1 / 37
46

Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Feb 28, 2019

Download

Documents

dangxuyen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Sequence Labeling

Grzegorz Chrupa la and Nicolas Stroppa

GoogleSaarland University

META

Stroppa and Chrupala (UdS) Sequences 2010 1 / 37

Page 2: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Outline

1 Hidden Markov Models

2 Maximum Entropy Markov Models

3 Sequence perceptron

Stroppa and Chrupala (UdS) Sequences 2010 2 / 37

Page 3: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Entity recognition in news

West Indian all-rounder Phil SimonsPERSON took four for 38 on Friday asLeicestershire ...

We want to categorize news articles based on which entities they talkabout

We can annotate a number of articles with appropriate labels

And learn a model from the annotated data

Assigning labels to words in a sentence is an example of a sequencelabeling task

Stroppa and Chrupala (UdS) Sequences 2010 3 / 37

Page 4: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Sequence labeling

Word POS Chunk NEWest NNP B-NP B-MISCIndian NNP I-NP I-MISCall-rounder NN I-NP OPhil NNP I-NP B-PERSimons NNP I-NP I-PERtook VBD B-VP Ofour CD B-NP Ofor IN B-PP O38 CD B-NP Oon IN B-PP OFriday NNP B-NP Oas IN B-PP OLeicestershire NNP B-NP B-ORGbeat VBD B-VP O

Stroppa and Chrupala (UdS) Sequences 2010 4 / 37

Page 5: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Sequence labeling

Assigning sequences of labels to sequences of some objects is a verycommon task (NLP, bioinformatics)

In NLP

I Speech recognitionI POS taggingI chunking (shallow parsing)I named-entity recognition

Stroppa and Chrupala (UdS) Sequences 2010 5 / 37

Page 6: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

In general, learn a function h : Σ∗ → L∗ to assign a sequence oflabels from L to the sequence of input elements from Σ

The most easily tractable case: each element of the input sequencereceives one label:

h : Σn → Ln

In cases where it does not naturally hold, such as chunking, wedecompose the task so it is satisfied.

IOB scheme: each element gets a label indicating if it is initial inchunk X (B-X), a non-initial in chunk X (I-X) or is outside of anychunk (O).

Stroppa and Chrupala (UdS) Sequences 2010 6 / 37

Page 7: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Local classifier

The simplest approach to sequence labeling is to just use a regularclassifier, and make a local decision for each word.

Predictions for previous words can be used in predicting the currentword

This straightforward strategy can sometimes give surprisingly goodresults

Stroppa and Chrupala (UdS) Sequences 2010 7 / 37

Page 8: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Outline

1 Hidden Markov Models

2 Maximum Entropy Markov Models

3 Sequence perceptron

Stroppa and Chrupala (UdS) Sequences 2010 8 / 37

Page 9: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

HMM refresher

HMMs – simplified models of the process generating the sequences ofinterest

Observations generated by hidden statesI Analogous to classesI Dependencies between states

Stroppa and Chrupala (UdS) Sequences 2010 9 / 37

Page 10: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Formally

Sequence of observations x = x1, x2, . . . , xN

Corresponding hidden states z = z1, z2, . . . , zN

z = argmaxz

P(z|x)

= argmaxz

P(x|z)P(z)∑z P(x|z)P(z)

= argmaxz

P(x, z)

P(x, z) =N∏

i=1

P(xi |x1, . . . , xi−1, z1, . . . , zi )P(zi |x1, . . . , xi−1, z1, . . . , zi−1)

Stroppa and Chrupala (UdS) Sequences 2010 10 / 37

Page 11: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Simplifying assumptions

Current state only depends on previous state

Previous observation only influence current one via the state

P(x1, x2, . . . , xN , z1, z2, . . . , zN) =N∏

i=1

P(xi |zi )P(zi |zi−1)

P(xi |zi ) – emission probabilities

P(zi |zi−1) – transition probabilities

Stroppa and Chrupala (UdS) Sequences 2010 11 / 37

Page 12: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Stroppa and Chrupala (UdS) Sequences 2010 12 / 37

Page 13: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

A real Markov process

A dishonest casino

A casino has two dice:I Fair die: P(1) = P(2) = P(3) = P(5) = P(6) = 1/6I Loaded die:

P(1) = P(2) = P(3) = P(5) = 1/10P(6) = 1/2

Casino player switches back-and-forth between fair and loaded dieonce every 20 turns on average

Stroppa and Chrupala (UdS) Sequences 2010 13 / 37

Page 14: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Evaluation question

Given a sequence of rolls:12455264621461461361366616646 61636616366163616515615115146123562344

How likely is this sequence, given our model of the casino?

Stroppa and Chrupala (UdS) Sequences 2010 14 / 37

Page 15: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Decoding question

Given a sequence of rolls:12455264621461461361366616646 61636616366163616515615115146123562344

Which throws were generated by the fair dice and which by theloaded dice?

Stroppa and Chrupala (UdS) Sequences 2010 15 / 37

Page 16: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Learning question

Given a sequence of rolls:12455264621461461361366616646 61636616366163616515615115146123562344

Can we infer how the casino works? How loaded is the dice? Howoften the casino player changes between the dice?

Stroppa and Chrupala (UdS) Sequences 2010 16 / 37

Page 17: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

The dishonest casino model

Stroppa and Chrupala (UdS) Sequences 2010 17 / 37

Page 18: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Example

Let the sequence of rolls be: x = (1, 2, 1, 5, 2, 1, 6, 2, 4)

A candidate parse is z = (F ,F ,F ,F ,F ,F ,F ,F ,F ,F )

What is the probability P(x, z)?

P(x, z) =N∏

i=1

P(xi |zi )P(zi |zi−1)

(Let’s assume initial transition probabilities P(F |0) = P(L|0) = 12)

1

2× P(1|F )P(F |F )× P(2|F )P(F |F ) · · ·P(4|L)

=1

2×(

1

6

)10

× 0.959

= 5.21× 10−9

Stroppa and Chrupala (UdS) Sequences 2010 18 / 37

Page 19: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Example

Let the sequence of rolls be: x = (1, 2, 1, 5, 2, 1, 6, 2, 4)

A candidate parse is z = (F ,F ,F ,F ,F ,F ,F ,F ,F ,F )

What is the probability P(x, z)?

P(x, z) =N∏

i=1

P(xi |zi )P(zi |zi−1)

(Let’s assume initial transition probabilities P(F |0) = P(L|0) = 12)

1

2× P(1|F )P(F |F )× P(2|F )P(F |F ) · · ·P(4|L)

=1

2×(

1

6

)10

× 0.959

= 5.21× 10−9

Stroppa and Chrupala (UdS) Sequences 2010 18 / 37

Page 20: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Example

Let the sequence of rolls be: x = (1, 2, 1, 5, 2, 1, 6, 2, 4)

A candidate parse is z = (F ,F ,F ,F ,F ,F ,F ,F ,F ,F )

What is the probability P(x, z)?

P(x, z) =N∏

i=1

P(xi |zi )P(zi |zi−1)

(Let’s assume initial transition probabilities P(F |0) = P(L|0) = 12)

1

2× P(1|F )P(F |F )× P(2|F )P(F |F ) · · ·P(4|L)

=1

2×(

1

6

)10

× 0.959

= 5.21× 10−9

Stroppa and Chrupala (UdS) Sequences 2010 18 / 37

Page 21: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Example

What about the parse z = (L, L, L, L, L, L, L, L, L, L)?

1

2× P(1|L)P(L|L)× P(2|L)P(L|L) · · ·P(4|L)

=1

2× 0.52 × 0.950 = 7.9× 10−10

It’s 6.61 times more likely that the all the throws came from a fairdice than that they came from a loaded dice.

Stroppa and Chrupala (UdS) Sequences 2010 19 / 37

Page 22: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Example

Now let the throws be: x = (1, 6, 6, 5, 6, 2, 6, 6, 3, 6)

What is P(x,F 10) now?

1

2×(

1

6

)10

× 0.959 = 5.21× 10−9

Same as before

What is P(x, L10)

1

2× 0.14 × 0.56 × 0.959 = 0.5× 10−7

So now it is 100 times more likely that all the throws came from aloaded dice

Stroppa and Chrupala (UdS) Sequences 2010 20 / 37

Page 23: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Example

Now let the throws be: x = (1, 6, 6, 5, 6, 2, 6, 6, 3, 6)

What is P(x,F 10) now?

1

2×(

1

6

)10

× 0.959 = 5.21× 10−9

Same as before

What is P(x, L10)

1

2× 0.14 × 0.56 × 0.959 = 0.5× 10−7

So now it is 100 times more likely that all the throws came from aloaded dice

Stroppa and Chrupala (UdS) Sequences 2010 20 / 37

Page 24: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Example

Now let the throws be: x = (1, 6, 6, 5, 6, 2, 6, 6, 3, 6)

What is P(x,F 10) now?

1

2×(

1

6

)10

× 0.959 = 5.21× 10−9

Same as before

What is P(x, L10)

1

2× 0.14 × 0.56 × 0.959 = 0.5× 10−7

So now it is 100 times more likely that all the throws came from aloaded dice

Stroppa and Chrupala (UdS) Sequences 2010 20 / 37

Page 25: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Example

Now let the throws be: x = (1, 6, 6, 5, 6, 2, 6, 6, 3, 6)

What is P(x,F 10) now?

1

2×(

1

6

)10

× 0.959 = 5.21× 10−9

Same as before

What is P(x, L10)

1

2× 0.14 × 0.56 × 0.959 = 0.5× 10−7

So now it is 100 times more likely that all the throws came from aloaded dice

Stroppa and Chrupala (UdS) Sequences 2010 20 / 37

Page 26: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Decoding

Given x we want to find the best z, i.e. the one which maximizesP(x, z)

z = argmaxz

P(x, z)

Enumerate all possible z, and evalue P(x, z)?

Exponential in length of input

Dynamic programming to the rescue

Stroppa and Chrupala (UdS) Sequences 2010 21 / 37

Page 27: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Decoding

Given x we want to find the best z, i.e. the one which maximizesP(x, z)

z = argmaxz

P(x, z)

Enumerate all possible z, and evalue P(x, z)?

Exponential in length of input

Dynamic programming to the rescue

Stroppa and Chrupala (UdS) Sequences 2010 21 / 37

Page 28: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Decoding

Given x we want to find the best z, i.e. the one which maximizesP(x, z)

z = argmaxz

P(x, z)

Enumerate all possible z, and evalue P(x, z)?

Exponential in length of input

Dynamic programming to the rescue

Stroppa and Chrupala (UdS) Sequences 2010 21 / 37

Page 29: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Decoding

Store intermediate results in a table for reuse

Score to remember: probability of the most likely sequence of statesup to position i , with state at position i being k

Vk(i) = maxz1,...,zi−1

P(x1, · · · , xi−1, z1, · · · , zi−1, xi , zi = k)

Stroppa and Chrupala (UdS) Sequences 2010 22 / 37

Page 30: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

DecodingWe can define Vk(i) recursively

Vl(i + 1) = maxz1,...,zi

P(x1, . . . , xi , z1, . . . , zi , xi+1, zi+1 = l)

= maxz1,...,zi

P(xi+1, zi+1 = l |x1, . . . , xi , z1, . . . , zi )

× P(x1, . . . , xi , z1, . . . , zi )

= maxz1,...,zi

P(xi+1, zi+1 = l |zi )P(x1, . . . , xi , z1, . . . , zi )

= maxk

P(xi+1, zi+1 = l) maxz1,...,zi−1

P(x1, · · · , xi , z1, · · · , zi = k)

= maxk

P(xi+1, zi+1 = l)Vk(i)

= P(xi+1|zi+1 = l) maxk

P(zi+1 = k |zi = l)Vk(i)

We introduce simplified notation for the parameters

Vl(i + 1) = El(xi+1) maxk

AklVk(i)

Stroppa and Chrupala (UdS) Sequences 2010 23 / 37

Page 31: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Viterbi algorithm

Input x = (x1, . . . , xN)

Initialization

V0(0) = 1 where 0 is the fake starting position

Vk(0) = 0 for all k > 0

Recursion

Vl(i) = El(xi ) maxk

AklVk(i − 1)

Zl(i) = argmaxk

AklVk(i − 1)

Termination

P(x, z) = maxk

Vk(N)

zN = argmaxk

Vk(N)

Traceback

zi−1 = Zzi (i)

Stroppa and Chrupala (UdS) Sequences 2010 24 / 37

Page 32: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Learning HMM

Learning from labeled data

I Estimate parameters (emission and transition probabilities) from(smoothed) relative counts

Akl =C (k, l)∑l′ C (k, l ′)

Ek(x) =C (k, x)∑x′ C (k , x ′)

Learning from unlabeled with Expectation MaximizationI Start with randomly initialized parameters θ0I Iterate until convergence

F Compute (soft) labeling given current θi

F Compute updated parameters θi+1 from this labeling

Stroppa and Chrupala (UdS) Sequences 2010 25 / 37

Page 33: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Learning HMM

Learning from labeled dataI Estimate parameters (emission and transition probabilities) from

(smoothed) relative counts

Akl =C (k, l)∑l′ C (k, l ′)

Ek(x) =C (k, x)∑x′ C (k , x ′)

Learning from unlabeled with Expectation MaximizationI Start with randomly initialized parameters θ0I Iterate until convergence

F Compute (soft) labeling given current θi

F Compute updated parameters θi+1 from this labeling

Stroppa and Chrupala (UdS) Sequences 2010 25 / 37

Page 34: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Learning HMM

Learning from labeled dataI Estimate parameters (emission and transition probabilities) from

(smoothed) relative counts

Akl =C (k , l)∑l′ C (k , l ′)

Ek(x) =C (k , x)∑x′ C (k , x ′)

Learning from unlabeled with Expectation MaximizationI Start with randomly initialized parameters θ0I Iterate until convergence

F Compute (soft) labeling given current θi

F Compute updated parameters θi+1 from this labeling

Stroppa and Chrupala (UdS) Sequences 2010 25 / 37

Page 35: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Outline

1 Hidden Markov Models

2 Maximum Entropy Markov Models

3 Sequence perceptron

Stroppa and Chrupala (UdS) Sequences 2010 26 / 37

Page 36: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Maximum Entropy Markov Models

Model structure like in HMM

Logistic regression (Maxent) to learn P(zi |x, zi−1)

For decoding, use learned probabilities and run Viterbi

Stroppa and Chrupala (UdS) Sequences 2010 27 / 37

Page 37: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

HMMs and MEMMs

HMM POS tagging model:

z = argmaxz

P(z|x)

= argmaxz

P(x|z)P(z)

= argmaxz

∏i

P(xi |zi )P(zi |zi−1)

MEMM POS tagging model:

z = argmaxz

P(z|x)

= argmaxz

∏i

P(zi |x, zi−1)

Maximum entropy model gives conditional probabilities

Stroppa and Chrupala (UdS) Sequences 2010 28 / 37

Page 38: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Conditioning probabilities in a HMM and a MEMM

Stroppa and Chrupala (UdS) Sequences 2010 29 / 37

Page 39: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Viterbi in MEMMs

Decoding works almost the same as in HMM

Except entries in the DP table are values of P(zi |x, zi−1)

Recursive step: Viterbi value of time t for state j :

Vl(i + 1) = maxk

P(zi+1 = l |x, zi = k)Vk(i)

Stroppa and Chrupala (UdS) Sequences 2010 30 / 37

Page 40: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Outline

1 Hidden Markov Models

2 Maximum Entropy Markov Models

3 Sequence perceptron

Stroppa and Chrupala (UdS) Sequences 2010 31 / 37

Page 41: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Perceptron for sequences

SequencePerceptron({x}1:N , {z}1:N , I ):

1: w← 02: for i = 1...I do3: for n = 1...N do4: y(n) ← argmaxz w · Φ(x(n), z)5: if z(n) 6= z(n) then6: w← w + Φ(x(n), z(n))− Φ(x(n), z(n))7: return w

Stroppa and Chrupala (UdS) Sequences 2010 32 / 37

Page 42: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Feature function

HarryPER lovesO MaryPER

Φ(x, z) =∑

i

φ(x, zi−1, zi )

i xi = Harry ∧ zi = PER suff2(xi ) = ry ∧ zi = PER xi = loves ∧ zi = O1 1 1 02 0 0 13 0 1 0

Φ 1 2 1

Stroppa and Chrupala (UdS) Sequences 2010 33 / 37

Page 43: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Search

z(n) = argmaxz

w · Φ(x(n), z)

Global score is computed incrementally:

w · Φ(x, z) =

|x|∑i=1

w · φ(x, zi−1, zi )

Stroppa and Chrupala (UdS) Sequences 2010 34 / 37

Page 44: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Update term

w(n) = w(n−1) +[Φ(x(n), z(n))− Φ(x(n), z(n))

]Φ(Harry loves Mary,PER O PER)

− Φ(Harry loves Mary,ORG O PER) =

xi = Harry ∧ zi = PER xi = Harry ∧ zi = ORG suff2(xi ) = ry ∧ zi = PER · · ·1 0 2 · · ·0 1 1 · · ·1 -1 1 · · ·

Stroppa and Chrupala (UdS) Sequences 2010 35 / 37

Page 45: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

Comparison

Model HMM MEMM Perceptron

Type Generative Discriminative DiscriminativeDistribution P(x, z) P(z|x) N/ASmoothing Crucial Optional OptionalOutput dep. Chain Chain ChainSup. learning No decoding No decoding With decoding

Stroppa and Chrupala (UdS) Sequences 2010 36 / 37

Page 46: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/sequence-labeling.pdf · Sequence Labeling Grzegorz Chrupa la and Nicolas Stroppa Google Saarland University

The end

Stroppa and Chrupala (UdS) Sequences 2010 37 / 37