Top Banner
Learning for Structured Prediction Linear Methods For Sequence Labeling: Hidden Markov Models vs Structured Perceptron Ivan Titov
41

Ivan Titov

Feb 11, 2016

Download

Documents

kieve

Learning for Structured Prediction Linear Methods For Sequence Labeling: Hidden Markov Models vs Structured Perceptron. Ivan Titov. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A A A A A A A A. Last Time: Structured Prediction. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ivan  Titov

Learning for Structured Prediction

Linear Methods For Sequence Labeling: Hidden Markov Models vs Structured Perceptron

Ivan Titov

Page 2: Ivan  Titov

Last Time: Structured Prediction1. Selecting feature representation

It should be sufficient to discriminate correct structure from incorrect ones It should be possible to decode with it (see (3))

2. Learning Which error function to optimize on the training set, for example

How to make it efficient (see (3))

3. Decoding: Dynamic programming for simpler representations ? Approximate search for more powerful ones?

We illustrated all these challenges on the example of dependency parsing

y = argmaxy02Y (x ) w¢' (x;y0)

w¢' (x;y?) ¡ maxy02Y (x );y6=y ? w¢' (x;y0) > °

' (x;y)

'

x is an input (e.g., sentence), y is an output (syntactic

tree)

Page 3: Ivan  Titov

3

Outline

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models

Perceptron and Structured Perceptron algorithms / motivations

Decoding with the Linear Model Discussion: Discriminative vs. Generative

Page 4: Ivan  Titov

4

Sequence Labeling Problems Definition:

Input: sequences of variable length Output: every position is labeled

Examples: Part-of-speech tagging

Named-entity recognition, shallow parsing (“chunking”), gesture recognition from video-streams, …

x = (x1;x2; : : :;xjx j), xi 2 Xy = (y1;y2; : : : ;yjx j), yi 2 Y

John carried

a tin can .

NP VBD DT NN NN .

x =y =

Page 5: Ivan  Titov

5

Part-of-speech tagging

Labels: NNP – proper singular noun; VBD - verb, past tense DT - determiner

Consider

John carried

a tin can .

NNP VBD DT NN NN or MD? .

x =y =

NN – singular noun MD - modal . - final

punctuation

If you just predict the most frequent tag for

each word you will make a mistake here

In fact, even knowing that the previous word is a noun is not

enough

Tin can cause poisoning …NN MD VB NN …

x =y =

One need to model interactions between labels to successfully resolve ambiguities, so this should be tackled as a structured prediction problem

3

Page 6: Ivan  Titov

6

Named Entity Recognition[ORG Chelsea], despite their name, are not based in [LOC Chelsea], but in

neighbouring [LOC Fulham] .

Not as trivial as it may seem, consider: [PERS Bill Clinton] will not embarrass [PERS Chelsea] at her

wedding Tiger failed to make a birdie

Encoding example (BIO-encoding)Bill Clinto

nembarrass

edChels

eaat

her

wedding at Astor Courts

B-PERS

I-PERS O B-PERS O O O O B-LOC I-LOC

x =y =

Chelsea can be a person too!

Is it an animal or a person?in the South Course

3

Page 7: Ivan  Titov

7

Vision: Gesture Recognition Given a sequence of frames in a video annotate each frame with a

gesture type:

Types of gestures:

It is hard to predict gestures from each frame in isolation, you need to exploit relations between frames and gesture types

Flip back Shrink vertically

Expand

vertically

Double back

Point and back

Expand horizontally

Figures from (Wang et al., CVPR 06)

Page 8: Ivan  Titov

8

Outline

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models

Perceptron and Structured Perceptron algorithms / motivations

Decoding with the Linear Model Discussion: Discriminative vs. Generative

Page 9: Ivan  Titov

9

Hidden Markov Models We will consider the part-of-speech (POS) tagging example

A “generative” model, i.e.: Model: Introduce a parameterized model of how both words

and tags are generated Learning: use a labeled training set to estimate the most likely

parameters of the model Decoding:

John carried

a tin can .

NP VBD DT NN NN .

P (x;yjµ)

µy = argmaxy0 P (x;y0jµ)

Page 10: Ivan  Titov

10

Hidden Markov Models A simplistic state diagram for noun phrases: N – tags, M –

vocabulary size

States correspond to POS tags, Words are emitted independently from each POS tag

Parameters (to be estimated from the training set): Transition probabilities : [ N x N ] matrix

Emission probabilities : [ N x M] matrix

Stationarity assumption: this probability does not

depend on the position in the sequence t

$

Det

Adj

Noun

[0.5 : the]

[0.01: red,

, …]0.1

0.8

P (y(t) jy(t¡ 1))P (x(t) jy(t))

0.1

0.50.2

1.0

0.5

0.8

[0.01 : herring, …]

0.01 : dog

0.01 : hungry

0.5 : a

aExample:

hungry

dog

Page 11: Ivan  Titov

11

Hidden Markov Models Representation as an instantiation of a graphical model: N – tags, M

– vocabulary size

States correspond to POS tags, Words are emitted independently from each POS tag

Parameters (to be estimated from the training set): Transition probabilities : [ N x N ] matrix

Emission probabilities : [ N x M] matrix

Stationarity assumption: this probability does not

depend on the position in the sequence t

P (y(t) jy(t¡ 1))P (x(t) jy(t))

x(4)

x(1) x(2) x(3)

y(4)y(1) y(2) y(3) A arrow means that in the generative story x(4) is

generated from some P(x(4) | y(4))

= Det

= a

= Adj = Noun

= hungry

= dog

Page 12: Ivan  Titov

12

Hidden Markov Models: Estimation N – the number tags, M – vocabulary size Parameters (to be estimated from the training set):

Transition probabilities , A - [ N x N ] matrix

Emission probabilities , B - [ N x M] matrix

Training corpus: x(1)= (In, an, Oct., 19, review, of, …. ), y(1)= (IN, DT, NNP, CD, NN,

IN, …. ) x(2)= (Ms., Haag, plays, Elianti,.), y(2)= (NNP, NNP, VBZ, NNP, .) … x(L)= (The, company, said,…), y(L)= (DT, NN, VBD, NNP, .)

How to estimate the parameters using maximum likelihood estimation?

You probably can guess what these estimation should be?

aj i = P (y(t) = ijy(t¡ 1) = j )bik = P (x(t) = kjy(t) = i)

Page 13: Ivan  Titov

13

Hidden Markov Models: Estimation Parameters (to be estimated from the training set):

Transition probabilities , A - [ N x N ] matrix

Emission probabilities , B - [ N x M] matrix

Training corpus: (x(1),y(1) ), l = 1, … L Write down the probability of the corpus according to the HMM:P (fx(l);y(l)gL

l=1) = Q Ll=1 P (x(l);y(l)) =

= Q Ll=1 a$;y( l )

1

³ Q jx l j¡ 1t=1 by( l )

t ;x( l )t

ay( l )t ;y( l )

t + 1

´by( l )

j x l j ;x( l )j x l j

ay( l )j x l j ;$

=

Select tag for the first word

Draw a word from this

state

Select the next state

Draw last

word

Transit into the $ state

= Q Ni ;j =1 aCT (i ;j )

i ;jQ N

i=1Q M

k=1 bCE (i ;k)i ;k

CT(i,j) is #times tag i is followed by tag j. Here we assume that $ is a

special tag which precedes and succeeds every sentence

CE(i,k) is #times word k is emitted

by tag i

aj i = P (y(t) = ijy(t¡ 1) = j )bik = P (x(t) = kjy(t) = i)

8

Page 14: Ivan  Titov

Hidden Markov Models: Estimation Maximize:

Equivalently maximize the logarithm of this:

subject to probabilistic constraints:

Or, we can decompose it into 2N optimization tasks:

P (fx(l);y(l)gLl=1) =

log(P (f x(l);y(l)gLl=1)) =

= P Ni=1

³ P Nj =1 CT (i; j ) logai ;j + P M

k=1 CE (i;k) logbi ;k´

P Nj =1 ai ;j = 1; P N

i=1 bi ;k = 1; i = 1;:: :;N

14

= Q Ni ;j =1 aCT (i ;j )

i ;jQ N

i=1Q M

k=1 bCE (i ;k)i ;k

CT(i,j) is #times tag i is followed by tag j.

CE(i,k) is #times word k is emitted by tag i

i = 1;:: : ;N :maxai ;1 ;:::;ai ;N

P Nj =1 CT (i; j ) logai ;j

s.t. P Nj =1 ai ;j = 1

maxbi ;1 ;:::;bi ;M CE (i;k) logbi ;k

s.t. P Ni=1 bi ;k = 1

For transitionsi = 1;:: : ;N :For emissions

Page 15: Ivan  Titov

Hidden Markov Models: Estimation For transitions (some i)

Constrained optimization task, Lagrangian:

Find critical points of Lagrangian by solving the system of equations:

Similarly, for emissions:

15

maxai ;1;:::;ai ;NP N

j =1 CT (i; j ) logai ;j

s.t. 1¡ P Nj =1 ai ;j = 0

L(ai ;1; : : : ;ai ;N ;¸) = P Nj =1 CT (i; j ) logai ;j + ¸ £ (1¡ P N

j =1 ai ;j )

@L@ = 1¡ P N

j =1 ai ;j = 0@L

@ai j= CT (i ;j )

ai j¡ ¸ = 0 =) ai j = CT (i ;j )

¸

P (yt = j jyt¡ 1 = i) = ai ;j = CT (i ;j )Pj 0 CT (i ;j 0)

P (xt = kjyt = i) = bi ;k = CE (i ;k)Pk 0 CE (i ;k0)

The maximum likelihood solution is just normalized

counts of events. Always like this for

generative models if all the labels are visible in training

I ignore “smoothing” to process rare or unseen word tag combinations… Outside

score of the seminar2

Page 16: Ivan  Titov

16

HMMs as linear models

Decoding: We will talk about the decoding algorithm slightly later, let us

generalize Hidden Markov Model:

But this is just a linear model!!

John carried

a tin can .

? ? ? ? ? .y = argmaxy0 P (x;y0jA;B) = argmaxy0 logP (x;y0jA;B)

logP (x;y0jA;B) = P jxj+1l=1 logby0

i ;xi + logay0i ;y0

i + 1

= P Ni=1

P Nj =1 CT (y0; i; j ) £ logai ;j + P N

i=1P M

k=1 CE (x;y0; i;k) £ logbi ;k

The number of times tag i is followed by

tag j in the candidate y’

The number of times tag i corresponds to word k

in (x, y’)

2

Page 17: Ivan  Titov

)

)

Scoring: example

Their inner product is exactly

But may be there other (and better?) ways to estimate , especially when we know that HMM is not a faithful model of reality?

It is not only a theoretical question! (we’ll talk about that in a moment)

wM L = (

' (x;y0) = (

(x;y0) = John carried

a tin can .

NP VBD DT NN NN . 1 0 … 1 1 0 …

NP: John NP:Mary ... NP-VBD NN-. MD-. …Unary features

logbN P ;J ohn logbN P ;M ar y : : : logaN N ;V B D logaN N ;: logaM D ;:

Edge features

wM L ¢' (x;y0) = P Ni=1

P Nj =1 CT (y0; i; j ) £ logai ;j +

P Ni=1

P Mk=1 CE (x;y0; i;k) £ logbi ;k

CE (x;y0; i;k)CT (y0; i; j )

: : :

logP (x;y0jA;B)

w

Page 18: Ivan  Titov

18

Feature view Basically, we define features which correspond to edges in the graph:

x(4)

x(1) x(2) x(3)

y(4)y(1) y(2) y(3)

Shaded because they are visible (both in training

and testing)

Page 19: Ivan  Titov

19

Generative modeling For a very large dataset (asymptotic analysis):

If data is generated from some “true” HMM, then (if the training set is sufficiently large), we are guaranteed to have an optimal tagger

Otherwise, (generally) HMM will not correspond to an optimal linear classifier

Discriminative methods which minimize the error more directly are guaranteed (under some fairly general conditions) to converge to an optimal linear classifier

For smaller training sets Generative classifiers converge faster to their optimal error [Ng &

Jordan, NIPS 01]

Real case: HMM is a coarse

approximation of reality

A discriminative classifier

A generative model

Errors on a regression dataset (predict housing prices in Boston area):

# train examples1

Page 20: Ivan  Titov

20

Outline

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models

Perceptron and Structured Perceptron algorithms / motivations

Decoding with the Linear Model Discussion: Discriminative vs. Generative

Page 21: Ivan  Titov

21

Perceptron Let us start with a binary classification problem

For binary classification the prediction rule is: Perceptron algorithm, given a training set

y 2 f+1;¡ 1gy = sign(w ¢' (x))

= 0 // initialize do err = 0

for = 1 .. L // over the training examples if ( < 0) // if mistake += // update, err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return

y(l)¡w ¢' (x(l))¢l

w

w ´y(l) ' (x(l))

w

f x(l);y(l)gLl=1

´ > 0

break ties (0) in some

deterministic way

Page 22: Ivan  Titov

22

Linear classification Linear separable case, “a perfect” classifier:

Linear functions are often written as: , but we can assume that

y = sign(w ¢' (x) + b)

(w ¢' (x) + b) = 0

w

' (x)1

' (x)2

' (x)0 = 1 for any x

Figure adapted from Dan Roth’s class at UIUC

Page 23: Ivan  Titov

23

Perceptron: geometric interpretationif ( < 0) // if mistake += // updateendif

y(l)¡w ¢' (x(l))¢w ´y(l) ' (x(l))

Page 24: Ivan  Titov

24

Perceptron: geometric interpretationif ( < 0) // if mistake += // updateendif

y(l)¡w ¢' (x(l))¢w ´y(l) ' (x(l))

Page 25: Ivan  Titov

25

Perceptron: geometric interpretationif ( < 0) // if mistake += // updateendif

y(l)¡w ¢' (x(l))¢w ´y(l) ' (x(l))

Page 26: Ivan  Titov

26

Perceptron: geometric interpretationif ( < 0) // if mistake += // updateendif

y(l)¡w ¢' (x(l))¢w ´y(l) ' (x(l))

Page 27: Ivan  Titov

27

Perceptron: algebraic interpretationif ( < 0) // if mistake += // updateendif

y(l)¡w ¢' (x(l))¢w ´y(l) ' (x(l))

We want after the update to increase If the increase is large enough than there will be no

misclassification Let’s see that’s what happens after the update

So, the perceptron update moves the decision hyperplane towards misclassified

y(l)¡(w + ´y(l) ' (x(l))) ¢' (x(l))¢

y(l)¡w¢' (x(l))¢

= y(l)¡w ¢' (x(l))¢+ ´(y(l))2 ¡' (x(l)) ¢' (x(l))¢

(y(l))2 = 1 squared norm > 0

' (x(l))

Page 28: Ivan  Titov

28

Perceptron

The perceptron algorithm, obviously, can only converge if the training set is linearly separable

It is guaranteed to converge in a finite number of iterations, dependent on how well two classes are separated (Novikoff, 1962)

Page 29: Ivan  Titov

29

Averaged Perceptron A small modification

= 0, = 0 // initialize for k = 1 .. K // for a number of iterations for = 1 .. L // over the training examples if ( < 0) // if mistake += // update, endif += // sum of over the course of training endfor endfor return

y(l)¡w ¢' (x(l))¢l

wP

w ´y(l) ' (x(l)) ´ > 0

w wP

w w

1K L wP

w

Do not run until convergence

Note: it is after endif

More stable in training: a vector which survived more iterations without updates is more similar to the resulting vector , as it was added a larger number of times

w1

K L wP

2

Page 30: Ivan  Titov

30

Structured Perceptron Let us start with structured problem: Perceptron algorithm, given a training set

= 0 // initialize do err = 0 for = 1 .. L // over the training examples // model prediction

if ( ) // if mistake += // update err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return

l

w

w

f x(l);y(l)gLl=1

y = argmaxy02Y (x ) w ¢' (x;y0)

y = argmaxy02Y (x ( l ) ) w ¢' (x(l);y0)w ¢' (x(l); y) > w ¢' (x(l);y(l))w ´ ¡' (x(l);y(l)) ¡ ' (x(l); y)¢

Pushes the correct sequence up and

the incorrectly predicted one down

Page 31: Ivan  Titov

31

Str. perceptron: algebraic interpretationif ( ) // if mistake += // updateendif

w

We want after the update to increase If the increase is large enough then will be scored

above Clearly, that this is achieved as this product will be increased

byThere might be

other

but we will deal with them on the next iterations

w ¢' (x(l); y) > w ¢' (x(l);y(l))´ ¡' (x(l);y(l)) ¡ ' (x(l); y)¢

w ¢(' (x(l);y(l)) ¡ ' (x(l); y))y(l) y

y02 Y(x(l))´jj' (x(l);y(l)) ¡ ' (x(l); y)jj2

Page 32: Ivan  Titov

32

Structured Perceptron Positive:

Very easy to implement Often, achieves respectable results As other discriminative techniques, does not make assumptions about

the generative process Additional features can be easily integrated, as long as decoding is

tractable Drawbacks

“Good” discriminative algorithms should optimize some measure which is closely related to the expected testing results: what perceptron is doing on non-linearly separable data seems not clear

However, for the averaged (voted) version a generalization bound which generalization properties of Perceptron (Freund & Shapire 98)

Later, we will consider more advance learning algorithms

Page 33: Ivan  Titov

33

Outline

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models

Perceptron and Structured Perceptron algorithms / motivations

Decoding with the Linear Model Discussion: Discriminative vs. Generative

Page 34: Ivan  Titov

34

Decoding with the Linear model

Decoding: Again a linear model with the following edge features (a

generalization of a HMM) In fact, the algorithm does not depend on the feature of input (they

do not need to be local)

x(4)

x(1) x(2) x(3)

y(4)y(1) y(2) y(3)

y = argmaxy02Y (x ) w¢' (x;y0)

Page 35: Ivan  Titov

35

Decoding with the Linear model

Decoding: Again a linear model with the following edge features (a

generalization of a HMM) In fact, the algorithm does not depend on the feature of input (they

do not need to be local)

x

y(4)y(1) y(2) y(3)

y = argmaxy02Y (x ) w¢' (x;y0)

Page 36: Ivan  Titov

36

Decoding with the Linear model

Decoding:

Let’s change notation: Edge scores : roughly corresponds to

Defined for t = 0 too (“start” feature: ) Decode:

Decoding: a dynamic programming algorithm - Viterbi algorithm

x

y(4)y(1) y(2) y(3)y = argmaxy02Y (x ) w¢' (x;y0)

logayt ¡ 1;yt + logbyt ;xtf t(yt¡ 1;yt;x)

y = argmaxy02Y (x )P jx j

t=1 f t(y0t¡ 1;y0

t;x)

Start/Stop symbol information ($) can

be encoded with them too.

y0 = $

Page 37: Ivan  Titov

37

Time complexity ?

Viterbi algorithm Decoding:

Loop invariant: ( ) scoret[y] - score of the highest scoring sequence up to position t

with prevt[y] - previous tag on this sequence

Init: score0[$] = 0, score0[y] = - 1 for other y Recomputation ( )

Return: retrace prev pointers starting from

…x

y(4)y(1) y(2) y(3)

t = 1;: : :; jxj

y = argmaxy02Y (x )P jx j

t=1 f t(y0t¡ 1;y0

t;x)

t = 1;: : :; jxjprevt[y] = argmaxy0 scoret[y0]+ f t(y0;y;x)scoret[y] = scoret¡ 1[prevt[y]]+ f t(prevt[y];y;x)

argmaxy scorejxj[y]

O(N 2jxj)

1

Page 38: Ivan  Titov

38

Outline

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models

Perceptron and Structured Perceptron algorithms / motivations

Decoding with the Linear Model Discussion: Discriminative vs. Generative

Page 39: Ivan  Titov

39

Recap: Sequence Labeling Hidden Markov Models:

How to estimate Discriminative models

How to learn with structured perceptron Both learning algorithms result in a linear model

How to label with the linear models

Page 40: Ivan  Titov

40

Discriminative vs Generative Generative models:

Cheap to estimate: simply normalized counts Hard to integrate complex features: need to come up with

a generative story and this story may be wrong Does not result in an optimal classifier when model

assumptions are wrong (i.e., always) Discriminative models

More expensive to learn: need to run decoding (here, Viterbi) during training and usually multiple times per an example

Easy to integrate features: though some feature may make decoding intractable

Usually less accurate on small datasets

Not necessary the case for generative models with latent

variables

Page 41: Ivan  Titov

41

Reminders

Speakers: slides about a week before the talk, meetings with me before/after this point will normally be needed

Reviewers: reviews are accepted only before the day we consider the topic

Everyone: References to the papers to read at GoogleDocs,

These slides (and previous ones) will be online today speakers: send me the last version of your slides too

Next time: Lea about Models of Parsing, PCFGs vs general WCFGs (Michael Collins’ book chapter)