Ivan Titov

Learning for Structured Prediction

Linear Methods For Sequence Labeling: Hidden Markov Models vs Structured Perceptron

Ivan Titov

Last Time: Structured Prediction1. Selecting feature representation

It should be sufficient to discriminate correct structure from incorrect ones It should be possible to decode with it (see (3))

2. Learning Which error function to optimize on the training set, for example

How to make it efficient (see (3))

3. Decoding: Dynamic programming for simpler representations ? Approximate search for more powerful ones?

We illustrated all these challenges on the example of dependency parsing

y = argmaxy02Y (x ) w¢' (x;y0)

w¢' (x;y?) ¡ maxy02Y (x );y6=y ? w¢' (x;y0) > °

' (x;y)

'

x is an input (e.g., sentence), y is an output (syntactic

tree)

3

Outline

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models

Perceptron and Structured Perceptron algorithms / motivations

Decoding with the Linear Model Discussion: Discriminative vs. Generative

4

Sequence Labeling Problems Definition:

Input: sequences of variable length Output: every position is labeled

Examples: Part-of-speech tagging

Named-entity recognition, shallow parsing (“chunking”), gesture recognition from video-streams, …

x = (x1;x2; : : :;xjx j), xi 2 Xy = (y1;y2; : : : ;yjx j), yi 2 Y

John carried

a tin can .

NP VBD DT NN NN .

x =y =

5

Part-of-speech tagging

Labels: NNP – proper singular noun; VBD - verb, past tense DT - determiner

Consider

John carried

a tin can .

NNP VBD DT NN NN or MD? .

x =y =

NN – singular noun MD - modal . - final

punctuation

If you just predict the most frequent tag for

each word you will make a mistake here

In fact, even knowing that the previous word is a noun is not

enough

Tin can cause poisoning …NN MD VB NN …

x =y =

One need to model interactions between labels to successfully resolve ambiguities, so this should be tackled as a structured prediction problem

3

6

Named Entity Recognition[ORG Chelsea], despite their name, are not based in [LOC Chelsea], but in

neighbouring [LOC Fulham] .

Not as trivial as it may seem, consider: [PERS Bill Clinton] will not embarrass [PERS Chelsea] at her

wedding Tiger failed to make a birdie

Encoding example (BIO-encoding)Bill Clinto

nembarrass

edChels

eaat

her

wedding at Astor Courts

B-PERS

I-PERS O B-PERS O O O O B-LOC I-LOC

x =y =

Chelsea can be a person too!

Is it an animal or a person?in the South Course

…

3

7

Vision: Gesture Recognition Given a sequence of frames in a video annotate each frame with a

gesture type:

Types of gestures:

It is hard to predict gestures from each frame in isolation, you need to exploit relations between frames and gesture types

Flip back Shrink vertically

Expand

vertically

Double back

Point and back

Expand horizontally

Figures from (Wang et al., CVPR 06)

8

Outline






9

Hidden Markov Models We will consider the part-of-speech (POS) tagging example

A “generative” model, i.e.: Model: Introduce a parameterized model of how both words

and tags are generated Learning: use a labeled training set to estimate the most likely

parameters of the model Decoding:

John carried

a tin can .

NP VBD DT NN NN .

P (x;yjµ)

µy = argmaxy0 P (x;y0jµ)

10

Hidden Markov Models A simplistic state diagram for noun phrases: N – tags, M –

vocabulary size

States correspond to POS tags, Words are emitted independently from each POS tag

Parameters (to be estimated from the training set): Transition probabilities : [ N x N ] matrix

Emission probabilities : [ N x M] matrix

Stationarity assumption: this probability does not

depend on the position in the sequence t

$

Det

Adj

Noun

[0.5 : the]

[0.01: red,

, …]0.1

0.8

P (y(t) jy(t¡ 1))P (x(t) jy(t))

0.1

0.50.2

1.0

0.5

0.8

[0.01 : herring, …]

0.01 : dog

0.01 : hungry

0.5 : a

aExample:

hungry

dog

11

Hidden Markov Models Representation as an instantiation of a graphical model: N – tags, M

– vocabulary size

States correspond to POS tags, Words are emitted independently from each POS tag

Parameters (to be estimated from the training set): Transition probabilities : [ N x N ] matrix

Emission probabilities : [ N x M] matrix

Stationarity assumption: this probability does not

depend on the position in the sequence t

P (y(t) jy(t¡ 1))P (x(t) jy(t))

…

x(4)

…

x(1) x(2) x(3)

y(4)y(1) y(2) y(3) A arrow means that in the generative story x(4) is

generated from some P(x(4) | y(4))

= Det

= a

= Adj = Noun

= hungry

= dog

12

Hidden Markov Models: Estimation N – the number tags, M – vocabulary size Parameters (to be estimated from the training set):

Transition probabilities , A - [ N x N ] matrix

Emission probabilities , B - [ N x M] matrix

Training corpus: x(1)= (In, an, Oct., 19, review, of, …. ), y(1)= (IN, DT, NNP, CD, NN,

IN, …. ) x(2)= (Ms., Haag, plays, Elianti,.), y(2)= (NNP, NNP, VBZ, NNP, .) … x(L)= (The, company, said,…), y(L)= (DT, NN, VBD, NNP, .)

How to estimate the parameters using maximum likelihood estimation?

You probably can guess what these estimation should be?

aj i = P (y(t) = ijy(t¡ 1) = j )bik = P (x(t) = kjy(t) = i)

13

Hidden Markov Models: Estimation Parameters (to be estimated from the training set):

Transition probabilities , A - [ N x N ] matrix

Emission probabilities , B - [ N x M] matrix

Training corpus: (x(1),y(1) ), l = 1, … L Write down the probability of the corpus according to the HMM:P (fx(l);y(l)gL

l=1) = Q Ll=1 P (x(l);y(l)) =

= Q Ll=1 a$;y( l )

1

³ Q jx l j¡ 1t=1 by( l )

t ;x( l )t

ay( l )t ;y( l )

t + 1

´by( l )

j x l j ;x( l )j x l j

ay( l )j x l j ;$

=

Select tag for the first word

Draw a word from this

state

Select the next state

Draw last

word

Transit into the $ state

= Q Ni ;j =1 aCT (i ;j )

i ;jQ N

i=1Q M

k=1 bCE (i ;k)i ;k

CT(i,j) is #times tag i is followed by tag j. Here we assume that $ is a

special tag which precedes and succeeds every sentence

CE(i,k) is #times word k is emitted

by tag i

aj i = P (y(t) = ijy(t¡ 1) = j )bik = P (x(t) = kjy(t) = i)

8

Hidden Markov Models: Estimation Maximize:

Equivalently maximize the logarithm of this:

subject to probabilistic constraints:

Or, we can decompose it into 2N optimization tasks:

P (fx(l);y(l)gLl=1) =

log(P (f x(l);y(l)gLl=1)) =

= P Ni=1

³ P Nj =1 CT (i; j ) logai ;j + P M

k=1 CE (i;k) logbi ;k´

P Nj =1 ai ;j = 1; P N

i=1 bi ;k = 1; i = 1;:: :;N

14

= Q Ni ;j =1 aCT (i ;j )

i ;jQ N

i=1Q M

k=1 bCE (i ;k)i ;k

CT(i,j) is #times tag i is followed by tag j.

CE(i,k) is #times word k is emitted by tag i

i = 1;:: : ;N :maxai ;1 ;:::;ai ;N

P Nj =1 CT (i; j ) logai ;j

s.t. P Nj =1 ai ;j = 1

maxbi ;1 ;:::;bi ;M CE (i;k) logbi ;k

s.t. P Ni=1 bi ;k = 1

For transitionsi = 1;:: : ;N :For emissions

Hidden Markov Models: Estimation For transitions (some i)

Constrained optimization task, Lagrangian:

Find critical points of Lagrangian by solving the system of equations:

Similarly, for emissions:

15

maxai ;1;:::;ai ;NP N

j =1 CT (i; j ) logai ;j

s.t. 1¡ P Nj =1 ai ;j = 0

L(ai ;1; : : : ;ai ;N ;¸) = P Nj =1 CT (i; j ) logai ;j + ¸ £ (1¡ P N

j =1 ai ;j )

@L@ = 1¡ P N

j =1 ai ;j = 0@L

@ai j= CT (i ;j )

ai j¡ ¸ = 0 =) ai j = CT (i ;j )

¸

P (yt = j jyt¡ 1 = i) = ai ;j = CT (i ;j )Pj 0 CT (i ;j 0)

P (xt = kjyt = i) = bi ;k = CE (i ;k)Pk 0 CE (i ;k0)

The maximum likelihood solution is just normalized

counts of events. Always like this for

generative models if all the labels are visible in training

I ignore “smoothing” to process rare or unseen word tag combinations… Outside

score of the seminar2

16

HMMs as linear models

Decoding: We will talk about the decoding algorithm slightly later, let us

generalize Hidden Markov Model:

But this is just a linear model!!

John carried

a tin can .

? ? ? ? ? .y = argmaxy0 P (x;y0jA;B) = argmaxy0 logP (x;y0jA;B)

logP (x;y0jA;B) = P jxj+1l=1 logby0

i ;xi + logay0i ;y0

i + 1

= P Ni=1

P Nj =1 CT (y0; i; j ) £ logai ;j + P N

i=1P M

k=1 CE (x;y0; i;k) £ logbi ;k

The number of times tag i is followed by

tag j in the candidate y’

The number of times tag i corresponds to word k

in (x, y’)

2

)

)

Scoring: example

Their inner product is exactly

But may be there other (and better?) ways to estimate , especially when we know that HMM is not a faithful model of reality?

It is not only a theoretical question! (we’ll talk about that in a moment)

wM L = (

' (x;y0) = (

(x;y0) = John carried

a tin can .

NP VBD DT NN NN . 1 0 … 1 1 0 …

NP: John NP:Mary ... NP-VBD NN-. MD-. …Unary features

logbN P ;J ohn logbN P ;M ar y : : : logaN N ;V B D logaN N ;: logaM D ;:

Edge features

wM L ¢' (x;y0) = P Ni=1

P Nj =1 CT (y0; i; j ) £ logai ;j +

P Ni=1

P Mk=1 CE (x;y0; i;k) £ logbi ;k

CE (x;y0; i;k)CT (y0; i; j )

: : :

logP (x;y0jA;B)

w

18

Feature view Basically, we define features which correspond to edges in the graph:

…

x(4)

…

x(1) x(2) x(3)

y(4)y(1) y(2) y(3)

Shaded because they are visible (both in training

and testing)

19

Generative modeling For a very large dataset (asymptotic analysis):

If data is generated from some “true” HMM, then (if the training set is sufficiently large), we are guaranteed to have an optimal tagger

Otherwise, (generally) HMM will not correspond to an optimal linear classifier

Discriminative methods which minimize the error more directly are guaranteed (under some fairly general conditions) to converge to an optimal linear classifier

For smaller training sets Generative classifiers converge faster to their optimal error [Ng &

Jordan, NIPS 01]

Real case: HMM is a coarse

approximation of reality

A discriminative classifier

A generative model

Errors on a regression dataset (predict housing prices in Boston area):

# train examples1

20

Outline






21

Perceptron Let us start with a binary classification problem

For binary classification the prediction rule is: Perceptron algorithm, given a training set

y 2 f+1;¡ 1gy = sign(w ¢' (x))

= 0 // initialize do err = 0

for = 1 .. L // over the training examples if ( < 0) // if mistake += // update, err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return

y(l)¡w ¢' (x(l))¢l

w

w ý(l) ' (x(l))

w

f x(l);y(l)gLl=1

´ > 0

break ties (0) in some

deterministic way

22

Linear classification Linear separable case, “a perfect” classifier:

Linear functions are often written as: , but we can assume that

y = sign(w ¢' (x) + b)

(w ¢' (x) + b) = 0

w

' (x)1

' (x)2

' (x)0 = 1 for any x

Figure adapted from Dan Roth’s class at UIUC

23

Perceptron: geometric interpretationif ( < 0) // if mistake += // updateendif

y(l)¡w ¢' (x(l))¢w ý(l) ' (x(l))

24


y(l)¡w ¢' (x(l))¢w ý(l) ' (x(l))

25


y(l)¡w ¢' (x(l))¢w ý(l) ' (x(l))

26


y(l)¡w ¢' (x(l))¢w ý(l) ' (x(l))

27

Perceptron: algebraic interpretationif ( < 0) // if mistake += // updateendif

y(l)¡w ¢' (x(l))¢w ý(l) ' (x(l))

We want after the update to increase If the increase is large enough than there will be no

misclassification Let’s see that’s what happens after the update

So, the perceptron update moves the decision hyperplane towards misclassified

y(l)¡(w + ý(l) ' (x(l))) ¢' (x(l))¢

y(l)¡w¢' (x(l))¢

= y(l)¡w ¢' (x(l))¢+ ´(y(l))2 ¡' (x(l)) ¢' (x(l))¢

(y(l))2 = 1 squared norm > 0

' (x(l))

28

Perceptron

The perceptron algorithm, obviously, can only converge if the training set is linearly separable

It is guaranteed to converge in a finite number of iterations, dependent on how well two classes are separated (Novikoff, 1962)

29

Averaged Perceptron A small modification

= 0, = 0 // initialize for k = 1 .. K // for a number of iterations for = 1 .. L // over the training examples if ( < 0) // if mistake += // update, endif += // sum of over the course of training endfor endfor return

y(l)¡w ¢' (x(l))¢l

wP

w ý(l) ' (x(l)) ´ > 0

w wP

w w

1K L wP

w

Do not run until convergence

Note: it is after endif

More stable in training: a vector which survived more iterations without updates is more similar to the resulting vector , as it was added a larger number of times

w1

K L wP

2

30

Structured Perceptron Let us start with structured problem: Perceptron algorithm, given a training set

= 0 // initialize do err = 0 for = 1 .. L // over the training examples // model prediction

if ( ) // if mistake += // update err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return

l

w

w

f x(l);y(l)gLl=1

y = argmaxy02Y (x ) w ¢' (x;y0)

y = argmaxy02Y (x ( l ) ) w ¢' (x(l);y0)w ¢' (x(l); y) > w ¢' (x(l);y(l))w ´ ¡' (x(l);y(l)) ¡ ' (x(l); y)¢

Pushes the correct sequence up and

the incorrectly predicted one down

31

Str. perceptron: algebraic interpretationif ( ) // if mistake += // updateendif

w

We want after the update to increase If the increase is large enough then will be scored

above Clearly, that this is achieved as this product will be increased

byThere might be

other

but we will deal with them on the next iterations

w ¢' (x(l); y) > w ¢' (x(l);y(l))´ ¡' (x(l);y(l)) ¡ ' (x(l); y)¢

w ¢(' (x(l);y(l)) ¡ ' (x(l); y))y(l) y

y02 Y(x(l))´jj' (x(l);y(l)) ¡ ' (x(l); y)jj2

32

Structured Perceptron Positive:

Very easy to implement Often, achieves respectable results As other discriminative techniques, does not make assumptions about

the generative process Additional features can be easily integrated, as long as decoding is

tractable Drawbacks

“Good” discriminative algorithms should optimize some measure which is closely related to the expected testing results: what perceptron is doing on non-linearly separable data seems not clear

However, for the averaged (voted) version a generalization bound which generalization properties of Perceptron (Freund & Shapire 98)

Later, we will consider more advance learning algorithms

33

Outline






34

Decoding with the Linear model

Decoding: Again a linear model with the following edge features (a

generalization of a HMM) In fact, the algorithm does not depend on the feature of input (they

do not need to be local)

…

x(4)

…

x(1) x(2) x(3)

y(4)y(1) y(2) y(3)


35


Decoding: Again a linear model with the following edge features (a

generalization of a HMM) In fact, the algorithm does not depend on the feature of input (they

do not need to be local)

…

x

y(4)y(1) y(2) y(3)


36


Decoding:

Let’s change notation: Edge scores : roughly corresponds to

Defined for t = 0 too (“start” feature: ) Decode:

Decoding: a dynamic programming algorithm - Viterbi algorithm

…

x

y(4)y(1) y(2) y(3)y = argmaxy02Y (x ) w¢' (x;y0)

logayt ¡ 1;yt + logbyt ;xtf t(yt¡ 1;yt;x)

y = argmaxy02Y (x )P jx j

t=1 f t(y0t¡ 1;y0

t;x)

Start/Stop symbol information ($) can

be encoded with them too.

y0 = $

37

Time complexity ?

Viterbi algorithm Decoding:

Loop invariant: ( ) scoret[y] - score of the highest scoring sequence up to position t

with prevt[y] - previous tag on this sequence

Init: score0[$] = 0, score0[y] = - 1 for other y Recomputation ( )

Return: retrace prev pointers starting from

…x

y(4)y(1) y(2) y(3)

t = 1;: : :; jxj

y = argmaxy02Y (x )P jx j

t=1 f t(y0t¡ 1;y0

t;x)

t = 1;: : :; jxjprevt[y] = argmaxy0 scoret[y0]+ f t(y0;y;x)scoret[y] = scoret¡ 1[prevt[y]]+ f t(prevt[y];y;x)

argmaxy scorejxj[y]

O(N 2jxj)

1

38

Outline






39

Recap: Sequence Labeling Hidden Markov Models:

How to estimate Discriminative models

How to learn with structured perceptron Both learning algorithms result in a linear model

How to label with the linear models

40

Discriminative vs Generative Generative models:

Cheap to estimate: simply normalized counts Hard to integrate complex features: need to come up with

a generative story and this story may be wrong Does not result in an optimal classifier when model

assumptions are wrong (i.e., always) Discriminative models

More expensive to learn: need to run decoding (here, Viterbi) during training and usually multiple times per an example

Easy to integrate features: though some feature may make decoding intractable

Usually less accurate on small datasets

Not necessary the case for generative models with latent

variables

41

Reminders

Speakers: slides about a week before the talk, meetings with me before/after this point will normally be needed

Reviewers: reviews are accepted only before the day we consider the topic

Everyone: References to the papers to read at GoogleDocs,

These slides (and previous ones) will be online today speakers: send me the last version of your slides too

Next time: Lea about Models of Parsing, PCFGs vs general WCFGs (Michael Collins’ book chapter)

Ivan Titov

Documents

example problems

gesture types

linear modeldiscussion

gesture recognition7given

pers chelsea

example bio

example of dependency

lociloc chelsea