1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

1

CS546: Machine Learning and Natural Language

Multi-Class and Structured Prediction Problems

Slides from Taskar and Klein are used in this lecture

2

Outline

– Multi-Class classification:– Structured Prediction– Models for Structured Prediction and

Classification

• Example of POS tagging

=5

3

Mutliclass problems

– Most of the machinery we talked before was focused on binary classification problems– e.g., SVMs we discussed so far

– However most problems we encounter in NLP are either:• MultiClass: e.g., text categorization• Structured Prediction: e.g., predict syntactic structure

of a sentence– How to deal with them?

=5

4

Binary linear classification

=5

5

Multiclass classification

=5

6

Perceptron

=5

Structured Perceptron

x̂

• Joint feature representation:• Algoritm:

8

Perceptron

=5

9

Binary Classification Margin

=5

10

Generalize to MultiClass

=5

11

Converting to MultiClass SVM

=5

12

Max margin = Min Norm

=5

•As before, these are equivalent formulations:

13

Problems:

=5

•Requires separability•What if we have noise in data?•What if we have little simple feature space?

14

Non-separable case

=5

15

Non-separable case

=5

16

Compare with MaxEnt

=5

17

Loss Comparison

=5

18

=5

• So far, we considered multiclass classification• 0-1 losses l(y,y’)• What if what we want to do is to predict:

• sequences of POS• syntactic trees• translation

Multiclass -> Structured

•

19

=5

Predicting word alignments

•

20

=5

Predicting Syntactic Trees

•

21

=5

Structured Models

•

22

=5

Parsing

•

23

=5

Max Margin Markov Networks (M3Ns)

•

Taskar et al, 2003; similar Tsochantaridis et al, 2004

24

=5

Max Margin Markov Networks (M3Ns)

•

25 MultiClass Classification

Solving MultiClass with binary learning

• MultiClass classifier– Function f : Rd {1,2,3,...,k}

• Decompose into binary problems

• Not always possible to learn • Different scale • No theoretical justification

Real Problem


Learning via One-Versus-All (OvA) Assumption

• Find vr,vb,vg,vy Rn such that

– vr.x > 0 iff y = red – vb.x > 0 iff y = blue – vg.x > 0 iff y = green – vy.x > 0 iff y = yellow

• Classifier f(x) = argmax vi.x

Individual Classifiers

Decision Regions

H = Rkn


Learning via All-Verses-All (AvA) Assumption

• Find vrb,vrg,vry,vbg,vby,vgy Rd such that

– vrb.x > 0 if y = red < 0 if y = blue

– vrg.x > 0 if y = red < 0 if y = green– ... (for all pairs)

Individual Classifiers

Decision Regions

H = Rkkn

How to classify?

28

Classifying with AvA

Tree

1 red, 2 yellow, 2 green ?

Majority Vote

Tournament

All are post-learning and might cause weird stuff

29

=5

POS Tagging

•

• English tags

30

=5

POS Tagging, examples from WSJ

From McCallum

31

=5

POS Tagging

•

• Ambiguity: not a trivial task

• Useful tasks:• important features for other steps are based

on POS • E.g., use POS as input to a parser

32

But still why so popular

=5

– Historically the first statistical NLP problem– Easy to apply arbitrary classifiers:

– both for sequence models and just independent classifiers

– Can be regarded as Finite-State Problem– Easy to evaluate– Annotation is cheaper to obtain than

TreeBanks (other languages)

33

=5

HMM (reminder)

•

34

=5

HMM (reminder) - transitions

•

35

=5

Transition Estimates

•

36

=5

Emission Estimates

•

37

=5

MaxEnt (reminder)

•

38

=5

Decoding: HMM vs MaxEnt

•

39

=5

Accuracies overview

•

40

=5

Accuracies overview

•

41

SVMs for tagging

– We can use SVMs in a similar way as MaxEnt (or other classifiers)

– We can use a window around the word – 97.16 % on WSJ

=5

42

SVMs for tagging

=5

from Jimenez & Marquez

43

No sequence modeling

=5

44

CRFs and other global models

=5

45

CRFs and other global models

=5

Compare

=5

CRFs - no local normalization

MEMMs - Note: after each step t the remaining probability mass cannotbe reduced – it can only be distributedacross among possible state transitions

HMMs

W

T

47

Label Bias

=5

based on a slide from Joe Drish

48

Label Bias

=5

• Recall Transition based parsing -- Nivre’s algorithm (with beam search)

• At each step we can observe only local features (limited look-ahead)

• If later we see that the following word is impossible we can only distribute probability uniformly across all (im-)possible decisions

• If a small number of such decisions – we cannot decrease probability dramatically

• So, label bias is likely to be a serious problem if:• Non local dependencies• States have small number of possible outgoing

transitions

49

Pos Tagging Experiments

=5

– “+” is an extended feature set (hard to integrate in a generative model)

– oov – out-of-vocabulary

50

Supervision

=5

– We considered before the supervised case– Training set is labeled

– However, we can try to induce word classes without supervision– Unsupervised tagging– We will later discuss the EM algorithm

– Can do it in a partly supervised:– Seed tags – Small labeled dataset– Parallel corpus– ....

51

Why not to predict POS + parse trees simultaneously?

=5

– It is possible and often done this way– Doing tagging internally often benefits

parsing accuracy – Unfortunately, parsing models are less robust

than taggers– e.g., non-grammatical sentences, different

domains– It is more expensive and does not help...

52

Questions

=5

• Why there is no label-bias problem for a generative model (e.g., HMM) ?

• How would you integrate word features in a generative model (e.g., HMMs for POS tagging)?

• e.g., if word has:• -ing, -s, -ed, -d, -ment, ...• post-, de-,...

53

“CRFs” for more complex structured output problems

=5

• We considered sequence labeled problems• Here, the structure of dependencies is fixed• What if we do not know the structure but would

like to have interactions respecting the structure ?

54


=5

• Recall, we had the MST algorithm (McDonald and Pereira, 05)

55


=5

• Complex inference• E.g., arbitrary 2nd order dependency parsing

models are not tractable (non-projective) NP-complete: (McDonald & Pereira, EACL 06)

• Recently conditional models for constituent parsing:

• (Finkel et al, ACL 08)• (Carreras et al, CoNLL 08)• ...

56

Back to MultiClass

=5

– Let us review how to decompose multiclass problem to binary classification problems

57

Summary

=5

• Margin-based method for multiclass classification and structured prediction

• CRFs vs HMMs vs MEMMs for POS tagging

58

Conclusions

• All approaches use linear representation• The differences are

– Features– How to learn weights– Training Paradigms:

• Global Training (CRF, Global Perceptron)• Modular Training (PMM, MEMM, ...)

– These approaches are easier to train, but may requires additional mechanisms to enforce global constraints.

1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Documents