Top Banner
1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture
58

1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Dec 29, 2015

Download

Documents

Alannah Malone
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

1

CS546: Machine Learning and Natural Language

Multi-Class and Structured Prediction Problems

Slides from Taskar and Klein are used in this lecture

Page 2: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

2

Outline

– Multi-Class classification:– Structured Prediction– Models for Structured Prediction and

Classification

• Example of POS tagging

=5

Page 3: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

3

Mutliclass problems

– Most of the machinery we talked before was focused on binary classification problems– e.g., SVMs we discussed so far

– However most problems we encounter in NLP are either:• MultiClass: e.g., text categorization• Structured Prediction: e.g., predict syntactic structure

of a sentence– How to deal with them?

=5

Page 4: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

4

Binary linear classification

=5

Page 5: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

5

Multiclass classification

=5

Page 6: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

6

Perceptron

=5

Page 7: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Structured Perceptron

• Joint feature representation:• Algoritm:

Page 8: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

8

Perceptron

=5

Page 9: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

9

Binary Classification Margin

=5

Page 10: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

10

Generalize to MultiClass

=5

Page 11: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

11

Converting to MultiClass SVM

=5

Page 12: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

12

Max margin = Min Norm

=5

•As before, these are equivalent formulations:

Page 13: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

13

Problems:

=5

•Requires separability•What if we have noise in data?•What if we have little simple feature space?

Page 14: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

14

Non-separable case

=5

Page 15: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

15

Non-separable case

=5

Page 16: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

16

Compare with MaxEnt

=5

Page 17: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

17

Loss Comparison

=5

Page 18: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

18

=5

• So far, we considered multiclass classification• 0-1 losses l(y,y’)• What if what we want to do is to predict:

• sequences of POS• syntactic trees• translation

Multiclass -> Structured

Page 19: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

19

=5

Predicting word alignments

Page 20: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

20

=5

Predicting Syntactic Trees

Page 21: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

21

=5

Structured Models

Page 22: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

22

=5

Parsing

Page 23: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

23

=5

Max Margin Markov Networks (M3Ns)

Taskar et al, 2003; similar Tsochantaridis et al, 2004

Page 24: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

24

=5

Max Margin Markov Networks (M3Ns)

Page 25: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

25 MultiClass Classification

Solving MultiClass with binary learning

• MultiClass classifier– Function f : Rd {1,2,3,...,k}

• Decompose into binary problems

• Not always possible to learn • Different scale • No theoretical justification

Real Problem

Page 26: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

26 MultiClass Classification

Learning via One-Versus-All (OvA) Assumption

• Find vr,vb,vg,vy Rn such that

– vr.x > 0 iff y = red – vb.x > 0 iff y = blue – vg.x > 0 iff y = green – vy.x > 0 iff y = yellow

• Classifier f(x) = argmax vi.x

Individual Classifiers

Decision Regions

H = Rkn

Page 27: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

27 MultiClass Classification

Learning via All-Verses-All (AvA) Assumption

• Find vrb,vrg,vry,vbg,vby,vgy Rd such that

– vrb.x > 0 if y = red < 0 if y = blue

– vrg.x > 0 if y = red < 0 if y = green– ... (for all pairs)

Individual Classifiers

Decision Regions

H = Rkkn

How to classify?

Page 28: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

28

Classifying with AvA

Tree

1 red, 2 yellow, 2 green ?

Majority Vote

Tournament

All are post-learning and might cause weird stuff

Page 29: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

29

=5

POS Tagging

• English tags

Page 30: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

30

=5

POS Tagging, examples from WSJ

From McCallum

Page 31: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

31

=5

POS Tagging

• Ambiguity: not a trivial task

• Useful tasks:• important features for other steps are based

on POS • E.g., use POS as input to a parser

Page 32: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

32

But still why so popular

=5

– Historically the first statistical NLP problem– Easy to apply arbitrary classifiers:

– both for sequence models and just independent classifiers

– Can be regarded as Finite-State Problem– Easy to evaluate– Annotation is cheaper to obtain than

TreeBanks (other languages)

Page 33: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

33

=5

HMM (reminder)

Page 34: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

34

=5

HMM (reminder) - transitions

Page 35: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

35

=5

Transition Estimates

Page 36: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

36

=5

Emission Estimates

Page 37: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

37

=5

MaxEnt (reminder)

Page 38: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

38

=5

Decoding: HMM vs MaxEnt

Page 39: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

39

=5

Accuracies overview

Page 40: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

40

=5

Accuracies overview

Page 41: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

41

SVMs for tagging

– We can use SVMs in a similar way as MaxEnt (or other classifiers)

– We can use a window around the word – 97.16 % on WSJ

=5

Page 42: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

42

SVMs for tagging

=5

from Jimenez & Marquez

Page 43: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

43

No sequence modeling

=5

Page 44: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

44

CRFs and other global models

=5

Page 45: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

45

CRFs and other global models

=5

Page 46: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Compare

=5

CRFs - no local normalization

MEMMs - Note: after each step t the remaining probability mass cannotbe reduced – it can only be distributedacross among possible state transitions

HMMs

W

T

Page 47: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

47

Label Bias

=5

based on a slide from Joe Drish

Page 48: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

48

Label Bias

=5

• Recall Transition based parsing -- Nivre’s algorithm (with beam search)

• At each step we can observe only local features (limited look-ahead)

• If later we see that the following word is impossible we can only distribute probability uniformly across all (im-)possible decisions

• If a small number of such decisions – we cannot decrease probability dramatically

• So, label bias is likely to be a serious problem if:• Non local dependencies• States have small number of possible outgoing

transitions

Page 49: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

49

Pos Tagging Experiments

=5

– “+” is an extended feature set (hard to integrate in a generative model)

– oov – out-of-vocabulary

Page 50: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

50

Supervision

=5

– We considered before the supervised case– Training set is labeled

– However, we can try to induce word classes without supervision– Unsupervised tagging– We will later discuss the EM algorithm

– Can do it in a partly supervised:– Seed tags – Small labeled dataset– Parallel corpus– ....

Page 51: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

51

Why not to predict POS + parse trees simultaneously?

=5

– It is possible and often done this way– Doing tagging internally often benefits

parsing accuracy – Unfortunately, parsing models are less robust

than taggers– e.g., non-grammatical sentences, different

domains– It is more expensive and does not help...

Page 52: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

52

Questions

=5

• Why there is no label-bias problem for a generative model (e.g., HMM) ?

• How would you integrate word features in a generative model (e.g., HMMs for POS tagging)?

• e.g., if word has:• -ing, -s, -ed, -d, -ment, ...• post-, de-,...

Page 53: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

53

“CRFs” for more complex structured output problems

=5

• We considered sequence labeled problems• Here, the structure of dependencies is fixed• What if we do not know the structure but would

like to have interactions respecting the structure ?

Page 54: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

54

“CRFs” for more complex structured output problems

=5

• Recall, we had the MST algorithm (McDonald and Pereira, 05)

Page 55: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

55

“CRFs” for more complex structured output problems

=5

• Complex inference• E.g., arbitrary 2nd order dependency parsing

models are not tractable (non-projective) NP-complete: (McDonald & Pereira, EACL 06)

• Recently conditional models for constituent parsing:

• (Finkel et al, ACL 08)• (Carreras et al, CoNLL 08)• ...

Page 56: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

56

Back to MultiClass

=5

– Let us review how to decompose multiclass problem to binary classification problems

Page 57: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

57

Summary

=5

• Margin-based method for multiclass classification and structured prediction

• CRFs vs HMMs vs MEMMs for POS tagging

Page 58: 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

58

Conclusions

• All approaches use linear representation• The differences are

– Features– How to learn weights– Training Paradigms:

• Global Training (CRF, Global Perceptron)• Modular Training (PMM, MEMM, ...)

– These approaches are easier to train, but may requires additional mechanisms to enforce global constraints.