Top Banner
MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1
104

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

Jan 01, 2016

Download

Documents

Margaret George
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

1

MaxEnt: Training, Smoothing, Tagging

Advanced Statistical Methods in NLPLing572

February 7, 2012

Page 2: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

2

RoadmapMaxent:

Training

Smoothing

Case study: POS Tagging (redux)Beam search

Page 3: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

3

Training

Page 4: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

4

TrainingLearn λs from training data

Page 5: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

5

TrainingLearn λs from training data

Challenge: Usually can’t solve analyticallyEmploy numerical methods

Page 6: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

6

TrainingLearn λs from training data

Challenge: Usually can’t solve analyticallyEmploy numerical methods

Main different techniques:Generalized Iterative Scaling (GIS, Darroch &Ratcliffe,

‘72)

Improved Iterative Scaling (IIS, Della Pietra et al, ‘95)

L-BFGS,…..

Page 7: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

7

Generalized Iterative Scaling

GIS Setup:GIS required constraint: , where C is a constant

Page 8: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

8

Generalized Iterative Scaling

GIS Setup:GIS required constraint: , where C is a constant

If not, then set

:

Page 9: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

9

Generalized Iterative Scaling

GIS Setup:GIS required constraint: , where C is a constant

If not, then set

Page 10: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

10

Generalized Iterative Scaling

GIS Setup:GIS required constraint: , where C is a constant

If not, then set

and add a correction feature function fk+1:

Page 11: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

11

Generalized Iterative Scaling

GIS Setup:GIS required constraint: , where C is a constant

If not, then set

and add a correction feature function fk+1:

GIS also requires at least one active feature for any eventDefault feature functions solve this problem

Page 12: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

12

GIS IterationCompute the empirical expectation

Page 13: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

13

GIS IterationCompute the empirical expectation

Initialization:λj(0) ; set to 0 or some value

Page 14: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

14

GIS IterationCompute the empirical expectation

Initialization:λj(0) ; set to 0 or some value

Iterate until convergence for each j:

Page 15: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

15

GIS IterationCompute the empirical expectation

Initialization:λj(0) ; set to 0 or some value

Iterate until convergence for each j:Compute p(y|x) under the current model

Page 16: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

16

GIS IterationCompute the empirical expectation

Initialization:λj(0) ; set to 0 or some value

Iterate until convergence for each j:Compute p(y|x) under the current model

Compute model expectation under current model

Page 17: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

17

GIS IterationCompute the empirical expectation

Initialization:λj(0) ; set to 0 or some value

Iterate until convergence for each j:Compute p(y|x) under the current model

Compute model expectation under current model

Update model parameters by weighted ratio of empirical and model expectations

Page 18: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

18

GIS IterationCompute

Page 19: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

19

GIS IterationCompute

Initialization:λj(0) ; set to 0 or some value

Page 20: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

20

GIS IterationCompute

Initialization:λj(0) ; set to 0 or some value

Iterate until convergence:Compute

Page 21: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

21

GIS IterationCompute

Initialization:λj(0) ; set to 0 or some value

Iterate until convergence:Compute p(n)(y|x)=

Page 22: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

22

GIS IterationCompute

Initialization:λj(0) ; set to 0 or some value

Iterate until convergence:Compute p(n)(y|x)=

Compute

Page 23: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

23

GIS IterationCompute

Initialization:λj(0) ; set to 0 or some value

Iterate until convergence:Compute p(n)(y|x)=

Compute

Update

Page 24: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

24

ConvergenceMethods have convergence guarantees

Page 25: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

25

ConvergenceMethods have convergence guarantees

However, full convergence may take very long time

Page 26: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

26

ConvergenceMethods have convergence guarantees

However, full convergence may take very long timeFrequently use threshold

Page 27: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

27

ConvergenceMethods have convergence guarantees

However, full convergence may take very long timeFrequently use threshold

Page 28: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

28

Calculating LL(p)LL = 0

For each sample x in the training dataLet y be the true label of xprob = p(y|x)LL += 1/N * prob

Page 29: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

29

Running TimeFor each iteration the running time is:

Page 30: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

30

Running TimeFor each iteration the running time is O(NPA),

where:

N: number of training instances

P: number of classes

A: Average number of active features for instance (x,y)

Page 31: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

31

L-BFGSLimited-memory version of

Broyden–Fletcher–Goldfarb–Shanno (BFGS) method

Page 32: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

32

L-BFGSLimited-memory version of

Broyden–Fletcher–Goldfarb–Shanno (BFGS) method

Quasi-Newton method for unconstrained optimization

Page 33: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

33

L-BFGSLimited-memory version of

Broyden–Fletcher–Goldfarb–Shanno (BFGS) method

Quasi-Newton method for unconstrained optimization

Good for optimization problems with many variables

Page 34: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

34

L-BFGSLimited-memory version of

Broyden–Fletcher–Goldfarb–Shanno (BFGS) method

Quasi-Newton method for unconstrained optimization

Good for optimization problems with many variables

“Algorithm of choice” for MaxEnt and related models

Page 35: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

35

L-BFGSReferences:

Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782

Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528

Page 36: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

36

L-BFGSReferences:

Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782

Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528

Implementations: Java, Matlab, Python via scipy, R, etcSee Wikipedia page

Page 37: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

37

SmoothingBased on Klein & Manning, 2003; F. Xia

Page 38: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

38

SmoothingProblems of scale:

Page 39: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

39

SmoothingProblems of scale:

Large numbers of featuresSome NLP problems in MaxEnt 1M features

Storage can be a problem

Page 40: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

40

SmoothingProblems of scale:

Large numbers of featuresSome NLP problems in MaxEnt 1M features

Storage can be a problem

Sparseness problemsEase of overfitting

Page 41: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

41

SmoothingProblems of scale:

Large numbers of featuresSome NLP problems in MaxEnt 1M features

Storage can be a problem

Sparseness problemsEase of overfitting

Optimization problemsFeatures can be near infinite, take long time to

converge

Page 42: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

42

SmoothingConsider the coin flipping problem

Three empirical distributionsModels

From K&M ‘03

Page 43: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

43

Need for Smoothing Two problems

From K&M ‘03

Page 44: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

44

Need for Smoothing Two problems

Optimization:Optimal value of λ?

∞Slow to optimize

From K&M ‘03

Page 45: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

45

Need for Smoothing Two problems

Optimization:Optimal value of λ?

∞Slow to optimize

No smoothingLearned distribution

just as spiky (K&M’03)

From K&M ‘03

Page 46: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

46

Possible Solutions

Page 47: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

47

Possible SolutionsEarly stopping

Feature selection

Regularization

Page 48: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

48

Early StoppingPrior use of early stopping

Page 49: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

49

Early StoppingPrior use of early stopping

Decision tree heuristics

Page 50: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

50

Early StoppingPrior use of early stopping

Decision tree heuristics

Similarly hereStop training after a few iterations

λwill have increased

Guarantees bounded, finite training time

Page 51: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

51

Feature SelectionApproaches:

Page 52: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

52

Feature SelectionApproaches:

Heuristic: Drop features based on fixed thresholdsi.e. number of occurrences

Page 53: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

53

Feature SelectionApproaches:

Heuristic: Drop features based on fixed thresholdsi.e. number of occurrences

Wrapper methods:Add feature selection to training loop

Page 54: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

54

Feature SelectionApproaches:

Heuristic: Drop features based on fixed thresholdsi.e. number of occurrences

Wrapper methods:Add feature selection to training loop

Heuristic approaches: Simple, reduce features, but could harm

performance

Page 55: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

55

RegularizationIn statistics and machine learning, regularization

is any method of preventing overfitting of data by a model.

From K&M ’03, F. Xia

Page 56: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

56

RegularizationIn statistics and machine learning, regularization

is any method of preventing overfitting of data by a model.

Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines.

From K&M ’03, F. Xia

Page 57: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

57

RegularizationIn statistics and machine learning, regularization

is any method of preventing overfitting of data by a model.

Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines.

In this case, we change the objective function: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ)

From K&M ’03, F. Xia

Page 58: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

58

Prior Possible prior distributions: uniform, exponential

Page 59: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

59

Prior Possible prior distributions: uniform, exponential

Gaussian prior:

Page 60: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

60

Prior Possible prior distributions: uniform, exponential

Gaussian prior:

log P(Y,λ|X) = log P(λ)+log P(Y|X,λ)

Page 61: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

61

Maximize P(Y|X,λ)

Maximize P(Y, λ|X)

In practice, μ=0; 2σ2=1

Page 62: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

62

L1 and L2 Regularization

Page 63: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

63

Smoothing: POS Example

Page 64: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

64

Advantages of SmoothingSmooths distributions

Page 65: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

65

Advantages of SmoothingSmooths distributions

Moves weight onto more informative features

Page 66: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

66

Advantages of SmoothingSmooths distributions

Moves weight onto more informative features

Enables effective use of larger numbers of features

Page 67: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

67

Advantages of SmoothingSmooths distributions

Moves weight onto more informative features

Enables effective use of larger numbers of features

Can speed up convergence

Page 68: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

68

Summary: TrainingMany training methods:

Generalized Iterative Scaling (GIS)

Smoothing:Early stopping, feature selection, regularization

Regularization:Change objective function – add priorCommon prior: Gaussian priorMaximizing posterior not equivalent to max ent

Page 69: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

69

MaxEnt POS Tagging

Page 70: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

70

Notation(Ratnaparkhi, 1996)

h: history xWord and tag history

t: tag y

Page 71: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

71

POS Tagging ModelP(t1,…,tn|w1,…,wn)

where hi={wi,wi-1,wi-2,wi+1,wi+2,ti-1,ti-2}

Page 72: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

72

MaxEnt Feature Set

Page 73: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

73

Example

Feature for ‘about’

Exclude features seen < 10 times

Page 74: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

74

TrainingGIS

Training time: O(NTA)N: training set sizeT: number of tagsA: average number of features active for event

(h,t)

24 hours on a ‘96 machine

Page 75: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

75

Finding FeaturesIn training, where do features come from?

Where do features come from in testing?

w-1 w0 w-1w0 w+1 t-1 y

x1(Time)

<s> Time <s>Time flies BOS N

x2 (flies)

Time flies Time flies like N N

x3 (like)

flies like flies like an N V

Page 76: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

76

Finding FeaturesIn training, where do features come from?

Where do features come from in testing?tag features come from classification of prior word

w-1 w0 w-1w0 w+1 t-1 y

x1(Time)

<s> Time <s>Time flies BOS N

x2 (flies)

Time flies Time flies like N N

x3 (like)

flies like flies like an N V

Page 77: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

77

DecodingGoal: Identify highest probability tag sequence

Page 78: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

78

DecodingGoal: Identify highest probability tag sequence

Issues:Features include tags from previous words

Not immediately available

Page 79: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

79

DecodingGoal: Identify highest probability tag sequence

Issues:Features include tags from previous words

Not immediately available

Uses tag historyJust knowing highest probability preceding tag

insufficient

Page 80: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

80

Beam SearchIntuition:

Breadth-first search explores all pathsLots of paths are (pretty obviously) badWhy explore bad paths?Restrict to (apparently best) paths

Approach:Perform breadth-first search, but

Page 81: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

81

Beam SearchIntuition:

Breadth-first search explores all pathsLots of paths are (pretty obviously) badWhy explore bad paths?Restrict to (apparently best) paths

Approach:Perform breadth-first search, butRetain only k ‘best’ paths thus fark: beam width

Page 82: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

82

Beam Search, k=3 <s> time flies like an arrow

BOS

Page 83: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

83

Beam Search, k=3

V

N

<s> time flies like an arrow

BOS

Page 84: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

84

Beam Search, k=3

V

N

<s> time flies like an arrow

BOS

N

V

N

V

Page 85: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

85

Beam Search, k=3

V

N

<s> time flies like an arrow

BOS

N

V

N

V

P

V

P

V

P

V

Page 86: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

86

Beam Search, k=3

V

N

<s> time flies like an arrow

BOS

N

V

N

V

P

56V

P

V

P

V

Page 87: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

87

Beam SearchW={w1,w2,…,wn}: test sentence

Page 88: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

88

Beam SearchW={w1,w2,…,wn}: test sentence

sij: jth highest prob. sequence up to & inc. word wi

Page 89: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

89

Beam SearchW={w1,w2,…,wn}: test sentence

sij: jth highest prob. sequence up to & inc. word wi

Generate tags for w1, keep top k, set s1j accordingly

Page 90: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

90

Beam SearchW={w1,w2,…,wn}: test sentence

sij: jth highest prob. sequence up to & inc. word wi

Generate tags for w1, keep top k, set s1j accordingly

for i=2 to n:

Page 91: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

91

Beam SearchW={w1,w2,…,wn}: test sentence

sij: jth highest prob. sequence up to & inc. word wi

Generate tags for w1, keep top k, set s1j accordingly

for i=2 to n:Extension: add tags for wi to each s(i-1)j

Page 92: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

92

Beam SearchW={w1,w2,…,wn}: test sentence

sij: jth highest prob. sequence up to & inc. word wi

Generate tags for w1, keep top k, set s1j accordingly

for i=2 to n:Extension: add tags for wi to each s(i-1)j

Beam selection: Sort sequences by probabilityKeep only top k sequences

Page 93: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

93

Beam SearchW={w1,w2,…,wn}: test sentence

sij: jth highest prob. sequence up to & inc. word wi

Generate tags for w1, keep topN, set s1j accordingly

for i=2 to n: For each s(i-1)j

for wi form vector, keep topN tags for wi

Beam selection: Sort sequences by probabilityKeep only top sequences, using pruning on next slide

Return highest probability sequence sn1

Page 94: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

94

Beam SearchPruning and storage:

W = beam widthFor each node, store:

Tag for wi

Probability of sequence so far, probi,j=

For each candidate j, si,j

Keep the node if probi,j in topK, and

probi,j is sufficiently high

e.g. lg(probi,j)+W>=lg(max_prob)

Page 95: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

95

DecodingTag dictionary:

known word: returns tags seen with word in training

unknown word: returns all tags

Beam width = 5

Running time: O(NTAB)N,T,A as beforeB: beam width

Page 96: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

96

POS TaggingOverall accuracy: 96.3+%

Unseen word accuracy: 86.2%

Comparable to HMM tagging accuracy or TBL

ProvidesProbabilistic frameworkBetter able to model different info sources

Topline accuracy 96-97%Consistency issues

Page 97: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

97

Beam SearchBeam search decoding:

Variant of breadth first searchAt each layer, keep only top k sequences

Advantages:

Page 98: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

98

Beam SearchBeam search decoding:

Variant of breadth first searchAt each layer, keep only top k sequences

Advantages:Efficient in practice: beam 3-5 near optimal

Empirically, beam 5-10% of search space; prunes 90-95%

Page 99: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

99

Beam SearchBeam search decoding:

Variant of breadth first searchAt each layer, keep only top k sequences

Advantages:Efficient in practice: beam 3-5 near optimal

Empirically, beam 5-10% of search space; prunes 90-95%

Simple to implementJust extensions + sorting, no dynamic programming

Page 100: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

100

Beam SearchBeam search decoding:

Variant of breadth first searchAt each layer, keep only top k sequences

Advantages:Efficient in practice: beam 3-5 near optimal

Empirically, beam 5-10% of search space; prunes 90-95%

Simple to implementJust extensions + sorting, no dynamic programming

Running time:

Page 101: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

Beam SearchBeam search decoding:

Variant of breadth first searchAt each layer, keep only top sequences

Advantages:Efficient in practice: beam 3-5 near optimal

Empirically, beam 5-10% of search space; prunes 90-95%Simple to implement

Just extensions + sorting, no dynamic programming

Disadvantage: Not guaranteed optimal (or complete)

101

Page 102: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

MaxEnt POS TaggingPart of speech tagging by classification:

Feature designword and tag context featuresorthographic features for rare words

102

Page 103: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

MaxEnt POS TaggingPart of speech tagging by classification:

Feature designword and tag context featuresorthographic features for rare words

Sequence classification problems:Tag features depend on prior classification

103

Page 104: MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

MaxEnt POS TaggingPart of speech tagging by classification:

Feature designword and tag context featuresorthographic features for rare words

Sequence classification problems:Tag features depend on prior classification

Beam search decodingEfficient, but inexact

Near optimal in practice

104