MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

1

MaxEnt: Training, Smoothing, Tagging

Advanced Statistical Methods in NLPLing572

February 7, 2012

2

RoadmapMaxent:

Training

Smoothing

Case study: POS Tagging (redux)Beam search

3

Training

4

TrainingLearn λs from training data

5


Challenge: Usually can’t solve analyticallyEmploy numerical methods

6


Challenge: Usually can’t solve analyticallyEmploy numerical methods

Main different techniques:Generalized Iterative Scaling (GIS, Darroch &Ratcliffe,

‘72)

Improved Iterative Scaling (IIS, Della Pietra et al, ‘95)

L-BFGS,…..

7

Generalized Iterative Scaling

GIS Setup:GIS required constraint: , where C is a constant

8



If not, then set

:

9



If not, then set

10



If not, then set

and add a correction feature function fk+1:

11



If not, then set

and add a correction feature function fk+1:

GIS also requires at least one active feature for any eventDefault feature functions solve this problem

12

GIS IterationCompute the empirical expectation

13


Initialization:λj(0) ; set to 0 or some value

14



Iterate until convergence for each j:

15



Iterate until convergence for each j:Compute p(y|x) under the current model

16




Compute model expectation under current model

17




Compute model expectation under current model

Update model parameters by weighted ratio of empirical and model expectations

18

GIS IterationCompute

19



20



Iterate until convergence:Compute

21



Iterate until convergence:Compute p(n)(y|x)=

22




Compute

23




Compute

Update

24

ConvergenceMethods have convergence guarantees

25


However, full convergence may take very long time

26


However, full convergence may take very long timeFrequently use threshold

27


However, full convergence may take very long timeFrequently use threshold

28

Calculating LL(p)LL = 0

For each sample x in the training dataLet y be the true label of xprob = p(y|x)LL += 1/N * prob

29

Running TimeFor each iteration the running time is:

30

Running TimeFor each iteration the running time is O(NPA),

where:

N: number of training instances

P: number of classes

A: Average number of active features for instance (x,y)

31

L-BFGSLimited-memory version of

Broyden–Fletcher–Goldfarb–Shanno (BFGS) method

http://en.wikipedia.org/wiki/BFGS_method


32



Quasi-Newton method for unconstrained optimization



33




Good for optimization problems with many variables



34




Good for optimization problems with many variables

“Algorithm of choice” for MaxEnt and related models



35

L-BFGSReferences:

Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782

Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528

http://www.ece.northwestern.edu/~nocedal/PSfiles/limited-memory.ps.gz



36

L-BFGSReferences:

Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782

Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528

Implementations: Java, Matlab, Python via scipy, R, etcSee Wikipedia page




37

SmoothingBased on Klein & Manning, 2003; F. Xia

38

SmoothingProblems of scale:

39


Large numbers of featuresSome NLP problems in MaxEnt 1M features

Storage can be a problem

40




Sparseness problemsEase of overfitting

41




Sparseness problemsEase of overfitting

Optimization problemsFeatures can be near infinite, take long time to

converge

42

SmoothingConsider the coin flipping problem

Three empirical distributionsModels

From K&M ‘03

43

Need for Smoothing Two problems

From K&M ‘03

44


Optimization:Optimal value of λ?

∞Slow to optimize

From K&M ‘03

45


Optimization:Optimal value of λ?

∞Slow to optimize

No smoothingLearned distribution

just as spiky (K&M’03)

From K&M ‘03

46

Possible Solutions

47

Possible SolutionsEarly stopping

Feature selection

Regularization

48

Early StoppingPrior use of early stopping

49


Decision tree heuristics

50


Decision tree heuristics

Similarly hereStop training after a few iterations

λwill have increased

Guarantees bounded, finite training time

51

Feature SelectionApproaches:

52


Heuristic: Drop features based on fixed thresholdsi.e. number of occurrences

53



Wrapper methods:Add feature selection to training loop

54



Wrapper methods:Add feature selection to training loop

Heuristic approaches: Simple, reduce features, but could harm

performance

55

RegularizationIn statistics and machine learning, regularization

is any method of preventing overfitting of data by a model.

From K&M ’03, F. Xia

56



Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines.


57



Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines.

In this case, we change the objective function: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ)


58

Prior Possible prior distributions: uniform, exponential

59


Gaussian prior:

60


Gaussian prior:

log P(Y,λ|X) = log P(λ)+log P(Y|X,λ)

61

Maximize P(Y|X,λ)

Maximize P(Y, λ|X)

In practice, μ=0; 2σ2=1

62

L1 and L2 Regularization

63

Smoothing: POS Example

64

Advantages of SmoothingSmooths distributions

65


Moves weight onto more informative features

66



Enables effective use of larger numbers of features

67



Enables effective use of larger numbers of features

Can speed up convergence

68

Summary: TrainingMany training methods:

Generalized Iterative Scaling (GIS)

Smoothing:Early stopping, feature selection, regularization

Regularization:Change objective function – add priorCommon prior: Gaussian priorMaximizing posterior not equivalent to max ent

69

MaxEnt POS Tagging

70

Notation(Ratnaparkhi, 1996)

h: history xWord and tag history

t: tag y

71

POS Tagging ModelP(t1,…,tn|w1,…,wn)

where hi={wi,wi-1,wi-2,wi+1,wi+2,ti-1,ti-2}

72

MaxEnt Feature Set

73

Example

Feature for ‘about’

Exclude features seen < 10 times

74

TrainingGIS

Training time: O(NTA)N: training set sizeT: number of tagsA: average number of features active for event

(h,t)

24 hours on a ‘96 machine

75

Finding FeaturesIn training, where do features come from?

Where do features come from in testing?

w-1 w0 w-1w0 w+1 t-1 y

x1(Time)

<s> Time <s>Time flies BOS N

x2 (flies)

Time flies Time flies like N N

x3 (like)

flies like flies like an N V

76

Finding FeaturesIn training, where do features come from?

Where do features come from in testing?tag features come from classification of prior word

w-1 w0 w-1w0 w+1 t-1 y

x1(Time)

<s> Time <s>Time flies BOS N

x2 (flies)

Time flies Time flies like N N

x3 (like)

flies like flies like an N V

77

DecodingGoal: Identify highest probability tag sequence

78


Issues:Features include tags from previous words

Not immediately available

79


Issues:Features include tags from previous words

Not immediately available

Uses tag historyJust knowing highest probability preceding tag

insufficient

80

Beam SearchIntuition:

Breadth-first search explores all pathsLots of paths are (pretty obviously) badWhy explore bad paths?Restrict to (apparently best) paths

Approach:Perform breadth-first search, but

81

Beam SearchIntuition:

Breadth-first search explores all pathsLots of paths are (pretty obviously) badWhy explore bad paths?Restrict to (apparently best) paths

Approach:Perform breadth-first search, butRetain only k ‘best’ paths thus fark: beam width

82

Beam Search, k=3 <s> time flies like an arrow

BOS

83

Beam Search, k=3

V

N

<s> time flies like an arrow

BOS

84

Beam Search, k=3

V

N


BOS

N

V

N

V

85

Beam Search, k=3

V

N


BOS

N

V

N

V

P

V

P

V

P

V

86

Beam Search, k=3

V

N


BOS

N

V

N

V

P

56V

P

V

P

V

87

Beam SearchW={w1,w2,…,wn}: test sentence

88


sij: jth highest prob. sequence up to & inc. word wi

89



Generate tags for w1, keep top k, set s1j accordingly

90




for i=2 to n:

91




for i=2 to n:Extension: add tags for wi to each s(i-1)j

92




for i=2 to n:Extension: add tags for wi to each s(i-1)j

Beam selection: Sort sequences by probabilityKeep only top k sequences

93



Generate tags for w1, keep topN, set s1j accordingly

for i=2 to n: For each s(i-1)j

for wi form vector, keep topN tags for wi

Beam selection: Sort sequences by probabilityKeep only top sequences, using pruning on next slide

Return highest probability sequence sn1

94

Beam SearchPruning and storage:

W = beam widthFor each node, store:

Tag for wi

Probability of sequence so far, probi,j=

For each candidate j, si,j

Keep the node if probi,j in topK, and

probi,j is sufficiently high

e.g. lg(probi,j)+W>=lg(max_prob)

95

DecodingTag dictionary:

known word: returns tags seen with word in training

unknown word: returns all tags

Beam width = 5

Running time: O(NTAB)N,T,A as beforeB: beam width

96

POS TaggingOverall accuracy: 96.3+%

Unseen word accuracy: 86.2%

Comparable to HMM tagging accuracy or TBL

ProvidesProbabilistic frameworkBetter able to model different info sources

Topline accuracy 96-97%Consistency issues

97

Beam SearchBeam search decoding:

Variant of breadth first searchAt each layer, keep only top k sequences

Advantages:

98



Advantages:Efficient in practice: beam 3-5 near optimal

Empirically, beam 5-10% of search space; prunes 90-95%

99





Simple to implementJust extensions + sorting, no dynamic programming

100





Simple to implementJust extensions + sorting, no dynamic programming

Running time:


Variant of breadth first searchAt each layer, keep only top sequences


Empirically, beam 5-10% of search space; prunes 90-95%Simple to implement

Just extensions + sorting, no dynamic programming

Disadvantage: Not guaranteed optimal (or complete)

101

MaxEnt POS TaggingPart of speech tagging by classification:

Feature designword and tag context featuresorthographic features for rare words

102



Sequence classification problems:Tag features depend on prior classification

103



Sequence classification problems:Tag features depend on prior classification

Beam search decodingEfficient, but inexact

Near optimal in practice

104

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

Documents