Top Banner
Stochastic Gradient Descent Training for L1- regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou University of Manchester 1
24

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Mar 30, 2015

Download

Documents

Daisy Wilkey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with

Cumulative Penalty

Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou

University of Manchester

1

Page 2: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Log-linear models in NLP

• Maximum entropy models– Text classification (Nigam et al., 1999)– History-based approaches (Ratnaparkhi, 1998)

• Conditional random fields– Part-of-speech tagging (Lafferty et al., 2001),

chunking (Sha and Pereira, 2003), etc.• Structured prediction

– Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.

2

Page 3: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Log-linear models

Feature functionWeight

y

yxxi

ii fwZ ,exp

• Log-linear (a.k.a. maximum entropy) model

• Training– Maximize the conditional likelihood of the training data

Partition function:

iii fwZ

p yxx

wxy ,exp1

;|

wwxyw RpLN

j

jj 1

;|log

3

Page 4: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Regularization

• To avoid overfitting to the training data– Penalize the weights of the features

• L1 regularization

– Most of the weights become zero– Produces sparse (compact) models– Saves memory and storage

i

i

N

j

jj wCpL1

;|log wxyw

4

Page 5: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Training log-linear models

• Numerical optimization methods– Gradient descent (steepest descent or hill-climbing)– Quasi-Newton methods (e.g. BFGS, OWL-QN)– Stochastic Gradient Descent (SGD)– etc.

• Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.

5

Page 6: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Gradient Descent (Hill Climbing)

1w

2w

objective

6

Page 7: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Stochastic Gradient Descent (SGD)

1w

2w

objective

Compute an approximate gradient using onetraining sample

7

Page 8: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Stochastic Gradient Descent (SGD)

• Weight update procedure– very simple (similar to the Perceptron algorithm)

Not differentiable

8

i

jj

i

ki

ki w

N

Cp

www wxy ;|log1

: learning rate

Page 9: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Using subgradients

• Weight update procedure

i

jj

i

ki

ki w

N

Cp

www wxy ;|log1

0if1

0if0

0if1

i

i

i

ii w

w

w

ww

9

Page 10: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Using subgradients

• Problems– L1 penalty needs to be applied to all features

(including the ones that are not used in the current sample).

– Few weights become zero as a result of training.

i

jj

i

ki

ki w

N

Cp

www wxy ;|log1

10

Page 11: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Clipping-at-zero approach

• Carpenter (2008)• Special case of the FOLOS algorithm (Duchi and

Singer, 2008) and the truncated gradient method (Langford et al., 2009)

• Enables lazy update

w

11

Page 12: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Clipping-at-zero approach

12

i

jj

i

ki

ki w

N

Cp

www wxy ;|log1

N

Cww

w

N

Cww

w

pw

ww

ki

ki

ki

ki

ki

ki

jj

i

ki

ki

2

11

2

1

2

11

2

1

2

1

,0min

then0ifelse

,0max

then0if

;|log wxy

Page 13: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

• Text chunking

• Named entity recognition

• Part-of-speech tagging

13

Number of non-zero features

Quasi-Newton 18,109

SGD (Naive) 455,651

SGD (Clipping-at-zero) 87,792

Number of non-zero features

Quasi-Newton 30,710

SGD (Naive) 1,032,962

SGD (Clipping-at-zero) 279,886

Number of non-zero features

Quasi-Newton 50,870

SGD (Naive) 2,142,130

SGD (Clipping-at-zero) 323,199

Page 14: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Why it does not produce sparse models

• In SGD, weights are not updated smoothly

Fails to becomezero!

L1 penalty is wasted away

14

Page 15: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Cumulative L1 penalty

• The absolute value of the total L1 penalty which should have been applied to each weight

• The total L1 penalty which has actually been applied to each weight

15

k

ttk N

Cu

1

k

t

ti

tik wwq

1

2

11

Page 16: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Applying L1 with cumulative penalty

12

11

2

1

12

11

2

1

2

1

,0min

then0ifelse

,0max

then0if

;|log

kik

ki

ki

ki

kik

ki

ki

ki

jj

i

ki

ki

quww

w

quww

w

pw

ww wxy

• Penalize each weight according to the difference between and ku

1kiq

Page 17: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Implementation

10 lines of code!

17

Page 18: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Experiments

• Model: Conditional Random Fields (CRFs)• Baseline: OWL-QN (Andrew and Gao, 2007)• Tasks

– Text chunking (shallow parsing)• CoNLL 2000 shared task data• Recognize base syntactic phrases (e.g. NP, VP, PP)

– Named entity recognition• NLPBA 2004 shared task data• Recognize names of genes, proteins, etc.

– Part-of-speech (POS) tagging• WSJ corpus (sections 0-18 for training)

18

Page 19: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

CoNLL 2000 chunking task: objective

19

Page 20: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

CoNLL 2000 chunking: non-zero features

20

Page 21: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

CoNLL 2000 chunking

Passes Obj. # Features Time (sec) F-score

OWL-QN 160 -1.583 18,109 598 93.62

SGD (Naive) 30 -1.671 455,651 1,117 93.64

SGD (Clipping + Lazy Update) 30 -1.671 87,792 144 93.65

SGD (Cumulative) 30 -1.653 28,189 149 93.68

SGD (Cumulative + ED) 30 -1.622 23,584 148 93.66

21

• Performance of the produced model

• Training is 4 times faster than OWL-QN• The model is 4 times smaller than the clipping-at-zero approach• The objective is also slightly better

Page 22: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Passes Obj. # Features Time (sec) F-score

OWL-QN 160 -2.448 30,710 2,253 71.76

SGD (Naive) 30 -2.537 1,032,962 4,528 71.20

SGD (Clipping + Lazy Update) 30 -2.538 279,886 585 71.20

SGD (Cumulative) 30 -2.479 31,986 631 71.40

SGD (Cumulative + ED) 30 -2.443 25,965 631 71.63

NLPBA 2004 named entity recognition

22

Passes Obj. # Features Time (sec)

Accuracy

OWL-QN 124 -1.941 50,870 5,623 97.16

SGD (Naive) 30 -2.013 2,142,130 18,471 97.18

SGD (Clipping + Lazy Update) 30 -2.013 323,199 1,680 97.18

SGD (Cumulative) 30 -1.987 62,043 1,777 97.19

SGD (Cumulative + ED) 30 -1.954 51,857 1,774 97.17

Part-of-speech tagging on WSJ

Page 23: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Discussions

• Convergence– Demonstrated empirically– Penalties applied are not i.i.d.

• Learning rate– The need for tuning can be annoying– Rule of thumb:

• Exponential decay (passes = 30, alpha = 0.85)

23

Page 24: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou.

Conclusions

• Stochastic gradient descent training for L1-regularized log-linear models– Force each weight to receive the total L1 penalty

that would have been applied if the true (noiseless) gradient were available

• 3 to 4 times faster than OWL-QN• Extremely easy to implement

24