Top Banner
Neural network language models Lecture, Feb 16 CS 690N, Spring 2017 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2017/ Brendan O’Connor College of Information and Computer Sciences University of Massachusetts Amherst Thursday, February 16, 17
16

Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Aug 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural network language models

Lecture, Feb 16CS 690N, Spring 2017

Advanced Natural Language Processinghttp://people.cs.umass.edu/~brenocon/anlp2017/

Brendan O’ConnorCollege of Information and Computer Sciences

University of Massachusetts Amherst

Thursday, February 16, 17

Page 2: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural Language Models

Feed forward network

h = g(Vx + c)

y = Wh + bx

h

y

[Slide: Phil Blunsom]

Thursday, February 16, 17

Page 3: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Nonlinear activation functions

3

5.3. RECURRENT NEURAL NETWORK LANGUAGE MODELS 91

Figure 5.2: Nonlinear activation functions for neural networks

recurrent neural network (RNN Mikolov et al., 2010). The basic idea is to recurrentlyupdate the context vectors as we move through the sequence. Let us write h

m

for thecontextual information at position m in the sequence. RNNs employ the following recur-rence:

x

m

,�

w

m

(5.27)h

m

=g(⇥h

m�1

+ x

m

) (5.28)

p(wm+1

| w1

, w2

, . . . , wm

) =exp(�

w

m+1 · hm

)Pw

02V exp(�w

0 · hm

), (5.29)

where � is a matrix of input word embeddings, and x

m

denotes the embedding for wordwm

. The function g is an element-wise nonlinear activation function. Typical choices are:

• tanh(x), the hyperbolic tangent;

• �(x), the sigmoid function 1

1+exp(�x)

;

• (x)+

, the rectified linear unit, (x)+

= max(x, 0), also called ReLU.

These activation functions are shown in Figure 5.2. The sigmoid and tanh functions“squash” their inputs into a fixed range: [0, 1] for the sigmoid, [�1, 1] for tanh. This makesit possible to chain together many iterations of these functions without numerical insta-bility.

A key point about the RNN language model is that although each wm

depends only onthe context vector h

m�1

, this vector is in turn influenced by all previous tokens, w1

, w2

, . . . wm�1

,through the recurrence operation: w

1

affects h

1

, which affects h

2

, and so on, until the in-formation is propagated all the way to h

m�1

, and then on to wm

(see Figure 5.1). Thisis an important distinction from n-gram language models, where any information out-side the n-word window is ignored. Thus, in principle, the RNN language model can

(c) Jacob Eisenstein 2014-2017. Work in progress.

sigmoid(x) =

e

x

1 + e

x

tanh(x) = 2⇥ sgm(x)� 1

a.k.a. “ReLU”(x)+ = max(0, x)

Thursday, February 16, 17

Page 4: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural Language Models

Trigram NN language model

hn = g(V [wn�1

;wn�2

] + c)

pn = softmax(Whn + b)

softmax(u)i =exp uiPj exp uj

• wi are one hot vetors and pi aredistributions,

• |wi | = |pi | = V (words in thevocabulary),

• V is usually very large > 1e5.

wn�1

hn

pn

wn�2

[Slide: Phil Blunsom]

Neural Language Models

Trigram NN language model

hn = g(V [wn�1

;wn�2

] + c)

pn = softmax(Whn + b)

softmax(u)i =exp uiPj exp uj

• wi are one hot vetors and pi aredistributions,

• |wi | = |pi | = V (words in thevocabulary),

• V is usually very large > 1e5.

wn�1

hn

pn

wn�2

Word embeddings

Thursday, February 16, 17

Page 5: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural Language Models: Sampling

wn|wn�1

,wn�2

⇠ pn

wn�1

hn

a

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

wn�2

he built

pn

[Slide: Phil Blunsom]

Thursday, February 16, 17

Page 6: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural Language Models: Sampling

wn|wn�1

,wn�2

⇠ pn

w0

There

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

w�1

<s> <s>

p1

h1

[Slide: Phil Blunsom]

Thursday, February 16, 17

Page 7: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural Language Models: Sampling

wn|wn�1

,wn�2

⇠ pn

w0

There

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

w�1

<s> <s>

p1

he~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk p2

w1w0

h1 h2

[Slide: Phil Blunsom]

Thursday, February 16, 17

Page 8: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural Language Models: Sampling

wn|wn�1

,wn�2

⇠ pn

w0

There

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

w�1

<s> <s>

p1

he~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

built

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . .. . . .

aardva

rkp2 p3

w1w0 w1 w2

h1 h2 h3

[Slide: Phil Blunsom]

Thursday, February 16, 17

Page 9: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural Language Models: Sampling

wn|wn�1

,wn�2

⇠ pn

w0

There

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

w�1

<s> <s>

p1

he~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

built

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . .. . . .

aardva

rk

a

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rkp2 p3 p4

w1w0 w1 w2 w2 w3

h1 h2 h3 h4

[Slide: Phil Blunsom]

Thursday, February 16, 17

Page 10: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural Language Models: Training

The usual training objective isthe cross entropy of the datagiven the model (MLE):

F = � 1

N

X

n

costn(wn, pn)

The cost function is simply themodel’s estimated log-probabilityof wn:

cost(a, b) = aT log b

(assuming wi is a one hotencoding of the word)

wn

costn

wn�1

hn

pn

wn�2

[Slide: Phil Blunsom]

Thursday, February 16, 17

Page 11: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural Language Models: Training

Calculating the gradients isstraightforward with backpropagation:

@F@W

= � 1

N

Pn

@costn@pn

@pn@W

@F@V

= � 1

N

Pn

@costn@pn

@pn@hn

@hn@V

wn

costn

wn�1

hn

pn

wn�2

[Slide: Phil Blunsom]

Thursday, February 16, 17

Page 12: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Neural Language Models: Training

Calculating the gradients is straightforward with back propagation:

@F@W

= �1

4

4X

n=1

@costn@pn

@pn@W

,@F@V

= �1

4

4X

n=1

@costn@pn

@pn@hn

@hn@V

w1 w2 w3 w4

cost1 cost2 cost3 cost4

w0

h1 h2 h3 h4

w1 w2 w3

p1 p2 p3 p4

F

w�1 w0 w1 w2

Note that calculating the gradients for each time step n is independent ofall other timesteps, as such they are calculated in parallel and summed.

[Slide: Phil Blunsom]

Thursday, February 16, 17

Page 13: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Comparison with Count Based N-Gram LMs

Good

• Better generalisation on unseen n-grams, poorer on seen n-grams.Solution: direct (linear) ngram features.

• Simple NLMs are often an order magnitude smaller in memoryfootprint than their vanilla n-gram cousins (though not if you usethe linear features suggested above!).

Bad

• The number of parameters in the model scales with the n-gram sizeand thus the length of the history captured.

• The n-gram history is finite and thus there is a limit on the longestdependencies that an be captured.

• Mostly trained with Maximum Likelihood based objectives which donot encode the expected frequencies of words a priori.

[Slide: Phil Blunsom]

Thursday, February 16, 17

Page 14: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Training NNs

• Dropout (preferred regularization method)

• Minibatching

• Parallelization (GPUs)

• Local optima?

14

Thursday, February 16, 17

Page 15: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

Word/feature embeddings

• “Lookup layer”: from discrete input features (words, ngrams, etc.) to continuous vectors

• Anything that was directly used in log-linear models, move to using vectors

• Learn or not?

• Learn: they’re just model parameters

• Fixed: use pretrained embeddings

• Use a faster-to-train model on very large, perhaps different, dataset[e.g. word2vec, glove pretrained word vectors]

• Both: initialize with pretrained, then learn

• Word at test but not training time?

• Shared representations fordomain adaptation and multitask learning

15

Thursday, February 16, 17

Page 16: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh

16

wt | wt�2, wt�1 wt | w1, . . . wt�1

Local models Long-history models

Fully observeddirect word models

. . . . . . Log-linear models . . . . . .

Latent-classdirect word models

Markovian neural LM Recurrent neural LM

Thursday, February 16, 17