Top Banner
Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)
68

Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Mar 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetworks

WeiXu(many slides from Greg Durrett and Philipp Koehn)

Page 2: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Recap:LossFunc7ons

Hinge(SVM)

Logis7c

Perceptron

0-1(ideal)

Loss

w>x

Page 3: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Recap:Logis7cRegression

‣ Tolearnweights:maximizediscrimina7veloglikelihoodofdataP(y|x)

P (y = +|x) = logistic(w>x)

P (y = +|x) =exp(

Pni=1 wixi)

1 + exp(Pn

i=1 wixi)

L(xj , yj = +) = logP (yj = +|xj)

=nX

i=1

wixji � log

1 + exp

nX

i=1

wixji

!!

sumoverfeatures

Page 4: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Recap:Mul7classLogis7cRegression

sumoveroutputspacetonormalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

Why?Interpretrawclassifierscoresasprobabili(es

exp6.05

22.2

0.55

probabili7esmustbe>=0

unnormalizedprobabili7es

normalize0.21

0.77

0.02

probabili7esmustsumto1

probabili7es

1.00

0.00

0.00correct(gold)probabili7es

toomanydrugtrials,

toofewpa4ents

compare

L(x, y) =nX

j=1

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

log(0.21)=-1.56

Page 5: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

sumoveroutputspacetonormalize

‣ Training:maximize

=nX

j=1

w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

!L(x, y) =

nX

j=1

logP (y⇤j |xj)

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Recap:Mul7classLogis7cRegression

Page 6: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Administrivia

‣ Homework1Graded(onCarmen)

‣ Homework2willbereleasedsoon.

‣ Reading:Eisenstein3.1-3.3,Jurafsky+Mar7n7.1-7.4

Page 7: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

ThisLecture

‣ Feedforwardneuralnetworks+backpropaga7on

‣ Neuralnetworkbasics

‣ Applica7ons

‣ Neuralnetworkhistory

‣ Implemen7ngneuralnetworks(if7me)

Page 8: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

History:NN“darkages”

‣ ConvNets:appliedtoMNISTbyLeCunin1998

‣ LSTMs:HochreiterandSchmidhuber(1997)

‣ Henderson(2003):neuralshie-reduceparser,notSOTAhgps://www.youtube.com/watch?v=FwFduRA_L6Q&feature=youtu.be

hgps://www.andreykurenkov.com/wri7ng/ai/a-brief-history-of-neural-nets-and-deep-learning/

Page 9: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

2008-2013:Aglimmeroflight…

‣ CollobertandWeston2011:“NLP(almost)fromscratch”

‣ Feedforwardneuralnetsinducefeaturesforsequen7alCRFs(“neuralCRF”)

‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011version7edSOTA

‣ Socher2011-2014:tree-structuredRNNsworkingokay

‣ Krizhevskeyetal.(2012):AlexNetforvision

Page 10: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

2014:Stuffstartsworking

‣ Sutskeveretal.(2014)+Bahdanauetal.(2015):seq2seq+agen7onforneuralMT(LSTMsworkforNLP?)

‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassifica7on/sen7ment(convnetsworkforNLP?)

‣ 2015:explosionofneuralnetsforeverythingunderthesun

‣ ChenandManning(2014)transi7on-baseddependencyparser(evenfeedforwardnetworksworkwellforNLP?)

Page 11: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Whydidn’ttheyworkbefore?

‣ Datasetstoosmall:forMT,notreallybegerun7lyouhave1M+parallelsentences(andreallyneedalotmore)

‣Op(miza(onnotwellunderstood:goodini7aliza7on,per-featurescaling+momentum(Adagrad/Adadelta/Adam)workbestout-of-the-box

‣ Regulariza(on:dropoutispregyhelpful

‣ Inputs:needwordrepresenta7onstohavetherightcon7nuousseman7cs

‣ Computersnotbigenough:can’trunforenoughitera7ons

Page 12: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetBasics

Page 13: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetworks:mo7va7on

‣ Howcanwedononlinearclassifica7on?Kernelsaretooslow…

‣Wanttolearnintermediateconjunc7vefeaturesoftheinput

argmaxyw>f(x, y)‣ Linearclassifica7on:

themoviewasnotallthatgood

I[containsnot&containsgood]

Page 14: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetworks:XOR

x1

x2

x1 x2

1 1111

100 0

00

0

0

1 0

1

x1, x2

(generally x = (x1, . . . , xm))

y

(generally y = (y1, . . . , yn)) y = x1 XOR x2

‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfunc7on

‣ Inputs

‣ Output

Page 15: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetworks:XOR

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1“or”

y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

(looks like action potential in neuron)

Page 16: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetworks:XORy = a1x1 + a2x2

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1

Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

x2

x1

“or”y = �x1 � x2 + 2 tanh(x1 + x2)

Page 17: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetworks:XOR

x1

x2

0

1 -1

0

x2

x1

[not]

[good] y = �2x1 � x2 + 2 tanh(x1 + x2)

I

I

themoviewasnotallthatgood

Page 18: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetworks

Takenfromhgp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp space

ShiftNonlinear transformation

Linear model: y = w · x+ b

y = g(w · x+ b)y = g(Wx+ b)

tanh

Page 19: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetworks

Takenfromhgp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!

Page 20: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

DeepNeuralNetworks

Adopted from Chris Dyer

}outputoffirstlayer

z = g(Vg(Wx+ b) + c)

z = g(Vy + c)

Input Second Layer

FirstLayer

“Feedforward”computa7on(notrecurrent)

z = V(Wx+ b) + c

Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?

Page 21: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

DeepNeuralNetworks

Takenfromhgp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Page 22: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

FeedforwardNetworks,Backpropaga7on

Page 23: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Administrivia

‣ Homework2isreleased,dueonFebruary18(startearly!).

‣ Reading:Eisenstein7.0-7.4,Jurafsky+Mar7nChapter8

Page 24: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

SimpleNeuralNetwork

‣ Assumesthatthelabelsyareindexedandassociatedwithcoordinatesinavectorspace

9Simple Neural Network

11

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• One innovation: bias units (no inputs, always value 1)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 25: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

SimpleNeuralNetwork 9Simple Neural Network

11

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• One innovation: bias units (no inputs, always value 1)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 26: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

SimpleNeuralNetwork

‣ Tryouttwoinputvalues

10Sample Input

1

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Try out two input values

• Hidden unit computation

sigmoid(1.0 ⇥ 3.7 + 0.0 ⇥ 3.7 + 1 ⇥�1.5) = sigmoid(2.2) =1

1 + e�2.2= 0.90

sigmoid(1.0 ⇥ 2.9 + 0.0 ⇥ 2.9 + 1 ⇥�4.5) = sigmoid(�1.6) =1

1 + e1.6= 0.17

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 27: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

SimpleNeuralNetwork

‣ Tryouttwoinputvalues‣ Hiddenunitcomputa7on

11Computed Hidden

.90

.17

1

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Try out two input values

• Hidden unit computation

sigmoid(1.0 ⇥ 3.7 + 0.0 ⇥ 3.7 + 1 ⇥�1.5) = sigmoid(2.2) =1

1 + e�2.2= 0.90

sigmoid(1.0 ⇥ 2.9 + 0.0 ⇥ 2.9 + 1 ⇥�4.5) = sigmoid(�1.6) =1

1 + e1.6= 0.17

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

11Computed Hidden

.90

.17

1

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Try out two input values

• Hidden unit computation

sigmoid(1.0 ⇥ 3.7 + 0.0 ⇥ 3.7 + 1 ⇥�1.5) = sigmoid(2.2) =1

1 + e�2.2= 0.90

sigmoid(1.0 ⇥ 2.9 + 0.0 ⇥ 2.9 + 1 ⇥�4.5) = sigmoid(�1.6) =1

1 + e1.6= 0.17

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 28: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

SimpleNeuralNetwork

‣ Outputunitcomputa7on

13Computed Output

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Output unit computation

sigmoid(.90 ⇥ 4.5 + .17 ⇥�5.2 + 1 ⇥�2.0) = sigmoid(1.17) =1

1 + e�1.17= 0.76

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

13Computed Output

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Output unit computation

sigmoid(.90 ⇥ 4.5 + .17 ⇥�5.2 + 1 ⇥�2.0) = sigmoid(1.17) =1

1 + e�1.17= 0.76

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 29: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

SimpleNeuralNetwork

‣ NetworkimplementsXOR‣ h0isOR,h1isAND

14Output for all Binary Inputs

Input x0 Input x1 Hidden h0 Hidden h1 Output y00 0 0.12 0.02 0.18 ! 00 1 0.88 0.27 0.74 ! 11 0 0.73 0.12 0.74 ! 11 1 0.99 0.73 0.33 ! 0

• Network implements XOR

– hidden node h0 is OR– hidden node h1 is AND– final layer operation is h0 ��h1

• Power of deep neural networks: chaining of processing stepsjust as: more Boolean circuits ! more complex computations possible

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 30: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ Computedoutput:‣ Correctoutput:

‣ Q:howdoweadjusttheweights?

Error20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Computed output: y = .76

• Correct output: t = 1.0

) How do we adjust the weights?

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Computed output: y = .76

• Correct output: t = 1.0

) How do we adjust the weights?

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Computed output: y = .76

• Correct output: t = 1.0

) How do we adjust the weights?

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 31: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

GradientDescent 22Gradient Descent

λ

error(λ)

gradient = 1

current λoptimal λ

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 32: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

GradientDescent 23Gradient Descent

Gradient for w1

Gradient for w

2

OptimumCurrent Point

Combined Gradient

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 33: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Deriva7veofSigmoid

‣ Deriva7ve:

‣ Sigmoidfunc7on:24Derivative of Sigmoid

• Sigmoid sigmoid(x) =1

1 + e�x

• Reminder: quotient rule⇣f(x)

g(x)

⌘0=

g(x)f 0(x) � f(x)g0(x)

g(x)2

• Derivative d sigmoid(x)dx

=d

dx

1

1 + e�x

=0 ⇥ (1 � e�x) � (�e�x)

(1 + e�x)2

=1

1 + e�x

⇣ e�x

1 + e�x

=1

1 + e�x

⇣1 �

1

1 + e�x

= sigmoid(x)(1 � sigmoid(x))

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

24Derivative of Sigmoid• Sigmoid sigmoid(x) =

1

1 + e�x

• Reminder: quotient rule⇣f(x)

g(x)

⌘0=

g(x)f 0(x) � f(x)g0(x)

g(x)2

• Derivative d sigmoid(x)dx

=d

dx

1

1 + e�x

=0 ⇥ (1 � e�x) � (�e�x)

(1 + e�x)2

=1

1 + e�x

⇣ e�x

1 + e�x

=1

1 + e�x

⇣1 �

1

1 + e�x

= sigmoid(x)(1 � sigmoid(x))

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 34: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ Error(L2norm):‣ Deriva7veoferrorwithregardtooneweight:

FinalLayerUpdate‣ Linearcombina7onofweights:‣ Ac7va7onfunc7on:

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 35: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ Error(L2norm):‣ Deriva7veoferrorwithregardtooneweight:

‣ Errorisdefinedwithrespectto:

FinalLayerUpdate(1)‣ Linearcombina7onofweights:‣ Ac7va7onfunc7on:

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

26Final Layer Update (1)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• Error E is defined with respect to y

dE

dy=

d

dy

1

2(t� y)2 = �(t� y)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

26Final Layer Update (1)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• Error E is defined with respect to y

dE

dy=

d

dy

1

2(t� y)2 = �(t� y)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

26Final Layer Update (1)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• Error E is defined with respect to y

dE

dy=

d

dy

1

2(t� y)2 = �(t� y)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

26Final Layer Update (1)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• Error E is defined with respect to y

dE

dy=

d

dy

1

2(t� y)2 = �(t� y)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 36: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ Error(L2norm):‣ Deriva7veoferrorwithregardtooneweight:

‣ withrespecttois:

FinalLayerUpdate(2)‣ Linearcombina7onofweights:‣ Ac7va7onfunc7on:

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

26Final Layer Update (1)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• Error E is defined with respect to y

dE

dy=

d

dy

1

2(t� y)2 = �(t� y)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

27Final Layer Update (2)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• y with respect to x is sigmoid(s)

dy

ds=

d sigmoid(s)ds

= sigmoid(s)(1� sigmoid(s)) = y(1� y)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

27Final Layer Update (2)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• y with respect to x is sigmoid(s)

dy

ds=

d sigmoid(s)ds

= sigmoid(s)(1� sigmoid(s)) = y(1� y)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

27Final Layer Update (2)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• y with respect to x is sigmoid(s)

dy

ds=

d sigmoid(s)ds

= sigmoid(s)(1� sigmoid(s)) = y(1� y)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

27Final Layer Update (2)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• y with respect to x is sigmoid(s)

dy

ds=

d sigmoid(s)ds

= sigmoid(s)(1� sigmoid(s)) = y(1� y)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 37: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ Error(L2norm):‣ Deriva7veoferrorwithregardtooneweight:

‣ isweightedlinearcombina7onofhiddennodevalues:

FinalLayerUpdate(3)‣ Linearcombina7onofweights:‣ Ac7va7onfunc7on:

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

27Final Layer Update (2)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• y with respect to x is sigmoid(s)

dy

ds=

d sigmoid(s)ds

= sigmoid(s)(1� sigmoid(s)) = y(1� y)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

28Final Layer Update (3)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• x is weighted linear combination of hidden node values hk

ds

dwk=

d

dwk

X

k

wkhk = hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

28Final Layer Update (3)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• x is weighted linear combination of hidden node values hk

ds

dwk=

d

dwk

X

k

wkhk = hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

28Final Layer Update (3)• Linear combination of weights s =

Pkwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

• x is weighted linear combination of hidden node values hk

ds

dwk=

d

dwk

X

k

wkhk = hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 38: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ Deriva7veoferrorwithregardtooneweight:

PuxngitAllTogether

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

‣Weightedadjustmentwillbescaledbyafixelearningrate:

29Putting it All Together

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

= �(t� y) y(1� y) hk

– error– derivative of sigmoid: y0

• Weight adjustment will be scaled by a fixed learning rate µ

�wk = µ (t� y) y0 hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

29Putting it All Together

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

= �(t� y) y(1� y) hk

– error– derivative of sigmoid: y0

• Weight adjustment will be scaled by a fixed learning rate µ

�wk = µ (t� y) y0 hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

29Putting it All Together

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

= �(t� y) y(1� y) hk

– error– derivative of sigmoid: y0

• Weight adjustment will be scaled by a fixed learning rate µ

�wk = µ (t� y) y0 hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

29Putting it All Together

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

= �(t� y) y(1� y) hk

– error– derivative of sigmoid: y0

• Weight adjustment will be scaled by a fixed learning rate µ

�wk = µ (t� y) y0 hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

29Putting it All Together

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk

= �(t� y) y(1� y) hk

– error– derivative of sigmoid: y0

• Weight adjustment will be scaled by a fixed learning rate µ

�wk = µ (t� y) y0 hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 39: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ Erroriscomputedoveralljoutputnodes:

Mul7pleOutputNodes

‣Weightsareadjustedaccordingtothenodetheypointto:

30Multiple Output Nodes

• Our example only had one output node

• Typically neural networks have multiple output nodes

• Error is computed over all j output nodes

E =X

j

1

2(tj � yj)

2

• Weights k ! j are adjusted according to the node they point to

�wj k = µ(tj � yj) y0j hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

30Multiple Output Nodes

• Our example only had one output node

• Typically neural networks have multiple output nodes

• Error is computed over all j output nodes

E =X

j

1

2(tj � yj)

2

• Weights k ! j are adjusted according to the node they point to

�wj k = µ(tj � yj) y0j hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 40: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ Inahiddenlayer,wedonothaveatargetoutputvalue‣ But,wecancomputehowmucheachnodecontributedtodownstreamerror‣ Defini7onoferrortermofeachnode:

HiddenLayerUpdate

‣ Back-propagatetheerrorterm:

‣ Universalupdateformula:

31Hidden Layer Update• In a hidden layer, we do not have a target output value

• But we can compute how much each node contributed to downstream error

• Definition of error term of each node

�j = (tj � yj) y0j

• Back-propagate the error term(why this way? there is math to back it up...)

�i =⇣X

j

wj i�j⌘y0i

• Universal update formula�wj k = µ �j hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

31Hidden Layer Update• In a hidden layer, we do not have a target output value

• But we can compute how much each node contributed to downstream error

• Definition of error term of each node

�j = (tj � yj) y0j

• Back-propagate the error term(why this way? there is math to back it up...)

�i =⇣X

j

wj i�j⌘y0i

• Universal update formula�wj k = µ �j hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

31Hidden Layer Update• In a hidden layer, we do not have a target output value

• But we can compute how much each node contributed to downstream error

• Definition of error term of each node

�j = (tj � yj) y0j

• Back-propagate the error term(why this way? there is math to back it up...)

�i =⇣X

j

wj i�j⌘y0i

• Universal update formula�wj k = µ �j hk

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 41: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ Computedoutput:‣ Correctoutput:

‣ Q:howdoweadjusttheweights?

OurExample20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Computed output: y = .76

• Correct output: t = 1.0

) How do we adjust the weights?

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Computed output: y = .76

• Correct output: t = 1.0

) How do we adjust the weights?

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

• Computed output: y = .76

• Correct output: t = 1.0

• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434

– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 42: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

• Computed output: y = .76

• Correct output: t = 1.0

• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434

– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

‣ Computedoutput:‣ Correctoutput:‣ Finallayerweightupdates(learningrate):

OurExample20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Computed output: y = .76

• Correct output: t = 1.0

) How do we adjust the weights?

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Computed output: y = .76

• Correct output: t = 1.0

) How do we adjust the weights?

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

• Computed output: y = .76

• Correct output: t = 1.0

• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434

– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

• Computed output: y = .76

• Correct output: t = 1.0

• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434

– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 43: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ Computedoutput:‣ Correctoutput:‣ Finallayerweightupdates(learningrate):

OurExample20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Computed output: y = .76

• Correct output: t = 1.0

) How do we adjust the weights?

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Computed output: y = .76

• Correct output: t = 1.0

) How do we adjust the weights?

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

• Computed output: y = .76

• Correct output: t = 1.0

• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434

– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

• Computed output: y = .76

• Correct output: t = 1.0

• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434

– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

33Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

4.891 —

-5.126 ——

-1.566 ——

• Computed output: y = .76

• Correct output: t = 1.0

• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434

– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 44: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

‣ HiddennodeD:

‣ HiddennodeE:

HiddenLayerUpdates 34Hidden Layer Updates

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

4.891 —

-5.126 ——

-1.566 ——

• Hidden node D

– �D =⇣P

j wj i�j⌘y0D = wGD �G y0D = 4.5⇥ .0434⇥ .0898 = .0175

– �wDA = µ �D hA = 10⇥ .0175⇥ 1.0 = .175– �wDB = µ �D hB = 10⇥ .0175⇥ 0.0 = 0– �wDC = µ �D hC = 10⇥ .0175⇥ 1 = .175

• Hidden node E

– �E =⇣P

j wj i�j⌘y0E = wGE �G y0E = �5.2⇥ .0434⇥ 0.2055 = �.0464

– �wEA = µ �E hA = 10⇥�.0464⇥ 1.0 = �.464– etc.

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

34Hidden Layer Updates

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

4.891 —

-5.126 ——

-1.566 ——

• Hidden node D

– �D =⇣P

j wj i�j⌘y0D = wGD �G y0D = 4.5⇥ .0434⇥ .0898 = .0175

– �wDA = µ �D hA = 10⇥ .0175⇥ 1.0 = .175– �wDB = µ �D hB = 10⇥ .0175⇥ 0.0 = 0– �wDC = µ �D hC = 10⇥ .0175⇥ 1 = .175

• Hidden node E

– �E =⇣P

j wj i�j⌘y0E = wGE �G y0E = �5.2⇥ .0434⇥ 0.2055 = �.0464

– �wEA = µ �E hA = 10⇥�.0464⇥ 1.0 = �.464– etc.

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

34Hidden Layer Updates

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

4.891 —

-5.126 ——

-1.566 ——

• Hidden node D

– �D =⇣P

j wj i�j⌘y0D = wGD �G y0D = 4.5⇥ .0434⇥ .0898 = .0175

– �wDA = µ �D hA = 10⇥ .0175⇥ 1.0 = .175– �wDB = µ �D hB = 10⇥ .0175⇥ 0.0 = 0– �wDC = µ �D hC = 10⇥ .0175⇥ 1 = .175

• Hidden node E

– �E =⇣P

j wj i�j⌘y0E = wGE �G y0E = �5.2⇥ .0434⇥ 0.2055 = �.0464

– �wEA = µ �E hA = 10⇥�.0464⇥ 1.0 = �.464– etc.

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

Page 45: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Logis7cRegressionwithNNs

P (y|x) = exp(w>f(x, y))Py0 exp(w>f(x, y0))

‣ Singlescalarprobability

P (y|x) = softmax�[w>f(x, y)]y2Y

� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)

softmax(p)i =exp(pi)Pi0 exp(pi0)

‣ soemax:expsandnormalizesagivenvector

P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]

P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer

Page 46: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetworksforClassifica7on

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

soemaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classes

probs

We can think of a neural network classifier with one hidden layer as building a vector z which is a hidden layer representation of the input, and then running standard logistic regression on the features that the network develops in z.

Page 47: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

TrainingNeuralNetworks

‣Maximizeloglikelihoodoftrainingdata

‣ i*:indexofthegoldlabel

‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex

z = g(V f(x))P (y|x) = softmax(Wz)

L(x, i⇤) = Wz · ei⇤ � logX

j

exp(Wz) · ej

L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)

Page 48: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Compu7ngGradients

‣ GradientwithrespecttoW

ifi=i*zj � P (y = i|x)zj

�P (y = i|x)zj

@

@WijL(x, i⇤) =

zj � P (y = i|x)zj

�P (y = i|x)zj otherwise

‣ Lookslikelogis7cregressionwithzasthefeatures!

i

j

{

L(x, i⇤) = Wz · ei⇤ � logX

j

exp(Wz) · ej

W

index of gold label index of vector z

index of output space Y

num_classesxdmatrix

Page 49: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NeuralNetworksforClassifica7on

V soemaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@Wz

Page 50: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Compu7ngGradients

‣ Gradientwithrespectto

L(x, i⇤) = Wz · ei⇤ � logX

j

exp(Wz) · ej

@L(x, i⇤)@z

= Wi⇤ �X

j

P (y = j|x)Wj

z

err(root) = ei⇤ � P (y|x)dim=num_classesdim=d

@L(x, i⇤)@z

= err(z) = W>err(root)

[somemath…]

Page 51: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Backpropaga7on:Picture

V soemaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)err(z)

z

‣ Canforgeteverythingaeerz,treatitastheoutputandkeepbackpropping

Page 52: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Compu7ngGradients:Backpropaga7onz = g(V f(x))

Ac7va7onsathiddenlayer

‣ GradientwithrespecttoV:applythechainrule

err(root) = ei⇤ � P (y|x)dim=num_classes dim=d

@L(x, i⇤)@z

= err(z) = W>err(root)

L(x, i⇤) = Wz · ei⇤ � logX

j

exp(Wz) · ej

[somemath…]

@L(x, i⇤)@Vij

=@L(x, i⇤)

@z

@z

@Vij

Page 53: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Backpropaga7on:Picture

V soemaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)@z

@Verr(z)

zf(x)

Page 54: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Backpropaga7on

‣ Step1:compute err(root) = ei⇤ � P (y|x)

‣ Step2:computederiva7vesofWusingerr(root)

‣ Step3:compute @L(x, i⇤)@z

= err(z) = W>err(root)

‣ Step4:computederiva7vesofVusingerr(z)

‣ Step5+:con7nuebackpropaga7on(computeerr(f(x))ifnecessary…)

P (y|x) = softmax(Wg(V f(x)))

(vector)

(vector)

(matrix)

(matrix)

Page 55: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Backpropaga7on:Takeaways

‣ GradientsofoutputweightsWareeasytocompute—lookslikelogis7cregressionwithhiddenlayerzasfeaturevector

‣ Cancomputederiva7veoflosswithrespecttoztoforman“errorsignal”forbackpropaga7on

‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropaga7on

‣ Needtorememberthevaluesfromtheforwardcomputa7on

Page 56: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Applica7ons

Page 57: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NLPwithFeedforwardNetworks

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample

emb(interest)

emb(rates)‣Weightmatrixlearnsposi7on-dependent

processingofthewords

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

Page 58: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NLPwithFeedforwardNetworks

‣ Hiddenlayermixesthesedifferentsignalsandlearnsfeatureconjunc7ons

Bothaetal.(2017)

Page 59: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

NLPwithFeedforwardNetworks‣Mul7lingualtaggingresults:

Bothaetal.(2017)

‣ GillickusedLSTMs;thisissmaller,faster,andbeger

Page 60: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Sen7mentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput

Iyyeretal.(2015)

Page 61: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Sen7mentAnalysis

{

{Bag-of-words

TreeRNNs/CNNS/LSTMS

WangandManning(2012)

Kim(2014)

Iyyeretal.(2015)

Page 62: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

CoreferenceResolu7on‣ Feedforwardnetworksiden7fycoreferencearcs

ClarkandManning(2015),Wisemanetal.(2015)

PresidentObamasigned…

Helatergaveaspeech…

?

Page 63: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Implementa7onDetails

Page 64: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Computa7onGraphs

‣ Compu7nggradientsishard!

‣ Automa7cdifferen7a7on:instrumentcodetokeeptrackofderiva7ves

y = x * x (y,dy) = (x * x, 2 * x * dx)codegen

‣ Computa7onisnowsomethingweneedtoreasonaboutsymbolically

‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch

Page 65: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Computa7onGraphsinPytorch

P (y|x) = softmax(Wg(V f(x)))

class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)

def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))

‣ Defineforwardpassfor

Page 66: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Computa7onGraphsinPytorch

P (y|x) = softmax(Wg(V f(x)))

ffnn = FFNN()

loss.backward()

probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)

optimizer.step()

def make_update(input, gold_label):

ffnn.zero_grad() # clear gradient variables

ei*: one-hot vector of the label (e.g., [0, 1, 0])

Page 67: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

TrainingaModel

Defineacomputa7ongraph

Foreachepoch:

Computelossonbatch

Foreachbatchofdata:

Decodetestset

Autogradtocomputegradientsandtakestep

Page 68: Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

Batching

‣ Batchingdatagivesspeedupsduetomoreefficientmatrixopera7ons

‣ Needtomakethecomputa7ongraphprocessabatchatthesame7me

probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))

...

‣ Batchsizesfrom1-100oeenworkwell

def make_update(input, gold_label)

# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]

...