Top Banner
CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer
60

CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Jul 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

CSE490U:DeepLearningSpring2016

Yejin Choi

SomeslidesfromCarlosGuestrin,AndrewRosenberg, LukeZettlemoyer

Page 2: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew
Page 3: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew
Page 4: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

HumanNeurons

• Switchingtime• ~0.001second

• Numberofneurons– 1010

• Connectionsperneuron– 104-5

• Scenerecognitiontime– 0.1seconds

• Numberofcyclesperscenerecognition?– 100àmuchparallelcomputation!

Page 5: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

g

PerceptronasaNeuralNetwork

Thisisoneneuron:– Inputedgesx1 ...xn,alongwithbasis– Thesumisrepresentedgraphically– Sumpassedthroughanactivationfunctiong

Page 6: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

SigmoidNeuron

g

Just change g!• Why would we want to do this?• Notice new output range [0,1]. What was it before?• Look familiar?

Page 7: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

OptimizinganeuronWetraintominimizesum-squarederror

⌅(x) =

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦�

⇧L

⇧w= w �

j

�jyjxj

⌅(u).⌅(v) =

⇤u1u2

⌅.

⇤v1v2

⌅= u1v1 + u2v2 = u.v

⌅(u).⌅(v) =

u21u1u2u2u1u22

⌦⌦� .

v21v1v2v2v1v22

⌦⌦� = u21v21 + 2u1v1u2v2 + u22v

22

= (u1v1 + u2v2)2

= (u.v)2

⌅(u).⌅(v) = (u.v)d

P (errortrue(h) ⇥ ⇤) ⇥ |H|e�m⇥ ⇥ ⇥

ln�|H|e�m⇥

⇥⇥ ln ⇥

ln |H|�m⇤ ⇥ ln ⇥

m ⇤ln |H|+ ln 1�

⇤ ⇤ln |H|+ ln 1�

m

⇧l

⇧wi= �

j

[yj � g(w0 +↵

i

wixji )]

⇧wig(w0 +

i

wixji )

7

Solution just depends on g’: derivative of activation function!

�l

�wi= �

j

[yj � g(w0 +�

i

wixji )]

�wig(w0 +

i

wixji )

�wig(w0 +

i

wixji ) = xj

i

�wig(w0 +

i

wixji ) = xj

ig�(w0 +

i

wixji )

g�(x) = g(x)(1� g(x))

�xf(g(x)) = f �(g(x))g�(x)

8

@

@wig(w0 +

X

i

wixji ) = x

jig

0(w0 +X

i

wixji )

Page 8: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Sigmoidunits:havetodifferentiateg�l

�wi= �

j

[yj � g(w0 +�

i

wixji )]

�wig(w0 +

i

wixji )

�wig(w0 +

i

wixji ) = xj

i

�wig(w0 +

i

wixji ) = xj

ig�(w0 +

i

wixji )

g�(x) = g(x)(1� g(x))

8

Page 9: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

g

Perceptron,linearclassification,Booleanfunctions:xi∈{0,1}

• Canlearnx1∨ x2?• -0.5+x1+ x2

• Canlearnx1∧ x2?• -1.5+x1+x2

• Canlearnanyconjunctionordisjunction?• 0.5+x1+…+xn• (-n+0.5)+x1+…+xn

• Canlearnmajority?• (-0.5*n)+x1+…+xn

• Whatarewemissing?ThedreadedXOR!,etc.

Page 10: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

GoingbeyondlinearclassificationSolvingtheXORproblem

y =x1 XORx2

v1= (x1∧¬x2)=-1.5+2x1-x2

v2= (x2∧¬x1)=-1.5+2x2-x1

y=v1∨v2=-0.5+v1+v2

x1

x2

1

v1

v2

y

1-0.5

1

1

-1.5

2-1

2

-1-1.5

= (x1 ∧ ¬x2) ∨ (x2 ∧¬x1)

Page 11: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Hiddenlayer

• Singleunit:

• 1-hiddenlayer:

• Nolongerconvexfunction!

Page 12: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Example data for NN with hidden layer

Page 13: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Learned weights for hidden layer

Page 14: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Why“representationlearning”?

• MaxEnt (multinomiallogisticregression):

• NNs:

y = softmax(w · f(x, y))

y = softmax(w · �(Ux))

y = softmax(w · �(U (n)(...�(U

(2)�(U

(1)x))))

You design the feature vector

Feature representations are “learned” through hidden layers

Page 15: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Verydeepmodelsincomputervision

Page 16: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

RECURRENTNEURALNETWORKS

Page 17: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

RecurrentNeuralNetworks(RNNs)• EachRNNunitcomputesanewhiddenstateusingthepreviousstateanda

newinput• EachRNNunit(optionally)makesanoutputusingthecurrenthiddenstate

• Hiddenstatesarecontinuousvectors– Canrepresentveryrichinformation– Possiblytheentirehistoryfromthebeginning

• Parametersareshared(tied)acrossallRNNunits(unlikefeedforwardNNs)

ht = f(xt, ht�1)

ht 2 RD

yt = softmax(V ht)

Page 18: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

RecurrentNeuralNetworks(RNNs)• GenericRNNs:

• VanillaRNN:

ht = f(xt, ht�1)

yt = softmax(V ht)

yt = softmax(V ht)

ht = tanh(Uxt +Wht�1 + b)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

Page 19: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ManyusesofRNNs

• Input:asequence• Output:onelabel(classification)• Example:sentimentclassification

ht = f(xt, ht�1)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$

ℎ%y = softmax(V hn)

1.Classification(seq toone)

Page 20: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

2.onetoseq

• Input:oneitem• Output:asequence• Example:Imagecaptioning

ht = f(xt, ht�1)

yt = softmax(V ht)

𝑥"

ℎ" ℎ# ℎ$

ℎ%ℎ$ℎ#ℎ"Cat sitting on top of ….

ManyusesofRNNs

Page 21: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

3.sequencetagging

• Input:asequence• Output:asequence(ofthesamelength)• Example:POStagging,NamedEntityRecognition• HowaboutLanguageModels?

– Yes!RNNscanbeusedasLMs!– RNNsmakemarkov assumption:T/F? ht = f(xt, ht�1)

yt = softmax(V ht)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$

ℎ%ℎ$ℎ#ℎ"

ManyusesofRNNs

Page 22: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

4.Languagemodels• Input:asequenceofwords• Output:onenextword• Output:orasequenceofnextwords• Duringtraining,x_t istheactualwordinthetrainingsentence.• Duringtesting,x_t isthewordpredictedfromtheprevioustimestep.• DoesRNNLMsmakeMarkovassumption?

– i.e.,thenextworddependsonlyonthepreviousNwords

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$

ℎ%ℎ$ℎ#ℎ"

ManyusesofRNNs

ht = f(xt, ht�1)

yt = softmax(V ht)

Page 23: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

5.seq2seq (aka“encoder-decoder”)

• Input:asequence• Output:asequence(ofdifferent length)• Examples?

ht = f(xt, ht�1)

yt = softmax(V ht)

𝑥" 𝑥# 𝑥$

ℎ" ℎ# ℎ$

ℎ%

ℎ% ℎ' ℎ(

ℎ)ℎ(ℎ'

ManyusesofRNNs

Page 24: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ManyusesofRNNs4.seq2seq (aka“encoder-decoder”)

Figure from http://www.wildml.com/category/conversational-agents/

• Conversation and Dialogue• Machine Translation

Page 25: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ManyusesofRNNs4.seq2seq (aka“encoder-decoder”)

John has a dog

𝑥" 𝑥# 𝑥$

ℎ" ℎ# ℎ$

ℎ%

ℎ% ℎ' ℎ(

ℎ)ℎ(ℎ'

Parsing!- “Grammar as Foreign Language” (Vinyals et al., 2015)

Page 26: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

RecurrentNeuralNetworks(RNNs)• GenericRNNs:

• VanillaRNN:

ht = f(xt, ht�1)

yt = softmax(V ht)

yt = softmax(V ht)

ht = tanh(Uxt +Wht�1 + b)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

Page 27: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

RecurrentNeuralNetworks(RNNs)• GenericRNNs:• VanillaRNNs:• LSTMs(LongShort-termMemoryNetworks):

ht = f(xt, ht�1)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

o

t

= �(U (o)x

t

+W

(o)h

t�1 + b

(o))

ft = �(U (f)xt +W

(f)ht�1 + b

(f))

it = �(U (i)xt +W

(i)ht�1 + b

(i))

c̃t = tanh(U (c)xt +W

(c)ht�1 + b

(c))

ct = ft � ct�1 + it � c̃tht = ot � tanh(ct)

There are many known variations to this set of equations!

ht = tanh(Uxt +Wht�1 + b)

𝑐" 𝑐# 𝑐$ 𝑐% 𝑐+ : cell state

ℎ+ : hidden state

Page 28: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

LSTMS(LONGSHORT-TERMMEMORYNETWORKS

𝑐+,"

ℎ+,"

𝑐+

ℎ+

Figure by Christopher Olah (colah.github.io)

Page 29: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

LSTMS(LONGSHORT-TERMMEMORYNETWORKSsigmoid:[0,1]

ft = �(U (f)xt +W

(f)ht�1 + b

(f))

Forget gate: forget the past or not

Figure by Christopher Olah (colah.github.io)

Page 30: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

LSTMS(LONGSHORT-TERMMEMORYNETWORKSsigmoid:[0,1]

tanh:[-1,1]

it = �(U (i)xt +W

(i)ht�1 + b

(i))

c̃t = tanh(U (c)xt +W

(c)ht�1 + b

(c))

Input gate: use the input or not

New cell content (temp):

ft = �(U (f)xt +W

(f)ht�1 + b

(f))

Forget gate: forget the past or not

Figure by Christopher Olah (colah.github.io)

Page 31: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

LSTMS(LONGSHORT-TERMMEMORYNETWORKSsigmoid:[0,1]

tanh:[-1,1]

ct = ft � ct�1 + it � c̃t

New cell content: - mix old cell with the new temp cell

it = �(U (i)xt +W

(i)ht�1 + b

(i))

c̃t = tanh(U (c)xt +W

(c)ht�1 + b

(c))

Input gate: use the input or not

New cell content (temp):

ft = �(U (f)xt +W

(f)ht�1 + b

(f))

Forget gate: forget the past or not

Figure by Christopher Olah (colah.github.io)

Page 32: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

LSTMS(LONGSHORT-TERMMEMORYNETWORKS

ct = ft � ct�1 + it � c̃t

New cell content: - mix old cell with the new temp cell

it = �(U (i)xt +W

(i)ht�1 + b

(i))

c̃t = tanh(U (c)xt +W

(c)ht�1 + b

(c))

Input gate: use the input or not

New cell content (temp):

ft = �(U (f)xt +W

(f)ht�1 + b

(f))

Forget gate: forget the past or notOutput gate: output from the new cell or noto

t

= �(U (o)x

t

+W

(o)h

t�1 + b

(o))

ht = ot � tanh(ct)Hidden state:

Figure by Christopher Olah (colah.github.io)

Page 33: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

LSTMS(LONGSHORT-TERMMEMORYNETWORKS

it = �(U (i)xt +W

(i)ht�1 + b

(i))Input gate: use the input or notft = �(U (f)

xt +W

(f)ht�1 + b

(f))Forget gate: forget the past or not

Output gate: output from the new cell or not

o

t

= �(U (o)x

t

+W

(o)h

t�1 + b

(o))

ct = ft � ct�1 + it � c̃tNew cell content: - mix old cell with the new temp cell

c̃t = tanh(U (c)xt +W

(c)ht�1 + b

(c))New cell content (temp):

ht = ot � tanh(ct)Hidden state:

𝑐+,"

ℎ+,"

𝑐+

ℎ+

Page 34: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

vanishinggradientproblemforRNNs.

• Theshadingofthenodesintheunfoldednetworkindicatestheirsensitivity totheinputsattimeone(thedarkertheshade,thegreaterthesensitivity).

• Thesensitivity decaysovertimeasnewinputsoverwritetheactivationsofthehiddenlayer,andthenetwork‘forgets’thefirstinputs.

Example from Graves 2012

Page 35: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

PreservationofgradientinformationbyLSTM

• Forsimplicity,allgatesareeitherentirelyopen(‘O’)orclosed(‘—’).• Thememorycell ‘remembers’thefirstinputaslongastheforgetgateisopenand

theinputgateisclosed.• Thesensitivity oftheoutputlayercanbeswitchedonandoffbytheoutputgate

withoutaffectingthecell.

Forget gate

Input gate

Output gate

Example from Graves 2012

Page 36: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

RecurrentNeuralNetworks(RNNs)• GenericRNNs:• VanillaRNNs:• GRUs(GatedRecurrentUnits):

ht = f(xt, ht�1)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

zt = �(U (z)xt +W

(z)ht�1 + b

(z))

rt = �(U (r)xt +W

(r)ht�1 + b

(r))

h̃t = tanh(U (h)xt +W

(h)(rt � ht�1) + b

(h))

ht = (1� zt) � ht�1 + zt � h̃tLess parameters than LSTMs. Easier to train for comparable performance!

ht = tanh(Uxt +Wht�1 + b)

Page 37: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

RecursiveNeuralNetworks• Sometimes,inferenceoveratreestructuremakesmoresensethan

sequentialstructure• Anexampleofcompositionalityinideologicalbiasdetection

(red→ conservative,blue→ liberal,gray→ neutral)inwhichmodifierphrasesandpunctuationcausepolarityswitchesathigherlevelsoftheparsetree

Example from Iyyer et al., 2014

Page 38: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

RecursiveNeuralNetworks• NNsconnectedasatree• Treestructureisfixedapriori• Parametersareshared,similarlyasRNNs

Example from Iyyer et al., 2014

Page 39: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

LEARNING:BACKPROPAGATION

Page 40: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ErrorBackpropagation

• Modelparameters:

forbrevity:

x0

x1

x2

xP

f(x,

~

✓)

~✓ = {wij , wjk, wkl}

Next 10 slides on back propagation are adapted from Andrew Rosenberg

~✓ = {w(1)ij , w(2)

jk , w(3)kl }

w(1)ij w(2)

jk

w(3)kl

Page 41: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ErrorBackpropagation

• Modelparameters:• Leta andz betheinputandoutputofeachnode

41

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj zlalak

~✓ = {wij , wjk, wkl}

Page 42: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ErrorBackpropagation

wij wjk

zj

aj

aj =X

i

wijzi

zj = g(aj)

zi

Page 43: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zl

aj =X

i

wijzi

al

ak =X

j

wjkzj al =X

k

wklzk

zj = g(aj) zk = g(ak) zl = g(al)

• Leta andz betheinputandoutputofeachnode

Page 44: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zl

aj =X

i

wijzi

al

ak =X

j

wjkzj al =X

k

wklzk

zj = g(aj) zk = g(ak) zl = g(al)

• Leta andz betheinputandoutputofeachnode

Page 45: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Training:minimizeloss

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

R(✓) =1N

NX

n=0

L(yn � f(xn))

=1N

NX

n=0

12

(yn � f(xn))2

=1N

NX

n=0

12

0

@yn � g

0

@X

k

wklg

0

@X

j

wjkg

X

i

wijxn,i

!1

A

1

A

1

A2

Empirical Risk Function

Page 46: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Training:minimizeloss

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

R(✓) =1N

NX

n=0

L(yn � f(xn))

=1N

NX

n=0

12

(yn � f(xn))2

=1N

NX

n=0

12

0

@yn � g

0

@X

k

wklg

0

@X

j

wjkg

X

i

wijxn,i

!1

A

1

A

1

A2

Empirical Risk Function

Page 47: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ErrorBackpropagation

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

Optimize last layer weights wklLn =

12

(yn � f(xn))2

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl

�Calculus chain rule

Page 48: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ErrorBackpropagation

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

Optimize last layer weights wklLn =

12

(yn � f(xn))2

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl

�Calculus chain rule

@R

@wkl=

1N

X

n

@ 1

2 (yn � g(al,n))2

@al,n

� @al,n

@wkl

Page 49: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ErrorBackpropagation

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

Optimize last layer weights wklLn =

12

(yn � f(xn))2

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl

�Calculus chain rule

@R

@wkl=

1N

X

n

@ 1

2 (yn � g(al,n))2

@al,n

� @zk,nwkl

@wkl

Page 50: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ErrorBackpropagation

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

Optimize last layer weights wklLn =

12

(yn � f(xn))2

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl

�Calculus chain rule

@R

@wkl=

1N

X

n

@ 1

2 (yn � g(al,n))2

@al,n

� @zk,nwkl

@wkl

�=

1N

X

n

[�(yn � zl,n)g0(al,n)] zk,n

Page 51: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ErrorBackpropagation

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

Optimize last layer weights wklLn =

12

(yn � f(xn))2

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl

�Calculus chain rule

@R

@wkl=

1N

X

n

@ 1

2 (yn � g(al,n))2

@al,n

� @zk,nwkl

@wkl

�=

1N

X

n

[�(yn � zl,n)g0(al,n)] zk,n

=1N

X

n

�l,nnzk,n

Page 52: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

ErrorBackpropagation

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

Repeat for all previous layers

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl

�=

1N

X

n

[�(yn � zl,n)g0(al,n)] zk,n =1N

X

n

�l,nzk,n

@R

@wjk=

1N

X

n

@Ln

@ak,n

� @ak,n

@wjk

�=

1N

X

n

"X

l

�l,nwklg0(ak,n)

#zj,n =

1N

X

n

�k,nzj,n

@R

@wij=

1N

X

n

@Ln

@aj,n

� @aj,n

@wij

�=

1N

X

n

"X

k

�k,nwjkg0(aj,n)

#zi,n =

1N

X

n

�j,nzi,n

Page 53: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

aj =X

i

wijzi

zj = g(aj)

@R

@wjk=

1N

X

n

@Ln

@ak,n

� @ak,n

@wjk

�=

1N

X

n

"X

l

�l,nwklg0(ak,n)

#zj,n =

1N

X

n

�k,nzj,n

@R

@wij=

1N

X

n

@Ln

@aj,n

� @aj,n

@wij

�=

1N

X

n

"X

k

�k,nwjkg0(aj,n)

#zi,n =

1N

X

n

�j,nzi,n

∑ wij wjk

zj

aj

zi �j∑

�i �k zk

Backprop Recursion

Page 54: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

wt+1ij = wt

ij � ⌘@R

wij

wt+1jk = wt

jk � ⌘@R

wkl

wt+1kl = wt

kl � ⌘@R

wkl

Learning:GradientDescent

Page 55: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Backpropagation• Startswithaforwardsweeptocomputealltheintermediatefunctionvalues• Throughbackprop,computesthepartialderivativesrecursively• Aformofdynamicprogramming

– Insteadofconsideringexponentiallymanypathsbetween aweightw_ij andthefinalloss(risk),storeandreuseintermediateresults.

• Atypeofautomaticdifferentiation.(thereareothervariantse.g.,recursivedifferentiationonlythroughforwardpropagation.

zi

�j @R

@wij

Forward

Gradient

Page 56: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Backpropagation• TensorFlow (https://www.tensorflow.org/)• Torch(http://torch.ch/)• Theano (http://deeplearning.net/software/theano/)• CNTK(https://github.com/Microsoft/CNTK)• cnn (https://github.com/clab/cnn)• Caffe (http://caffe.berkeleyvision.org/)

PrimaryInterfaceLanguage:• Python• Lua• Python• C++• C++• C++

Forward

Gradient

Page 57: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

CrossEntropyLoss(akalogloss,logisticloss)• CrossEntropy

• Relatedquantities– Entropy– KLdivergence(thedistancebetweentwodistributionspandq)

• UseCrossEntropyformodelsthatshouldhavemoreprobabilisticflavor(e.g.,languagemodels)

• UseMeanSquaredError lossformodelsthatfocusoncorrect/incorrectpredictions

H(p, q) = Ep[�log q] = H(p) +DKL(p||q)

H(p, q) = �X

y

p(y) log q(y)

H(p) =X

y

p(y)log p(y)

DKL(p||q) =X

y

p(y) logp(y)

q(y)

MSE =1

2(y � f(x))2

Predicted prob

True prob

Page 58: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

RNNLearning:Backprop ThroughTime(BPTT)

• Similartobackprop withnon-recurrentNNs• Butunlikefeedforward (non-recurrent)NNs,eachunitinthe

computationgraphrepeatstheexactsameparameters…• Backprop gradientsoftheparametersofeachunitasifthey

aredifferentparameters• Whenupdatingtheparametersusingthegradients,usethe

averagegradientsthroughouttheentirechainofunits.

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

Page 59: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Convergenceofbackprop• Withoutnon-linearityorhiddenlayers,learningisconvexoptimization– Gradientdescentreachesglobalminima

• Multilayerneuralnets(withnonlinearity)arenotconvex– Gradientdescentgetsstuckinlocalminima– Selectingnumberofhiddenunitsandlayers=fuzzyprocess– NNshavemadeaHUGEcomebackinthelastfewyears

• Neuralnetsarebackwithanewname– Deepbeliefnetworks– HugeerrorreductionwhentrainedwithlotsofdataonGPUs

Page 60: CSE 490 U: Deep Learning Spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew

Overfitting inNNs• AreNNslikelytooverfit?

– Yes,theycanrepresentarbitraryfunctions!!!

• Avoidingoverfitting?– Moretrainingdata– Fewerhiddennodes/bettertopology

– Randomperturbationtothegraphtopology(“Dropout”)

– Regularization– Earlystopping