Page 1
CS343:ArtificialIntelligence
DeepLearning
Prof.ScottNiekum—TheUniversityofTexasatAustin [TheseslidesbasedonthoseofDanKlein,PieterAbbeel,AncaDraganforCS188IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]
Page 2
Pleasefilloutcourseevalsonline!
Page 3
Review:LinearClassifiers
Page 4
FeatureVectors
Hello,
Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just
# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...
SPAM or +
PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...
“2”
Page 5
Some(Simplified)Biology
▪ Verylooseinspiration:humanneurons
Page 6
LinearClassifiers
▪ Inputsarefeaturevalues▪ Eachfeaturehasaweight▪ Sumistheactivation
▪ Iftheactivationis:▪ Positive,output+1▪ Negative,output-1
Σf1f2f3
w1
w2
w3>0?
Page 8
Non-LinearSeparators
▪ Datathatislinearlyseparableworksoutgreatforlineardecisionrules:
▪ Butwhatarewegoingtodoifthedatasetisjusttoohard?
▪ Howabout…mappingdatatoahigher-dimensionalspace:
0
0
0
x2
x
x
x
ThisandnextslideadaptedfromRayMooney,UT
Page 9
Non-LinearSeparators
▪ Generalidea:theoriginalfeaturespacecanalwaysbemappedtosomehigher-dimensionalfeaturespacewherethetrainingsetisseparable:
Φ: x → φ(x)
Page 12
ManualFeatureDesign
Page 13
FeaturesandGeneralization
[DalalandTriggs,2005]
Page 14
FeaturesandGeneralization
Image HoG
Page 15
ManualFeatureDesign!DeepLearning
▪ Manualfeaturedesignrequires:
▪ Domain-specificexpertise
▪ Domain-specificeffort
▪ Whatifwecouldlearnthefeatures,too?
▪ DeepLearning
Page 16
Perceptron
Σf1f2f3
w1
w2
w3
>0?
Page 17
Two-LayerPerceptronNetwork
Σ
f1
f2
f3w13
w23
w33
>0?
Σw12
w22
w32
>0?
Σw11
w21
w31
>0?
Σ
w1
w2
w3
>0?
Page 18
N-LayerPerceptronNetwork
Σ
f1
f2
f3
>0?
Σ >0?
Σ >0?
Σ
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?…
…
…
>0?
Page 19
Performance
graph credit Matt Zeiler, Clarifai
Page 20
Performance
graph credit Matt Zeiler, Clarifai
Page 21
Performance
graph credit Matt Zeiler, Clarifai
AlexNet
Page 22
Performance
graph credit Matt Zeiler, Clarifai
AlexNet
Page 23
Performance
graph credit Matt Zeiler, Clarifai
AlexNet
Page 24
SpeechRecognition
graph credit Matt Zeiler, Clarifai
Page 25
N-LayerPerceptronNetwork
Σ
f1
f2
f3
>0?
Σ >0?
Σ >0?
Σ
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?…
…
…
>0?
Page 26
LocalSearch
▪ Simple,generalidea:▪ Startwherever▪ Repeat:movetothebestneighboringstate▪ Ifnoneighborsbetterthancurrent,quit▪ Neighbors=smallperturbationsofw
▪ Properties▪ Plateausandlocaloptima
Howtoescapeplateausandfindagoodlocaloptimum?Howtodealwithverylargeparametervectors?E.g.,
Page 27
Perceptron
Σf1f2f3
w1
w2
w3
>0?
▪ Objective:ClassificationAccuracy
▪ Issue:manyplateaus! howtomeasureincrementalprogresstowardacorrectlabel?
Page 28
Soft-Max
▪ Scorefory=1: Scorefory=-1:
▪ Probabilityoflabel:
▪ Objective:
▪ Log:
Page 29
Two-LayerNeuralNetwork
Σ
f1
f2
f3w13
w23
w33
>0?
Σw12
w22
w32
>0?
Σw11
w21
w31
>0?
Σ
w1
w2
w3
Page 30
N-LayerNeuralNetwork
Σ
f1
f2
f3
>0?
Σ >0?
Σ >0?
Σ
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?…
…
…
Page 31
OurStatus
▪ Ourobjective ▪ Changessmoothlywithchangesinw▪ Doesn’tsufferfromthesameplateausastheperceptronnetwork
▪ Challenge:howtofindagoodw?
▪ Equivalently:
Page 32
1-doptimization
▪ Couldevaluate and▪ Thenstepinbestdirection
▪ Or,evaluatederivative:
▪ Tellswhichdirectiontostepinto
Page 33
2-DOptimization
Source: Thomas Jungblut’s Blog
Page 34
▪ Idea:
▪ Startsomewhere▪ Repeat:Takeastepinthesteepestdescentdirection
SteepestDescent
Figure source: Mathworks
Page 35
WhatistheSteepestDescentDirection?
Page 36
WhatistheSteepestDescentDirection?
▪ SteepestDirection=directionofthegradient
Page 37
OptimizationProcedure1:GradientDescent
▪ Init:
▪ Fori=1,2,…
▪ :learningrate---tweakingparameterthatneedstobechosencarefully
▪ How?Trymultiplechoices▪ Cruderuleofthumb:updatechangesabout0.1–1%
Page 38
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201638
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?
Page 39
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201639
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?
Page 40
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201640
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with Gradient Descent? very slow progress along flat direction, jitter along steep one
Page 41
OptimizationProcedure2:Momentum
▪ Init:
▪ Fori=1,2,…
▪ GradientDescent
▪ Init:
▪ Fori=1,2,…
-Physicalinterpretationasballrollingdownthelossfunction+friction(mucoefficient). -mu=usually~0.5,0.9,or0.99(Sometimesannealedovertime,e.g.from0.5->0.99)
▪ Momentum
Page 42
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201642
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with Momentum?
Page 43
Howdoweactuallycomputegradientw.r.t.weights?
Backpropagation!
Page 44
1
Backpropagation Learning
15-486/782: Artificial Neural NetworksDavid S. Touretzky
Fall 2006
Page 45
2
LMS / Widrow-Hoff Rule
Works fine for a single layer of trainable weights.
What about multi-layer networks?
S
wi
xi
y
wi = −y−dxi
Page 46
3
With Linear Units, Multiple Layers Don't Add Anything
U : 2×3 matrix
V : 3×4 matrix
x
Linear operators are closed under composition.Equivalent to a single layer of weights W=U×V
But with non-linear units, extra layers addcomputational power.
y
y = U×V x = U×V 2×4
x
Page 47
4
What Can be Done withNon-Linear (e.g., Threshold) Units?
1 layer oftrainable weights
separating hyperplane
Page 48
5
2 layers oftrainable weights
convex polygon region
Page 49
6
3 layers oftrainable weights
composition of polygons:convex regionsnon
Page 50
7
How Do We Train AMulti-Layer Network?
Error = d-yy
Error = ???
Can't use perceptron training algorithm becausewe don't know the 'correct' outputs for hidden units.
Page 51
8
How Do We Train AMulti-Layer Network?
y
Define sum-squared error:
E =1
2∑p
dp−yp2
Use gradient descent error minimization:
wij = −∂E
∂wij
Works if the nonlinear transfer function is differentiable.
Page 52
9
Deriving the LMS or “Delta” RuleAs Gradient Descent Learning
y = ∑i
wi xi
E = 1
2∑p
dp−yp2 dE
d y= y−d
∂E
∂wi
=dE
d y⋅∂ y
∂wi
= y−dxi
wi = −∂E
∂wi
= −y−dxixi
wi
y
How do we extend this to two layers?
Page 53
10
Switch to Smooth Nonlinear Units
net j = ∑i
wij yi
y j = gnet j
Common choices for g:
g x =1
1e−x
g 'x = gx⋅1−gx
g x=tanhx
g 'x=1 /cosh2x
g must be differentiable
Page 54
11
Gradient Descent with Nonlinear Units
y=g net=tanh ∑i wi xidE
dy=y−d,
dy
dnet=1/cosh
2net , ∂net
∂wi
=xi
∂E
∂wi
=dE
dy⋅dy
dnet⋅∂net
∂wi
= y−d/cosh2
∑i wi xi⋅xi
tanh(Swix
i)xi
wiy
Page 55
12
Now We Can Use The Chain Rule
yk
wjk
y j
wij
yi
∂E∂yk
= yk−dk
k =∂E
∂netk= yk−dk⋅g'netk
∂E
∂wjk
=∂E
∂netk⋅∂netk∂wjk
=∂E
∂netk⋅y j
∂E
∂y j
= ∑k ∂E
∂netk⋅∂netk∂ y j
j =
∂E∂net j
=∂E∂ y j
⋅g'net j
∂E
∂wij
=∂E
∂net j⋅yi
12
Now We Can Use The Chain Rule
yk
wjk
y j
wij
yi
∂E∂yk
= yk−dk
k =∂E
∂netk= yk−dk⋅g'netk
∂E
∂wjk
=∂E
∂netk⋅∂netk∂wjk
=∂E
∂netk⋅y j
∂E
∂y j
= ∑k ∂E
∂netk⋅∂netk∂ y j
j =
∂E∂net j
=∂E∂ y j
⋅g'net j
∂E
∂wij
=∂E
∂net j⋅yi
12
Now We Can Use The Chain Rule
yk
wjk
y j
wij
yi
∂E∂yk
= yk−dk
k =∂E
∂netk= yk−dk⋅g'netk
∂E
∂wjk
=∂E
∂netk⋅∂netk∂wjk
=∂E
∂netk⋅y j
∂E
∂y j
= ∑k ∂E
∂netk⋅∂netk∂ y j
j =
∂E∂net j
=∂E∂ y j
⋅g'net j
∂E
∂wij
=∂E
∂net j⋅yi
Page 56
13
Weight Updates
∂E
∂wjk
=∂E
∂netk⋅∂netk∂wjk
= k⋅y j
∂E
∂wij
=∂E
∂net j⋅∂net j∂wij
= j⋅yi
wjk = −⋅∂E
∂wjk
wij = −⋅∂E
∂wij
Page 57
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201657
Deep learning is everywhere
[Krizhevsky 2012]
Classification Retrieval
Page 58
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201658
[Faster R-CNN: Ren, He, Girshick, Sun 2015]
Detection Segmentation
[Farabet et al., 2012]
Deep learning is everywhere
Page 59
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201659
NVIDIA Tegra X1
self-driving cars
Deep learning is everywhere
Page 60
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201660
[Toshev, Szegedy 2014]
[Mnih 2013]
Deep learning is everywhere
Page 61
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201661
[Ciresan et al. 2013] [Sermanet et al. 2011] [Ciresan et al.]
Deep learning is everywhere
Page 62
[Vinyals et al., 2015]
Image Captioning