CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

CS343:ArtificialIntelligence

DeepLearning

Prof.ScottNiekum—TheUniversityofTexasatAustin [TheseslidesbasedonthoseofDanKlein,PieterAbbeel,AncaDraganforCS188IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]

Pleasefilloutcourseevalsonline!

Review:LinearClassifiers

FeatureVectors

Hello,

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SPAM or +

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2”

Some(Simplified)Biology

▪ Verylooseinspiration:humanneurons

LinearClassifiers

▪ Inputsarefeaturevalues▪ Eachfeaturehasaweight▪ Sumistheactivation

▪ Iftheactivationis:▪ Positive,output+1▪ Negative,output-1

Σf1f2f3

w1

w2

w3>0?

Non-Linearity

Non-LinearSeparators

▪ Datathatislinearlyseparableworksoutgreatforlineardecisionrules:

▪ Butwhatarewegoingtodoifthedatasetisjusttoohard?

▪ Howabout…mappingdatatoahigher-dimensionalspace:

0

0

0

x2

x

x

x

ThisandnextslideadaptedfromRayMooney,UT

Non-LinearSeparators

▪ Generalidea:theoriginalfeaturespacecanalwaysbemappedtosomehigher-dimensionalfeaturespacewherethetrainingsetisseparable:

Φ: x → φ(x)

ComputerVision

ObjectDetection

ManualFeatureDesign

FeaturesandGeneralization

[DalalandTriggs,2005]

FeaturesandGeneralization

Image HoG

ManualFeatureDesign!DeepLearning

▪ Manualfeaturedesignrequires:

▪ Domain-specificexpertise

▪ Domain-specificeffort

▪ Whatifwecouldlearnthefeatures,too?

▪ DeepLearning

Perceptron

Σf1f2f3

w1

w2

w3

>0?

Two-LayerPerceptronNetwork

Σ

f1

f2

f3w13

w23

w33

>0?

Σw12

w22

w32

>0?

Σw11

w21

w31

>0?

Σ

w1

w2

w3

>0?

N-LayerPerceptronNetwork

Σ

f1

f2

f3

>0?

Σ >0?

Σ >0?

Σ

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?…

…

…

>0?

Performance

graph credit Matt Zeiler, Clarifai

Performance


Performance


AlexNet

Performance


AlexNet

Performance


AlexNet

SpeechRecognition


N-LayerPerceptronNetwork

Σ

f1

f2

f3

>0?

Σ >0?

Σ >0?

Σ

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?…

…

…

>0?

LocalSearch

▪ Simple,generalidea:▪ Startwherever▪ Repeat:movetothebestneighboringstate▪ Ifnoneighborsbetterthancurrent,quit▪ Neighbors=smallperturbationsofw

▪ Properties▪ Plateausandlocaloptima

Howtoescapeplateausandfindagoodlocaloptimum?Howtodealwithverylargeparametervectors?E.g.,

Perceptron

Σf1f2f3

w1

w2

w3

>0?

▪ Objective:ClassificationAccuracy

▪ Issue:manyplateaus! howtomeasureincrementalprogresstowardacorrectlabel?

Soft-Max

▪ Scorefory=1: Scorefory=-1:

▪ Probabilityoflabel:

▪ Objective:

▪ Log:

Two-LayerNeuralNetwork

Σ

f1

f2

f3w13

w23

w33

>0?

Σw12

w22

w32

>0?

Σw11

w21

w31

>0?

Σ

w1

w2

w3

N-LayerNeuralNetwork

Σ

f1

f2

f3

>0?

Σ >0?

Σ >0?

Σ

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?…

…

…

OurStatus

▪ Ourobjective ▪ Changessmoothlywithchangesinw▪ Doesn’tsufferfromthesameplateausastheperceptronnetwork

▪ Challenge:howtofindagoodw?

▪ Equivalently:

1-doptimization

▪ Couldevaluate and▪ Thenstepinbestdirection

▪ Or,evaluatederivative:

▪ Tellswhichdirectiontostepinto

2-DOptimization

Source: Thomas Jungblut’s Blog

▪ Idea:

▪ Startsomewhere▪ Repeat:Takeastepinthesteepestdescentdirection

SteepestDescent

Figure source: Mathworks

WhatistheSteepestDescentDirection?

WhatistheSteepestDescentDirection?

▪ SteepestDirection=directionofthegradient

OptimizationProcedure1:GradientDescent

▪ Init:

▪ Fori=1,2,…

▪ :learningrate---tweakingparameterthatneedstobechosencarefully

▪ How?Trymultiplechoices▪ Cruderuleofthumb:updatechangesabout0.1–1%

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201638

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?



Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?



Q: What is the trajectory along which we converge towards the minimum with Gradient Descent? very slow progress along flat direction, jitter along steep one

OptimizationProcedure2:Momentum

▪ Init:

▪ Fori=1,2,…

▪ GradientDescent

▪ Init:

▪ Fori=1,2,…

-Physicalinterpretationasballrollingdownthelossfunction+friction(mucoefficient). -mu=usually~0.5,0.9,or0.99(Sometimesannealedovertime,e.g.from0.5->0.99)

▪ Momentum



Q: What is the trajectory along which we converge towards the minimum with Momentum?

Howdoweactuallycomputegradientw.r.t.weights?

Backpropagation!

1

Backpropagation Learning

15-486/782: Artificial Neural NetworksDavid S. Touretzky

Fall 2006

2

LMS / Widrow-Hoff Rule

Works fine for a single layer of trainable weights.

What about multi-layer networks?

S

wi

xi

y

wi = −y−dxi

3

With Linear Units, Multiple Layers Don't Add Anything

U : 2×3 matrix

V : 3×4 matrix

x

Linear operators are closed under composition.Equivalent to a single layer of weights W=U×V

But with non-linear units, extra layers addcomputational power.

y

y = U×V x = U×V 2×4

x

4

What Can be Done withNon-Linear (e.g., Threshold) Units?

1 layer oftrainable weights

separating hyperplane

5

2 layers oftrainable weights

convex polygon region

6

3 layers oftrainable weights

composition of polygons:convex regionsnon

7

How Do We Train AMulti-Layer Network?

Error = d-yy

Error = ???

Can't use perceptron training algorithm becausewe don't know the 'correct' outputs for hidden units.

8

How Do We Train AMulti-Layer Network?

y

Define sum-squared error:

E =1

2∑p

dp−yp2

Use gradient descent error minimization:

wij = −∂E

∂wij

Works if the nonlinear transfer function is differentiable.

9

Deriving the LMS or “Delta” RuleAs Gradient Descent Learning

y = ∑i

wi xi

E = 1

2∑p

dp−yp2 dE

d y= y−d

∂E

∂wi

=dE

d y⋅∂ y

∂wi

= y−dxi

wi = −∂E

∂wi

= −y−dxixi

wi

y

How do we extend this to two layers?

10

Switch to Smooth Nonlinear Units

net j = ∑i

wij yi

y j = gnet j

Common choices for g:

g x =1

1e−x

g 'x = gx⋅1−gx

g x=tanhx

g 'x=1 /cosh2x

g must be differentiable

11

Gradient Descent with Nonlinear Units

y=g net=tanh ∑i wi xidE

dy=y−d,

dy

dnet=1/cosh

2net , ∂net

∂wi

=xi

∂E

∂wi

=dE

dy⋅dy

dnet⋅∂net

∂wi

= y−d/cosh2

∑i wi xi⋅xi

tanh(Swix

i)xi

wiy

12

Now We Can Use The Chain Rule

yk

wjk

y j

wij

yi

∂E∂yk

= yk−dk

k =∂E

∂netk= yk−dk⋅g'netk

∂E

∂wjk

=∂E

∂netk⋅∂netk∂wjk

=∂E

∂netk⋅y j

∂E

∂y j

= ∑k ∂E

∂netk⋅∂netk∂ y j

j =

∂E∂net j

=∂E∂ y j

⋅g'net j

∂E

∂wij

=∂E

∂net j⋅yi

12


yk

wjk

y j

wij

yi

∂E∂yk

= yk−dk

k =∂E


∂E

∂wjk

=∂E


=∂E

∂netk⋅y j

∂E

∂y j

= ∑k ∂E


j =

∂E∂net j

=∂E∂ y j

⋅g'net j

∂E

∂wij

=∂E

∂net j⋅yi

12


yk

wjk

y j

wij

yi

∂E∂yk

= yk−dk

k =∂E


∂E

∂wjk

=∂E


=∂E

∂netk⋅y j

∂E

∂y j

= ∑k ∂E


j =

∂E∂net j

=∂E∂ y j

⋅g'net j

∂E

∂wij

=∂E

∂net j⋅yi

13

Weight Updates

∂E

∂wjk

=∂E


= k⋅y j

∂E

∂wij

=∂E

∂net j⋅∂net j∂wij

= j⋅yi

wjk = −⋅∂E

∂wjk

wij = −⋅∂E

∂wij


Deep learning is everywhere

[Krizhevsky 2012]

Classification Retrieval


[Faster R-CNN: Ren, He, Girshick, Sun 2015]

Detection Segmentation

[Farabet et al., 2012]



NVIDIA Tegra X1

self-driving cars



[Toshev, Szegedy 2014]

[Mnih 2013]



[Ciresan et al. 2013] [Sermanet et al. 2011] [Ciresan et al.]


[Vinyals et al., 2015]

Image Captioning

CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Documents