Top Banner
Deep Learning: Restricted Boltzmann Machines & Deep Belief Nets Based on slides by Geoffrey Hinton, Sue Becker, Yann LeCun, Yoshua Bengio, Frank Wood Robot Image Credit: Viktoriya Sukhanova © 123RF.com These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric.
42

Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

May 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

DeepLearning:RestrictedBoltzmannMachines

&DeepBeliefNets

BasedonslidesbyGeoffreyHinton,SueBecker,YannLeCun,Yoshua Bengio,FrankWood

Robot Image Credit: Viktoriya Sukhanova © 123RF.com

These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric.

Page 2: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

NeuralNetworks

2

Compare outputs with correct answer to get

error signal

outputs

inputs

hidden layers

Back-propagateerrorsignaltogetderivativesfor

learning

Page 3: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Whatiswrongwithback-propagation?• Itrequireslabeledtrainingdata

– Almostalldataisunlabeled

• Thelearningtimedoesnotscalewell– Itisveryslowinnetswithmultiplehiddenlayers

• Itcangetstuckinpoorlocaloptima– Theseareoftenquitegood,butfordeepnetstheyarefarfromoptimal

3

Page 4: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Motivations• Supervisedtrainingofdeepmodels(e.g.many-layeredNNets)isdifficult(optimizationproblem)

• Shallowmodels(SVMs,one-hidden-layerNNets,boosting,etc…)areunlikelycandidatesforlearninghigh-levelabstractionsneededforAI

• Unsupervisedlearningcoulddo“local-learning” (eachmoduletriesitsbesttomodelwhatitsees)

• Inference(+learning)isintractableindirectedgraphicalmodelswithmanyhiddenvariables

• Currentunsupervisedlearningmethodsdon’teasilyextendtolearnmultiplelevelsofrepresentation

Page 5: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

BeliefNets• Abeliefnetisadirectedacyclic

graphcomposedofstochasticvariables.

• Canobservesomeofthevariablesandwewouldliketosolvetwoproblems:

• Theinferenceproblem:Inferthestatesoftheunobservedvariables.

• Thelearningproblem:Adjusttheinteractionsbetweenvariablestomakethenetworkmorelikelytogeneratetheobserveddata.

stochastichiddencause

visibleeffect

Usenetscomposedoflayersofstochasticbinaryvariableswithweightedconnections.Later,wewillgeneralizetoothertypesofvariable.

Page 6: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Explainingaway(JudeaPearl)• Eveniftwohiddencausesareindependent,theycanbecome

dependentwhenweobserveaneffectthattheycanbothinfluence.– Ifwelearnthattherewasanearthquakeitreducestheprobabilitythatthehousejumpedbecauseofatruck.

truckhitshouse earthquake

housejumps

P(J|T)=0.9

P(T)=e-10 P(E)=e-10

P(J|E)=0.9

P(J)=e-20

Page 7: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Whymultilayerlearningishardinasigmoidbeliefnet

• TolearnΘ,weneedtheposteriordistributioninthefirsthiddenlayer.

• Problem1:Theposterioristypicallyintractablebecauseof“explainingaway”.

• Problem2: Theposteriordependsonthepriorcreatedbyhigherlayersaswellasthelikelihood.– SotolearnΘ,weneedtoknowtheweightsinhigherlayers,evenifweareonlyapproximatingtheposterior.Alltheweightsinteract.

• Problem3: Weneedtointegrateoverallpossibleconfigurationsofthehighervariablestogetthepriorforfirsthiddenlayer.Yuk!

data

hiddenvariables

hiddenvariables

hiddenvariables

likelihood Θ

prior

Page 8: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

StochasticbinaryneuronsHaveastateof1or0,whichisastochasticfunctionoftheneuron’sbias b andtheinputstates itreceivesfromotherneurons.

0.5

00

1

p(ai = 1) =

1

1 + exp(�bi �P

j xj⇥ji)

p(ai = 1) =

1

1 + exp(�bi �P

j sj⇥ji)

bi +X

j

sj⇥ji

Page 9: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

P (ai = 1) =

1

1 + exp(�P

j sj⇥ji/T )=

1

1 + exp(��Ei/T )

Energy gap = �Ei = E(ai = 0)� E(ai = 1)

StochasticunitsReplacethebinarythresholdunitsbybinarystochasticunitsthatmakebiasedrandomdecisions.

– Thetemperaturecontrolstheamountofnoise– Decreasingalltheenergygapsbetweenconfigurationsisequivalenttoraisingthenoiselevel

temperature

Page 10: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

RestrictedBoltzmannMachines

• Restricttheconnectivitytomakelearningeasier– Onlyonelayerofhiddenunits

• Dealwithmorelayerslater

– Noconnectionsbetweenhiddenunits• InanRBM,thehiddenunitsareconditionally

independentgiventhevisiblestates– Socanquicklygetanunbiasedsamplefromtheposteriordistributionwhengivenadata-vector

– Thisisabigadvantageoverdirectedbeliefnets

hidden

i

j

visible

Page 11: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Theenergyofajointconfiguration(ignoringbiasterms)

weightbetweenunitsi andj

Energywithconfigurationvonthevisibleunitsandhonthehiddenunits

binarystateofvisibleuniti

E(v,h) = �X

i,j

vihj⇥ij

binarystateofhiddenunitj

�@E(v,h)

@⇥ij= vihj

Page 12: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Weightsà Energiesà Probabilities

• Eachpossiblejointconfigurationofthevisibleandhiddenunitshasanenergy– Theenergyisdeterminedbytheweightsandbiases

• Theenergyofajointconfigurationofthevisibleandhiddenunitsdeterminesitsprobability:

• Theprobabilityofaconfigurationoverthevisibleunitsisfoundbysummingtheprobabilitiesofallthejointconfigurationsthatcontainit.

P (v,h) / e�E(v,h)

Page 13: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Usingenergiestodefineprobabilities

• Probabilityofajointconfigurationoverbothvisibleandhiddenunits

• Probabilityofaparticularconfigurationofthevisibleunits

P (v,h) =e�E(v,h)

Pu,g e

�E(u,g)

P (v) =

Ph e�E(v,h)

Pu,g e

�E(u,g)

Page 14: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

ApictureoftheBoltzmannmachinelearningalgorithmforanRBM

i

t=0

Startwithatrainingvectoronthevisibleunits.

Thenalternatebetweenupdatingallthehiddenunitsinparallelandupdatingallthevisibleunitsinparallel.

afantasy

@ logP (v)

@⇥ij= E0(vihj)� E1(vihj)

j

E0(vihj)

i

j

E1(vihj)

i

j

E1(vihj)

i

j

t=1 t=2 t=∞

visible

hidden

Page 15: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Averysurprisingfact• Everythingthatoneweightneedstoknowabouttheotherweightsandthedatainordertodomaximumlikelihoodlearningiscontainedinthedifferenceoftwocorrelations.

Derivativeoflogprobabilityofonetrainingvector

Expectedvalueofproductofstatesatthermalequilibriumwhenthetrainingvectorisclampedonthevisibleunits

Expectedvalueofproductofstatesatthermalequilibriumwhennothingisclamped

@ logP (v)

@⇥ij= E0(vihj)� E1(vihj)

Page 16: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

ApictureoftheBoltzmannmachinelearningalgorithmforanRBM

i

j

i

j

i

j

i

j

t=0t=1t=2t=infinity

Problem:thisMarkovchainmaytakeaverylongtimetoconverge!

Solution:ContrastiveDivergence

E0(vihj) E1(vihj) E1(vihj)

Page 17: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

ContrastiveDivergenceLearning:AquickwaytolearnanRBM

i

j

i

j

t=0t=1

Startwithatrainingvectoronthevisibleunits.

Updateallthehiddenunitsinparallel

Updatetheallthevisibleunitsinparalleltogeta“reconstruction”.

Updatethehiddenunitsagain.

Thisisnotfollowingthegradientoftheloglikelihood.Butitworkswell.

Itisapproximatelyfollowingthegradientofanotherobjectivefunction(Carreira-Perpinan &Hinton,2005).

reconstructiondata

E0(vihj) E1(vihj)

�⇥ij = ✏[E0(vihj)� E1(vihj)]

Page 18: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Howtolearnasetoffeaturesthataregoodforreconstructingimagesofthedigit2

50binaryfeatureneurons

16x16pixelimage

50binaryfeatureneurons

16x16pixelimage

Increment weightsbetweenanactivepixelandanactivefeature

Decrementweightsbetweenanactivepixelandanactivefeature

data(reality)

reconstruction(betterthanreality)

Page 19: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Eachneurongrabsadifferentfeature.

TheFinal50 x256Weights

Page 20: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

ReconstructionfromactivatedbinaryfeaturesData

ReconstructionfromactivatedbinaryfeaturesData

Howwellcanwereconstructthedigitimagesfromthebinaryfeatureactivations?

Newtestimagesfromthedigitclassthatthemodelwastrainedon

Imagesfromanunfamiliardigitclass(thenetworktriestoseeeveryimageasa2)

Page 21: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

UsinganRBMtolearnamodelofadigitclass

Reconstructionsbymodeltrainedon2’s

Reconstructionsbymodeltrainedon3’s

Data

i

j

i

j

reconstructiondata

256visibleunits(pixels)

100hiddenunits(features)

E0(vihj) E1(vihj)

Page 22: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

TrainingaDeepBeliefNetwork(themainreasonRBM’sareinteresting)

• Firsttrainalayeroffeaturesthatreceiveinputdirectlyfromthepixels.

• Thentreattheactivationsofthetrainedfeaturesasiftheywerepixelsandlearnfeaturesoffeaturesinasecondhiddenlayer.

• Itcanbeprovedthateachtimeweaddanotherlayeroffeaturesweimproveavariational lowerboundonthelogprobabilityofthetrainingdata.– Theproofisslightlycomplicated.– ButitisbasedonaneatequivalencebetweenanRBMandadeepdirectedmodel

Page 23: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

TheGenerativeModelAfterLearning3Layers

Togeneratedata:1. Getanequilibriumsamplefromthe

top-levelRBMbyperformingalternatingGibbssamplingforalongtime.

2. Performatop-downpasstogetstatesforalltheotherlayers.

Sothelowerlevelbottom-upconnectionsarenot partofthegenerativemodel.Theyarejustusedforinference.

h2

data

h1

h3

⇥3

⇥2

⇥1

Page 24: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Whydoesgreedylearningwork?• EachRBMconvertsitsdatadistributioninto

anaggregatedposteriordistributionoveritshiddenunits.

• Thisdividesthetaskofmodelingitsdataintotwotasks:– Task1:Learngenerativeweightsthatcanconverttheaggregatedposteriordistributionoverthehiddenunitsbackintothedatadistribution.

– Task2:Learntomodeltheaggregatedposteriordistributionoverthehiddenunits.

– TheRBMdoesagoodjoboftask1andamoderatelygoodjoboftask2.

• Task2iseasier(forthenextRBM)thanmodelingtheoriginaldatabecausetheaggregatedposteriordistributionisclosertoadistributionthatanRBMcanmodelperfectly.

Aggregatedposterior

distributiononhiddenunits

Datadistributiononvisibleunits

P (v | h,⇥)

P (h | ⇥)

Task 1

Task 2

Page 25: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Whydoesgreedylearningwork?• TheweightsΘ inthebottomlevelRBMdefine

P(v | h) andtheyalso,indirectly,defineP(h).• SowecanexpresstheRBMmodelas

• IfweleaveP(v | h,Θ) aloneandimproveP(h|Θ),wewillimproveP(v).

• ToimproveP(h),weneedittobeabettermodeloftheaggregatedposteriordistributionoverhiddenvectorsproducedbyapplyingΘ tothedata.– Accomplishedbythenexthigherlayer

P (v) =X

h

P (v | h,⇥)P (h | ⇥)

Page 26: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Whygreedylearningworks• Eachtimewelearnanewlayer,theinferenceatthelayer

belowbecomesincorrect,butthevariational boundonthelogprob ofthedataimproves(onlytrueintheory)

• Sincetheboundstartsasanequality,learninganewlayerneverdecreasesthelogprob ofthedata,providedwestartthelearningfromthetiedweightsthatimplementthecomplementaryprior

• Nowthatwehaveaguaranteewecanloosentherestrictionsandstillfeelconfident– Allowlayerstovaryinsize– Donotstartthelearningateachlayerfromtheweightsinthelayerbelow

Page 27: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Aneuralnetworkmodelofdigitrecognition

2000top-levelunits

500units

500units

28x28pixelimage

10labelunits

Themodellearnsajointdensityforlabelsandimages.

Toperformrecognitionwecanstartwithaneutralstateofthelabelunitsanddooneortwoiterationsofthetop-levelRBM.

OrwecanjustcomputethefreeenergyoftheRBMwitheachofthe10labels

ThetoptwolayersformarestrictedBoltzmannmachinewhosefreeenergylandscapemodelsthelowdimensionalmanifoldsofthedigits.

Thevalleyshavenames:

Page 28: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Movieofthenetworkgeneratingdigits

(availableatwww.cs.toronto/~hinton)

Page 29: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Fine-tuningwithacontrastiveversionofthe“wake-sleep” algorithm

Afterlearningmanylayersoffeatures,wecanfine-tunethefeaturestoimprovegeneration.

1.Doastochasticbottom-uppass– Adjustthetop-downweightstobegoodatreconstructingthefeatureactivitiesinthelayerbelow.

2. DoafewiterationsofsamplinginthetoplevelRBM– Adjusttheweightsinthetop-levelRBM.

3. Doastochastictop-downpass– Adjustthebottom-upweightstobegoodatreconstructingthefeatureactivitiesinthelayerabove.

Notrequired!Buthelpstherecognitionrate.

Page 30: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

LimitsoftheGenerativeModel

1.Designedforimageswherenon-binaryvaluescanbetreatedasprobabilities.

2. Top-downfeedbackonlyinthehighest(associative)layer.3. Nosystematicwaytodealwithinvariance.4. Assumessegmentationalreadyperformedanddoesnotlearn

toattendtothemostinformativepartsofobjects.

Page 31: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

DeepNetActivationFunctions

Page 32: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

OtherDeepArchitectures:ConvolutionalNeuralNetwork

[Image credit: http://timdettmers.com/2015/03/26/convolution-deep-learning/]

Page 33: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

OtherDeepArchitectures:ConvolutionalNeuralNetwork

[Image credit: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/]

[Image credit: http://rnd.azoft.com/wp-content/uploads_rnd/2016/11/overall-1024x256.png]

Page 34: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

OtherDeepArchitectures:LongShort-TermMemory(LSTM)

[Image credits: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

Page 35: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

DeepLearningintheHeadlines

48

Page 36: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

pixels

edges

object parts(combination of edges)

object models

DeepBeliefNetonFaceImages

BasedonmaterialsbyAndrewNg

49

Page 37: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Examplesoflearnedobjectpartsfromobjectcategories

LearningofObjectParts

Faces Cars Elephants Chairs

Slidecredit:AndrewNg50

Page 38: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

TrainingonMultipleObjects

Trainedon4classes(cars,faces,motorbikes,airplanes).Secondlayer:Shared-featuresandobject-specificfeatures.Thirdlayer:Morespecificfeatures.

Slidecredit:AndrewNg51

Page 39: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

SceneLabelingviaDeepLearning

[Farabet etal.ICML2012,PAMI2013] 52

Page 40: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

Inputimages

SamplesfromfeedforwardInference(control)

SamplesfromFullposteriorinference

InferencefromDeepLearnedModelsGeneratingposteriorsamplesfromfacesby“fillingin” experiments(cf.LeeandMumford,2003).Combinebottom-upandtop-downinference.

Slidecredit:AndrewNg53

Page 41: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

MachineLearninginAutomaticSpeechRecognition

ATypicalSpeechRecognitionSystem

MLusedtopredictofphonestatesfromthesoundspectrogram

Deeplearninghasstate-of-the-artresults

# HiddenLayers 1 2 4 8 10 12

WordErrorRate% 16.0 12.8 11.4 10.9 11.0 11.1

BaselineGMMperformance=15.4%[Zeiler etal.“Onrectifiedlinearunitsforspeechrecognition”ICASSP2013]

54

Page 42: Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

ImpactofDeepLearninginSpeechTechnology

Slidecredit:LiDeng,MSResearch55