Page 1
DeepLearning:RestrictedBoltzmannMachines
&DeepBeliefNets
BasedonslidesbyGeoffreyHinton,SueBecker,YannLeCun,Yoshua Bengio,FrankWood
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric.
Page 2
NeuralNetworks
2
Compare outputs with correct answer to get
error signal
outputs
inputs
hidden layers
Back-propagateerrorsignaltogetderivativesfor
learning
Page 3
Whatiswrongwithback-propagation?• Itrequireslabeledtrainingdata
– Almostalldataisunlabeled
• Thelearningtimedoesnotscalewell– Itisveryslowinnetswithmultiplehiddenlayers
• Itcangetstuckinpoorlocaloptima– Theseareoftenquitegood,butfordeepnetstheyarefarfromoptimal
3
Page 4
Motivations• Supervisedtrainingofdeepmodels(e.g.many-layeredNNets)isdifficult(optimizationproblem)
• Shallowmodels(SVMs,one-hidden-layerNNets,boosting,etc…)areunlikelycandidatesforlearninghigh-levelabstractionsneededforAI
• Unsupervisedlearningcoulddo“local-learning” (eachmoduletriesitsbesttomodelwhatitsees)
• Inference(+learning)isintractableindirectedgraphicalmodelswithmanyhiddenvariables
• Currentunsupervisedlearningmethodsdon’teasilyextendtolearnmultiplelevelsofrepresentation
Page 5
BeliefNets• Abeliefnetisadirectedacyclic
graphcomposedofstochasticvariables.
• Canobservesomeofthevariablesandwewouldliketosolvetwoproblems:
• Theinferenceproblem:Inferthestatesoftheunobservedvariables.
• Thelearningproblem:Adjusttheinteractionsbetweenvariablestomakethenetworkmorelikelytogeneratetheobserveddata.
stochastichiddencause
visibleeffect
Usenetscomposedoflayersofstochasticbinaryvariableswithweightedconnections.Later,wewillgeneralizetoothertypesofvariable.
Page 6
Explainingaway(JudeaPearl)• Eveniftwohiddencausesareindependent,theycanbecome
dependentwhenweobserveaneffectthattheycanbothinfluence.– Ifwelearnthattherewasanearthquakeitreducestheprobabilitythatthehousejumpedbecauseofatruck.
truckhitshouse earthquake
housejumps
P(J|T)=0.9
P(T)=e-10 P(E)=e-10
P(J|E)=0.9
P(J)=e-20
Page 7
Whymultilayerlearningishardinasigmoidbeliefnet
• TolearnΘ,weneedtheposteriordistributioninthefirsthiddenlayer.
• Problem1:Theposterioristypicallyintractablebecauseof“explainingaway”.
• Problem2: Theposteriordependsonthepriorcreatedbyhigherlayersaswellasthelikelihood.– SotolearnΘ,weneedtoknowtheweightsinhigherlayers,evenifweareonlyapproximatingtheposterior.Alltheweightsinteract.
• Problem3: Weneedtointegrateoverallpossibleconfigurationsofthehighervariablestogetthepriorforfirsthiddenlayer.Yuk!
data
hiddenvariables
hiddenvariables
hiddenvariables
likelihood Θ
prior
Page 8
StochasticbinaryneuronsHaveastateof1or0,whichisastochasticfunctionoftheneuron’sbias b andtheinputstates itreceivesfromotherneurons.
0.5
00
1
p(ai = 1) =
1
1 + exp(�bi �P
j xj⇥ji)
p(ai = 1) =
1
1 + exp(�bi �P
j sj⇥ji)
bi +X
j
sj⇥ji
Page 9
P (ai = 1) =
1
1 + exp(�P
j sj⇥ji/T )=
1
1 + exp(��Ei/T )
Energy gap = �Ei = E(ai = 0)� E(ai = 1)
StochasticunitsReplacethebinarythresholdunitsbybinarystochasticunitsthatmakebiasedrandomdecisions.
– Thetemperaturecontrolstheamountofnoise– Decreasingalltheenergygapsbetweenconfigurationsisequivalenttoraisingthenoiselevel
temperature
Page 10
RestrictedBoltzmannMachines
• Restricttheconnectivitytomakelearningeasier– Onlyonelayerofhiddenunits
• Dealwithmorelayerslater
– Noconnectionsbetweenhiddenunits• InanRBM,thehiddenunitsareconditionally
independentgiventhevisiblestates– Socanquicklygetanunbiasedsamplefromtheposteriordistributionwhengivenadata-vector
– Thisisabigadvantageoverdirectedbeliefnets
hidden
i
j
visible
Page 11
Theenergyofajointconfiguration(ignoringbiasterms)
weightbetweenunitsi andj
Energywithconfigurationvonthevisibleunitsandhonthehiddenunits
binarystateofvisibleuniti
E(v,h) = �X
i,j
vihj⇥ij
binarystateofhiddenunitj
�@E(v,h)
@⇥ij= vihj
Page 12
Weightsà Energiesà Probabilities
• Eachpossiblejointconfigurationofthevisibleandhiddenunitshasanenergy– Theenergyisdeterminedbytheweightsandbiases
• Theenergyofajointconfigurationofthevisibleandhiddenunitsdeterminesitsprobability:
• Theprobabilityofaconfigurationoverthevisibleunitsisfoundbysummingtheprobabilitiesofallthejointconfigurationsthatcontainit.
P (v,h) / e�E(v,h)
Page 13
Usingenergiestodefineprobabilities
• Probabilityofajointconfigurationoverbothvisibleandhiddenunits
• Probabilityofaparticularconfigurationofthevisibleunits
P (v,h) =e�E(v,h)
Pu,g e
�E(u,g)
P (v) =
Ph e�E(v,h)
Pu,g e
�E(u,g)
Page 14
ApictureoftheBoltzmannmachinelearningalgorithmforanRBM
i
t=0
Startwithatrainingvectoronthevisibleunits.
Thenalternatebetweenupdatingallthehiddenunitsinparallelandupdatingallthevisibleunitsinparallel.
afantasy
@ logP (v)
@⇥ij= E0(vihj)� E1(vihj)
j
E0(vihj)
i
j
E1(vihj)
i
j
E1(vihj)
i
j
t=1 t=2 t=∞
visible
hidden
Page 15
Averysurprisingfact• Everythingthatoneweightneedstoknowabouttheotherweightsandthedatainordertodomaximumlikelihoodlearningiscontainedinthedifferenceoftwocorrelations.
Derivativeoflogprobabilityofonetrainingvector
Expectedvalueofproductofstatesatthermalequilibriumwhenthetrainingvectorisclampedonthevisibleunits
Expectedvalueofproductofstatesatthermalequilibriumwhennothingisclamped
@ logP (v)
@⇥ij= E0(vihj)� E1(vihj)
Page 16
ApictureoftheBoltzmannmachinelearningalgorithmforanRBM
i
j
i
j
i
j
i
j
t=0t=1t=2t=infinity
Problem:thisMarkovchainmaytakeaverylongtimetoconverge!
Solution:ContrastiveDivergence
E0(vihj) E1(vihj) E1(vihj)
Page 17
ContrastiveDivergenceLearning:AquickwaytolearnanRBM
i
j
i
j
t=0t=1
Startwithatrainingvectoronthevisibleunits.
Updateallthehiddenunitsinparallel
Updatetheallthevisibleunitsinparalleltogeta“reconstruction”.
Updatethehiddenunitsagain.
Thisisnotfollowingthegradientoftheloglikelihood.Butitworkswell.
Itisapproximatelyfollowingthegradientofanotherobjectivefunction(Carreira-Perpinan &Hinton,2005).
reconstructiondata
E0(vihj) E1(vihj)
�⇥ij = ✏[E0(vihj)� E1(vihj)]
Page 18
Howtolearnasetoffeaturesthataregoodforreconstructingimagesofthedigit2
50binaryfeatureneurons
16x16pixelimage
50binaryfeatureneurons
16x16pixelimage
Increment weightsbetweenanactivepixelandanactivefeature
Decrementweightsbetweenanactivepixelandanactivefeature
data(reality)
reconstruction(betterthanreality)
Page 19
Eachneurongrabsadifferentfeature.
TheFinal50 x256Weights
Page 20
ReconstructionfromactivatedbinaryfeaturesData
ReconstructionfromactivatedbinaryfeaturesData
Howwellcanwereconstructthedigitimagesfromthebinaryfeatureactivations?
Newtestimagesfromthedigitclassthatthemodelwastrainedon
Imagesfromanunfamiliardigitclass(thenetworktriestoseeeveryimageasa2)
Page 21
UsinganRBMtolearnamodelofadigitclass
Reconstructionsbymodeltrainedon2’s
Reconstructionsbymodeltrainedon3’s
Data
i
j
i
j
reconstructiondata
256visibleunits(pixels)
100hiddenunits(features)
E0(vihj) E1(vihj)
Page 22
TrainingaDeepBeliefNetwork(themainreasonRBM’sareinteresting)
• Firsttrainalayeroffeaturesthatreceiveinputdirectlyfromthepixels.
• Thentreattheactivationsofthetrainedfeaturesasiftheywerepixelsandlearnfeaturesoffeaturesinasecondhiddenlayer.
• Itcanbeprovedthateachtimeweaddanotherlayeroffeaturesweimproveavariational lowerboundonthelogprobabilityofthetrainingdata.– Theproofisslightlycomplicated.– ButitisbasedonaneatequivalencebetweenanRBMandadeepdirectedmodel
Page 23
TheGenerativeModelAfterLearning3Layers
Togeneratedata:1. Getanequilibriumsamplefromthe
top-levelRBMbyperformingalternatingGibbssamplingforalongtime.
2. Performatop-downpasstogetstatesforalltheotherlayers.
Sothelowerlevelbottom-upconnectionsarenot partofthegenerativemodel.Theyarejustusedforinference.
h2
data
h1
h3
⇥3
⇥2
⇥1
Page 24
Whydoesgreedylearningwork?• EachRBMconvertsitsdatadistributioninto
anaggregatedposteriordistributionoveritshiddenunits.
• Thisdividesthetaskofmodelingitsdataintotwotasks:– Task1:Learngenerativeweightsthatcanconverttheaggregatedposteriordistributionoverthehiddenunitsbackintothedatadistribution.
– Task2:Learntomodeltheaggregatedposteriordistributionoverthehiddenunits.
– TheRBMdoesagoodjoboftask1andamoderatelygoodjoboftask2.
• Task2iseasier(forthenextRBM)thanmodelingtheoriginaldatabecausetheaggregatedposteriordistributionisclosertoadistributionthatanRBMcanmodelperfectly.
Aggregatedposterior
distributiononhiddenunits
Datadistributiononvisibleunits
P (v | h,⇥)
P (h | ⇥)
Task 1
Task 2
Page 25
Whydoesgreedylearningwork?• TheweightsΘ inthebottomlevelRBMdefine
P(v | h) andtheyalso,indirectly,defineP(h).• SowecanexpresstheRBMmodelas
• IfweleaveP(v | h,Θ) aloneandimproveP(h|Θ),wewillimproveP(v).
• ToimproveP(h),weneedittobeabettermodeloftheaggregatedposteriordistributionoverhiddenvectorsproducedbyapplyingΘ tothedata.– Accomplishedbythenexthigherlayer
P (v) =X
h
P (v | h,⇥)P (h | ⇥)
Page 26
Whygreedylearningworks• Eachtimewelearnanewlayer,theinferenceatthelayer
belowbecomesincorrect,butthevariational boundonthelogprob ofthedataimproves(onlytrueintheory)
• Sincetheboundstartsasanequality,learninganewlayerneverdecreasesthelogprob ofthedata,providedwestartthelearningfromthetiedweightsthatimplementthecomplementaryprior
• Nowthatwehaveaguaranteewecanloosentherestrictionsandstillfeelconfident– Allowlayerstovaryinsize– Donotstartthelearningateachlayerfromtheweightsinthelayerbelow
Page 27
Aneuralnetworkmodelofdigitrecognition
2000top-levelunits
500units
500units
28x28pixelimage
10labelunits
Themodellearnsajointdensityforlabelsandimages.
Toperformrecognitionwecanstartwithaneutralstateofthelabelunitsanddooneortwoiterationsofthetop-levelRBM.
OrwecanjustcomputethefreeenergyoftheRBMwitheachofthe10labels
ThetoptwolayersformarestrictedBoltzmannmachinewhosefreeenergylandscapemodelsthelowdimensionalmanifoldsofthedigits.
Thevalleyshavenames:
Page 28
Movieofthenetworkgeneratingdigits
(availableatwww.cs.toronto/~hinton)
Page 29
Fine-tuningwithacontrastiveversionofthe“wake-sleep” algorithm
Afterlearningmanylayersoffeatures,wecanfine-tunethefeaturestoimprovegeneration.
1.Doastochasticbottom-uppass– Adjustthetop-downweightstobegoodatreconstructingthefeatureactivitiesinthelayerbelow.
2. DoafewiterationsofsamplinginthetoplevelRBM– Adjusttheweightsinthetop-levelRBM.
3. Doastochastictop-downpass– Adjustthebottom-upweightstobegoodatreconstructingthefeatureactivitiesinthelayerabove.
Notrequired!Buthelpstherecognitionrate.
Page 30
LimitsoftheGenerativeModel
1.Designedforimageswherenon-binaryvaluescanbetreatedasprobabilities.
2. Top-downfeedbackonlyinthehighest(associative)layer.3. Nosystematicwaytodealwithinvariance.4. Assumessegmentationalreadyperformedanddoesnotlearn
toattendtothemostinformativepartsofobjects.
Page 31
DeepNetActivationFunctions
Page 32
OtherDeepArchitectures:ConvolutionalNeuralNetwork
[Image credit: http://timdettmers.com/2015/03/26/convolution-deep-learning/]
Page 33
OtherDeepArchitectures:ConvolutionalNeuralNetwork
[Image credit: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/]
[Image credit: http://rnd.azoft.com/wp-content/uploads_rnd/2016/11/overall-1024x256.png]
Page 34
OtherDeepArchitectures:LongShort-TermMemory(LSTM)
[Image credits: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Page 35
DeepLearningintheHeadlines
48
Page 36
pixels
edges
object parts(combination of edges)
object models
DeepBeliefNetonFaceImages
BasedonmaterialsbyAndrewNg
49
Page 37
Examplesoflearnedobjectpartsfromobjectcategories
LearningofObjectParts
Faces Cars Elephants Chairs
Slidecredit:AndrewNg50
Page 38
TrainingonMultipleObjects
Trainedon4classes(cars,faces,motorbikes,airplanes).Secondlayer:Shared-featuresandobject-specificfeatures.Thirdlayer:Morespecificfeatures.
Slidecredit:AndrewNg51
Page 39
SceneLabelingviaDeepLearning
[Farabet etal.ICML2012,PAMI2013] 52
Page 40
Inputimages
SamplesfromfeedforwardInference(control)
SamplesfromFullposteriorinference
InferencefromDeepLearnedModelsGeneratingposteriorsamplesfromfacesby“fillingin” experiments(cf.LeeandMumford,2003).Combinebottom-upandtop-downinference.
Slidecredit:AndrewNg53
Page 41
MachineLearninginAutomaticSpeechRecognition
ATypicalSpeechRecognitionSystem
MLusedtopredictofphonestatesfromthesoundspectrogram
Deeplearninghasstate-of-the-artresults
# HiddenLayers 1 2 4 8 10 12
WordErrorRate% 16.0 12.8 11.4 10.9 11.0 11.1
BaselineGMMperformance=15.4%[Zeiler etal.“Onrectifiedlinearunitsforspeechrecognition”ICASSP2013]
54
Page 42
ImpactofDeepLearninginSpeechTechnology
Slidecredit:LiDeng,MSResearch55