1 CSE 5526: DBNs CSE 5526: Introduction to Neural Networks Deep Belief Networks
1 CSE 5526: DBNs
CSE 5526: Introduction to Neural Networks
Deep Belief Networks
2 CSE 5526: DBNs
Deep circuits can represent logic expressions using exponentially fewer components
β’ Consider the parity problem (project 1) β’ for ππ β 0,1 π·π·
ππ ππ = οΏ½1, οΏ½π₯π₯ππ is even
ππ0, otherwise
β’ The depth-2 circuit to compute ππ ππ uses ππ 2π·π· AND, OR, and NOT elements
β’ A depth-D circuit to compute ππ ππ uses ππ π·π·
β’ In general, a depth-k circuit uses O π·π·ππβ2ππβ12π·π·
1ππβ1
β’ See (Hastad, 1987, Thm. 2.2)
3 CSE 5526: DBNs
Backpropagation through deep neural nets leads to the vanishing gradient problem
β’ Recall, gradient of error WRT weights in layer β ππππ πππππ€π€ππππβ
= βπΏπΏππβπ¦π¦ππββ1 where πΏπΏππβ = ππβ² π£π£ππβ οΏ½πΏπΏππβ+1π€π€ππππβ
ππ
β’ In matrix notation, define vector πΉπΉβ and diagonal matrix Ξ¦β² β with ππβ² π£π£ππβ on its diagonal, then
πΉπΉβ = Ξ¦β² β ππβπΉπΉβ+1 = Ξ¦β² β ππβΞ¦β² β+1 ππβ+1 β―πΉπΉπΏπΏ
β Ξ¦β²ππ πΏπΏββπΉπΉπΏπΏ β’ Generally, Ξ¦β²ππ πΏπΏββ either goes to β or 0
4 CSE 5526: DBNs
Convolutional networks are deep networks that are feasible to train
β’ Neural network that learns βreceptive fieldsβ β’ And applies them across different spatial positions
β’ Weight matrices are very constrained β’ Train using standard backprop
5 CSE 5526: DBNs
LeNet-1 zipcode recognizer
β’ Trained on 7300 digits and tested on 2000 new ones β’ 1% error on training set, 5% error on test set β’ If allowing no decision, 1% error on the test set β’ Difficult task (see examples)
β’ Remark: constraining network connectivity is a way of incorporating prior knowledge about a problem β’ Backprop applies whether or not the network is
constrained
6 CSE 5526: DBNs
LeNet-1 zipcode recognizer architecture
7 CSE 5526: DBNs
Another way to train deep neural nets is to use unsupervised pre-training
β’ Build training up from the bottom β’ Train a shallow model to describe the data β’ Treat that as a fixed transformation β’ Train another shallow model on transformed data β’ Etc.
β’ No long-distance gradients necessary β’ Initialize a deep neural network with these params
8 CSE 5526: DBNs
Restricted Boltzmann machines can be used as building blocks in this way
β’ A restricted Boltzmann machine (RBM) is a Boltzmann machine with one visible layer and one hidden layer, and no connection within each layer
9 CSE 5526: DBNs
RBM conditions are easy to compute
β’ The energy function is:
ππ π―π―,π‘π‘ = β12οΏ½οΏ½π€π€πππππ£π£ππ
ππππ
βππ = β12ππππππππ
β’ So ππ ππ ππ , ππ ππ ππ are now easy to compute β’ No Gibbs sampling necessary
ππ ππ ππ = exp12ππππππππ οΏ½ exp
12ππππππππ
ππ
β1
οΏ½ exp12ππππππππ
ππ
= οΏ½οΏ½ exp12ππππππππβ βππ
βππππ
10 CSE 5526: DBNs
RBM training still needs Gibbs sampling
β’ Setting ππ = 1, we have ππππ π°π°πππ€π€ππππ
= ππππππ+ β ππππππβ
= βππ0 π£π£ππ 0 β βππ
β π£π£ππ β
β’ The second correlation is computed using alternating Gibbs sampling until thermal equilibrium
11 CSE 5526: DBNs
Contrastive divergence is a quick way to train an RBM
β’ Contrastive divergence training β’ Start at observed data, sample h, then v, then h
β³ π€π€ππππ = ππ βππ0 π£π£ππ 0 β βππ
1 π£π£ππ 1
β’ First term is exact β’ Second term approximates a sample from the
unclamped joint distribution β’ Assuming that ππ(ππ,ππ) is close to the data distribution β’ Then (ππ 1 ,ππ 1 ) is a reasonable sample from ππ ππ,ππ
12 CSE 5526: DBNs
Logistic belief nets are directed Boltzmann machines
β’ Each unit is bipolar (binary) and stochastic
β’ Sampling from the belief net is easy
β’ Computing probabilities is still hard
13 CSE 5526: DBNs
Sampling from a logistic belief net
β’ Given the bipolar states of the units in layer ππ, we generate the state of each unit in layer ππ β 1:
ππ βππππβ1 = 1 = ππ οΏ½π€π€ππππ(ππ)βππ
(ππ)
ππ
where superscript indicates layer number and
ππ π₯π₯ =1
1 + exp (βπ₯π₯)
is a logistic activation function
14 CSE 5526: DBNs
Learning rule
β’ The bottom layer π‘π‘(0) is equal to the visible layer π―π― β’ Learning in a belief net maximizes the likelihood of
generating the input patterns applied to π―π―, we have β³ π€π€ππππ = βππ
ππ βππππβ1 β ππ βππ
ππβ1 = 1
β’ The difference term in the above equation includes an evaluation of the posterior probability given the training data β’ Computing posteriors is, unfortunately, very difficult
15 CSE 5526: DBNs
A special belief net
β’ However, for a special kind of belief net, computing posteriors is easy
β’ Consider a logistic belief net with an infinite number of layers and tied weights β’ That is, a deep belief net (DBN)
16 CSE 5526: DBNs
Sampling from an infinite belief net produces samples from the posterior
17 CSE 5526: DBNs
Learning in this infinite belief net is now easy
β’ Because of the tied weights, all but two terms cancel each other out
ππππ π°π°πππ€π€ππππ
= βππ0 π£π£ππ 0 β π£π£ππ 1
+ π£π£ππ 1 βππ0 β βππ
1 + βππ1 π£π£ππ 1 β π£π£ππ 2
+ β― = βππ
0 π£π£ππ 0 β βππβ π£π£ππ β
18 CSE 5526: DBNs
Thus learning in this infinite belief net is equivalent to learning in an RBM
β’ This rule is exactly the same as the one for the RBM β’ Hence the equivalence between learning an infinite belief
net and an RBM β’ Infinite belief nets are also known as deep belief
nets (DBNs)
19 CSE 5526: DBNs
Training a general deep net layer-by-layer
1. First learn ππ with all weights tied 2. Freeze (fix) ππ as ππ0, which represents the
learned weights for the first hidden layer 3. Learn the weights for the second hidden layer by
treating responses of the first hidden layer to the training data as βinput dataβ
4. Freeze the weights for the second hidden layer 5. Repeat steps 3-4 as many times as the prescribed
number of hidden layers
20 CSE 5526: DBNs
Thus an infinite belief network can be implemented with finite computation
π»π»0
ππ0 ππ ππππ
21 CSE 5526: DBNs
Thus an infinite belief network can be implemented with finite computation
ππ1
π»π»0
ππ0 ππ0
ππ ππππ
22 CSE 5526: DBNs
Thus an infinite belief network can be implemented with finite computation
π»π»1
ππ1
π»π»0
ππ0 ππ0
ππ1
ππ ππππ
23 CSE 5526: DBNs
Thus an infinite belief network can be implemented with finite computation
ππ2
π»π»1
ππ1
π»π»0
ππ0 ππ0
ππ1
ππ2
ππ ππππ
24 CSE 5526: DBNs
Thus an infinite belief network can be implemented with finite computation
π»π»2
ππ2
π»π»1
ππ1
π»π»0
ππ0 ππ0
ππ1
ππ2
ππ3
ππ ππππ
25 CSE 5526: DBNs
Remarks (Hinton, Osindero, Yeh, 2006)
β’ As the number of layers increases, the maximum likelihood approximation of the training data improves
β’ For discriminative training (e.g. for classification) we add an output layer on top of the learned generative model, and train the entire net by a discriminative algorithm
β’ Although much faster than Boltzmann machines (e.g. no simulated annealing), pretraining is still quite slow, and involves a lot of design as for MLP
26 CSE 5526: DBNs
DBNs have been successfully applied to an increasing number of tasks
β’ Ex: MNIST handwritten digit recognition β’ A DNN with two hidden layers achieves 1.25%
error rate, vs. 1.4% for SVM and 1.5% for MLP β’ Great example animations
β’ http://www.cs.toronto.edu/~hinton/digits.html
27 CSE 5526: DBNs
Samples from the learned generative model with one label clamped on
28 CSE 5526: DBNs
Samples with one label clamped on starting at a randomly initialized image