Deep Networks CSE Course on Artificial Neural Networks & CogSci Course on Cognitive Modeling and AI BraSci Course on Computational Neuroscience Fall 2011 November 3, 2011 Byoung-Tak Zhang Computer Science and Engineering (CSE) & Cognitive Science and Brain Science Programs http://bi.snu.ac.kr/
48
Embed
In Vitro Molecular Evolution · • Introduction to Deep Network • Restricted Boltzmann Machine (RBM) • Learning RBM: Contrastive Divergence ... A picture of the Boltzmann machine
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
USING A VAST, VAST MAJORITY OF SLIDES ORIGINALLY FROM: Geoffrey Hinton, Sue Becker, Yann Le Cun, Yoshua Bengio, Frank Wood, Honglak Lee, George Taylor
Motivation: why go deep?
• Deep Architectures can be representationally efficient
– Fewer computational units for same function
• Deep Representations might allow for a hierarchy or representation
– Allows non-local generalization
– Comprehensibility
• Multiple levels of latent variables allow combinatorial sharing of statistical strength
• Deep architectures work well (vision, audio, NLP, etc.)!
Learning to extract the orientation of a face patch (Salakhutdinov & Hinton, NIPS 2007)
The training and test sets for predicting
face orientation
11,000 unlabeled cases 100, 500, or 1000 labeled cases
face patches from new people
The root mean squared error in the orientation
when combining GP’s with deep belief nets
22.2 17.9 15.2
17.2 12.7 7.2
16.3 11.2 6.4
GP on
the
pixels
GP on
top-level
features
GP on top-level
features with
fine-tuning
100 labels
500 labels
1000 labels
Conclusion: The deep features are much better
than the pixels. Fine-tuning helps a lot.
Deep Autoencoders (Hinton & Salakhutdinov, 2006)
• They always looked like a really
nice way to do non-linear
dimensionality reduction:
– But it is very difficult to
optimize deep autoencoders
using backpropagation.
• We now have a much better way
to optimize them:
– First train a stack of 4 RBM’s
– Then “unroll” them.
– Then fine-tune with backprop.
1000 neurons
500 neurons
500 neurons
250 neurons
250 neurons
30
1000 neurons
28x28
28x28
1
2
3
4
4
3
2
1
W
W
W
W
W
W
W
W
T
T
T
T
linear
units
A comparison of methods for compressing
digit images to 30 real numbers.
real
data
30-D
deep auto
30-D logistic
PCA
30-D
PCA
Retrieving documents that are similar
to a query document
• We can use an autoencoder to find low-
dimensional codes for documents that allow
fast and accurate retrieval of similar
documents from a large set.
• We start by converting each document into a
“bag of words”. This a 2000 dimensional
vector that contains the counts for each of the
2000 commonest words.
How to compress the count vector
• We train the neural
network to reproduce its
input vector as its output
• This forces it to
compress as much
information as possible
into the 10 numbers in
the central bottleneck.
• These 10 numbers are
then a good way to
compare documents.
2000 reconstructed counts
500 neurons
2000 word counts
500 neurons
250 neurons
250 neurons
10
input
vector
output
vector
Performance of the autoencoder at
document retrieval
• Train on bags of 2000 words for 400,000 training cases of business documents.
– First train a stack of RBM’s. Then fine-tune with backprop.
• Test on a separate 400,000 documents.
– Pick one test document as a query. Rank order all the other test documents by using the cosine of the angle between codes.
– Repeat this using each of the 400,000 test documents as the query (requires 0.16 trillion comparisons).
• Plot the number of retrieved documents against the proportion that are in the same hand-labeled class as the query document.
Proportion of retrieved documents in same class as query
Number of documents retrieved
First compress all documents to 2 numbers using a type of PCA
Then use different colors for different document categories
First compress all documents to 2 numbers.
Then use different colors for different document categories
RESTRICTED BOLTZMANN
MACHINE (RBM)
Restricted Boltzmann Machines
• Restrict the connectivity to make learning
easier.
– Only one layer of hidden units.
– No connections between hidden units.
• In an RBM, the hidden units are
conditionally independent given the
visible states.
– So can quickly get an unbiased
sample from the posterior distribution
when given a data-vector.
– This is a big advantage over directed
belief nets
hidden
i
j
visible
Stochastic binary neurons
• These have a state of 1 or 0 which is a stochastic
function of the neuron’s bias, b, and the input it receives
from other neurons.
0.5
0
0
1
j
jijii
wsbsp
)exp(1)(
11
j
jiji wsb
)( 1isp
Stochastic units
• Replace the binary threshold units by binary stochastic
units that make biased random decisions.
– The temperature controls the amount of noise.
– Decreasing all the energy gaps between configurations
is equivalent to raising the noise level.
)()(
1
1
1
1)(
10
1
iii
TETwsi
sEsEEgapEnergy
ee
spij ijj
temperature
The Energy of a joint configuration
ji
ijji
unitsi
ii wssbsE vhvhvhhv ),(
bias of unit i
weight between units i and j
Energy with configuration v on the visible units and h on the hidden units
binary state of unit i in joint configuration v, h
indexes every non-identical pair of i and j once
v
h
Weights Energies Probabilities
• Each possible joint configuration of the visible and hidden units has an energy – The energy is determined by the weights and
biases (as in a Hopfield net).
• The energy of a joint configuration of the visible and hidden units determines its probability:
• The probability of a configuration over the visible units is found by summing the probabilities of all the joint configurations that contain it.
),(),(
hvEhvp e
LEARNING RBM:
CONTRASTIVE DIVERGENCE
Maximizing the training data log likelihood
• We want maximizing parameters
• Differentiate w.r.t. to all parameters and
perform gradient ascent to find optimal
parameters.
• The derivation is nasty.
D,,
1,, |
|
logmaxarg),,|(Dlogmaxarg11 d
c m
mm
m
mm
ncf
df
pnn
Assuming d’s drawn inde
pendently from p()
Standard PoE form
Over all training data.
PoE: product-of-experts
Equilibrium Is Hard to Achieve
• With:
can now train our PoE model.
• But… there’s a problem: – is computationally infeasible to obtain (esp. in
an inner gradient ascent loop).
– Sampling Markov Chain must converge to target distribution. Often this takes a very long time!
Pm
mm
Pm
mm cfdf |log|log
0
m
np
),,|(Dlog 1
P
A very surprising fact
• Everything that one weight needs to know about
the other weights and the data in order to do
maximum likelihood learning is contained in the
difference of two correlations.
freejijiij
ssssw
p
v
v)(log
Derivative of log probability of one training vector
Expected value of product of states at thermal equilibrium when the training vector is clamped on the visible units
Expected value of product of states at thermal equilibrium when nothing is clamped
The (theoretical) batch learning algorithm
• Positive phase – Clamp a data vector on the visible units. – Let the hidden units reach thermal equilibrium at a
temperature of 1 – Sample for all pairs of units – Repeat for all data vectors in the training set.
• Negative phase – Do not clamp any of the units – Let the whole network reach thermal equilibrium at
a temperature of 1 (where do we start?) – Sample for all pairs of units – Repeat many times to get good estimates
• Weight updates – Update each weight by an amount proportional to
the difference in in the two phases.
jiss
jiss
jiss
Solution: Contrastive Divergence!
• Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically)
• Why does this work? – Attempts to minimize the ways that the model
distorts the data.
10
|log|log
Pm
mm
Pm
mm cfdf
m
np
),,|(Dlog 1
Contrastive Divergence
• Maximum likelihood gradient: pull down energy
surface at the examples and pull it up
everywhere else, with more emphasis where
model puts more probability mass
• Contrastive divergence updates: pull down
energy surface at the examples and pull it up in
their neighborhood, with more emphasis where
model puts more probability mass
A picture of the Boltzmann machine learning
algorithm for an RBM
0 jiss1 jiss jiss
i
j
i
j
i
j
i
j
t = 0 t = 1 t = 2 t = infinity
)( 0 jijiij ssssw
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
a fantasy
Contrastive divergence learning:
A quick way to learn an RBM
0 jiss 1 jiss
i
j
i
j
t = 0 t = 1
)( 10 jijiij ssssw
Start with a training vector on the visible units.
Update all the hidden units in parallel
Update the all the visible units in parallel to get a “reconstruction”.
Update the hidden units again.
This is not following the gradient of the log likelihood. But it works well.
When we consider infinite directed nets it will be easy to see why it works.
reconstruction data
DEEP BELIEF NETWORKS
Deep Belief Networks(DBNs)
• Probabilistic generative model
• Deep architecture – multiple layers
• Unsupervised pre-learning provides a good
initialization of the network
– maximizing the lower-bound of the log-
likelihood of the data
• Supervised fine-tuning
– Generative: Up-down algorithm
– Discriminative: backpropagation
DBN structure
Deep Network Training
• Use unsupervised learning (greedy layer-wise training)
– Allows abstraction to develop naturally from one layer to another
– Help the network initialize with good parameters
• Perform supervised top-down training as final step
– Refine the features (intermediate layers) so that they become more relevant for the task
DBN Greedy training
• First step:
– Construct an RBM
with an input layer v
and a hidden layer h
– Train the RBM
DBN Greedy training
• Second step:
– Stack another hidden
layer on top of the
RBM to form a new
RBM
– Fix , sample
from as input.
Train as RBM
1W1h
1( | )Q h v2W
DBN Greedy training
• Third step:
– Continue to stack
layers on top of the
network, train it as
previous step, with
sample sampled from
• And so on…
2 1( | )Q h h
Why greedy learning works
• Each time we learn a new layer, the inference at the
layer below becomes incorrect, but the variational bound
on the log prob of the data improves (only true in theory -
ml).
• Since the bound starts as an equality, learning a new
layer never decreases the log prob of the data, provided
we start the learning from the tied weights that
implement the complementary prior.
• Now that we have a guarantee we can loosen the
restrictions and still feel confident.
– Allow layers to vary in size.
– Do not start the learning at each layer from the