In Vitro Molecular Evolution · • Introduction to Deep Network • Restricted Boltzmann Machine (RBM) • Learning RBM: Contrastive Divergence ... A picture of the Boltzmann machine

Deep Networks

CSE Course on Artificial Neural Networks &

CogSci Course on Cognitive Modeling and AI

BraSci Course on Computational Neuroscience

Fall 2011

November 3, 2011

Byoung-Tak Zhang

Computer Science and Engineering (CSE) &

Cognitive Science and Brain Science Programs

http://bi.snu.ac.kr/

http://bi.snu.ac.kr/

Overview

• Introduction to Deep Network

• Restricted Boltzmann Machine (RBM)

• Learning RBM: Contrastive Divergence

• Deep Belief Networks (DBN)

• Examples

– Image Reconstruction

– Motion generation

USING A VAST, VAST MAJORITY OF SLIDES ORIGINALLY FROM: Geoffrey Hinton, Sue Becker, Yann Le Cun, Yoshua Bengio, Frank Wood, Honglak Lee, George Taylor

Motivation: why go deep?

• Deep Architectures can be representationally efficient

– Fewer computational units for same function

• Deep Representations might allow for a hierarchy or representation

– Allows non-local generalization

– Comprehensibility

• Multiple levels of latent variables allow combinatorial sharing of statistical strength

• Deep architectures work well (vision, audio, NLP, etc.)!

Learning to extract the orientation of a face patch (Salakhutdinov & Hinton, NIPS 2007)

The training and test sets for predicting

face orientation

11,000 unlabeled cases 100, 500, or 1000 labeled cases

face patches from new people

The root mean squared error in the orientation

when combining GP’s with deep belief nets

22.2 17.9 15.2

17.2 12.7 7.2

16.3 11.2 6.4

GP on

the

pixels

GP on

top-level

features

GP on top-level

features with

fine-tuning

100 labels

500 labels

1000 labels

Conclusion: The deep features are much better

than the pixels. Fine-tuning helps a lot.

Deep Autoencoders (Hinton & Salakhutdinov, 2006)

• They always looked like a really

nice way to do non-linear

dimensionality reduction:

– But it is very difficult to

optimize deep autoencoders

using backpropagation.

• We now have a much better way

to optimize them:

– First train a stack of 4 RBM’s

– Then “unroll” them.

– Then fine-tune with backprop.

1000 neurons

500 neurons

500 neurons

250 neurons

250 neurons

30

1000 neurons

28x28

28x28

1

2

3

4

4

3

2

1

W

W

W

W

W

W

W

W

T

T

T

T

linear

units

A comparison of methods for compressing

digit images to 30 real numbers.

real

data

30-D

deep auto

30-D logistic

PCA

30-D

PCA

Retrieving documents that are similar

to a query document

• We can use an autoencoder to find low-

dimensional codes for documents that allow

fast and accurate retrieval of similar

documents from a large set.

• We start by converting each document into a

“bag of words”. This a 2000 dimensional

vector that contains the counts for each of the

2000 commonest words.

How to compress the count vector

• We train the neural

network to reproduce its

input vector as its output

• This forces it to

compress as much

information as possible

into the 10 numbers in

the central bottleneck.

• These 10 numbers are

then a good way to

compare documents.

2000 reconstructed counts

500 neurons

2000 word counts

500 neurons

250 neurons

250 neurons

10

input

vector

output

vector

Performance of the autoencoder at

document retrieval

• Train on bags of 2000 words for 400,000 training cases of business documents.

– First train a stack of RBM’s. Then fine-tune with backprop.

• Test on a separate 400,000 documents.

– Pick one test document as a query. Rank order all the other test documents by using the cosine of the angle between codes.

– Repeat this using each of the 400,000 test documents as the query (requires 0.16 trillion comparisons).

• Plot the number of retrieved documents against the proportion that are in the same hand-labeled class as the query document.

Proportion of retrieved documents in same class as query

Number of documents retrieved

First compress all documents to 2 numbers using a type of PCA

Then use different colors for different document categories

First compress all documents to 2 numbers.

Then use different colors for different document categories

RESTRICTED BOLTZMANN

MACHINE (RBM)

Restricted Boltzmann Machines

• Restrict the connectivity to make learning

easier.

– Only one layer of hidden units.

– No connections between hidden units.

• In an RBM, the hidden units are

conditionally independent given the

visible states.

– So can quickly get an unbiased

sample from the posterior distribution

when given a data-vector.

– This is a big advantage over directed

belief nets

hidden

i

j

visible

Stochastic binary neurons

• These have a state of 1 or 0 which is a stochastic

function of the neuron’s bias, b, and the input it receives

from other neurons.

0.5

0

0

1

j

jijii

wsbsp

)exp(1)(

11

j

jiji wsb

)( 1isp

Stochastic units

• Replace the binary threshold units by binary stochastic

units that make biased random decisions.

– The temperature controls the amount of noise.

– Decreasing all the energy gaps between configurations

is equivalent to raising the noise level.

)()(

1

1

1

1)(

10

1

iii

TETwsi

sEsEEgapEnergy

ee

spij ijj

temperature

The Energy of a joint configuration

ji

ijji

unitsi

ii wssbsE vhvhvhhv ),(

bias of unit i

weight between units i and j

Energy with configuration v on the visible units and h on the hidden units

binary state of unit i in joint configuration v, h

indexes every non-identical pair of i and j once

v

h

Weights Energies Probabilities

• Each possible joint configuration of the visible and hidden units has an energy – The energy is determined by the weights and

biases (as in a Hopfield net).

• The energy of a joint configuration of the visible and hidden units determines its probability:

• The probability of a configuration over the visible units is found by summing the probabilities of all the joint configurations that contain it.

),(),(

hvEhvp e

LEARNING RBM:

CONTRASTIVE DIVERGENCE

Maximizing the training data log likelihood

• We want maximizing parameters

• Differentiate w.r.t. to all parameters and

perform gradient ascent to find optimal

parameters.

• The derivation is nasty.

D,,

1,, |

|

logmaxarg),,|(Dlogmaxarg11 d

c m

mm

m

mm

ncf

df

pnn

Assuming d’s drawn inde

pendently from p()

Standard PoE form

Over all training data.

PoE: product-of-experts

Equilibrium Is Hard to Achieve

• With:

can now train our PoE model.

• But… there’s a problem: – is computationally infeasible to obtain (esp. in

an inner gradient ascent loop).

– Sampling Markov Chain must converge to target distribution. Often this takes a very long time!

Pm

mm

Pm

mm cfdf |log|log

0

m

np

),,|(Dlog 1

P

A very surprising fact

• Everything that one weight needs to know about

the other weights and the data in order to do

maximum likelihood learning is contained in the

difference of two correlations.

freejijiij

ssssw

p

v

v)(log

Derivative of log probability of one training vector

Expected value of product of states at thermal equilibrium when the training vector is clamped on the visible units

Expected value of product of states at thermal equilibrium when nothing is clamped

The (theoretical) batch learning algorithm

• Positive phase – Clamp a data vector on the visible units. – Let the hidden units reach thermal equilibrium at a

temperature of 1 – Sample for all pairs of units – Repeat for all data vectors in the training set.

• Negative phase – Do not clamp any of the units – Let the whole network reach thermal equilibrium at

a temperature of 1 (where do we start?) – Sample for all pairs of units – Repeat many times to get good estimates

• Weight updates – Update each weight by an amount proportional to

the difference in in the two phases.

jiss

jiss

jiss

Solution: Contrastive Divergence!

• Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically)

• Why does this work? – Attempts to minimize the ways that the model

distorts the data.

10

|log|log

Pm

mm

Pm

mm cfdf

m

np

),,|(Dlog 1

Contrastive Divergence

• Maximum likelihood gradient: pull down energy

surface at the examples and pull it up

everywhere else, with more emphasis where

model puts more probability mass

• Contrastive divergence updates: pull down

energy surface at the examples and pull it up in

their neighborhood, with more emphasis where

model puts more probability mass

A picture of the Boltzmann machine learning

algorithm for an RBM

0 jiss1 jiss jiss

i

j

i

j

i

j

i

j

t = 0 t = 1 t = 2 t = infinity

)( 0 jijiij ssssw

Start with a training vector on the visible units.

Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

a fantasy

Contrastive divergence learning:

A quick way to learn an RBM

0 jiss 1 jiss

i

j

i

j

t = 0 t = 1

)( 10 jijiij ssssw

Start with a training vector on the visible units.

Update all the hidden units in parallel

Update the all the visible units in parallel to get a “reconstruction”.

Update the hidden units again.

This is not following the gradient of the log likelihood. But it works well.

When we consider infinite directed nets it will be easy to see why it works.

reconstruction data

DEEP BELIEF NETWORKS

Deep Belief Networks(DBNs)

• Probabilistic generative model

• Deep architecture – multiple layers

• Unsupervised pre-learning provides a good

initialization of the network

– maximizing the lower-bound of the log-

likelihood of the data

• Supervised fine-tuning

– Generative: Up-down algorithm

– Discriminative: backpropagation

DBN structure

Deep Network Training

• Use unsupervised learning (greedy layer-wise training)

– Allows abstraction to develop naturally from one layer to another

– Help the network initialize with good parameters

• Perform supervised top-down training as final step

– Refine the features (intermediate layers) so that they become more relevant for the task

DBN Greedy training

• First step:

– Construct an RBM

with an input layer v

and a hidden layer h

– Train the RBM

DBN Greedy training

• Second step:

– Stack another hidden

layer on top of the

RBM to form a new

RBM

– Fix , sample

from as input.

Train as RBM

1W1h

1( | )Q h v2W

DBN Greedy training

• Third step:

– Continue to stack

layers on top of the

network, train it as

previous step, with

sample sampled from

• And so on…

2 1( | )Q h h

Why greedy learning works

• Each time we learn a new layer, the inference at the

layer below becomes incorrect, but the variational bound

on the log prob of the data improves (only true in theory -

ml).

• Since the bound starts as an equality, learning a new

layer never decreases the log prob of the data, provided

we start the learning from the tied weights that

implement the complementary prior.

• Now that we have a guarantee we can loosen the

restrictions and still feel confident.

– Allow layers to vary in size.

– Do not start the learning at each layer from the

weights in the layer below.

The generative model after learning 3 layers

• To generate data:

1. Get an equilibrium sample

from the top-level RBM by

performing alternating Gibbs

sampling for a long time.

2. Perform a top-down pass to

get states for all the other

layers.

So the lower level bottom-up

connections are not part of

the generative model. They

are just used for inference.

h2

data

h1

h3

2W

3W

1W

EXAMPLE:

IMAGE REGENERATION

A model of digit recognition

2000 top-level neurons

500 neurons

500 neurons

28 x 28

pixel

image

10 label

neurons

The model learns to generate combinations of

labels and images.

To perform recognition we start with a neutral

state of the label units and do an up-pass from

the image followed by a few iterations of the

top-level associative memory.

The top two layers form an

associative memory whose

energy landscape models the low

dimensional manifolds of the

digits.

The energy valleys have names

Demo: http://www.cs.toronto.edu/~hinton/adi/index.htm

Samples generated by letting the associative

memory run with one label clamped. There are

1000 iterations of alternating Gibbs sampling

between samples.

Examples of correctly recognized handwritten digits

that the neural network had never seen before

Its very

good

How well does it discriminate on MNIST test set with

no extra information about geometric distortions?

• Generative model based on RBM’s 1.25%

• Support Vector Machine (Decoste et. al.) 1.4%

• Backprop with 1000 hiddens (Platt) ~1.6%

• Backprop with 500 -->300 hiddens ~1.6%

• K-Nearest Neighbor ~ 3.3%

• See Le Cun et. al. 1998 for more results

• Its better than backprop and much more neurally plausible

because the neurons only need to send one kind of signal,

and the teacher can be another sensory input.

EXAMPLE:

MOTION REGENERATION

The conditional RBM model (Sutskever & Hinton 2007)

• Given the data and the previous hidden

state, the hidden units at time t are

conditionally independent.

– So online inference is very easy

• Learning can be done by using

contrastive divergence.

– Reconstruct the data at time t from

the inferred states of the hidden units

and the earlier states of the visibles.

– The temporal connections can be

learned as if they were additional

biases

t-2 t-1 t

t

)( reconjdatajiij sssw

i

j

Generating from a learned model

• The inputs from the earlier states

of the visible units create

dynamic biases for the hidden

and current visible units.

• Perform alternating Gibbs

sampling for a few iterations

between the hidden units and the

current visible units.

– This picks new hidden and

visible states that are

compatible with each other

and with the recent history. t-2 t-1 t

t

i

j

Conditional Deep Belief Network

cDBN as Bayesian filtering

Application for human motion

– Generating realistic human motion sequence

by successively sampling the t-th frame given

the previously sampled k frames

drunk graceful sexy

(Taylor et al, 2007)

In Vitro Molecular Evolution · • Introduction to Deep Network • Restricted Boltzmann Machine (RBM) • Learning RBM: Contrastive Divergence ... A picture of the Boltzmann machine

Documents