Top Banner
© Eric Xing @ CMU, 2006-2010 Machine Learning Neural Networks Eric Xing Lecture 3, August 12, 2010 Reading: Chap. 5 CB

Lecture3 xing fei-fei

Apr 15, 2017



Tianlu Wang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Page 1: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Machine Learning

Neural Networks

Eric XingLecture 3, August 12, 2010

Reading: Chap. 5 CB

Page 2: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Learning highly non-linear functions

f: X Y f might be non-linear function X (vector of) continuous and/or discrete vars Y (vector of) continuous and/or discrete vars

The XOR gate Speech recognition

Page 3: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

From biological neuron to artificial neuron (perceptron)

Activation function

Artificial neuron networks supervised learning gradient descent

Perceptron and Neural Nets

Soma Soma

























Y if , if ,


Input Layer Output Layer

Middle Layer

I n p

u t

S i

g n

a l s

O u

t p

u t

S i

g n

a l s

Page 4: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Connectionist Models Consider humans:

Neuron switching time ~ 0.001 second

Number of neurons ~ 1010

Connections per neuron ~ 104-5

Scene recognition time ~ 0.1 second

100 inference steps doesn't seem like enough much parallel computation

Properties of artificial neural nets (ANN) Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed processes







Page 5: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Male Age Temp WBC PainIntensity


37 10 11 20 1adjustableweights

0 1 0 0000



CholecystitisSmall Bowel

PancreatitisObstructionPainDuodenal Ulcer

Abdominal Pain Perceptron

Page 6: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

0.8Myocardial Infarction “Probability” of MI

112 150

MaleAgeSmokerECG: STPainIntensity


PainDuration Elevation

Myocardial Infarction Network

Page 7: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

The "Driver" Network

Page 8: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010


Output units

No disease Pneumonia Flu Meningitis

Input units

Cough Headache

what we gotwhat we wanted-error

∆ rulechange weights todecrease the error


Page 9: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Input units

Input to unit j: aj = Σwijai



Input to unit i: ai

measured value of variable i

Output of unit j:

oj = 1/ (1 + e- (aj+θj) )Output



Page 10: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Jargon Pseudo-Correspondence Independent variable = input variable Dependent variable = output variable Coefficients = “weights” Estimates = “targets”

Logistic Regression Model (the sigmoid unit)Inputs Output

Age 34


Stage 4

“Probability of beingAlive”






a, b, c

Independent variables

x1, x2, x3Dependent variable

p Prediction

Page 11: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

The perceptron learning algorithm

Recall the nice property of sigmoid function

Consider regression problem f:XY , for scalar Y:

Let’s maximize the conditional data likelihood

Page 12: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Gradient Descent

xd = input

td = target output

od =observed unit


wi =weight i

Page 13: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

The perceptron learning rules

xd = input

td = target output

od =observed unit


wi =weight i

Batch mode:Do until converge:

1. compute gradient ∇ED[w]


Incremental mode:Do until converge:

For each training example d in D

1. compute gradient ∇Ed[w]



Page 14: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

MLE vs MAP Maximum conditional likelihood estimate

Maximum a posteriori estimate

Page 15: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

What decision surface does a perceptron define?

x y Z (color)

0 0 10 1 11 0 11 1 0


some possible values for w1 and w2

w1 w2


f(x1w1 + x2w2) = y f(0w1 + 0w2) = 1 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0


x1 x2

w1 w2

θ = 0.5

f(a) = 1, for a > θ0, for a ≤ θ


Page 16: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

x y Z (color)

0 0 00 1 11 0 11 1 0

What decision surface does a perceptron define?


some possible values for w1 and w2

w1 w2

f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0


x1 x2

w1 w2

θ = 0.5

f(a) = 1, for a > θ0, for a ≤ θ


Page 17: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

x y Z (color)

0 0 00 1 11 0 11 1 0

What decision surface does a perceptron define?


f(a) = 1, for a > θ0, for a ≤ θ

θw1 w4w3


w5 w6

θ = 0.5 for all units

a possible set of values for (w1, w2, w3, w4, w5 , w6):(0.6,-0.6,-0.7,0.8,1,1)

Page 18: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Cough No cough

CoughNo coughNo headache No headache

Headache Headache

No disease

Meningitis Flu


No treatmentTreatment

00 10

01 11

000 100





Non Linear Separation

Page 19: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010




Independent variables

Dependent variable


Age 34


Stage 4









“Probability of beingAlive”





Neural Network Model

Page 20: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010




Independent variables

Dependent variable


Age 34


Stage 4







“Probability of beingAlive”


“Combined logistic models”

Page 21: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010




Independent variables

Dependent variable


Age 34


Stage 4






“Probability of beingAlive”


Page 22: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010




Independent variables

Dependent variable


Age 34


Stage 4







“Probability of beingAlive”


Page 23: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

WeightsIndependent variables

Dependent variable


Age 34


Stage 4









“Probability of beingAlive”





Not really, no target for hidden units...

Page 24: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010


Output units

No disease Pneumonia Flu Meningitis

Input units

Cough Headache

what we gotwhat we wanted-error

∆ rulechange weights todecrease the error


Page 25: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Input units

Output units


what we gotwhat we wanted-error

∆ rule

∆ rule

Hidden Units and Backpropagation

Page 26: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Backpropagation Algorithm Initialize all weights to small random numbers

Until convergence, Do

1. Input the training example to the network and compute the network outputs

1. For each output unit k

2. For each hidden unit h

3. Undate each network weight wi,j


xd = input

td = target output

od =observed unit


wi =weight i

Page 27: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

More on Backpropatation It is doing gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum

In practice, often works well (can run multiple times)

Often include weight momentum α

Minimizes error over training examples Will it generalize well to subsequent testing examples?

Training can take thousands of iterations, very slow! Using network after training is very fast

Page 28: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Learning Hidden Layer Representation A network:

A target function:

Can this be learned?

Page 29: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Learning Hidden Layer Representation A network:

Learned hidden layer representation:

Page 30: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010


Page 31: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010


Page 32: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Expressive Capabilities of ANNs Boolean functions:

Every Boolean function can be represented by network with single hidden layer But might require exponential (in number of inputs) hidden units

Continuous functions: Every bounded continuous function can be approximated with arbitrary small

error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989] Any function can be approximated to arbitrary accuracy by a network with two

hidden layers [Cybenko 1988].

Page 33: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Application: ANN for Face Reco. The model The learned hidden unit


Page 34: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

X1 X2 X3

“X1” “X1X3” “X1X2X3”



X1 X2 X3 X1X2 X1X3 X2X3


(23-1) possible combinations


Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...

Regression vs. Neural Networks

Page 35: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

winitial wtrained

initial error

final error

Error surface

positive change

negative derivative

local minimum

Minimizing the Error

Page 36: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Overfitting in Neural NetsC



Overfitted model “Real” model



Overfitted model



Page 37: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Alternative Error Functions Penalize large weights:

Training on target slopes as well as values

Tie together weights E.g., in phoneme recognition

Page 38: Lecture3 xing fei-fei

© Eric Xing @ CMU, 2006-2010

Artificial neural networks – what you should know Highly expressive non-linear functions Highly parallel network of logistic function units Minimizing sum of squared training errors

Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output values

Minimizing sum of sq errors plus weight squared (regularization) MAP estimates assuming weight priors are zero mean Gaussian

Gradient descent as training procedure How to derive your own gradient descent procedure

Discover useful representations at hidden units Local minima is greatest problem Overfitting, regularization, early stopping

Page 39: Lecture3 xing fei-fei

Neural Network Approaches for Visual Recognition

L. Fei-Fei

Computer Science Dept.

Stanford University

Page 40: Lecture3 xing fei-fei

Machine learning in computer vision

• Aug 12, Lecture 3: Neural Network– Convolutional Nets for object recognition

– Unsupervised feature learning via Deep Belief Net

7 August 2010 40L. Fei-Fei, Dragon Star 2010, Stanford

Page 41: Lecture3 xing fei-fei

Machine learning in computer vision

• Aug 12, Lecture 3: Neural Network– Convolutional Nets for object recognition

(slides courtesy to Yann LeCun (NYU))

– Unsupervised feature learning via Deep Belief Net

7 August 2010 41L. Fei-Fei, Dragon Star 2010, Stanford

[[Reference paper: Yann LeCun et al. Proc. IEEE, 1998]]

Page 42: Lecture3 xing fei-fei

Mammalian visual pathway is hierarchical

• The ventral (recognition) pathway in the visual cortex– Retina → LGN → V1 → V2 → V4 → PIT → AIT (80-100ms)





on T



7 August 2010 42L. Fei-Fei, Dragon Star 2010, Stanford

Page 43: Lecture3 xing fei-fei

• Efficient Learning Algorithms for multi-stage “Deep Architectures”– Multi-stage learning is difficult

• Learning good internal representations of the world (features)– Hierarchy of features that capture the relevant information and

eliminates irrelevant variabilities• Learning with unlabeled samples (unsupervised learning)

– Animals can learn vision by just looking at the world– No supervision is required

• Deep learning algorithms that can be applied to other modalities– If we can learn feature hierarchies for vision, we can learn feature

hierarchies for audition, natural language, ....

Some challenges for machine learning in vision

Page 44: Lecture3 xing fei-fei

• The raw input is pre-processed through a hand-crafted feature extractor

• The features are not learned• The trainable classifier is often generic (task independent), and

“simple” (linear classifier, kernel machine, nearest neighbor,.....)• The most common Machine Learning architecture: the Kernel


“Simple” Trainable Classifier

Pre-processing / Feature Extraction

this part is mostly hand-craftedInternal Representation

7 August 2010 44L. Fei-Fei, Dragon Star 2010, Stanford

The traditional “shallow” architecture for recognition

Page 45: Lecture3 xing fei-fei

• In Language: hierarchy in syntax and semantics–Words->Parts of Speech->Sentences->Text–Objects,Actions,Attributes...-> Phrases -> Statements ->

Stories• In Vision: part-whole hierarchy–Pixels->Edges->Textons->Parts->Objects->Scenes






Good representations are hierarchical

7 August 2010 45L. Fei-Fei, Dragon Star 2010, Stanford

Page 46: Lecture3 xing fei-fei

• Deep Learning: learning a hierarchy of internal representations• From low-level features to mid-level invariant representations, to

object identities• Representations are increasingly invariant as we go up the layers• using multiple stages gets around the specificity/invariance







Learned Internal Representation

“Deep” learning: learning hierarchical representations

7 August 2010 46L. Fei-Fei, Dragon Star 2010, Stanford

Page 47: Lecture3 xing fei-fei

• We can approximate any function as close as we want with shallow architecture. Why would we need deep ones?

–kernel machines and 2-layer neural net are “universal”.

• Deep learning machines• Deep machines are more efficient for representing

certain classes of functions, particularly those involved in visual recognition

– they can represent more complex functions with less “hardware”

• We need an efficient parameterization of the class of functions that are useful for “AI” tasks.

7 August 2010 47L. Fei-Fei, Dragon Star 2010, Stanford

Do we really need deep architecture?

Page 48: Lecture3 xing fei-fei




• Multiple Stages• Each Stage is composed of

– A bank of local filters (convolutions)– A non-linear layer (may include harsh non-linearities, such as

rectification, contrast normalization, etc...).– A feature pooling layer

• Multiple stages can be stacked to produce high-level representations– Each stage makes the representation more global, and more

invariant• The systems can be trained with a combination of unsupervised and

supervised methods




Hierarchical/deep architectures for vision

7 August 2010 48L. Fei-Fei, Dragon Star 2010, Stanford

Page 49: Lecture3 xing fei-fei

• [Hubel & Wiesel 1962]: – simple cells detect local features

– complex cells “pool” the outputs of simple cells within a retinotopicneighborhood.

pooling subsampling

“Simple cells”“Complex cells”

Multiple convolutions

Retinotopic Feature Maps

Filtering+NonLinearity+Pooling = 1 stage of a Convolutional Net

7 August 2010 49L. Fei-Fei, Dragon Star 2010, Stanford

Page 50: Lecture3 xing fei-fei

• Biologically-inspired models of low-level feature extraction– Inspired by [Hubel and Wiesel 1962]

• Only 3 types of operations needed for traditional ConvNets:– Convolution/Filtering– Pointwise Non-Linearity– Pooling/Subsampling




Feature extraction by filtering and pooling

7 August 2010 50L. Fei-Fei, Dragon Star 2010, Stanford

Page 51: Lecture3 xing fei-fei

Convolutional Network: Multi-Stage Trainable Architecture

Hierarchical ArchitectureRepresentations are more global, more invariant, and more abstract as we go up the layers

Alternated Layers of Filtering and Spatial PoolingFiltering detects conjunctions of featuresPooling computes local disjunctions of features

Fully TrainableAll the layers are trainable



Convolutions,Filtering Pooling




7 August 2010 51L. Fei-Fei, Dragon Star 2010, Stanford

Page 52: Lecture3 xing fei-fei

7 August 2010 52L. Fei-Fei, Dragon Star 2010, Stanford

Convolutional network architecture

Page 53: Lecture3 xing fei-fei

• Building a complete artificial vision system:– Stack multiple stages of simple cells / complex cells layers– Higher stages compute more global, more invariant features– Stick a classification layer on top– [Fukushima 1971-1982]

• neocognitron– [LeCun 1988-2007]

• convolutional net– [Poggio 2002-2006]

• HMAX– [Ullman 2002-2006]

• fragment hierarchy– [Lowe 2006]

• HMAX QUESTION: How do we find (or learn) the filters?

Convolutional Nets and other Multistage Hubel-Wiesel Architectures

7 August 2010 53L. Fei-Fei, Dragon Star 2010, Stanford

Page 54: Lecture3 xing fei-fei

Convolutional Net Training: Gradient Back-Propagation

7 August 2010 54L. Fei-Fei, Dragon Star 2010, Stanford

Page 55: Lecture3 xing fei-fei

7 August 2010 55L. Fei-Fei, Dragon Star 2010, Stanford

Page 56: Lecture3 xing fei-fei

7 August 2010 56L. Fei-Fei, Dragon Star 2010, Stanford

Page 57: Lecture3 xing fei-fei

7 August 2010 57L. Fei-Fei, Dragon Star 2010, Stanford

Page 58: Lecture3 xing fei-fei

7 August 2010 59L. Fei-Fei, Dragon Star 2010, Stanford

Page 59: Lecture3 xing fei-fei

Convolutional Net Architecture for Hand-writing recognition


Layer 16@28x28

Layer 26@14x14

Layer 312@10x10 Layer 4


Layer 5100@1x1







Layer 6: 10

Convolutional net for handwriting recognition (400,000 synapses)Convolutional layers (simple cells): all units in a feature plane share the same weightsPooling/subsampling layers (complex cells): for invariance to small distortions.Supervised gradient-descent learning using back-propagationThe entire network is trained end-to-end. All the layers are trained simultaneously.[LeCun et al. Proc IEEE, 1998]

7 August 2010 60L. Fei-Fei, Dragon Star 2010, Stanford

Page 60: Lecture3 xing fei-fei

MNIST Handwritten Digit Dataset

Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples

7 August 2010 61L. Fei-Fei, Dragon Star 2010, Stanford

Page 61: Lecture3 xing fei-fei

CLASSIFIER DEFORMATION PREPROCESSING ERROR (%) Referencelinear classifier (1-layer NN) none 12.00 LeCun et al. 1998linear classifier (1-layer NN) deskewing 8.40 LeCun et al. 1998pairwise linear classifier deskewing 7.60 LeCun et al. 1998K-nearest-neighbors, (L2) none 3.09 Kenneth Wilder, U. ChicagoK-nearest-neighbors, (L2) deskewing 2.40 LeCun et al. 1998K-nearest-neighbors, (L2) deskew, clean, blur 1.80 Kenneth Wilder, U. ChicagoK-NN L3, 2 pixel jit ter deskew, clean, blur 1.22 Kenneth Wilder, U. ChicagoK-NN, shape context m atching shape context feature 0.63 Belongie et al. IEEE PAMI 200240 PCA + quadrat ic classifier none 3.30 LeCun et al. 19981000 RBF + linear classifier none 3.60 LeCun et al. 1998K-NN, Tangent Distance subsam p 16x16 pixels 1.10 LeCun et al. 1998SVM, Gaussian Kernel none 1.40SVM deg 4 polynom ial deskewing 1.10 LeCun et al. 1998Reduced Set SVM deg 5 poly deskewing 1.00 LeCun et al. 1998Virtual SVM deg-9 poly Affine none 0.80 LeCun et al. 1998V-SVM, 2-pixel jit tered none 0.68 DeCoste and Scholkopf, MLJ 2002V-SVM, 2-pixel jit tered deskewing 0.56 DeCoste and Scholkopf, MLJ 20022-layer NN, 300 HU, MSE none 4.70 LeCun et al. 19982-layer NN, 300 HU, MSE, Affine none 3.60 LeCun et al. 19982-layer NN, 300 HU deskewing 1.60 LeCun et al. 19983-layer NN, 500+ 150 HU none 2.95 LeCun et al. 19983-layer NN, 500+ 150 HU Affine none 2.45 LeCun et al. 19983-layer NN, 500+ 300 HU, CE, reg none 1.53 Hinton, unpublished, 20052-layer NN, 800 HU, CE none 1.60 Sim ard et al., ICDAR 20032-layer NN, 800 HU, CE Affine none 1.10 Sim ard et al., ICDAR 20032-layer NN, 800 HU, MSE Elast ic none 0.90 Sim ard et al., ICDAR 20032-layer NN, 800 HU, CE Elast ic none 0.70 Sim ard et al., ICDAR 2003Convolut ional net LeNet-1 subsam p 16x16 pixels 1.70 LeCun et al. 1998Convolut ional net LeNet-4 none 1.10 LeCun et al. 1998Convolut ional net LeNet-5, none 0.95 LeCun et al. 1998Conv. net LeNet-5, Affine none 0.80 LeCun et al. 1998Boosted LeNet-4 Affine none 0.70 LeCun et al. 1998Conv. net , CE Affine none 0.60 Sim ard et al., ICDAR 2003Com v net , CE Elast ic none 0.40 Sim ard et al., ICDAR 2003

7 August 2010 62L. Fei-Fei, Dragon Star 2010, Stanford

Results on MNIST handwritten Digits

Page 62: Lecture3 xing fei-fei

Invariance and Robustness to Noise

7 August 2010 63L. Fei-Fei, Dragon Star 2010, Stanford

Page 63: Lecture3 xing fei-fei

Face Detection and Pose Estimation with Convolutional Nets

• Training: 52,850, 32x32 grey-level images of faces, 52,850 non-faces.

• Each sample: used 5 times with random variation in scale, in-plane rotation, brightness and contrast.

• 2nd phase: half of the initial negative set was replaced by false positives of the initial version of the detector .

7 August 2010 64L. Fei-Fei, Dragon Star 2010, Stanford

Page 64: Lecture3 xing fei-fei

Face Detection: Results

x93%86%Schneiderman & Kanade

x96%89%Rowley et al

x83%70%xJones & Viola (profile)

xx95%90%Jones & Viola (tilted)

88%83%83%67%97%90%Our Detector


False positives per image->

7 August 2010 65L. Fei-Fei, Dragon Star 2010, Stanford

Page 65: Lecture3 xing fei-fei

Face Detection and Pose Estimation: Results

7 August 2010 66L. Fei-Fei, Dragon Star 2010, Stanford

Page 66: Lecture3 xing fei-fei

Face Detection with a Convolutional Net

7 August 2010 67L. Fei-Fei, Dragon Star 2010, Stanford

Page 67: Lecture3 xing fei-fei

Yann LeCun

Industrial Applications of ConvNets

• AT&T/Lucent/NCR– Check reading, OCR, handwriting recognition (deployed 1996)

• Vidient Inc– Vidient Inc's “SmartCatch” system deployed in several airports

and facilities around the US for detecting intrusions, tailgating, and abandoned objects (Vidient is a spin-off of NEC)

• NEC Labs– Cancer cell detection, automotive applications, kiosks

• Google– OCR, face and license plate removal from StreetView

• Microsoft– OCR, handwriting recognition, speech detection

• France Telecom– Face detection, HCI, cell phone-based applications

• Other projects: HRL (3D vision)....

Page 68: Lecture3 xing fei-fei

Machine learning in computer vision

• Aug 12, Lecture 3: Neural Network– Convolutional Nets for object recognition

– Unsupervised feature learning via Deep Belief Net

(slides courtesy to Honglak Lee (Stanford))

7 August 2010 69L. Fei-Fei, Dragon Star 2010, Stanford

Page 69: Lecture3 xing fei-fei

Machine Learning’s Success• Data mining

– Web data mining– Biomedical data mining– Time series data mining

• Artificial Intelligence– Computer vision– Speech recognition– Autonomous car driving

However, machine learning’s success has relied on having a good feature representation of the data.

How can we develop good representations automatically?7 August 2010 70L. Fei-Fei, Dragon Star 2010,


Page 70: Lecture3 xing fei-fei

The Learning Pipeline


Input spaceMotorbikes“Non”-Motorbikes


pixel 1


l 2

pixel 1

pixel 2

7 August 2010 71L. Fei-Fei, Dragon Star 2010, Stanford

Page 71: Lecture3 xing fei-fei

The Learning Pipeline


Input space Feature spaceMotorbikes“Non”-Motorbikes

Low-level features


pixel 1


l 2




“feature engineering”handle


7 August 2010 72L. Fei-Fei, Dragon Star 2010, Stanford

Page 72: Lecture3 xing fei-fei

Computer vision features

SIFT Spin image


Textons GLOH

Drawbacks of feature engineering1. Needs expert knowledge2. Time consuming hand-tuning

7 August 2010 73L. Fei-Fei, Dragon Star 2010, Stanford

Page 73: Lecture3 xing fei-fei

Feature learning from unlabeled data

• Main idea– Finding underlying structure (cause) or statistical

correlation from the input data.

• Sparse coding [Olshausen and Field, 1997]

– Objective: Given input data {x}, search for a set of bases {bj} such that

where aj are mostly zeros.



7 August 2010 74L. Fei-Fei, Dragon Star 2010, Stanford

Page 74: Lecture3 xing fei-fei

Sparse coding on imagesNatural Images Learned bases: “Edges”

= 0.8 * + 0.3 * + 0.5 *

x = 0.8 * b36

+ 0.3 * b42 + 0.5 * b65

[0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representation)

New example

Lee, Ng, et al. 20077 August 2010 75L. Fei-Fei, Dragon Star 2010, Stanford

Page 75: Lecture3 xing fei-fei

OptimizationGiven input data {x(1), …, x(m)}, we want to find good bases {b1, …, bn}:

∑∑ ∑ +−i


i jj


iab abax 1


)()(, ||||||||min β

Reconstruction error Sparsity penalty

1||||: ≤∀ jbj Normalization constraint

Solve by alternating minimization:-- Keep b fixed, find optimal a. -- Keep a fixed, find optimal b. Lee, Ng, 20067 August 2010 76L. Fei-Fei, Dragon Star 2010,


Page 76: Lecture3 xing fei-fei

Evaluated on Caltech101 object category dataset.

Image classification

Previous reported results:Fei-Fei et al, 2004: 16%Berg et al., 2005: 17%Holub et al., 2005: 40%Serre et al., 2005: 35%Berg et al, 2005: 48%Lazebnik et al, 2006: 64%Varma et al. 2008: 78%



Algorithm Accuracy

Baseline (Fei-Fei et al., 2004) 16%

PCA 37%

Our method 47%

36% error reduction

Input Image Features (coefficients)Learned


7 August 2010 77L. Fei-Fei, Dragon Star 2010, Stanford

Page 77: Lecture3 xing fei-fei

Learning Feature Hierarchy

Input image (pixels)

“Sparse coding”(edges)

[Related work: Hinton, Bengio, LeCun, and others.]

DBN (Hinton et al., 2006) with additional sparseness constraint.

Higher layer(Combinations

of edges)

7 August 2010 78L. Fei-Fei, Dragon Star 2010, Stanford

Page 78: Lecture3 xing fei-fei

Restricted Boltzmann Machine (RBM)

– Undirected, bipartite graphical model

– Inference is easy

– Training by approximating maximum likelihood (Contrastive Divergence [Hinton, 2002])

visible nodes (data)

hidden nodes

Weights W (bases) encodes statistical relationship between h and v.

7 August 2010 79L. Fei-Fei, Dragon Star 2010, Stanford

Page 79: Lecture3 xing fei-fei

Deep Belief Network• Deep Belief Network (DBN) [Hinton et al., 2006]

– Generative model with multiple hidden layers

– Successful applications• Recognizing handwritten digits

• Learning motion capture data

• Collaborative filtering

– Bottom-up, layer-wise training

using Restricted Boltzmann machines




visible nodes (data)





7 August 2010 80L. Fei-Fei, Dragon Star 2010, Stanford

Page 80: Lecture3 xing fei-fei

Sparse Restricted Boltzmann Machines [NIPS-2008]

• Main idea– Constrain the hidden layer nodes to have “sparse”

average activation (similar to sparse coding)

– Regularize with a sparsity penalty

Log-likelihood Sparsity penalty

Average activation Target sparsity7 August 2010 81L. Fei-Fei, Dragon Star 2010, Stanford

Page 81: Lecture3 xing fei-fei

Sparse representation from digits [NIPS 2008]

Sparse RBM bases“pen-strokes”

Sparse RBMs often give readily interpretable featuresand good discriminative power.

Training examples

7 August 2010 82L. Fei-Fei, Dragon Star 2010, Stanford

Page 82: Lecture3 xing fei-fei

Learning object representations

• Learning objects and parts in images

• Large image patches contain interesting higher-level structures.– E.g., object parts and full objects

7 August 2010 83L. Fei-Fei, Dragon Star 2010, Stanford

Page 83: Lecture3 xing fei-fei

Applying DBN to large image




Input image


Problem: - Typically, input dimension ~ 1,000 (30x30 pixels)- Computationally intractable to learn from realistic image sizes (e.g. 200x200 pixels)


7 August 2010 84L. Fei-Fei, Dragon Star 2010, Stanford

Page 84: Lecture3 xing fei-fei

Convolutional architecturesWeight sharing by convolution (e.g., [Lecun et al., 1989])

“Max-pooling”Invariance Computational efficiencyDeterministic and feed-forward

We develop convolutional Restricted Boltzmann machine (CRBM).

We define probabilistic max-pooling that combine bottom-up and top-down information.

convolution filter

Detection layer

maximum 2x2 grid

Max-pooling layer

Detection layer

Max-pooling layer




maximum 2x2 grid





7 August 2010 85L. Fei-Fei, Dragon Star 2010, Stanford

Page 85: Lecture3 xing fei-fei


V (visible layer)

Detection layer H

Max-pooling layer P

Convolutional RBM (CRBM) [ICML 2009]

Hidden nodes (binary)

“Filter“ weights (shared)

For “filter” k,

At most one hidden nodes are active.

‘’max-pooling’’ node (binary)

Input data V

7 August 2010 86L. Fei-Fei, Dragon Star 2010, Stanford

Page 86: Lecture3 xing fei-fei

Probabilistic max pooling

Xj are stochastic binary and mutually exclusive.

X3X1 X2 X4

Collapse 2n configurations into n+1 configurations. Permits bottom up and top down inference.


Pooling node

Detection nodes


1 0 0 0


0 1 0 0


0 0 1 0


0 0 0 1


0 0 0 0

7 August 2010 87L. Fei-Fei, Dragon Star 2010, Stanford

Page 87: Lecture3 xing fei-fei

Probabilistic max pooling

X3X1 X2 X4


Bottom-up inference

I1 I2 I3 I4

Pooling node

Detection nodes

Probability can be written as a softmax function. Sample from multinomial distribution.

Output of convolution W*V from below

7 August 2010 88L. Fei-Fei, Dragon Star 2010, Stanford

Page 88: Lecture3 xing fei-fei

Convolutional Deep Belief Networks

• Bottom-up (greedy), layer-wise training– Train one layer (convolutional RBM) at a time.

• Inference (approximate)– Undirected connections for all layers (Markov net)

[Related work: Salakhutdinov and Hinton, 2009]

– Block Gibbs sampling or mean-field

– Hierarchical probabilistic inference

7 August 2010 89L. Fei-Fei, Dragon Star 2010, Stanford

Page 89: Lecture3 xing fei-fei

Unsupervised learning of object-parts

Faces Cars Elephants Chairs

7 August 2010 90L. Fei-Fei, Dragon Star 2010, Stanford

Page 90: Lecture3 xing fei-fei

Object category classification (Caltech 101)

• Our model is comparable to the results using state-of-the-art features (e.g., SIFT).









CDBN (first layer)

CDBN (first+second layer)

Ranzato et al. (2007)

Mutch and Lowe (2006)

Lazebnik et al. (2006)

Zhang et al. (2006)

Our method





7 August 2010 91L. Fei-Fei, Dragon Star 2010, Stanford

Page 91: Lecture3 xing fei-fei

Unsupervised learning of object-parts

Trained from multiple classes (cars, faces, motorbikes, airplanes)

Object-specific features

& shared features

“Grouping” the object parts

(highly specific)

7 August 2010 92L. Fei-Fei, Dragon Star 2010, Stanford

Page 92: Lecture3 xing fei-fei

Review of main contributions Developed efficient algorithms for unsupervised feature learning.

Showed that unsupervised feature learning is useful for many machine learning tasks. Object recognition, image segmentation, audio classification, text

classification, robotic perception, and others.

Object parts “Filling in”

7 August 2010 93L. Fei-Fei, Dragon Star 2010, Stanford

Page 93: Lecture3 xing fei-fei

Weaknesses & Criticisms

• Learning everything. Better to encode prior knowledge about structure of images.

A: Compare with machine learning vs. linguists debate in NLP.

• Results not yet competitive with best engineered systems.

A: Agreed. True for some domains.

7 August 2010 94L. Fei-Fei, Dragon Star 2010, Stanford