Page 1
Analysis of Single-Layer Networks
Presented by Hourieh Fakourfar
Adam Coates [email protected]
Honglak Lee [email protected]
Andrew Y. Ng [email protected]
Computer Science Department, Stanford University, Stanford, CA 94305, USA
An Analysis of Single-Layer Networks in Unsupervised Feature Learning
Page 2
Agenda
• Introduction
• Unsupervised Learning
• Feature Extraction
• Classification
• Experiments & Analysis
• Conclusion
Page 3
Motivation
• Achieve a state-of-the-art performance with simple algorithms and a single layer of features
• Avoid complexity and expense, without compromising performance
Page 5
Single-Layer Network
• Maps the n-dimensional input space to the m-dimensional output space
• Widely used for linear separable problems
Input layer Output layer
http://wwwold.ece.utep.edu/
Page 6
Multi-layer Network
• One or more hidden layers
• Higher level of computation
• More complexity
• Cost efficient?
• Performance?
Page 8
Unsupervised Learning
• To find hidden structure in unlabeled data▫ learn feature representations from unlabeled data
•• Unlike supervised learning no error or reward signal to
evaluate a potential solution.
• Approaches to unsupervised learning include:▫ Clustering k-means, mixture models, hierarchical clustering
▫ Blind signal separation using feature extraction techniques for dimensionality reduction PCA, Independent component analysis, Non-negative matrix
factorization, Singular value decomposition
Page 10
Benchmark Datasets
• CIFAR
• NORB
• STL
http://www.idsia.ch/~juergen/vision.html http://www.stanford.edu/~acoates//stl10/
Page 11
CIFAR-10
• 60000 32x32 color images
▫ 10 classes
▫ 6000 images per class
• 50000 training
• 10000 test images
32
32
CIFAR-10: 10 classes with 10 random images from each
http://www.cs.toronto.edu/~kriz/cifar.html
Page 12
NORB• Images of 50 toys
• 5 generic categories
▫ Four-legged animals
▫ Human figures
▫ Airplanes
▫ Trucks
▫ Cars
• Training set:
▫ 5 instances of each category (instances 4, 6, 7, 8 and 9)
• Test set
▫ 5 instances (instances 0, 1, 2, 3, and 5).
Page 13
STL-10
• 10 classes
▫ airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck.
• Images Size
▫ 96x96 pixels
• Training images
▫ 500 (10 pre-defined folds)
• Test images
▫ 800 per class
▫ 100000 unlabeled images for unsupervised learning
Page 15
Learning Framework
Extract patches
Extract random patches from unlabeled training images.
Pre-processing
Apply a pre-processing stage to the patches.
Feature-mapping
Learn a feature-mapping using an unsupervised learning algorithm.
Page 16
Preprocessing
Normalization (Zscores) Whitening
• For images/visual data:
▫ local brightness
▫ Contrast normalization
• Mean subtraction and scale normalization
• Common preprocessing technique
• Decorrelates data
• Observed vector x is linearly transformed to new vector
Page 17
Whitening
• Observed vector x
• linearly transformed to new vector
▫ Components are uncorrelated
▫ Their variances is equal to unity
• New vector is then defined by:
Page 18
Feature Mapping
•
Page 19
Feature Extraction & Classification
Framework
Extract features from
Pool features
Training and prediction
Page 20
Feature extraction and classification
• Reduce dimentionality to:▫ (n-w-1)-by-(n-w+1)-by-K image representation via feature mapping
▫ Summing up over local regions of yij
▫ Split yij into 4 equal-sized quadrants
▫ Compute the sum yij in each
▫ Obtain 4K-dimensional feature vectors for each training image and label
• Apply L2 SVM classification ▫ Where regularization parameters determined using cross validation
Page 22
Feature Learning Algorithms
• Sparse auto-encoder
• Sparse RBMs
• K-means clustering
• Gaussian Mixtures
Page 23
Sparse Auto-encoder
• Feature mapping:
Where,
is the logistic sigmoid function, and is applied component-wise to vector z
Page 24
Sparse Auto-encoder
• Nice way to do non-linear dimensionality reduction:
▫ They provide mappings both ways
▫ The learning time and memory both scale
• Dimensionality reduction facilitates the classification, visualization, communication, and storage of high-dimensional data.
Image Representation
Input Image
K
N
W,b
Page 25
Sparse Restricted Boltzmann Machine
(RBM)Hidden Units• Particular form of log-linear
Markov Random Field (MRF)
• Energy function is linear:
▫ E(v,h) = - b'v - c'h - h'Wv
▫ W :weights connecting hidden and visible units
▫ b, c are the offsets of the visible and hidden layers respectively
• Sparsity penalty as in autoencoder
m=3
n=4
Visible Units
http://deeplearning.net/tutorial/rbm.html
Page 26
K-means Clustering
• Standard 1-of-K, hard-assignment coding
• Non-linear mapping that attempts “softer”
• chance-based
Page 27
K-means Clustering
• Standard 1-of-K, hard-assignment coding
• Non-linear mapping that attempts “softer”
• chance-based
Page 28
Gaussian Mixtures Model (GMM)
Represents the density of K Gaussian distributions
f maps each input to the posterior membership probabilities
Page 30
Parameters
• Evaluate and assess the effects of change in the following parameters:
▫ Whitened or raw image
▫ Number of features K
▫ Stride size s
▫ Receptive field size w
Page 31
Testing Procedure
• For each unsupervised learning algorithm
▫ Train a single-layer of features
Whitened or raw
Choice of the parameteres K, s, and w
▫ Then train a linear classifier
On a holdout set (main analysis)
On test set (for final results)
Page 32
K-means
• Whitening is a crucial pre-process since the clustering algorithms cannot handle the correlations in the data
Without Whitening With Whitening
Page 33
GMM
• Whitening is a crucial pre-process since the clustering algorithms cannot handle the correlations in the data
Without Whitening With Whitening
Page 34
Sparse Auto-encoder
The effect here is somewhat ambiguous
Without Whitening With Whitening
Page 35
Sparse RBM
Without Whitening With Whitening
The effect here is somewhat ambiguous
Page 36
Performance for Raw and Whitened Inputs
• Feature representations with K=100,200, 400, 800, 1200, & 1600
• All algorithms achieved higher performance by learning more features as expected
Page 37
Performance vs. Feature Stride
• “Stride” s is the spacing between patches where feature values will be extracted
• # of features are fixed (1600)
• Receptive field size (6 pixel)
• Stride is varying over 1,2,4,8
Page 38
Effect of Receptive Field
• Stride = 1 ;1600 Bases ;Whitening
• Tested results for w = 6,8,12
• Overall 6 pixel receptive field
worked best
• Meanwhile 12 were similar or
worse than 6 or 8 pixels
• Unlike for other parameters
receptive field requires cross-
validation to make an informed
choice
Page 39
Final Classification Results
Page 40
Conclusions
Mean-subtraction, scale normalization and Whitening
+ Large K (#of features)
+ Small s (step size or “stride”)
+ Right patch size w (receptive field size)
+ Simple feature learning algorithm (soft K-means)
=
State-of-the-art results on CIFAR-10 and NORB