Analysis of Single-Layer Networks

Analysis of Single-Layer Networks

Presented by Hourieh Fakourfar

Adam Coates [email protected]

Honglak Lee [email protected]

Andrew Y. Ng [email protected]

Computer Science Department, Stanford University, Stanford, CA 94305, USA

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

mailto:[email protected]



Agenda

• Introduction

• Unsupervised Learning

• Feature Extraction

• Classification

• Experiments & Analysis

• Conclusion

Motivation

• Achieve a state-of-the-art performance with simple algorithms and a single layer of features

• Avoid complexity and expense, without compromising performance

Single-Layer Network

• Maps the n-dimensional input space to the m-dimensional output space

• Widely used for linear separable problems

Input layer Output layer

http://wwwold.ece.utep.edu/

Multi-layer Network

• One or more hidden layers

• Higher level of computation

• More complexity

• Cost efficient?

• Performance?

Unsupervised Learning

• To find hidden structure in unlabeled data▫ learn feature representations from unlabeled data

•• Unlike supervised learning no error or reward signal to

evaluate a potential solution.

• Approaches to unsupervised learning include:▫ Clustering k-means, mixture models, hierarchical clustering

▫ Blind signal separation using feature extraction techniques for dimensionality reduction PCA, Independent component analysis, Non-negative matrix

factorization, Singular value decomposition

Benchmark Datasets

• CIFAR

• NORB

• STL

http://www.idsia.ch/~juergen/vision.html http://www.stanford.edu/~acoates//stl10/

CIFAR-10

• 60000 32x32 color images

▫ 10 classes

▫ 6000 images per class

• 50000 training

• 10000 test images

32

32

CIFAR-10: 10 classes with 10 random images from each

http://www.cs.toronto.edu/~kriz/cifar.html

NORB• Images of 50 toys

• 5 generic categories

▫ Four-legged animals

▫ Human figures

▫ Airplanes

▫ Trucks

▫ Cars

• Training set:

▫ 5 instances of each category (instances 4, 6, 7, 8 and 9)

• Test set

▫ 5 instances (instances 0, 1, 2, 3, and 5).

STL-10

• 10 classes

▫ airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck.

• Images Size

▫ 96x96 pixels

• Training images

▫ 500 (10 pre-defined folds)

• Test images

▫ 800 per class

▫ 100000 unlabeled images for unsupervised learning

Learning Framework

Extract patches

Extract random patches from unlabeled training images.

Pre-processing

Apply a pre-processing stage to the patches.

Feature-mapping

Learn a feature-mapping using an unsupervised learning algorithm.

Preprocessing

Normalization (Zscores) Whitening

• For images/visual data:

▫ local brightness

▫ Contrast normalization

• Mean subtraction and scale normalization

• Common preprocessing technique

• Decorrelates data

• Observed vector x is linearly transformed to new vector

Whitening

• Observed vector x

• linearly transformed to new vector

▫ Components are uncorrelated

▫ Their variances is equal to unity

• New vector is then defined by:

Feature Mapping

•

Feature Extraction & Classification

Framework

Extract features from

Pool features

Training and prediction

Feature extraction and classification

• Reduce dimentionality to:▫ (n-w-1)-by-(n-w+1)-by-K image representation via feature mapping

▫ Summing up over local regions of yij

▫ Split yij into 4 equal-sized quadrants

▫ Compute the sum yij in each

▫ Obtain 4K-dimensional feature vectors for each training image and label

• Apply L2 SVM classification ▫ Where regularization parameters determined using cross validation

Feature Learning Algorithms

• Sparse auto-encoder

• Sparse RBMs

• K-means clustering

• Gaussian Mixtures

Sparse Auto-encoder

• Feature mapping:

Where,

is the logistic sigmoid function, and is applied component-wise to vector z

Sparse Auto-encoder

• Nice way to do non-linear dimensionality reduction:

▫ They provide mappings both ways

▫ The learning time and memory both scale

• Dimensionality reduction facilitates the classification, visualization, communication, and storage of high-dimensional data.

Image Representation

Input Image

K

N

W,b

Sparse Restricted Boltzmann Machine

(RBM)Hidden Units• Particular form of log-linear

Markov Random Field (MRF)

• Energy function is linear:

▫ E(v,h) = - b'v - c'h - h'Wv

▫ W :weights connecting hidden and visible units

▫ b, c are the offsets of the visible and hidden layers respectively

• Sparsity penalty as in autoencoder

m=3

n=4

Visible Units

http://deeplearning.net/tutorial/rbm.html

K-means Clustering

• Standard 1-of-K, hard-assignment coding

• Non-linear mapping that attempts “softer”

• chance-based

K-means Clustering

• Standard 1-of-K, hard-assignment coding

• Non-linear mapping that attempts “softer”

• chance-based

Gaussian Mixtures Model (GMM)

Represents the density of K Gaussian distributions

f maps each input to the posterior membership probabilities

Parameters

• Evaluate and assess the effects of change in the following parameters:

▫ Whitened or raw image

▫ Number of features K

▫ Stride size s

▫ Receptive field size w

Testing Procedure

• For each unsupervised learning algorithm

▫ Train a single-layer of features

Whitened or raw

Choice of the parameteres K, s, and w

▫ Then train a linear classifier

On a holdout set (main analysis)

On test set (for final results)

K-means

• Whitening is a crucial pre-process since the clustering algorithms cannot handle the correlations in the data

Without Whitening With Whitening

GMM

• Whitening is a crucial pre-process since the clustering algorithms cannot handle the correlations in the data


Sparse Auto-encoder

The effect here is somewhat ambiguous


Sparse RBM


The effect here is somewhat ambiguous

Performance for Raw and Whitened Inputs

• Feature representations with K=100,200, 400, 800, 1200, & 1600

• All algorithms achieved higher performance by learning more features as expected

Performance vs. Feature Stride

• “Stride” s is the spacing between patches where feature values will be extracted

• # of features are fixed (1600)

• Receptive field size (6 pixel)

• Stride is varying over 1,2,4,8

Effect of Receptive Field

• Stride = 1 ;1600 Bases ;Whitening

• Tested results for w = 6,8,12

• Overall 6 pixel receptive field

worked best

• Meanwhile 12 were similar or

worse than 6 or 8 pixels

• Unlike for other parameters

receptive field requires cross-

validation to make an informed

choice

Final Classification Results

Conclusions

Mean-subtraction, scale normalization and Whitening

+ Large K (#of features)

+ Small s (step size or “stride”)

+ Right patch size w (receptive field size)

+ Simple feature learning algorithm (soft K-means)

=

State-of-the-art results on CIFAR-10 and NORB

Analysis of Single-Layer Networks

Documents