p02 Sparse Coding Cvpr2012 Deep Learning Methods for Vision

CVPR12 Tutorial on Deep Learning

Sparse Coding

Kai Yu

[email protected]

Department of Multimedia, Baidu

Relentless research on visual recognition

04/07/23 2

Caltech 101

PASCAL VOC

80 Million Tiny Images

ImageNet

The pipeline of machine visual perception

04/07/23 3

Low-level sensing

Pre-processing

Feature extract.

Feature selection

Inference: prediction, recognition

• Most critical for accuracy• Account for most of the computation for testing• Most time-consuming in development cycle• Often hand-craft in practice

Most Efforts in Machine Learning

Computer vision features

SIFT Spin image

HoG RIFT

Slide Credit: Andrew Ng

GLOH

Learning features from data

04/07/23 5

Low-level sensing

Pre-processing

Feature extract.

Feature selection


Feature Learning: instead of design features, let’s design feature learners

Machine Learning

Learning features from data via sparse coding

04/07/23 6

Low-level sensing

Pre-processing

Feature extract.

Feature selection


Sparse coding offers an effective building block to learn useful features

Outline

1. Sparse coding for image classification

2. Understanding sparse coding

3. Hierarchical sparse coding

4. Other topics: e.g. structured model, scale-up, discriminative training

5. Summary

04/07/23 7

“BoW representation + SPM” Paradigm - I

04/07/23 8Figure credit: Fei-Fei Li

Bag-of-visual-words representation (BoW) based on VQ coding

“BoW representation + SPM” Paradigm - II

04/07/23 9Figure credit: Svetlana Lazebnik

Spatial pyramid matching: pooling in different scales and locations

Image Classification using “BoW ＋SPM”

04/07/23 10

VQ Coding

Dense SIFT

Spatial Pooling

Classifier

Image Classification

The Architecture of “Coding + Pooling”

11

• e.g., convolutional neural net, HMAX, BoW, …• e.g., convolutional neural net, HMAX, BoW, …

Coding Pooling Coding Pooling

“BoW+SPM” has two coding+pooling layers

12

e.g, SIFT, HOG

VQ Coding Average Pooling (obtain histogram)

SVMLocal Gradients Pooling

SIFT feature itself follows a coding+pooling operation SIFT feature itself follows a coding+pooling operation

Develop better coding methods

13

Better Coding Better Pooling Better Classifier

Better Coding Better Pooling

- Coding: nonlinear mapping data into another feature space- Better coding methods: sparse coding, RBMs, auto-encoders

What is sparse coding

04/07/23 14

Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detection).

Training: given a set of random patches x, learning a dictionary of bases [Φ1, Φ2, …]

Coding: for data vector x, solve LASSO to find the sparse coefficient vector a

Sparse coding: training time

Input: Images x1, x2, …, xm (each in Rd)Learn: Dictionary of bases , …, k (also Rd).

Alternating optimization: 1.Fix dictionary , …, k , optimize a (a standard LASSO problem）

2.Fix activations a, optimize dictionary , …, k (a convex QP problem)

Sparse coding: testing time

Input: Novel image patch x (in Rd) and previously learned i’sOutput: Representation [ai,ai,, …, ai,] of image patch xi.

0.8 * + 0.3 * + 0.5 *

Represent xi as: ai = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, …]

Sparse coding illustration

Natural Images Learned bases (1 , …, 64): “Edges”

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500 50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500 50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500 50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

0.8 * + 0.3 * + 0.5 *

x 0.8 * 36 + 0.3 * 42

+ 0.5

* 63

[a1, …, a64] = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, 0] (feature representation)

Test example

Compact & easily interpretableSlide credit: Andrew Ng

Testing:What is this?

Motorcycles Not motorcycles

Unlabeled images

…

[Raina, Lee, Battle, Packer & Ng, ICML 07]Self-taught Learning

Testing:What is this?

Slide credit: Andrew Ng

Classification Result on Caltech 101

04/07/23 19

64% SIFT VQ + Nonlinear SVM

~50%Pixel Sparse Coding + Linear SVM

9K images, 101 classes

20

Sparse Coding Max Pooling Linear Classifier

Local Gradients Pooling

e.g, SIFT, HOG

Sparse Coding on SIFT – ScSPM algorithm

[Yang, Yu, Gong & Huang, CVPR09]

04/07/23 21

64% SIFT VQ + Nonlinear SVM

73%SIFT Sparse Coding + Linear SVM (ScSPM)

Caltech-101

Sparse Coding on SIFT – the ScSPM algorithm

[Yang, Yu, Gong & Huang, CVPR09]

Summary: Accuracies on Caltech 101

04/07/23 22

VQ Coding

Dense SIFT

Spatial Pooling

Nonlinear SVM

Sparse Coding

Spatial Max Pooling

Linear SVM

Sparse Coding

Dense SIFT

Spatial Max Pooling

Linear SVM

64% 50% 73% Key message: - Deep models are preferred - Sparse coding is a better building block

Outline





5. Summary

04/07/23 23

Outline


2. Understanding sparse coding– Connections to RBMs, autoencoders, …– Sparse activations vs. sparse models, ..– Sparsity vs. locality – local sparse coding methods



5. Summary

04/07/23 24

Classical sparse coding

04/07/23 25

- a is sparse

- a is often higher dimension than x

- Activation a=f(x) is nonlinear implicit function of x

- reconstruction x’=g(a) is linear & explicit

x

a

f(x)

x’

ag(a)encoding decoding

RBM & autoencoders

- also involve activation and reconstruction

- but have explicit f(x)

- not necessarily enforce sparsity on a

- but if put sparsity on a, often get improved results [e.g. sparse RBM, Lee et al. NIPS08]

04/07/23 26

x

a

f(x)

x’

ag(a)encoding decoding

Sparse coding: A broader view

Any feature mapping from x to a, i.e. a = f(x), where

-a is sparse (and often higher dim. than x)

-f(x) is nonlinear

-reconstruction x’=g(a) , such that x’≈x

04/07/23 27

x

a

f(x)

x’

ag(a)

Therefore, sparse RBMs, sparse auto-encoder, even VQ can be viewed as a form of sparse coding.

Outline


2. Understanding sparse coding– Connections to RBMs, autoencoders, …– Sparse activations vs. sparse models, …– Sparsity vs. locality – local sparse coding methods



5. Summary

04/07/23 28

Sparse activations vs. sparse models

For a general function learning problem a = f(x):

1. sparse model: f(x)’s parameters are sparse- example: LASSO f(x)=<w,x>, w is sparse- the goal is feature selection: all data selects a common subset of features- hot topic in machine learning

2. sparse activations: f(x)’s outputs are sparse- example: sparse coding a=f(x), a is sparse- the goal is feature learning: different data points activate different feature subsets

04/07/23 29

Example of sparse models

04/07/23 30

• because the 2nd and 4th elements of w are non-zero, these are the two selected features in x

• globally-aligned sparse representation

x1 [ | | | | | | ］x2 [ | | | | | | ］

xm [ | | | | | | ］

…

x3 [ | | | | | | ］

[ 0 | 0 | 0 0 ］[ 0 | 0 | 0 0 ］

[ 0 | 0 | 0 0 ］

…[ 0 | 0 | 0 0 ］

f(x)=<w,x>, where w=[0, 0.2, 0, 0.1, 0, 0]

Example of sparse activations (sparse coding)

04/07/23 31

• different x has different dimensions activated

• locally-shared sparse representation: similar x’s tend to have similar non-zero dimensions

a1 [ 0 | | 0 0 … 0 ］a2 [ | | 0 0 0 … 0 ］

am [ 0 0 0 | | … 0 ］

…a3 [ | 0 | 0 0 … 0 ］

x1

x2x3

xm

Example of sparse activations (sparse coding)

04/07/23 32

• another example: preserving manifold structure

• more informative in highlighting richer data structures, i.e. clusters, manifolds,

a1 [ | | 0 0 0 … 0 ］a2 [ 0 | | 0 0 … 0 ］

am [ 0 0 0 | | … 0 ］

…a3 [ 0 0 | | 0 … 0 ］

x1

x2 x3

xm

Outline


2. Understanding sparse coding– Connections to RBMs, autoencoders, …– Sparse activations vs. sparse models, …– Sparsity vs. locality– Local sparse coding methods



5. Summary

04/07/23 33

Sparsity vs. Locality

04/07/23 34

sparse coding

local sparsecoding

• Intuition: similar data should get similar activated features

• Local sparse coding: • data in the same

neighborhood tend to have shared activated features;

• data in different neighborhoods tend to have different features activated.

Sparse coding is not always local: example

04/07/23 35

Case 2data manifold (or clusters)

• Each basis an “anchor point”• Sparsity: each datum is a linear combination of neighbor anchors. • Sparsity is caused by locality.

Case 1independent subspaces

• Each basis is a “direction”• Sparsity: each datum is a linear combination of only several bases.

Two approaches to local sparse coding

04/07/23 36

Approach 2Coding via local subspaces

Approach 1Coding via local anchor points

Classical sparse coding is empirically local

When it works best for classification, the codes are often found local.

It’s preferred to let similar data have similar non-zero dimensions in their codes.

04/07/23 37

MNIST Experiment: Classification using SC

04/07/23 38

• 60K training, 10K for test

• Let k=512

• Linear SVM on sparse codes

• 60K training, 10K for test

• Let k=512

• Linear SVM on sparse codes

Try different values

MNIST Experiment: Lambda = 0.0005

04/07/23 39

Each basis is like a part or direction.

Each basis is like a part or direction.


04/07/23 40

Again, each basis is like a part or direction.

Again, each basis is like a part or direction.


04/07/23 41

Now, each basis is more like a digit !

Now, each basis is more like a digit !


04/07/23 42

Like VQ now!Like VQ now!

Geometric view of sparse coding

04/07/23 43

Error: 4.54%

• When sparse coding achieves the best classification accuracy, the learned bases are like digits – each basis has a clear local class association.

• When sparse coding achieves the best classification accuracy, the learned bases are like digits – each basis has a clear local class association.

Error: 3.75% Error: 2.64%

Distribution of coefficients (MNIST)

04/07/23 44

Neighbor bases tend to get nonzero coefficients

Neighbor bases tend to get nonzero coefficients

Distribution of coefficient (SIFT, Caltech101)

04/07/23 45

Similar observation here!Similar observation here!

Outline


2. Understanding sparse coding– Connections to RBMs, autoencoders, …– Sparse activations vs. sparse models, …– Sparsity vs. locality– Local sparse coding methods


4. Summary

04/07/23 46

Why develop local sparse coding methods

Since locality is a preferred property in sparse coding, let’s explicitly ensure the locality.

The new algorithms can be well theoretically justified

The new algorithms will have computational advantages over classical sparse coding

04/07/23 47

Two approaches to local sparse coding

04/07/23 48

Approach 2Coding via local subspaces

Approach 1Coding via local anchor points

Local coordinate coding Super-vector codingImage Classification using Super-Vector Coding of Local Image Descriptors, Xi Zhou, Kai Yu, Tong Zhang, and Thomas Huang. In ECCV 2010.

Large-scale Image Classification: Fast Feature Extraction and SVM Training, Yuanqing Lin, Fengjun Lv, Shenghuo Zhu, Ming Yang, Timothee Cour, Kai Yu, LiangLiang Cao, Thomas Huang. In CVPR 2011

Learning locality-constrained linear coding for image classification, Jingjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang. In CVPR 2010.

Nonlinear learning using local coordinate coding, Kai Yu, Tong Zhang, and Yihong Gong. In NIPS 2009.

A function approximation framework to understand coding

04/07/23 49

• Assumption: image patches x follow a nonlinear manifold, and f(x) is smooth on the manifold.

• Coding: nonlinear mapping x a

typically, a is high-dim & sparse

• Nonlinear Learning: f(x) = <w, a>

• Assumption: image patches x follow a nonlinear manifold, and f(x) is smooth on the manifold.

• Coding: nonlinear mapping x a

typically, a is high-dim & sparse

• Nonlinear Learning: f(x) = <w, a>

Local sparse coding

04/07/23 50

Approach 1Local coordinate coding

Function Interpolation based on LCC

04/07/23 51

data points bases

locally linear

Yu, Zhang, Gong, NIPS 10

Local Coordinate Coding (LCC): connect coding to nonlinear function learning

04/07/23 52

Locality termFunction approximation

error

Coding error

If f(x) is (alpha, beta)-Lipschitz smoothThe key message: A good coding scheme should1. have a small coding error,2. and also be sufficiently local

Local Coordinate Coding (LCC)

04/07/23 53

• Dictionary Learning: k-means (or hierarchical k-means)• Dictionary Learning: k-means (or hierarchical k-means)

• Coding for x, to obtain its sparse representation a

Step 1 – ensure locality: find the K nearest bases

Step 2 – ensure low coding error:

Yu, Zhang & Gong, NIPS 09Wang, Yang, Yu, Lv, Huang CVPR 10

Local sparse coding

04/07/23 54

Approach 2Super-vector coding

Function approximation via super-vector coding:

data pointscluster centers

Piecewise local linear (first-order)Local tangent

Zhou, Yu, Zhang, and Huang, ECCV 10

04/07/23 56

Quantization error

Function approximation error

If f(x) is beta-Lipschitz smooth, and

Super-vector coding: Justification

Local tangent

Super-Vector Coding (SVC)

04/07/23 57

• Dictionary Learning: k-means (or hierarchical k-means)• Dictionary Learning: k-means (or hierarchical k-means)

• Coding for x, to obtain its sparse representation a

Step 1 – find the nearest basis of x, obtain its VQ coding

e.g. [0, 0, 1, 0, …]

Step 2 – form super vector coding:

e.g. [0, 0, 1, 0, …, 0, 0, (x-m3）， 0 ，… ]

Zhou, Yu, Zhang, and Huang, ECCV 10

Zero-order Local tangent

Results on ImageNet Challenge Dataset

04/07/23 58

40% VQ + Intersection Kernel

65%SVC + Linear SVM

ImageNet Challenge: 1.4 million images, 1000 classes

62% LCC + Linear SVM

Summary: local sparse coding

04/07/23 59

Approach 2Super-vector coding

Approach 1Local coordinate coding

- Sparsity achieved by explicitly ensuring locality- Sound theoretical justifications- Much simpler to implement and compute - Strong empirical success

Outline





5. Summary

04/07/23 60

Hierarchical sparse coding

61

Sparse Coding Pooling Sparse Coding Pooling

Learning from unlabeled data

Yu, Lin, & Lafferty, CVPR 11Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 11

A two-layer sparse coding formulation

04/07/23 62

Yu, Lin, & Lafferty, CVPR 11

MNIST Results - classificationMNIST Results - classification

HSC vs. CNN: HSC provide even better performance than CNN more amazingly, HSC learns features in unsupervised

manner!63


MNIST results -- learned dictionaryMNIST results -- learned dictionary

64

A hidden unit in the second layer is connected to a unit group in the 1st layer: invariance to translation, rotation, and deformation


Caltech101 results Caltech101 results - classification- classification

Learned descriptor: performs slightly better than SIFT + SC

65


Adaptive Deconvolutional Networks for Mid and High Level Feature Learning

Hierarchical Convolutional Sparse Coding.

Trained with respect to image from all layers (L1-L4).

Pooling both spatially and amongst features.

Learns invariant mid-level features.

Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 2011

Select L2 Feature Groups

Select L3 Feature Groups

Select L4 Features

L1 Feature Maps

L1 Feature Maps

Image

Image

L2 Feature Maps

L2 Feature Maps

L4 Feature Maps

L4 Feature Maps

L1 Features

L3 Feature Maps

L3 Feature Maps

Outline





5. Summary

04/07/23 67

Other topics of sparse coding

Structured sparse coding, for example– Group sparse coding [Bengio et al, NIPS 09]– Learning hierarchical dictionary [Jenatton, Mairal et al, 2010]

Scale-up sparse coding, for example– Feature-sign algorithm [Lee et al, NIPS 07]– Feed-forward approximation [Gregor & LeCun, ICML 10]– Online dictionary learning [Mairal et al, ICML 2009]

Discriminative training, for example– Backprop algorithms [Bradley & Jbagnell, NIPS 08; Yang et al. CVPR 10]– Supervised dictionary training [Mairal et al, NIPS08]

04/07/23 68

Summary of Sparse Coding

Sparse coding is an effect way for (unsupervised) feature learning

A building block for deep models

Sparse coding and its local variants (LCC, SVC) have pushed the boundary of accuracies on Caltech101, PASCAL VOC, ImageNet, …

Challenge: discriminative training is not straightforward

p02 Sparse Coding Cvpr2012 Deep Learning Methods for Vision

Documents

discriminative

local coordinate

classical

feature selection

spatial max

local sparse

discriminative

feature selection