Learning Visual Semantics: Models, Massive …mp7.watson.ibm.com/LearningVisualSemantics/slides/CaoFeatures.pdf · Matlab implementation (jyang29/LLC.htm ) ... of sub-systems to won

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center

2

Evolvement of Visual Features

• Low level features and histogram

• SIFT and bag-of-words models

• Sparse coding

• Super vector and Fisher vector

• Deep CNN

3


• Low level features and histogram


• Sparse coding


• Deep CNN

Less parameters

More parameters

4


• Low level features and spatial histogram


• Sparse coding


• Deep CNN

Three fundamental techniques1. histogram2. spatial gridding3. filter

have been used extensively

5

Low Level Features and Spatial Pyramid

Concatenating raw

pixels as 1D vector

Raw Pixels as Feature

Pictures courtesy to Face Research Lab, Antonio Torralba and Sam Roweis

Application 1: Face recognition

Application 2: Hand written digits

Tiny Image [Torralba et al 2007]: resize an image to 32x32 color thumbnail, which corresponds to a 3072 dimensional vector

7

From Pixels to Histograms

Color histogram [Swain and Ballard 91] is proposed to model the distribution of colors in an image.

b

g

r

We can extend color histogram to :• Edge histogram• Shape context histogram• Local binary patterns (LBP)• Histogram of gradientsSimilar color histogram feature

Unlike raw pixel based vectors, histograms are not sensitive to• misalignment • scale transform • global rotation

8

From Histogram to Spatialized Histogram

Problem of histograms: No spatial information!

Example thanks to Erik Learned-Miller

The same histogram!

Ojala et al, PAMI’02

Histograms of spatial cells Spatial pyramid matching

[Lazebnik et al

CVPR’06]

9

IBM IMARS Spatial Gridding

First position in 1st and 2nd ImageCLEF Medical Imaging Classification

Task: Determine which modality a medical image belongs to.

- Images from Pubmed articles

- 31 categories (x-ray, CT, MRI, ultrasound, etc.)

10

IBM IMARS Spatial Gridding

First position in 1st and 2nd ImageCLEF Medical Imaging Classification

http://www.imageclef.org/2012/medical

http://www.imageclef.org/2012/medical

11

Image Filters

• In addition to histogram, another group of features can be represented as “filters”. For example:

1. Harr-like filters(Viola-Jones face detection)

Widely used in fingerprint,

iris, OCR, texture and face

recognition.

2. Gabor filters(simple cells in the visual cortex can be modeled by

Gabor functions)

12

SIFT Feature and Bag-of-Words Model

• Raw pixel• Histogram feature

– Color Histogram– Edge histogram

• Frequency analysis• Image filters• Texture features

– LBP

• Scene features– GIST

• Shape descriptors• Edge detection• Corner detection

1999

SIFT features and

beyond

• DoG

• Hessian detector

• Laplacian of Harris

• FAST

• ORB

• …

• SIFT

• HOG

• SURF

• DAISY

• BRIEF

• …

Classical

features

13

David G. Lowe

- Distinctive image features from scale-invariant keypoints, IJCV 2004

- Object recognition from local scale-invariant features, ICCV 1999

SIFT Descriptor:

Histogram of gradient orientation

- Histogram is more robust to position than raw pixels

- Edge gradient is more distinctive than color for local patches

Concatenate histograms in spatial cells

Scale-Invariant Feature Transform (SIFT)

• Good parameters: 4 ori, 4 x 4 grid

• Soft-assignment to spatial bins

• Gaussian weighting over spatial location

• Reduce the influence of large gradient magnitudes: thresholding +normalization

David Lowe’s excellent performance tuning:

14

David G. Lowe

- Distinctive image features from scale-invariant keypoints, IJCV 2004

- Object recognition from local scale-invariant features, ICCV 1999

SIFT Detector:Detect maxima and minima of difference-of-Gaussian in scale space

Post-processing: keep corner points but reject low-contrast and edge points

Scale-Invariant Feature Transform (SIFT)

• In general object recognition, we may combine multiple detectors (e.g.,

Harris, Hessian), or use dense sampling for good performance.

• Following SIFT, many research works including SURF, BRIEF, ORB,

BRISK and etc have been proposed for faster local feature extraction.

15

Histogram of Local Features

And Bag-of-Words Models

16

Histogram of Local Features

…..

freq

uen

cy

codewords dim = # codewords

17

Histogram of Local Features + Spatial Gridding

dim = #codewords x #grids

……

18

Bag of Words Models

19

Bag-of-Words Representation

Object Bag of ‘words’

Computer Vision:

Text and NLP:

Slide credit: Fei-Fei Li

20

Topic Models for Bag-of-Words Representation

Unsupervised classification

Sivic et al. ICCV 2005

Supervised classification

Classification+

segmentation

Fei-Fei et al. CVPR 2005

Cao and Fei-Fei. ICCV 2007

21

But these models suffer from

- Loss of spatial information

- Loss of information in quantization of “visual words”

Pros and Cons of Bag of Words Models

Images differ from texts!

Better coding approach

Bag of Words Models are good in- Modeling prior knowledge- Providing intuitive interpretation

22

Sparse Coding

23

Sparse Coding

• Naïve histogram uses Vector Quantization as a hard assignment, while Sparse Coding provides a soft assignment.

• Sparse Coding: approximation of l0 norm (sparse solution):

• SC works better with max pooling (while traditional VQ with averages pooling)

• References: [M. Ranzato et al, CVPR’07] [J. Yang et al, CVPR09], [J. Wang et al CVPR10], [Y. Boureau et al, CVPR10]

24

Sparse Coding + Spatial Pyramid

Yang et al, Linear Spatial Pyramid Matching using Sparse Coding for Image Classification, CVPR 2009

Sparse coding +

spatial pyramid+

linear SVM

25

Efficient Approach

Locality preserving linear coding:

1. find k nearest neighbors to the query

2. compute sparse coding with the k neighbors

Significantly faster than naïve SC, e.g., O(1000a) -> O(5a)

For further speedup, we can use LS regression to replace SC

[J. Wang et al CVPR10]

Matlab implementation (http://www.ifp.illinois.edu/~jyang29/LLC.htm )

Can be further speed up for top-k search

http://www.ifp.illinois.edu/~jyang29/LLC.htm

26

Sparse Coding Are Not Necessarily Sparse

Hard quantization

s.t.

Sparsest solution! Less sparse!

Sparse coding is less sparse.

Image level representation is not sparse

after pooling.

Is the success of SC due to sparsity?

Sparse coding

27

Fisher Vector and Super Vector

28

Information Loss

• Coding with information loss:

VQ: Sparse coding:

• Lossless coding:

• Significant difference with a function:

SC or VQ:

Lossless coding:

a scalar!!

a function!!

29

Lossless Coding as Mixture of Experts

Expert 1 Expert 2 Expert 3

Gating function

(e.g., GMM, sparse GMM,

Harmonic K-means, etc)

• Let’s look at each codeword as a “local expert”:

30

Pooling Towards Image-Level Representation

Component 1 Component 2 Component 3

+

+ + +

Pooling:

Both Fisher Vector and Super Vector can be written in this form

(with different subtraction and normalization and factors)

• Fisher Vector [Perronnin et al, ECCV10]

• Supervector [X. Zhou, K. Yu, T. Zhang et al, ECCV10]

• HG [X. Zhou et al, ECCV09]

Related references:

+ +

Normalize and concatenate

31

Pooling Towards Image-Level Representation

Component 1 Component 2 Component 3

+

+ + +

Pooling:

+ +

Normalize and concatenate

Big model:

The dimension becomes C (#components) x d (#fea dim)

For example, if C=1000, d=128, the final dimension is 128K

100+ times longer than that from SC or VQ!

32

Very Long Vector as Feature Representation

We can generate very long image feature vector as we discussed before

The strong feature we used for ImageNet LSVRC 2010– Dense sampling: LBP + HOG, fea dim=100 (after PCA)

– GMM with 1024 components

– 4 spatial gridding (1+3x1)

– Dimension of image feature: 100 x 1024 x 4 = 0.41 M

GMM pooling

HOG

LBP

33

How to solve big models?

34

For Small Datasets: Use Kernel Trick!

Kernel trick:

• 10K images => Kernel matrix: 10K x 10K ~100M

• Computational complexity depends on the size of Kernel matrix

which is less than feature dimension

Learning Locally-Adaptive Decision Functions for Person Verification,

CVPR’13 (with Z. Li and S. Chang, F. Liang, T. Huang, J. Smith)

Results on LFW dataset

We tried nonlinear kernels for face verification and got good performance

35

For Large Dataset: Use Stochastic Gradient Descent

• Suppose we are working on ImageNet data using 0.4 M feature vectors.

• Total training data: 1.2M x 0.4M ~ 0.5 T real values!

– Too big to load into memory

– Too many samples to use kernel tricks

• Solution: Stochastic Gradient Descent (SGD)

– Idea: estimate the gradient on a randomly picked sample

– Comparing with gradient descent:

36

SGD Can Be Very Simple To Implement

A 10 line binary SVM solver by Shai Shalev-Shwartz

decreasing learning rate

37

Deep CNN and Related Tech

38

Deep CNN: A Bigger Model

Motivated by the studies of [Kizhevsky et al, NIPS12] [Y. LeCun et al, PIEEE98], deep convolutionary neural network (CNN) becomes the newest winner in ImageNet competition. The most popular CNN has:

– 5 convolutional layers to learn filters

– 2 fully connected layers

– 60 million parameters

– Stochastic gradient descent (again)

Why we can train such a bigger model now (not in 1990s)?

– The rise of big dataset (ImageNet)

– The bless of GPU computing

Deep Learning Demo

http://smith-gpu.pok.ibm.com:8080/

http://smith-gpu.pok.ibm.com:8080/

40

Learning Representation From Big Data

Computer vision researchers have seen big performance jump in large scale datasets like ImageNet.

Even earlier, researchers in and speech/acoustics have seen similar success in LVCSR and related tasks.

In another field, text/NLP researchers are also moving quickly to large scale learning. For example, the IBM Watson system used thousands of sub-systems to won the human players in Jeopardy! Game.

www.ibm.com/watsonjobs

Watson is hiring!

Especially, we are looking for winter interns working on vision + NLP problems. contact [email protected]

http://www.ibm.com/watsonjobs

41

Conclusion

42

Conclusion

The mutual evolvement of big data and big models:

Histogram

Sparse coding

(10K parameters)

Supervec, Fishervec

(0.4M parameters)

Deep CNN

(60M para)Bigger

Small dataset(e.g., Caltech101, 8K im)

Medium dataset(e.g., PASCAL, 10+K)

Large dataset(e.g., ImageNet 1.2M)

Bigger

Motivating questions:

- How to develop scalable solutions for big data?

- How to deal with situations with limited labeled data?

Please see the following talks for the answer!

Learning Visual Semantics: Models, Massive …mp7.watson.ibm.com/LearningVisualSemantics/slides/CaoFeatures.pdf · Matlab implementation (jyang29/LLC.htm ) ... of sub-systems to won

Documents