Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center
Learning Visual Semantics: Models, Massive Computation, and Innovative Applications
Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center
2
Evolvement of Visual Features
• Low level features and histogram
• SIFT and bag-of-words models
• Sparse coding
• Super vector and Fisher vector
• Deep CNN
3
Evolvement of Visual Features
• Low level features and histogram
• SIFT and bag-of-words models
• Sparse coding
• Super vector and Fisher vector
• Deep CNN
Less parameters
More parameters
4
Evolvement of Visual Features
• Low level features and spatial histogram
• SIFT and bag-of-words models
• Sparse coding
• Super vector and Fisher vector
• Deep CNN
Three fundamental techniques1. histogram2. spatial gridding3. filter
have been used extensively
5
Low Level Features and Spatial Pyramid
Concatenating raw
pixels as 1D vector
Raw Pixels as Feature
Pictures courtesy to Face Research Lab, Antonio Torralba and Sam Roweis
Application 1: Face recognition
Application 2: Hand written digits
Tiny Image [Torralba et al 2007]: resize an image to 32x32 color thumbnail, which corresponds to a 3072 dimensional vector
7
From Pixels to Histograms
Color histogram [Swain and Ballard 91] is proposed to model the distribution of colors in an image.
b
g
r
We can extend color histogram to :• Edge histogram• Shape context histogram• Local binary patterns (LBP)• Histogram of gradientsSimilar color histogram feature
Unlike raw pixel based vectors, histograms are not sensitive to• misalignment • scale transform • global rotation
8
From Histogram to Spatialized Histogram
Problem of histograms: No spatial information!
Example thanks to Erik Learned-Miller
The same histogram!
Ojala et al, PAMI’02
Histograms of spatial cells Spatial pyramid matching
[Lazebnik et al
CVPR’06]
9
IBM IMARS Spatial Gridding
First position in 1st and 2nd ImageCLEF Medical Imaging Classification
Task: Determine which modality a medical image belongs to.
- Images from Pubmed articles
- 31 categories (x-ray, CT, MRI, ultrasound, etc.)
10
IBM IMARS Spatial Gridding
First position in 1st and 2nd ImageCLEF Medical Imaging Classification
http://www.imageclef.org/2012/medical
11
Image Filters
• In addition to histogram, another group of features can be represented as “filters”. For example:
1. Harr-like filters(Viola-Jones face detection)
Widely used in fingerprint,
iris, OCR, texture and face
recognition.
2. Gabor filters(simple cells in the visual cortex can be modeled by
Gabor functions)
12
SIFT Feature and Bag-of-Words Model
• Raw pixel• Histogram feature
– Color Histogram– Edge histogram
• Frequency analysis• Image filters• Texture features
– LBP
• Scene features– GIST
• Shape descriptors• Edge detection• Corner detection
1999
SIFT features and
beyond
• DoG
• Hessian detector
• Laplacian of Harris
• FAST
• ORB
• …
• SIFT
• HOG
• SURF
• DAISY
• BRIEF
• …
Classical
features
13
David G. Lowe
- Distinctive image features from scale-invariant keypoints, IJCV 2004
- Object recognition from local scale-invariant features, ICCV 1999
SIFT Descriptor:
Histogram of gradient orientation
- Histogram is more robust to position than raw pixels
- Edge gradient is more distinctive than color for local patches
Concatenate histograms in spatial cells
Scale-Invariant Feature Transform (SIFT)
• Good parameters: 4 ori, 4 x 4 grid
• Soft-assignment to spatial bins
• Gaussian weighting over spatial location
• Reduce the influence of large gradient magnitudes: thresholding +normalization
David Lowe’s excellent performance tuning:
14
David G. Lowe
- Distinctive image features from scale-invariant keypoints, IJCV 2004
- Object recognition from local scale-invariant features, ICCV 1999
SIFT Detector:Detect maxima and minima of difference-of-Gaussian in scale space
Post-processing: keep corner points but reject low-contrast and edge points
Scale-Invariant Feature Transform (SIFT)
• In general object recognition, we may combine multiple detectors (e.g.,
Harris, Hessian), or use dense sampling for good performance.
• Following SIFT, many research works including SURF, BRIEF, ORB,
BRISK and etc have been proposed for faster local feature extraction.
15
Histogram of Local Features
And Bag-of-Words Models
16
Histogram of Local Features
…..
freq
uen
cy
codewords dim = # codewords
17
Histogram of Local Features + Spatial Gridding
dim = #codewords x #grids
……
18
Bag of Words Models
19
Bag-of-Words Representation
Object Bag of ‘words’
Computer Vision:
Text and NLP:
Slide credit: Fei-Fei Li
20
Topic Models for Bag-of-Words Representation
Unsupervised classification
Sivic et al. ICCV 2005
Supervised classification
Classification+
segmentation
Fei-Fei et al. CVPR 2005
Cao and Fei-Fei. ICCV 2007
21
But these models suffer from
- Loss of spatial information
- Loss of information in quantization of “visual words”
Pros and Cons of Bag of Words Models
Images differ from texts!
Better coding approach
Bag of Words Models are good in- Modeling prior knowledge- Providing intuitive interpretation
22
Sparse Coding
23
Sparse Coding
• Naïve histogram uses Vector Quantization as a hard assignment, while Sparse Coding provides a soft assignment.
• Sparse Coding: approximation of l0 norm (sparse solution):
• SC works better with max pooling (while traditional VQ with averages pooling)
• References: [M. Ranzato et al, CVPR’07] [J. Yang et al, CVPR09], [J. Wang et al CVPR10], [Y. Boureau et al, CVPR10]
24
Sparse Coding + Spatial Pyramid
Yang et al, Linear Spatial Pyramid Matching using Sparse Coding for Image Classification, CVPR 2009
Sparse coding +
spatial pyramid+
linear SVM
25
Efficient Approach
Locality preserving linear coding:
1. find k nearest neighbors to the query
2. compute sparse coding with the k neighbors
Significantly faster than naïve SC, e.g., O(1000a) -> O(5a)
For further speedup, we can use LS regression to replace SC
[J. Wang et al CVPR10]
Matlab implementation (http://www.ifp.illinois.edu/~jyang29/LLC.htm )
Can be further speed up for top-k search
26
Sparse Coding Are Not Necessarily Sparse
Hard quantization
s.t.
Sparsest solution! Less sparse!
Sparse coding is less sparse.
Image level representation is not sparse
after pooling.
Is the success of SC due to sparsity?
Sparse coding
27
Fisher Vector and Super Vector
28
Information Loss
• Coding with information loss:
VQ: Sparse coding:
• Lossless coding:
• Significant difference with a function:
SC or VQ:
Lossless coding:
a scalar!!
a function!!
29
Lossless Coding as Mixture of Experts
Expert 1 Expert 2 Expert 3
Gating function
(e.g., GMM, sparse GMM,
Harmonic K-means, etc)
• Let’s look at each codeword as a “local expert”:
30
Pooling Towards Image-Level Representation
Component 1 Component 2 Component 3
+
+ + +
Pooling:
Both Fisher Vector and Super Vector can be written in this form
(with different subtraction and normalization and factors)
• Fisher Vector [Perronnin et al, ECCV10]
• Supervector [X. Zhou, K. Yu, T. Zhang et al, ECCV10]
• HG [X. Zhou et al, ECCV09]
Related references:
+ +
Normalize and concatenate
31
Pooling Towards Image-Level Representation
Component 1 Component 2 Component 3
+
+ + +
Pooling:
+ +
Normalize and concatenate
Big model:
The dimension becomes C (#components) x d (#fea dim)
For example, if C=1000, d=128, the final dimension is 128K
100+ times longer than that from SC or VQ!
32
Very Long Vector as Feature Representation
We can generate very long image feature vector as we discussed before
The strong feature we used for ImageNet LSVRC 2010– Dense sampling: LBP + HOG, fea dim=100 (after PCA)
– GMM with 1024 components
– 4 spatial gridding (1+3x1)
– Dimension of image feature: 100 x 1024 x 4 = 0.41 M
GMM pooling
HOG
LBP
33
How to solve big models?
34
For Small Datasets: Use Kernel Trick!
Kernel trick:
• 10K images => Kernel matrix: 10K x 10K ~100M
• Computational complexity depends on the size of Kernel matrix
which is less than feature dimension
Learning Locally-Adaptive Decision Functions for Person Verification,
CVPR’13 (with Z. Li and S. Chang, F. Liang, T. Huang, J. Smith)
Results on LFW dataset
We tried nonlinear kernels for face verification and got good performance
35
For Large Dataset: Use Stochastic Gradient Descent
• Suppose we are working on ImageNet data using 0.4 M feature vectors.
• Total training data: 1.2M x 0.4M ~ 0.5 T real values!
– Too big to load into memory
– Too many samples to use kernel tricks
• Solution: Stochastic Gradient Descent (SGD)
– Idea: estimate the gradient on a randomly picked sample
– Comparing with gradient descent:
36
SGD Can Be Very Simple To Implement
A 10 line binary SVM solver by Shai Shalev-Shwartz
decreasing learning rate
37
Deep CNN and Related Tech
38
Deep CNN: A Bigger Model
Motivated by the studies of [Kizhevsky et al, NIPS12] [Y. LeCun et al, PIEEE98], deep convolutionary neural network (CNN) becomes the newest winner in ImageNet competition. The most popular CNN has:
– 5 convolutional layers to learn filters
– 2 fully connected layers
– 60 million parameters
– Stochastic gradient descent (again)
Why we can train such a bigger model now (not in 1990s)?
– The rise of big dataset (ImageNet)
– The bless of GPU computing
40
Learning Representation From Big Data
Computer vision researchers have seen big performance jump in large scale datasets like ImageNet.
Even earlier, researchers in and speech/acoustics have seen similar success in LVCSR and related tasks.
In another field, text/NLP researchers are also moving quickly to large scale learning. For example, the IBM Watson system used thousands of sub-systems to won the human players in Jeopardy! Game.
www.ibm.com/watsonjobs
Watson is hiring!
Especially, we are looking for winter interns working on vision + NLP problems. contact [email protected]
41
Conclusion
42
Conclusion
The mutual evolvement of big data and big models:
Histogram
Sparse coding
(10K parameters)
Supervec, Fishervec
(0.4M parameters)
Deep CNN
(60M para)Bigger
Small dataset(e.g., Caltech101, 8K im)
Medium dataset(e.g., PASCAL, 10+K)
Large dataset(e.g., ImageNet 1.2M)
Bigger
Motivating questions:
- How to develop scalable solutions for big data?
- How to deal with situations with limited labeled data?
Please see the following talks for the answer!