Bag-of-features for category recognition

Bag-of-features for category recognition

Cordelia Schmid

Bag-of-features for image classification

• Origin: texture recognition• Texture is characterized by the repetition of basic elements or

textons

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001;

Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003

Texture recognition

Universal texton dictionary

histogram

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003


• Origin: bag-of-words• Orderless document representation: frequencies of words from a

dictionary• Classification to determine document categories


Classification

SVM

Extract regions Compute descriptors

Find clusters and frequencies

Compute distance matrix

[Nowak,Jurie&Triggs,ECCV’06], [Zhang,Marszalek,Lazebnik&Schmid,IJCV’07]


Classification

SVM





Step 1 Step 2 Step 3


• Excellent results in the presence of background clutter

bikes books building cars people phones trees

Books- misclassified into faces, faces, buildings

Buildings- misclassified into faces, trees, trees

Cars- misclassified into buildings, phones, phones

Examples for misclassified images


Classification

SVM






Step 1: feature extraction

• Scale-invariant image regions + SIFT– Selection of characteristic points

Harris-Laplace Laplacian


• Scale-invariant image regions + SIFT– Robust description of the extracted image regions

gradient3D histogram

image patch

y

x

• SIFT [Lowe’99]– 8 orientations of the gradient

– 4x4 spatial grid


• Scale-invariant image regions + SIFT– Selection of characteristic points – Robust description of these characteristic points– Affine invariant regions give “too” much invariance– Rotation invariance in many cases “too” much invariance


• Scale-invariant image regions + SIFT– Selection of characteristic points – Robust description of these characteristic points– Affine invariant regions give “too” much invariance– Rotation invariance in many cases “too” much invariance

• Dense descriptors – Improve results in the context of categories (for most categories)– Interest points do not necessarily capture “all” features

Dense features

- Multi-scale dense grid: extraction of small overlapping patches at multiple scales

- Computation of the SIFT descriptor for each grid cells


• Scale-invariant image regions + SIFT (see lecture 2)

• Dense descriptors

• Color-based descriptors

• Shape-based descriptors


Classification

SVM





Step 2: Quantization

…

Step 2:QuantizationStep 2:Quantization

Clustering

Step 2: QuantizationStep 2: Quantization

Clustering

Visual vocabulary

Examples for visual words

Airplanes

Motorbikes

Faces

Wild Cats

Leaves

People

Bikes

Step 2: Quantization

• Cluster descriptors– K-mean – Gaussian mixture model

• Assign each visual word to a cluster– Hard or soft assignment

• Build frequency histogram

K-means clustering

• We want to minimize sum of squared Euclidean distances between points xi and their nearest cluster centers

Algorithm:• Randomly initialize K cluster centers• Iterate until convergence:

– Assign each data point to the nearest center– Recompute each cluster center as the mean of all points

assigned to it

K-means clustering

• Local minimum, solution dependent on initialization

• Initialization important, run several times– Select best solution, min cost

From clustering to vector quantization

• Clustering is a common method for learning a visual vocabulary or codebook– Unsupervised learning process– Each cluster center produced by k-means becomes a

codevector– Provided the training set is sufficiently representative, the

codebook will be “universal”

• The codebook is used for quantizing features– A vector quantizer takes a feature vector and maps it to the

index of the nearest codevector in a codebook– Codebook = visual vocabulary– Codevector = visual word

Visual vocabularies: Issues

• How to choose vocabulary size?– Too small: visual words not representative of all patches– Too large: quantization artifacts, overfitting

• Computational efficiency– Vocabulary trees

(Nister & Stewenius, 2006)

• Soft quantization: Gaussian

mixture instead of k-means

Hard or soft assignment

• K-means hard assignment – Assign to the closest cluster center – Count number of descriptors assigned to a center

• Gaussian mixture model soft assignment– Estimate distance to all centers– Sum over number of descriptors

• Frequency histogram

Image representationImage representation

…..

freq

uenc

y

codewords


Classification

SVM





Step 3: Classification

• Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes

Zebra

Non-zebra

Decisionboundary

Classification

• Assign input vector to one of two or more classes• Any decision rule divides input space into decision regions

separated by decision boundaries

Nearest Neighbor Classifier

• Assign label of nearest training data point to each test data point

Voronoi partitioning of feature space for 2-category 2-D and 3-D data

from Duda et al.

Source: D. Lowe

• For a new point, find the k closest points from training data• Labels of the k points “vote” to classify• Works well provided there is lots of data and the distance function is

good

K-Nearest Neighbors

k = 5

Source: D. Lowe

Linear classifiers

• Find linear function (hyperplane) to separate positive and negative examples

0:negative

0:positive

b

b

ii

ii

wxx

wxx

Which hyperplaneis best?

SVM (Support vector machine)

Functions for comparing histograms

• L1 distance

• χ2 distance

• Quadratic distance (cross-bin)

N

i

ihihhhD1

2121 |)()(|),(

N

i ihih

ihihhhD

1 21

221

21 )()(

)()(),(

ji

ij jhihAhhD,

22121 ))()((),(

Kernels for bags of features

• Histogram intersection kernel:

• Generalized Gaussian kernel:

• D can be Euclidean distance, χ2 distance, Earth Mover’s Distance, etc.

N

i

ihihhhI1

2121 ))(),(min(),(

2

2121 ),(1

exp),( hhDA

hhK

Chi-square kernel

•Multi-channel chi-square kernel

● Channel c is a combination of detector, descriptor

● is the chi-square distance between histograms

● is the mean value of the distances between all training sample

● Extension: learning of the weights, for example with MKL

),( jic HHD

cA

m

i iiiic hhhhHHD1 21

22121 )]()([

2

1),(

Pyramid match kernel

• Weighted sum of histogram intersections at multiple resolutions (linear in the number of features instead of cubic)

optimal partial matching between sets

of features

Pyramid Match

Histogram intersection

Difference in histogram intersections across levels counts number of new pairs matched

matches at this level matches at previous level

Histogram intersection

Pyramid Match

Pyramid match kernel

• Weights inversely proportional to bin size

• Normalize kernel values to avoid favoring large sets

measure of difficulty of a match at level i

histogram pyramids

number of newly matched pairs at level i

Example pyramid matchLevel 0



Example pyramid match

pyramid match

optimal match

Summary: Pyramid match kernel

optimal partial matching between sets

of features

number of new matches at level idifficulty of a match at level i

Spatial pyramid matching

• Add spatial information to the bag-of-features

• Perform matching in 2D image space

[Lazebnik, Schmid & Ponce, CVPR 2006]

Related work

Szummer & Picard (1997) Lowe (1999, 2004) Torralba et al. (2003)

GistSIFT

Similar approaches:

Subblock description [Szummer & Picard, 1997]

SIFT [Lowe, 1999]

GIST [Torralba et al., 2003]

Locally orderless representation at several levels of spatial resolution

level 0

Spatial pyramid representation


level 0 level 1



level 0 level 1 level 2


Spatial pyramid matching

• Combination of spatial levels with pyramid match kernel [Grauman & Darell’05]

Scene classification

L Single-level Pyramid

0(1x1) 72.2±0.6

1(2x2) 77.9±0.6 79.0 ±0.5

2(4x4) 79.4±0.3 81.1 ±0.3

3(8x8) 77.2±0.4 80.7 ±0.3

Retrieval examples

Category classification – CalTech101

L Single-level Pyramid

0(1x1) 41.2±1.2

1(2x2) 55.9±0.9 57.0 ±0.8

2(4x4) 63.6±0.9 64.6 ±0.8

3(8x8) 60.3±0.9 64.6 ±0.7

Bag-of-features approach by Zhang et al.’07: 54 %

CalTech101

Easiest and hardest classes

• Sources of difficulty:– Lack of texture– Camouflage– Thin, articulated limbs– Highly deformable shape

Discussion

• Summary– Spatial pyramid representation: appearance of local

image patches + coarse global position information– Substantial improvement over bag of features– Depends on the similarity of image layout

• Extensions– Integrating different types of features, learning weights,

use of different grids [Zhang’07, Bosch & Zisserman’07, Varma et al.’07, Marszalek et al.’07]

– Flexible, object-centered grid

Evaluation of image classification

• Image classification task PASCAL VOC 2007-2009

• Precision – recall curves for evaluation

• Mean average precision

Bag-of-features for category recognition

Documents

image classificationorigin

invariancedense descriptors

leung malik

belongie malik

schmid ponce

categoriesinterest points

misclassified imagesbag

cula dana