Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Compact Representation of Visual Data

(BOW, Fisher Vector & VLAD)

We try to understand...

● What is Compact Code ?

● Why ?? Its Applications

● Couple of such Codes: BOV, FV, VLAD, Classemes

● Its Application in large scale image search and

classification

Compact Code

● Code: The descriptor ( real or binary ) that represents an entity/instance

− E.g. entity: message, document, image or video

− E.g. descriptor: BoV, FV, VLAD

● Compact Code: efficiently represented ( less memory

space and easy to search for ) code

Example Descriptor

BoF

[ Figure from SE263:Video Analytics by R Venkatesh Babu ]

BoF

[ Figure from Kristen Grauman's website

Example Descriptor

HoG

[ Figure from SE263:Video Analytics by R Venkatesh Babu ]

Example Descriptor

VLAD

[ Figure from Jegou et. al, PAMI 2011 ]

Applications (in IP/VP)

● CBIR, large scale image and Video search

● Object recognition

● Image/Video Annotation/Classification

● Event detection

● Detecting partial image duplicates on the web and

deformed copies

Goal and Challenges

● Problem Addressing: Large scale image search

− Finding images representing the same object/content

● Constraints:

− Search accuracy

− Efficiency (Search time)

− Memory usage

BOV Model

BoF/BoW

● Success of BoW model is due to

− Powerful local descriptors like SIFT

− Comparison is easy (works with standard distances)

− High dimensionality → sparse vectors → inverted lists

can be employed

Image representation with Fisher Vector for

Semantic Classification and Retrieval

Motivation : Why ??

● Consider BOV representation

● Representation is computationally very expensive

− For each feature, need to find distance from all the cluster centers

− Runtime – O(NKd)

− N - number of features (~ 104 per image, say SIFT)

− K-number of centers (~ 1000 say for recognition)

− d-dimension of feature(~ 100 , for SIFT)

● In total, in the order of 109 multiplications per image, to obtain a

histogram of 1000 bins

BOV Model

20

5

38

10

[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]

http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php

Motivation : Why ??

● For more efficient representation (using BOV)

− BOV stores the no. of features assigned to each word (0th order statistics)



Motivation : Why ??



− If the number of words is increased → directly increases the computations



Motivation : Why ??



− If the number of words is increased → directly increases the computations

− Leads to many empty bins, redundancy



Motivation : Why ??

● Even when the counts are the same, the position and variance of the points in the cell can vary



Slight deviation..

● Pattern classification techniques can be divided into

− Generative approaches

− Discriminative approaches

● Generative: focuses on the modeling of class-

conditional probability (p(x/y)) density functions

● Discriminative: focuses directly on the problem of

interest: classification

Discriminative vs generative methods● Generative methods

● Say, X is the feature, Y is the label (simple 2 class case)

● Model the class conditional probabilities p(x/C1) and p(x/C2)

● Estimates the prior probabilities p(y)

● Uses Baye's rule to infer the class, given input



Discriminative vs generative methods

● Discriminative

● Directly estimate class probability given input: p(y|x)

● Some methods do not have probabilistic interpretation,

● eg. fit a function f(x), and assign to class 1 if f(x)>0, and to class 2 if f(x)<0



Fisher Vector Principles

● Fisher kernels: combine the benefits of generative and discriminative approaches

● Fit probabilistic model to data, p(X ; θ ).

● p is a pdf whose parameters are denoted by θ.

● Characterize the samples X = { xt; t = 1,..N } with the gradient vector:

Intuition ??Fixed-size !!


● GMM is the generally used distribution to model the SIFT features


● In total K(1+2D) dimensional representation

Fisher Vector Principles (optional slide)

● Generally, Mixture of Gaussians is used to model the local (SIFT) descriptors with, (assumed) diagonal covariance matrices

[Figure: Garg V et al., Sparse Discriminative Fisher Vectors in Visual Classification, ICVGIP 2012]


[Figure: Garg V et al., Sparse Discriminative Fisher Vectors in Visual Classification, ICVGIP 2012]

BOV vs FV● BOV

− Fits K-means clustering to the data

− Represents image as histogram of words

− Considers the 0th order statistics

● FV

− Fits GMM to the local descriptors

− Represents image with derivative of log likelihood

− Considers the 1st and 2nd order statistics also

● Computation

− Both compare N descriptors to K visual words (Centers/Gaussians)

● Memory Usage

− Higher for FV; a factor (2D+1) larger

− For K = 1000 ~ 1MB

− However, because we store more info per visual word, can obtain same or better performance

BoV, FV and VLAD

● VLAD : FV :: k-means : GMM clustering

References● Fisher kernels on visual vocabularies for image categorization F. Perronnin and C.

Dance, CVPR 2007

● http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php

● T. Jaakkola and D. Haussler, “Exploiting generative models in discriminative classifiers,” in NIPS, 1998

● H. Jégou, Perronnin, M. Douze, Jorge Sánchez, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in PAMI, 2011.


VLAD:Aggregating local descriptors into a compact

image representation

Jegou et.al, CVPR 10, PAMI 11

Fisher Vector

● Perronnin et al. [3] applied Fisher Kernel for image classification

● Model visual words with GMM, restricted to diagonal variance

matrices (Probabilistic visual vocabulary)

● Derive a d X k dimensional vector considering only means or

variances

● Compared to BoW fewer visual words are required

− Varied k from 16 to 256

Towards Efficiency

● Performance is achieved by optimizing

− The representation : aggregating local image

descriptors

− Dimensionality reduction of these vectors

− Indexing them

● These are dependent steps

Dimensionality

High-Dimension

● Better exhaustive search results

● Difficult to index

Low-Dimension

● Indexed efficiently

● Low discriminative power

VLAD: non probabilistic Fisher Kernel

● Jegou et al. Proposed in CVPR version

VLAD

Images and corresponding VLAD descriptors, for K=16 centroids. The components of the descriptor are represented like SIFT, with negative components in red.

Dimensionality reduction on local descriptors

● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results



● Apply a PCA on the SIFT descriptors to reduce them

from 128D to d = 64




from 128D to d = 64

● Two reasons may explain the positive impact of this PCA:




from 128D to d = 64


1. De-correlated data can be fitted more accurately by a

GMM with diagonal covariance matrices



● Apply a PCA on the SIFT descriptors to reduce them from 128D to d = 64


1. De-correlated data can be fitted more accurately by aGMM with

diagonal covariance matrices

2. The GMM estimation is noisy for the less energetic components

Evaluation of the Aggregation Methods

● Evaluation is performed(on Holidays dataset) without the subsequent indexing

Evaluation of the Aggregation Methods

● Inferences

− Results are similar if these representations are learned and computed

on the plain SIFT descriptors

− FV+PCA outperforms VLAD by a few points of mAP

− The larger the number of centroids, the better the performance

● For K=4096 → mAP=68.9%, outperforms any result reported for standard

BOW on this dataset ([1] reports mAP=57.2% with a 200k vocabulary)

Comparison of BOW/VLAD/FV

Comparison of BOW/VLAD/FV

Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Documents