Compact Representation of Visual Data (BOW, Fisher Vector & VLAD)
Compact Representation of Visual Data
(BOW, Fisher Vector & VLAD)
We try to understand...
● What is Compact Code ?
● Why ?? Its Applications
● Couple of such Codes: BOV, FV, VLAD, Classemes
● Its Application in large scale image search and
classification
Compact Code
● Code: The descriptor ( real or binary ) that represents an entity/instance
− E.g. entity: message, document, image or video
− E.g. descriptor: BoV, FV, VLAD
● Compact Code: efficiently represented ( less memory
space and easy to search for ) code
Example Descriptor
BoF
[ Figure from SE263:Video Analytics by R Venkatesh Babu ]
BoF
[ Figure from Kristen Grauman's website
Example Descriptor
HoG
[ Figure from SE263:Video Analytics by R Venkatesh Babu ]
Example Descriptor
VLAD
[ Figure from Jegou et. al, PAMI 2011 ]
Applications (in IP/VP)
● CBIR, large scale image and Video search
● Object recognition
● Image/Video Annotation/Classification
● Event detection
● Detecting partial image duplicates on the web and
deformed copies
Goal and Challenges
● Problem Addressing: Large scale image search
− Finding images representing the same object/content
● Constraints:
− Search accuracy
− Efficiency (Search time)
− Memory usage
BOV Model
BoF/BoW
● Success of BoW model is due to
− Powerful local descriptors like SIFT
− Comparison is easy (works with standard distances)
− High dimensionality → sparse vectors → inverted lists
can be employed
Image representation with Fisher Vector for
Semantic Classification and Retrieval
Motivation : Why ??
● Consider BOV representation
● Representation is computationally very expensive
− For each feature, need to find distance from all the cluster centers
− Runtime – O(NKd)
− N - number of features (~ 104 per image, say SIFT)
− K-number of centers (~ 1000 say for recognition)
− d-dimension of feature(~ 100 , for SIFT)
● In total, in the order of 109 multiplications per image, to obtain a
histogram of 1000 bins
BOV Model
20
5
38
10
[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]
Motivation : Why ??
● For more efficient representation (using BOV)
− BOV stores the no. of features assigned to each word (0th order statistics)
[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]
Motivation : Why ??
● For more efficient representation (using BOV)
− BOV stores the no. of features assigned to each word (0th order statistics)
− If the number of words is increased → directly increases the computations
[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]
Motivation : Why ??
● For more efficient representation (using BOV)
− BOV stores the no. of features assigned to each word (0th order statistics)
− If the number of words is increased → directly increases the computations
− Leads to many empty bins, redundancy
[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]
Motivation : Why ??
● Even when the counts are the same, the position and variance of the points in the cell can vary
[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]
Slight deviation..
● Pattern classification techniques can be divided into
− Generative approaches
− Discriminative approaches
● Generative: focuses on the modeling of class-
conditional probability (p(x/y)) density functions
● Discriminative: focuses directly on the problem of
interest: classification
Discriminative vs generative methods● Generative methods
● Say, X is the feature, Y is the label (simple 2 class case)
● Model the class conditional probabilities p(x/C1) and p(x/C2)
● Estimates the prior probabilities p(y)
● Uses Baye's rule to infer the class, given input
[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]
Discriminative vs generative methods
● Discriminative
● Directly estimate class probability given input: p(y|x)
● Some methods do not have probabilistic interpretation,
● eg. fit a function f(x), and assign to class 1 if f(x)>0, and to class 2 if f(x)<0
[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]
Fisher Vector Principles
● Fisher kernels: combine the benefits of generative and discriminative approaches
● Fit probabilistic model to data, p(X ; θ ).
● p is a pdf whose parameters are denoted by θ.
● Characterize the samples X = { xt; t = 1,..N } with the gradient vector:
Intuition ??Fixed-size !!
Fisher Vector Principles
● GMM is the generally used distribution to model the SIFT features
Fisher Vector Principles
● In total K(1+2D) dimensional representation
Fisher Vector Principles (optional slide)
● Generally, Mixture of Gaussians is used to model the local (SIFT) descriptors with, (assumed) diagonal covariance matrices
[Figure: Garg V et al., Sparse Discriminative Fisher Vectors in Visual Classification, ICVGIP 2012]
Fisher Vector Principles
[Figure: Garg V et al., Sparse Discriminative Fisher Vectors in Visual Classification, ICVGIP 2012]
BOV vs FV● BOV
− Fits K-means clustering to the data
− Represents image as histogram of words
− Considers the 0th order statistics
● FV
− Fits GMM to the local descriptors
− Represents image with derivative of log likelihood
− Considers the 1st and 2nd order statistics also
● Computation
− Both compare N descriptors to K visual words (Centers/Gaussians)
● Memory Usage
− Higher for FV; a factor (2D+1) larger
− For K = 1000 ~ 1MB
− However, because we store more info per visual word, can obtain same or better performance
BoV, FV and VLAD
● VLAD : FV :: k-means : GMM clustering
References● Fisher kernels on visual vocabularies for image categorization F. Perronnin and C.
Dance, CVPR 2007
● http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php
● T. Jaakkola and D. Haussler, “Exploiting generative models in discriminative classifiers,” in NIPS, 1998
● H. J´egou, Perronnin, M. Douze, Jorge S´anchez, C. Schmid, and P. P´erez, “Aggregating local descriptors into a compact image representation,” in PAMI, 2011.
VLAD:Aggregating local descriptors into a compact
image representation
Jegou et.al, CVPR 10, PAMI 11
Fisher Vector
● Perronnin et al. [3] applied Fisher Kernel for image classification
● Model visual words with GMM, restricted to diagonal variance
matrices (Probabilistic visual vocabulary)
● Derive a d X k dimensional vector considering only means or
variances
● Compared to BoW fewer visual words are required
− Varied k from 16 to 256
Towards Efficiency
● Performance is achieved by optimizing
− The representation : aggregating local image
descriptors
− Dimensionality reduction of these vectors
− Indexing them
● These are dependent steps
Dimensionality
High-Dimension
● Better exhaustive search results
● Difficult to index
Low-Dimension
● Indexed efficiently
● Low discriminative power
VLAD: non probabilistic Fisher Kernel
● Jegou et al. Proposed in CVPR version
VLAD
Images and corresponding VLAD descriptors, for K=16 centroids. The components of the descriptor are represented like SIFT, with negative components in red.
Dimensionality reduction on local descriptors
● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results
Dimensionality reduction on local descriptors
● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results
● Apply a PCA on the SIFT descriptors to reduce them
from 128D to d = 64
Dimensionality reduction on local descriptors
● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results
● Apply a PCA on the SIFT descriptors to reduce them
from 128D to d = 64
● Two reasons may explain the positive impact of this PCA:
Dimensionality reduction on local descriptors
● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results
● Apply a PCA on the SIFT descriptors to reduce them
from 128D to d = 64
● Two reasons may explain the positive impact of this PCA:
1. De-correlated data can be fitted more accurately by a
GMM with diagonal covariance matrices
Dimensionality reduction on local descriptors
● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results
● Apply a PCA on the SIFT descriptors to reduce them from 128D to d = 64
● Two reasons may explain the positive impact of this PCA:
1. De-correlated data can be fitted more accurately by aGMM with
diagonal covariance matrices
2. The GMM estimation is noisy for the less energetic components
Evaluation of the Aggregation Methods
● Evaluation is performed(on Holidays dataset) without the subsequent indexing
Evaluation of the Aggregation Methods
● Inferences
− Results are similar if these representations are learned and computed
on the plain SIFT descriptors
− FV+PCA outperforms VLAD by a few points of mAP
− The larger the number of centroids, the better the performance
● For K=4096 → mAP=68.9%, outperforms any result reported for standard
BOW on this dataset ([1] reports mAP=57.2% with a 200k vocabulary)
Comparison of BOW/VLAD/FV
Comparison of BOW/VLAD/FV