Top Banner
Compact Representation of Visual Data (BOW, Fisher Vector & VLAD)
44

Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Feb 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Compact Representation of Visual Data

(BOW, Fisher Vector & VLAD)

Page 2: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

We try to understand...

● What is Compact Code ?

● Why ?? Its Applications

● Couple of such Codes: BOV, FV, VLAD, Classemes

● Its Application in large scale image search and

classification

Page 3: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Compact Code

● Code: The descriptor ( real or binary ) that represents an entity/instance

− E.g. entity: message, document, image or video

− E.g. descriptor: BoV, FV, VLAD

● Compact Code: efficiently represented ( less memory

space and easy to search for ) code

Page 4: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Example Descriptor

BoF

[ Figure from SE263:Video Analytics by R Venkatesh Babu ]

Page 5: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

BoF

[ Figure from Kristen Grauman's website

Page 6: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Example Descriptor

HoG

[ Figure from SE263:Video Analytics by R Venkatesh Babu ]

Page 7: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Example Descriptor

VLAD

[ Figure from Jegou et. al, PAMI 2011 ]

Page 8: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Applications (in IP/VP)

● CBIR, large scale image and Video search

● Object recognition

● Image/Video Annotation/Classification

● Event detection

● Detecting partial image duplicates on the web and

deformed copies

Page 9: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Goal and Challenges

● Problem Addressing: Large scale image search

− Finding images representing the same object/content

● Constraints:

− Search accuracy

− Efficiency (Search time)

− Memory usage

Page 10: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

BOV Model

Page 11: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

BoF/BoW

● Success of BoW model is due to

− Powerful local descriptors like SIFT

− Comparison is easy (works with standard distances)

− High dimensionality → sparse vectors → inverted lists

can be employed

Page 12: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Image representation with Fisher Vector for

Semantic Classification and Retrieval

Page 13: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Motivation : Why ??

● Consider BOV representation

● Representation is computationally very expensive

− For each feature, need to find distance from all the cluster centers

− Runtime – O(NKd)

− N - number of features (~ 104 per image, say SIFT)

− K-number of centers (~ 1000 say for recognition)

− d-dimension of feature(~ 100 , for SIFT)

● In total, in the order of 109 multiplications per image, to obtain a

histogram of 1000 bins

Page 14: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

BOV Model

20

5

38

10

[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]

Page 15: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Motivation : Why ??

● For more efficient representation (using BOV)

− BOV stores the no. of features assigned to each word (0th order statistics)

[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]

Page 16: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Motivation : Why ??

● For more efficient representation (using BOV)

− BOV stores the no. of features assigned to each word (0th order statistics)

− If the number of words is increased → directly increases the computations

[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]

Page 17: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Motivation : Why ??

● For more efficient representation (using BOV)

− BOV stores the no. of features assigned to each word (0th order statistics)

− If the number of words is increased → directly increases the computations

− Leads to many empty bins, redundancy

[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]

Page 18: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Motivation : Why ??

● Even when the counts are the same, the position and variance of the points in the cell can vary

[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]

Page 19: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Slight deviation..

● Pattern classification techniques can be divided into

− Generative approaches

− Discriminative approaches

● Generative: focuses on the modeling of class-

conditional probability (p(x/y)) density functions

● Discriminative: focuses directly on the problem of

interest: classification

Page 20: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Discriminative vs generative methods● Generative methods

● Say, X is the feature, Y is the label (simple 2 class case)

● Model the class conditional probabilities p(x/C1) and p(x/C2)

● Estimates the prior probabilities p(y)

● Uses Baye's rule to infer the class, given input

[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]

Page 21: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Discriminative vs generative methods

● Discriminative

● Directly estimate class probability given input: p(y|x)

● Some methods do not have probabilistic interpretation,

● eg. fit a function f(x), and assign to class 1 if f(x)>0, and to class 2 if f(x)<0

[ Figure: from http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php by Jakob Verbeek]

Page 22: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Fisher Vector Principles

● Fisher kernels: combine the benefits of generative and discriminative approaches

● Fit probabilistic model to data, p(X ; θ ).

● p is a pdf whose parameters are denoted by θ.

● Characterize the samples X = { xt; t = 1,..N } with the gradient vector:

Intuition ??Fixed-size !!

Page 23: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Fisher Vector Principles

● GMM is the generally used distribution to model the SIFT features

Page 24: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Fisher Vector Principles

● In total K(1+2D) dimensional representation

Page 25: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Fisher Vector Principles (optional slide)

● Generally, Mixture of Gaussians is used to model the local (SIFT) descriptors with, (assumed) diagonal covariance matrices

[Figure: Garg V et al., Sparse Discriminative Fisher Vectors in Visual Classification, ICVGIP 2012]

Page 26: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Fisher Vector Principles

[Figure: Garg V et al., Sparse Discriminative Fisher Vectors in Visual Classification, ICVGIP 2012]

Page 27: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

BOV vs FV● BOV

− Fits K-means clustering to the data

− Represents image as histogram of words

− Considers the 0th order statistics

● FV

− Fits GMM to the local descriptors

− Represents image with derivative of log likelihood

− Considers the 1st and 2nd order statistics also

● Computation

− Both compare N descriptors to K visual words (Centers/Gaussians)

● Memory Usage

− Higher for FV; a factor (2D+1) larger

− For K = 1000 ~ 1MB

− However, because we store more info per visual word, can obtain same or better performance

Page 28: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

BoV, FV and VLAD

● VLAD : FV :: k-means : GMM clustering

Page 29: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

References● Fisher kernels on visual vocabularies for image categorization F. Perronnin and C.

Dance, CVPR 2007

● http://lear.inrialpes.fr/~verbeek/MLCR.11.12.php

● T. Jaakkola and D. Haussler, “Exploiting generative models in discriminative classifiers,” in NIPS, 1998

● H. J´egou, Perronnin, M. Douze, Jorge S´anchez, C. Schmid, and P. P´erez, “Aggregating local descriptors into a compact image representation,” in PAMI, 2011.

Page 30: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

VLAD:Aggregating local descriptors into a compact

image representation

Jegou et.al, CVPR 10, PAMI 11

Page 31: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Fisher Vector

● Perronnin et al. [3] applied Fisher Kernel for image classification

● Model visual words with GMM, restricted to diagonal variance

matrices (Probabilistic visual vocabulary)

● Derive a d X k dimensional vector considering only means or

variances

● Compared to BoW fewer visual words are required

− Varied k from 16 to 256

Page 32: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Towards Efficiency

● Performance is achieved by optimizing

− The representation : aggregating local image

descriptors

− Dimensionality reduction of these vectors

− Indexing them

● These are dependent steps

Page 33: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Dimensionality

High-Dimension

● Better exhaustive search results

● Difficult to index

Low-Dimension

● Indexed efficiently

● Low discriminative power

Page 34: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

VLAD: non probabilistic Fisher Kernel

● Jegou et al. Proposed in CVPR version

Page 35: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

VLAD

Images and corresponding VLAD descriptors, for K=16 centroids. The components of the descriptor are represented like SIFT, with negative components in red.

Page 36: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Dimensionality reduction on local descriptors

● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results

Page 37: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Dimensionality reduction on local descriptors

● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results

● Apply a PCA on the SIFT descriptors to reduce them

from 128D to d = 64

Page 38: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Dimensionality reduction on local descriptors

● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results

● Apply a PCA on the SIFT descriptors to reduce them

from 128D to d = 64

● Two reasons may explain the positive impact of this PCA:

Page 39: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Dimensionality reduction on local descriptors

● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results

● Apply a PCA on the SIFT descriptors to reduce them

from 128D to d = 64

● Two reasons may explain the positive impact of this PCA:

1. De-correlated data can be fitted more accurately by a

GMM with diagonal covariance matrices

Page 40: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Dimensionality reduction on local descriptors

● Applying the Fisher Kernel framework directly on local descriptors leads to suboptimal results

● Apply a PCA on the SIFT descriptors to reduce them from 128D to d = 64

● Two reasons may explain the positive impact of this PCA:

1. De-correlated data can be fitted more accurately by aGMM with

diagonal covariance matrices

2. The GMM estimation is noisy for the less energetic components

Page 41: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Evaluation of the Aggregation Methods

● Evaluation is performed(on Holidays dataset) without the subsequent indexing

Page 42: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Evaluation of the Aggregation Methods

● Inferences

− Results are similar if these representations are learned and computed

on the plain SIFT descriptors

− FV+PCA outperforms VLAD by a few points of mAP

− The larger the number of centroids, the better the performance

● For K=4096 → mAP=68.9%, outperforms any result reported for standard

BOW on this dataset ([1] reports mAP=57.2% with a 200k vocabulary)

Page 43: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Comparison of BOW/VLAD/FV

Page 44: Compact Representation of Visual Data (BOW, Fisher Vector ...val.serc.iisc.ernet.in/DAV/CompactRepresentation.pdf · Evaluation of the Aggregation Methods Inferences − Results are

Comparison of BOW/VLAD/FV