To aggregate or not to aggregate: Selective match kernels for image search Giorgos Tolias INRIA Rennes, NTUA Yannis Avrithis NTUA Herv´ e J´ egou INRIA Rennes Abstract This paper considers a family of metrics to compare im- ages based on their local descriptors. It encompasses the VLAD descriptor and matching techniques such as Ham- ming Embedding. Making the bridge between these ap- proaches leads us to propose a match kernel that takes the best of existing techniques by combining an aggregation procedure with a selective match kernel. Finally, the rep- resentation underpinning this kernel is approximated, pro- viding a large scale image search both precise and scalable, as shown by our experiments on several benchmarks. 1. Introduction This paper is interested in improving visual recognition of objects, locations and scenes. The best existing ap- proaches rely on local descriptors [14, 15]. Most of them inherit from the seminal Bag-of-Words (BOW) representa- tion [25, 7]. It employs a visual vocabulary to quantize a set of local descriptors and to produce a single vector that represents the image. This offers several desirable proper- ties. For image classification [7], it is compatible with pow- erful machine learning techniques such as support vectors machines. In this case, it is usually employed with rela- tively small visual vocabularies. In a query by content sce- nario [25], which is the focus of our paper, large vocabular- ies make the search efficient [17, 22, 16], thanks to inverted file structures [24] that exploit the sparsity of the representa- tion. The methods relying on these ingredients are typically able to search in millions of images in a few seconds or less. Several researchers have built upon this approach to de- sign better systems. In particular, the search is advanta- geously refined by re-ranking approaches, which operate on an initial short-list. This is done by exploiting additional ge- ometrical information [22, 18, 26] or applying query expan- sion techniques [6, 27]. This paper focuses on improving the quality of the initial result set. Re-ranking approaches are complementary stages that are subsequently applied. This work was done in the context of the Project Fire-ID, supported by the Agence Nationale de la Recherche (ANR-12-CORD-0016). Another important improvement is obtained by reduc- ing the quantization noise. This is done by multiple assign- ment [23, 12], or by exploiting a more precise representa- tion of the individual local descriptors, such as binary codes in the so-called Hamming Embedding (HE) method [12], or by integrating some information about the neighborhood of the descriptor [31]. All these approaches implicitly rely on approximate pair-wise matching of the query descriptors with those of the database images. In a concurrent effort to scale to even larger databases, recent encoding techniques such as Fisher kernels [19, 21], local linear coding [30] or the “vector or locally aggregated descriptors” (VLAD) [13], depart from the initial BOW framework by introducing alternative encoding schemes. By compressing the resulting vector representation [13, 20], the local descriptors are not considered individually. Im- ages can be represented by a small number of bytes, simi- lar to coded global descriptors [29], but with the advantage of preserving some key properties inherited from local de- scriptors, such as rotation and scale invariance. Our paper introduces a framework to bridge the gap be- tween the “matching-based” approaches, such as HE, and the recent aggregated representations, in particular VLAD. For this purpose, we introduce in Section 2 a class of match kernels that includes both matching-based and aggregated methods for unsupervised image search. We then discuss and analyze in Section 3 two key differ- ences between matching-based and aggregated approaches. First, we consider the selectivity of the matching function, i.e., the property that a correspondence established between two patches contributes to the image-level similarity only if the confidence is high enough. It is explicitly exploited in matching-based approaches only. Second, the aggregation (or pooling) operator used in BoW, VLAD or in the Fisher vector, is not considered in pure matching approaches such as HE. We show that it is worth doing it even in matching-based approaches, and dis- cuss its relationship with other methods (e.g., [11, 21]) in- troduced to handle the non-iid statistical behavior of local descriptors, also called the burstiness phenomenon [11]. This leads us to conclude that none of the existing schemes combines the best ingredients required to achieve 1
8
Embed
To aggregate or not to aggregate: Selective match kernels for image search
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
To aggregate or not to aggregate: Selective match kernels for image search
Giorgos Tolias
INRIA Rennes, NTUA
Yannis Avrithis
NTUA
Herve Jegou
INRIA Rennes
Abstract
This paper considers a family of metrics to compare im-
ages based on their local descriptors. It encompasses the
VLAD descriptor and matching techniques such as Ham-
ming Embedding. Making the bridge between these ap-
proaches leads us to propose a match kernel that takes the
best of existing techniques by combining an aggregation
procedure with a selective match kernel. Finally, the rep-
resentation underpinning this kernel is approximated, pro-
viding a large scale image search both precise and scalable,
as shown by our experiments on several benchmarks.
1. Introduction
This paper is interested in improving visual recognition
of objects, locations and scenes. The best existing ap-
proaches rely on local descriptors [14, 15]. Most of them
inherit from the seminal Bag-of-Words (BOW) representa-
tion [25, 7]. It employs a visual vocabulary to quantize a
set of local descriptors and to produce a single vector that
represents the image. This offers several desirable proper-
ties. For image classification [7], it is compatible with pow-
erful machine learning techniques such as support vectors
machines. In this case, it is usually employed with rela-
tively small visual vocabularies. In a query by content sce-
nario [25], which is the focus of our paper, large vocabular-
ies make the search efficient [17, 22, 16], thanks to inverted
file structures [24] that exploit the sparsity of the representa-
tion. The methods relying on these ingredients are typically
able to search in millions of images in a few seconds or less.
Several researchers have built upon this approach to de-
sign better systems. In particular, the search is advanta-
geously refined by re-ranking approaches, which operate on
an initial short-list. This is done by exploiting additional ge-
ometrical information [22, 18, 26] or applying query expan-
sion techniques [6, 27]. This paper focuses on improving
the quality of the initial result set. Re-ranking approaches
are complementary stages that are subsequently applied.
This work was done in the context of the Project Fire-ID, supported
by the Agence Nationale de la Recherche (ANR-12-CORD-0016).
Another important improvement is obtained by reduc-
ing the quantization noise. This is done by multiple assign-
ment [23, 12], or by exploiting a more precise representa-
tion of the individual local descriptors, such as binary codes
in the so-called Hamming Embedding (HE) method [12],
or by integrating some information about the neighborhood
of the descriptor [31]. All these approaches implicitly rely
on approximate pair-wise matching of the query descriptors
with those of the database images.
In a concurrent effort to scale to even larger databases,
recent encoding techniques such as Fisher kernels [19, 21],
local linear coding [30] or the “vector or locally aggregated
descriptors” (VLAD) [13], depart from the initial BOW
framework by introducing alternative encoding schemes.
By compressing the resulting vector representation [13, 20],
the local descriptors are not considered individually. Im-
ages can be represented by a small number of bytes, simi-
lar to coded global descriptors [29], but with the advantage
of preserving some key properties inherited from local de-
scriptors, such as rotation and scale invariance.
Our paper introduces a framework to bridge the gap be-
tween the “matching-based” approaches, such as HE, and
the recent aggregated representations, in particular VLAD.
For this purpose, we introduce in Section 2 a class of match
kernels that includes both matching-based and aggregated
methods for unsupervised image search.
We then discuss and analyze in Section 3 two key differ-
ences between matching-based and aggregated approaches.
First, we consider the selectivity of the matching function,
i.e., the property that a correspondence established between
two patches contributes to the image-level similarity only if
the confidence is high enough. It is explicitly exploited in
matching-based approaches only.
Second, the aggregation (or pooling) operator used in
BoW, VLAD or in the Fisher vector, is not considered in
pure matching approaches such as HE. We show that it is
worth doing it even in matching-based approaches, and dis-
cuss its relationship with other methods (e.g., [11, 21]) in-
troduced to handle the non-iid statistical behavior of local
descriptors, also called the burstiness phenomenon [11].
This leads us to conclude that none of the existing
schemes combines the best ingredients required to achieve
1
the best possible retrieval quality. As a result, we introduce
a new method that exploits the best of both worlds to pro-
duce a strong image representation and its corresponding
kernel between images. It combines an aggregation scheme
with a selective kernel. This vector representation is ad-
vantageously compressed to drastically reduce the memory
requirements, while also improving the search efficiency.
Section 4 shows that our method significantly outper-
forms the state of the art in a comparable setup, i.e., when
comparing the quality of the initial result set produced when
searching a large collection.
2. A framework for match kernels
This section first describes the class of match kernels that
we will analyze in this paper. This framework encompasses
several popular techniques published in the literature. In the
following, we denote the cardinality of a set A by #A.
Let us assume that an image is described by a set X ={x1, . . . , xn} of n d-dimensional local descriptors. The de-
scriptors are quantized by a k-means quantizer
q : Rd → C ⊂ Rd
x 7→ q(x) (1)
where C = {c1, . . . , ck} is a codebook comprising k = #Cvectors, which are referred to as visual words. We denote
by Xc = {x ∈ X : q(x) = c} the subset of descriptors in Xthat are assigned to a particular visual word c. In order to
compare two image representations X and Y , we consider
a family of set similarity functions K of the general form
K(X ,Y) = γ(X ) γ(Y)∑
c∈C
wc M(Xc,Yc) , (2)
where function M is defined between two sets of descriptors
Xc,Yc assigned to the same visual word. Depending on the
definition of M, the set similarity function K is or is not a
positive-definite kernel.
The scalarwc is a constant that depends on visual word c,for instance it integrates the inverse document frequency
(IDF) weighting term. The normalization factor γ(.) is typ-
ically computed as
γ(X ) =
(
∑
c∈C
wc M(Xc,Xc)
)−1/2
, (3)
such that the self-similarity of an image is K(X ,X ) = 1.
Several popular methods of the literature can be described
by the framework of Equation (2).
Bag-of-words. The BOW representation [25, 7] represents
each local descriptor x solely by its visual word. As no-
ticed in [3, 12], bag-of-words with cosine similarity can be
expressed in terms of Equation (2), by defining
M(Xc,Yc) = #Xc × #Yc =∑
x∈Xc
∑
y∈Yc
1, (4)
Other comparison metrics are also possible in this frame-
work. For instance, the histogram intersection would use
min(#Xc, #Yc) instead. In the case of max-pooling [4],
M(Xc,Yc) would be equal to 1 if both Xc,Yc are non-
empty, and zero otherwise.
Hamming Embedding (HE) [10, 12] is a matching model
that extends BOW by representing each local descriptor xwith both its quantized value q(x) and a binary code bx ofBbits. It computes the scores between all pairs of descriptors
assigned to the same visual word, as
M(Xc,Yc) =∑
x∈Xc
∑
y∈Yc
w (h (bx, by)) , (5)
where h is the Hamming distance andw is a weighting func-
tion that associates a weight to each of the B + 1 possi-
ble distance values. This function was first defined as bi-
nary [10], such that w(h) = 1 if h ≤ τ , and 0 otherwise. A
smoother weighting scheme is a better choice [11, 12], such
as the (thresholded) Gaussian function [11]
w(h) =
{
e−h2/σ2
, h ≤ τ0, otherwise.
(6)
We assume that binary codes lie in the Hamming space
{−1,+1}B and use the Hamming inner product
〈a, b〉h =a⊤b
B= a⊤b ∈ [−1, 1] (7)
instead of the Hamming distance presented in the original
HE paper [10]. Here a denotes the ℓ2-normalized coun-
terpart of vector a. The two choices are equivalent since
2h(a, b) = B(1− 〈a, b〉h).
VLAD [13] aggregates the descriptors associated with a
given visual word to produce a d× k vector representation.
This vector is constructed as the concatenation V(X ) ∝[V (Xc1), . . . , V (Xck)] of d-dimensional vectors, where
V (Xc) =∑
x∈Xc
r(x), (8)
and
r(x) = x− q(x) (9)
is the residual vector of x. Since the similarity of two
VLADs is measured by the dot product, it is easy to show
that VLAD corresponds to a match kernel of the form pro-
posed in Equation (2):
V(X )⊤V(Y) = γ(X ) γ(Y)∑
c∈C
V (Xc)⊤V (Yc), (10)
where Equation (3) determines the normalization factors.
Then it appears that
M(Xc,Yc) = V (Xc)⊤V (Yc) (11)
=∑
x∈Xc
∑
y∈Yc
r(x)⊤r(y). (12)
The power-law normalization proposed for Fisher vec-
tors [21] is also integrated in this framework by modifying
the definition of V , however it cannot be expanded as Equa-
tion (12). Its effect is similar to burstiness handling in [11].
Burstiness [11] refers to the phenomenon whereby a visual
word appears more times in an image than what a statisti-
cally independent model would predict. It tends to corrupt
the visual similarity measure. Once individual contributions
are aggregated per cell as in the HE model of Equation (5),
one solution is to down-weight highly populated cells.
For instance, one of the most effective burst weighting
models of [11] assumes that the outer sum in Equation (5)
refers to query descriptors Xc in the cell and down-weights
the inner sum of the descriptors Yc of a given database im-
age by (#Yc(x))−1/2, where
Yc(x) = {y ∈ Yc : w(h(bx, by)) 6= 0} (13)
is the subset of descriptors in Yc that match with x. A more
radical option is (#Yc(x))−1, effectively removing multiple
matches within cells, similarly to max-pooling [4].
3. Investigating selectivity and aggregation
The three match kernels presented above share some
similarities, in particular the fact that the set of descriptors
is partitioned into cells and that only vectors lying in the
same cell contribute to the overall similarity. VLAD and
HE have key characteristics that we discuss in this section.
This leads us to explore new possible kernels that are thor-
oughly evaluated in Section 4. We first develop a common
model assuming that full descriptors are available in both
images, i.e., uncompressed, and then consider the case of
binarized representations.
3.1. Towards a common model
The non-aggregated kernels individually match all the el-
ements occurring in the same Voronoi cell. They are defined
as the set of kernels M of the form
MN(Xc,Yc) =∑
x∈Xc
∑
y∈Yc
σ(
φ(x)⊤φ(y))
. (14)
This equation encompasses all the variants discussed so
far, excluding the burstiness post-processing considered in
Equation (13). Here φ is an arbitrary vector representation
function, possibly non-linear or including normalization,
Model M(Xc,Yc) φ(x) σ(u) ψ(z) Φ(Xc)
BOW (4) MN or MA 1 u z #Xc
HE (5) MN bx w(
B
2(1− u)
)
— —
VLAD (12) MN or MA r(x) u z V (Xc)
SMK (20) MN r(x) σα(u) — —
ASMK (22) MA r(x) σα(u) z V (Xc)
SMK⋆ (23) MN bx σα(u) — —
ASMK⋆ (24) MA r(x) σα(u) b(z) b(V (Xc))
Table 1. Existing and new solutions for the match kernel M. They
are classified as non-aggregated MN (14) and aggregated kernels
MA (15), or possibly both. φ(x): scalar or vector representation
of descriptor x. σ(u): scalar selectivity of u, where u is assumed
normalized in [−1, 1]. ψ(z): representation of aggregated descrip-
tor z per cell. Φ(Xc) (17): equivalent representation of descriptor
set Xc per cell. Given any vector x, we denote by x = x/‖x‖ its
ℓ2-normalized counterpart.
and σ : R → R is a scalar selectivity function. Options
for these functions are presented in Table 1 and discussed
later in this section.
The aggregated kernels, in contrast, are written as
MA(Xc,Yc) = σ
ψ
(
∑
x∈Xc
φ(x)
)⊤
ψ
∑
y∈Yc
φ(y)
(15)
= σ(
Φ(Xc)⊤Φ(Yc)
)
, (16)
where ψ is another vector representation function, again
possibly non-linear or including normalization. Φ(Xc) is
the aggregated vector representation of a set Xc of descrip-
tors in a cell, such that Φ(∅) = 0 and
Φ(Xc) = ψ
(
∑
x∈Xc
φ(x)
)
. (17)
This formulation suggests other potential strategies. In
contrast to Equation (14), there is at most a single match
between aggregated representations Φ(Xc) and Φ(Yc), and
selectivity σ is applied after aggregation.
Of the variants discussed so far, BOW and VLAD both
fit into Equation (15), with σ simply being identity. This is
not the case for HE matching. Note that the aggregation,
i.e., computing Φ(Xc), is an off-line operation.
3.2. Non-aggregated matching SMK
We introduce a selective match kernel (SMK) in this sub-
section. It is motivated by the observation that VLAD em-
ploys a linear weighting scheme in Equation (12) for the
contribution of individual matching pairs (x, y) to M, while
HE applies a non-linear weighting function σ to the similar-
ity φ(x)⊤φ(y) between a pair of descriptor x and y.
α = 1, τ = 0.0
α = 1, τ = 0.25
α = 3, τ = 0.0
α = 3, τ = 0.25
Figure 1. Matching features with descriptors assigned to the same
visual word and similarity above the threshold. Examples for dif-
ferent values of α and τ . Color denotes descriptor similarity de-
fined by σα(r(x)⊤r(y)), with yellow corresponding to 0 and red
to the maximum similarity per image pair.
Choice of selectivity function σ. Without loss of general-
ity, we consider a thresholded polynomial selectivity func-
tion σα : R → R+ of the form
σα(u) =
{
sign(u)|u|α if u > τ0 otherwise,
(18)
and typically set α = 3. In all our experiments we have used
τ ≥ 0. It plays the same role as the weighting function w in
Equation (5), applied to similarities instead of distances.
Figure 1 shows the effect of this function σα when
matching features between two images, for different values
of the exponent α and of the threshold τ . The descriptor
similarity, now measured by σα, is displayed in different
colors. A larger α increases the selectivity and drastically
down-weights false correspondences. This advantageously
replaces hard thresholding as initially proposed in HE [10].
Choice of φ. We consider a non-approximate representa-
tion of the intermediate vector representation φ(x) in Equa-
tion (14), and adopt a choice similar to VLAD by using the
ℓ2-normalized residual r(x), defined as
r(x) =x− q(x)
‖x− q(x)‖. (19)
Our SMK kernel is obtained by setting σ = σα and φ = rin Equation (14), as
SMK(Xc,Yc) =∑
x∈Xc
∑
y∈Yc
σα(r(x)⊤r(y)), (20)
It differs from HE in that it uses the normalized resid-
ual instead of binary vectors. It also differs from VLAD,
considered as a matching function, by the selectivity func-
tion σ and because we normalize the residual vector. These
differences are summarized in Table 1.
3.3. Aggregated selective match kernel ASMK
SMK weights the contributions of individual matches
with a non-linear function. We now propose to apply a se-
lective function after aggregating the different vectors per
cell. Aggregating the vectors per cell has the advantage of
producing a more compact representation.
Our ASMK kernel is constructed as follows. The residual
vectors are summed as in VLAD, producing a single rep-
resentative descriptor per cell. This sum is subsequently
ℓ2-normalized. The ℓ2-normalization ensures that the sim-
ilarity in input of σ always lies in the range [−1,+1]. It
means that
Φ(Xc) = V (Xc) = V (Xc)/‖V (Xc)‖ (21)
describes all the descriptors assigned to the cell c. The se-
lectivity function σα is applied after aggregation and nor-
malization, therefore the matching kernel MA becomes
ASMK(Xc,Yc) = σα
(
V (Xc)⊤V (Yc)
)
. (22)
The database vectors V (Xc) are computed off-line.
Figure 2 illustrates several examples of features that are
aggregated. They commonly correspond to repeated struc-
ture and textured regions. Such bursty features appear in
most urban images, and their matches usually dominate the
image level similarity. ASMK handles this by keeping only
one representative instance of all bursty descriptors, which ,
due to normalization, is equal to the normalized mean resid-
ual. Normalization per visual word was recently proposed
by a concurrent work [2] with comparatively small vocabu-
laries. The choice of normalizing our vector representation
Figure 2. Examples of features mapped to the same visual word, finally being aggregated. Each visual word is drawn with a different color.
Top 25 visual words are drawn, based on the number of features mapped to them.
resembles binary BOW [25] or max pooling [4] which both
tackle burstiness by accounting at most one vote per visual
word. Aggregating without normalizing still allows bursty
features to dominate the total similarity score.
3.4. Binarization SMK⋆ and ASMK⋆
HE relies on the binary vector bx instead of residual
r(x) = x − q(x). Although the choice of binarization
was adopted for the sake of compactness, a question arises:
What is the performance of the kernel if the full vector are
employed instead? This is what has motivated us to develop
the SMK and ASMK match kernels, which rely on full d-
dimensional descriptors. However, these kernels are costly
in terms of memory. That is why we also develop their bi-
nary versions (denoted with an additional ∗) in this section.
SMK⋆ and ASMK⋆. The approximated version SMK⋆ of
SMK is similar to HE, the only difference is the inner prod-
uct formulation and the choice of the selectivity function σαin Equation (18):
SMK⋆(Xc,Yc) =∑
x∈Xc
∑
y∈Yc
σα
(
b⊤x by
)
. (23)
It is an approximation of the full descriptor model of Equa-
tion (20), which uses the binary vector b instead of r.
Similarly, the approximation ASMK⋆ of the aggregated
version ASMK is obtained by binarizing V (Xc) before ap-
plying the selectivity function:
ASMK⋆(Xc,Yc) = σα
b
(
∑
x∈Xc
r(x)
)⊤
b
∑
y∈Yc
r(y)
,
(24)
where b is an element-wise binarization function b(x) =+1 if x ≥ 0,−1 otherwise. Note that the residual is here
computed with respect to the median as in HE, and not the
centroid. Moreover, in SMK⋆ and ASMK⋆ all descriptors
are projected using the same projection matrix as in HE.
Remark: In LSH, the Hamming distance gives an esti-
mate of the cosine similarity [5] between original vectors
(through arccos function). The differences with HE are that
(i) LSH is based on a set of random projections, whereas HE
uses a randomly oriented orthogonal basis; (ii) HE binarizes
the vectors according to their projected median values.
4. Experiments
This section describes some implementation details and
introduces the datasets and evaluation protocol used in our
experiments. We further present experiments for measuring
the impact of the kernel parameters, and finally compare
our methods against state-of-the-art methods. Most of our
results are presented without spatial verification or query
expansion (QE) to focus on the quality of the initial ranking,
before re-ranking by these complementary methods.
4.1. Implementation and experimental setup
Datasets. We evaluate the proposed methods on 3 publicly
available datasets, namely Holidays [12], Oxford Build-
ings [22] and Paris [23]. Evaluation measure is the mean
Average Precision (mAP). Due to the randomness intro-
duced to the binarized methods (SMK⋆ and ASMK⋆) by
the random projection matrix, the same as the one used in
the original Hamming Embedding, we create 3 independent
inverted files and measure the average performance.
Features. We have used the Hessian-Affine detector to ex-
tract local features. For Oxford and Paris datasets, we have
used the modified Hessian-Affine detector of Perdoch et
al. [18], which includes the gravity vector assumption and
improves retrieval performance. Most of our experiments
use the default detector threshold value. We also consider
the use of lower threshold values to derive larger sets of fea-
tures, and show the corresponding benefit in search quality,
at the cost of a memory and computational overhead.
We use SIFT descriptors and apply component-wise
square-rooting [1, 8]. This has proven to yield superior per-
formance at no cost. In more details, we follow the ap-
proach [8] in which component-wise square rooting is ap-
plied and the final vector is ℓ2-normalized. We also center
the SIFT descriptors. Our SIFT descriptor post-processing
is the same as the one of Tolias and Jegou [27].
Vocabularies. We have used flat k-means to create our vi-
sual vocabularies. These are always trained on an indepen-
dent dataset, different from the one indexed and used for
evaluation each time. Using visual vocabularies trained on
the evaluation dataset yields superior performance [22, 1]
but is more prone to over-fitting. Vocabularies used for Ox-
ford are trained on Paris, and vice versa, while the ones used
for Holidays are trained on an independent set of images
downloaded from Flickr. Unless stated otherwise, we use a
vocabulary of 65k visual words.
Inverted files. In contrast to VLAD, we apply our meth-
ods with relatively large vocabularies aiming at best per-
formance for object retrieval, and use an inverted file struc-
ture to exploit the sparsity of the BOW based representation.
With SMK and ASMK, each dimension of vectors φ(x) or
Φ(Xc) respectively, is uniformly quantized with 8 bits and
stored in the inverted file. Correspondingly, a binary vector
of 128 dimensions is stored along with SMK⋆ and ASMK⋆.
Multiple assignment. We further combine our proposed
method with multiple assignment (MA), which is applied
on query side only [12]. We replicate each descriptor vector
and assign each instance to a different visual word. When it
is stated that multiple assignment is used in our experiment,
5 nearest visual words are used. Single assignment will be
referred to as SA.
Burstiness. The non-aggregated versions of the proposed
methods allow multiple matches for a given visual word.
Thus, we combine them with the intra-image burstiness
normalization [11]. This is done to compare with our ag-
gregated methods which also deal with the burstiness phe-
nomenon. We will refer to burstiness normalization as
BURST in the experiments.
Query expansion. We combine our methods with local
visual query expansion [27] to futher improve the perfor-
mance. A brief description follows. Similarly to other vi-
sual query expansion methods [6, 1], we apply spatial verifi-
cation [22] to the 100 top ranked images in order to identify
the ones that are truly relevant. Images are considered as
verified when they are found to have at least 5 inliers with
64
66
68
70
72
74
76
78
80
82
1 2 3 4 5 6 7
mA
P
α
SMK
Oxford5kParis6k
Holidays 64
66
68
70
72
74
76
78
80
82
1 2 3 4 5 6 7
mA
P
α
SMK*
Oxford5kParis6k
Holidays
64
66
68
70
72
74
76
78
80
82
1 2 3 4 5 6 7
mA
P
α
ASMK
Oxford5kParis6k
Holidays 64
66
68
70
72
74
76
78
80
82
1 2 3 4 5 6 7
mA
P
α
ASMK*
Oxford5kParis6k
Holidays
Figure 3. Impact of parameter α for SMK and ASMK (left) and
their binarized counterparts (right). In these experiments, τ = 0.
single and 8 with multiple assignment. The estimated geo-
metric transformation estimated for verified images is used
to back-project features of database images to the query
image. Features projected out of the query region are dis-
carded. We collect visual words of all retained features, sort
them based on the number of verified images in which they
appear and select the top ranked ones. We select them in a
way such that the number of new visual words that are not
present in the query image are equal to the number of orig-
inal visual words of the query image. Descriptors assigned
to those visual words are merged with the query features,
and aggregation per visual word is applied once more. The
new expanded query is of the same nature as the original
one and can be issued to the same indexing structure.
Aggregation. For the aggregated methods descriptors of
database images are aggregated off-line and then stored in
the inverted file. On query time, query descriptors are ag-
gregated in the same way. In the case of multiple assign-
ment, aggregation is similarly applied once the aforemen-
tioned replication of descriptors is performed.
Query expansion uses spatial verification which requires
the construction of tentative correspondences. In the aggre-
gated scheme, when two aggregated features are matched
then correspondences are formed between all original fea-
tures being aggregated of query and database image.
4.2. Impact of the parameters
Parameter α. Figure 3 shows the impact of the parame-
ter α associated with our selectivity function. It controls
the balance between strong and weaker matches. Setting
70
72
74
76
78
80
82
0 0.1 0.2 0.3 0.4 0.5
mA
P
τ
Single assignment
SMKSMK-BURST
ASMK 70
72
74
76
78
80
82
0 0.1 0.2 0.3 0.4 0.5
mA
Pτ
Multiple assignment
SMKSMK-BURST
ASMK
Figure 4. Impact of threshold value τ on Oxford dataset for SMK,
SMK with burstiness normalization and ASMK. Results for single