Technical Report - cs.umn.edu · PDF fileTechnical Report Department of Computer Science ... To begin with we sparse code the data using a learned basis ... The algorithm embeds the

Efficient Similarity Search via Sparse Coding

Technical Report

Department of Computer Science

and Engineering

University of Minnesota

4-192 Keller Hall

200 Union Street SE

Minneapolis, MN 55455-0159 USA

TR 11-028


Anoop Cherian, Vassilios Morellas, and Nikos Papanikolopoulos

November 21, 2011


Anoop Cherian Vassilios Morellas Nikolaos Papanikolopoulos

{cherian, morellas, npapas}@cs.umn.edu

Abstract

This work presents a new indexing method using sparse coding for fastapproximate Nearest Neighbors (NN) on high dimensional image data.To begin with we sparse code the data using a learned basis dictionaryand an index of the dictionary’s support set is next used to generate onecompact identifier for each data point. As basis combinations increaseexponentially with an increasing support set, each data point is likelyto get a unique identifier that can be used to index a hash table forfast NN operations. When dealing with real world data, the identifierscorresponding to the query point and the true nearest neighbors in thedatabase seldom match exactly (due to image noise, distortion, etc.). Toaccommodate these near matches, we propose a novel extension of theframework that utilizes the regularization path of the LASSO formulationto create robust hash codes. Experiments are conducted on large datasetsand demonstrate that our algorithm rivals state-of-the-art NN techniquesin search time, accuracy and memory usage.

1 Introduction

The majority of the vision tasks in the recent times require efficient strategiesto search for Nearest Neighbors (NN) of a query point in a database. Examplesinclude, but not limited to face recognition, object tracking, multiview 3D re-construction [38] and video search engines [37]. Image data are generally highdimensional; Scale Invariant Feature Transforms (SIFT) [25], Generalized imagedescriptors [39], shape contexts [30], Histogram of Oriented Gradients (HOG)[5], etc. are a few examples. When the data is high dimensional, sophisticateddata structures might need to be designed to make the computation of NNsefficient. The problem is not restricted to computer vision, but manifests itselfin many other domains, such as document search, content-based informationretrieval [7], multimedia databases [2], etc.

In this paper, we develop algorithms for fast NN operations. Our approach ismotivated by recent advances in the theories of sparse signal processing and com-pressive sensing. The latter deals with the idea of sampling and reconstructingsignals that are sparse in a specific overcomplete basis, such that perfect recon-struction can be achieved via only very few samples (as compared to the numberof samples prescribed by the Shannon’s sampling theorem) [8]. For general data

1

descriptors (e.g. SIFT descriptors), it is non-trivial to find this overcomplete ba-sis dictionary. To meet this challenge, dictionary learning techniques have beensuggested [32, 11] whereby the data itself steers a basis dictionary by solvingan L1 regularized least-squares problem (namely LASSO). This idea of dictio-nary learning has been leveraged for very successful applications such as imagedenoising, object classification, recognition [28], etc. A natural next questionhowever is whether using the sparsity itself can achieve fast NN retrieval on highdimensional data. This paper examines this possibility in detail and proposesalgorithms applicable to the visual data domain.

Before proceeding further, we briefly list the primary contributions of thispaper:

• We propose a novel tuple representation, Subspace Combination Tuple(SCT), for high dimensional data using a learned dictionary that assistsin indexing the data vectors through a hashtable.

• Utilizing the regularization path of the LARS algorithm for solving theLASSO formulation, we propose a simple and effective method for gen-erating robust hash codes on the above SCTs for fast and accurate NNretrieval.

To set the stage for our discussions on Approximate Nearest Neighbors(ANN), we next review the NN algorithms and some previous literature in thenext section, followed by an overview of dictionary learning in Section 3. Sec-tion 4 combines the paradigms of sparsity and NN by introducing our indexablerepresentation of the data. Section 5 describes robustness algorithms for thenew representation. The algorithms are analyzed in Section 6, while Section 7provides details of our experiments. Our conclusions comprise Section 8.

2 Related Work

The general problem of NN search is defined as follows:

Definition 1. Given a dataset D ⊂ Rd, a distance function dist : D ×D → R

and a query vector q ∈ Rd, we define the Nearest Neighbor (NN) of q as:

NN(q) := {x′ ∈ D|∀x ∈ D, dist(q, x′) ≤ dist(q, x)} (1)

Traditionally, NN retrieval algorithms operate on data organized into mul-tidimensional indexing structures. The techniques generally fall into two cate-gories: (i) space partitioning methods, and (ii) data partitioning methods. Theformer type of methods partition the indexing space into subspaces along pre-defined boundaries without incorporating the distribution of the data withinthis space. Data structures based on KD-trees [36], grid files [33], etc. belongto this category. Data partitioning methods take into account the distributionof data along with its spatial organization for better NN performance. R-trees,R*-trees, R+-trees [17], etc. are a few illustrative algorithms based on this idea.

2

A comprehensive survey of data structure based NN algorithms can be found in[19, 29]. As the dimensionality of the data increases, it is well-known that thecomputational cost for exact NN retrieval grows at O(n|D|) due to the curse ofdimensionality [13, 41], making NN not more efficient than doing a linear searchon all the data points in all the dimensions.

In the recent past, research has been more focused in devising ApproximateNearest Neighbor (ANN) algorithms that relaxes the requirement of retrievingthe true nearest neighbor and instead permitting a neighbor in the proximity ofthe true neighbor, at the same time providing huge gains in the computationalcost. ANN algorithms fall in two categories (i) (1+ǫ) approximate nearest neigh-bors and (ii) Locality Sensing Hashing (LSH). (1 + ǫ) approximation methodstry to relax the definition in (1) with the following ANN criteria:

ANN(q) := {x′ ∈ D, ǫ > 0|

∀x ∈ D, dist(q, x′) ≤ (1 + ǫ)dist(q, x)} (2)

Here, instead of the exact NN, we seek a neighbor x′ to q that lies withinan ǫ ball to the true nearest neighbor. Papers such as [13, 23] propose variousheuristics to find the approximate neighbors.

Another intuitive way to improve NN is to utilize properties of the data suchthat data points with similar properties are clustered together. LSH utilizesthis idea into a probabilistic setting such that given a hash table H, a similarityfunction f : D ×D → R and a hash function h, for x, x′ ∈ D,

f(x, x′) ∝ Prob (h(x) = h(x′)) . (3)

That is, similar data points have a greater probability of being hashed to thesame hash bucket. Papers such as [12, 6, 18, 21] propose different strategies tobuild the hash functions for a variety of applications.

A drawback of LSH techniques is that knowledge of the data and its prop-erties are assumed to be given, which is seldom true in practice. For example,spectral hashing [42] assumes that the data comes from a known probabilitydistribution. When dealing with very high dimensional image descriptors, as inthe case of computer vision problems, this might be a an assumption difficult toaccommodate. Another promising technique is to use kernel functions to projectthe data into low dimensional hamming spaces. In [44], an SVM based discrim-inative NN algorithm is suggested. Kernel based hash functions are describedin [45]. The basic LSH framework is extended to arbitrary kernel functions in[20] nominated as Kernelized LSH (KLSH). A semi-supervised metric learningstrategy for learning the hash functions is suggested in [40]. A general trendin all these methods is the selection of a set of random hyperplanes to whichthe data is projected, later using these projections for hashing in the binaryHamming space. It is often seen that the selection of these hyperplanes have atremendous impact on the accuracy of the respective algorithm. Deciding on thebest ANN algorithm for a task is an important problem that one needs to con-sider for an application at hand. Muja and Lowe looks at this problem in [31],and proposes an algorithm combining two popular techniques: (i) hierarchical

3

k-means and (ii) randomized kd-trees. This hybrid algorithm is well-known asFast Largescale ANN (FLANN). The advantage of this algorithm being that theuser can specify the targeted accuracy of the algorithm and the technique auto-matically decide the right sub-algorithm and the set of parameters to guaranteethis accuracy. A drawback of the method being that when higher precisions arerequired, the computational speed up is not very impressive. More over, themethod uses a priority search step at which a potential neighbor is comparedagainst possible cluster centroids; for higher accuracy more data points needto be stored in the memory, limiting the applicability of the method to largedatasets.

A paper that is closer in spirit to our strategy is by Jegou et al. [15]. Thepaper describes a method to build a set of codebook centroids over the databasepoints based on subspace product quantization. They treat a given high dimen-sional data vector as being made of a set of short length codes from differentsubspaces. They learn a clustering framework based on k-means for each of thesubspaces from the training data. Given a query vector, a set of cluster centersat a minimum Euclidean distance from the query are found from each of thesesubspaces and an inverted file system is used for faster lookup. Our proposedmethod fundamentally differs from this technique in that rather than quantiz-ing the data vector as a Cartesian product of multiple short length codes, welearn a dictionary that finds a sparse subspace combination that characterizesthe data as a whole; thus avoiding redundancy in the representational powerof the short length codes. A direct consequence of our approach being that itgenerates shorter codes than the product quantization strategy.

A related method that was proposed recently is ANN using Anchor graphs [24];anchors being the cluster centroids of an approximate k-means algorithm on thetraining set. The algorithm embeds the data in a hamming space by buildingadjacency matrices via the anchors. The anchors help in improving the com-putational complexity of the graph construction from the training set. An outof sample extension of the graph for a query point with the anchors still costsO(k|D|), where k is the length of the hash code, which implies that the algo-rithm suffers when dataset scales. Another work related to ours and that usessparse data representations for ANN is by Zepada et al. [43]. In this work, aninverted file system is built from the support set of sparse codes similar to ourapproach, except for one key difference; our representation utilizes the theoret-ical properties of the LASSO solution to construct a sparse descriptor that isrobust to noise resulting in better retrieval accuracy.

All of the above hashing algorithms learn linear subspaces on to which thedata is projected, the sign of this projection is then utilized to create bit vectorhash code. At a fundamental level our algorithm is not much different from thisbasic strategy. We also learn an overcomplete basis set on to which the data isprojected to create the hash code, but with the significant difference that we donot use all the bases hashing a given data point (contrary to the popular ap-proach), but rather we select a small set of bases from the large collection usingthe properties of the data point; with the assumption that data with similarproperties will select similar bases. This active bases subset selection is based

4

on sparsity of the data. Thus, our approach combines the hamming embed-ding strategies of the present day state-of-the-art algorithms together with adata driven adaptive hash function selection mechanism to achieve superior NNperformance.

3 Background

3.1 Sparse Coding and Dictionary Learning

Compressive sensing deals with acquisition and reconstruction of signals thatare sparse in an appropriate overcomplete basis [9, 4] so that only a very smallnumber of samples (against those suggested by the Shannon-Nyquist theorem)are sufficient to reconstruct the signal. Usage of overcomplete basis for imagemodels was introduced in [34] for modeling the spatial receptive fields of themammalian visual cortex. The main motivation behind sparse coding is that thecoherence (refer Defintion 2 below) between the dictionary atoms incorporatea degree of non-linearity between the input-output signal; thus sparsifying theinput signal will select only those basis elements from the dictionary that arenecessary for the signal representation.

Formally, let x ∈ Rn be the input signal, and let Φ ∈ O(n) an orthonormalbasis such that x = Φa, for a sparse coefficient vector a ∈ Rn. Compressivesensing says that given a sensing matrix Ψ ∈ Rd×n (where d < n), a low dimen-sional sample v ∈ Rd where v = ΨΦa is enough to recover the underlying signalx. This reconstruction of x from v can be cast as the following optimizationproblem:

mina

‖a‖0 subject to ‖v − ΨΦa‖22 ≤ ǫ, (4)

where ǫ ≥ 0 captures potentially noisy measurements. The formulation (4) hastwo practical limitations: (i) it is known to be NP-hard; and (ii) knowing thesensing matrix Ψ or the orthonormal matrix Φ is not trivial for a given signalclass. The first limitation is tackled by relaxing the ℓ0-quasi-norm to the ℓ1-norm; the resulting problem is convex, and under the so-called restricted isome-try conditions, its solution perfectly recovers the desired signal [4]. To mitigatethe second issue, dictionary learning methods have been suggested [32], wherean incoherent overcomplete dictionary B with column vectors b1, b2, · · · , bn islearned directly from a collection of data samples vk, (k = 1, . . . , N), by solvingthe following optimization problem:

minb,a

N∑

i=1

‖vi −∑

j

ajibj‖

22 + λ‖ai‖1

subject to ‖bj‖2 ≤ 1, ∀j ∈ {1, ..., n} (5)

where the vector a is called the activation vector or the coefficient vector andthe notation a

ji stands for the jth component of the ith activation vector that

5

sparsifies the data vector vi. The objective function in (5) balances two terms:(i) the quadratic term minimizes the L2 error between the sparse representa-tion and the data vector vi and, (ii) the L1 minimization term imposes sparsityon the weights ai. The parameter λ regularizes the penalty imposed by theL1 constraint. The problem is convex in either B or ai separately, but is notconvex in both together, suggesting an alternating minimization strategy whichsolves each of the convex subproblems iteratively towards a local minimum.The K-SVD algorithm [1] is one such algorithm popularly used in the dictio-nary learning literature. Once the overcomplete basis B is learned for the datavectors, the following formulation can be used to find a sparse activation vectorw given a data vector v:

gλ(v) := argminw

‖v − Bw‖22 + λ‖w‖1 (6)

In some parts of this paper, we will refer to (6) in a slightly different formdefined as follows:

gλ(v) := argminw

‖w‖1

subject to ‖v − Bw‖22 ≤ δ2 (7)

for a reconstruction error bound δ > 0. It should be noted that (6) and (7)are equivalent and is commonly referred to in the statistics literature as the L1regualarized least squares or the LASSO formulation.

A quantity that we frequently allude to in this document is the coherence ofthe dictionary, which is defined as follows:

Definition 2. Coherence: Suppose B = {b1, b2, · · · , bn} is a dictionary whereeach bi ∈ Rd, i = 1, 2, · · · , n. Then we define the coherence µ of B as:

µ = maxi<j

|bTi bj |, ∀i, j ∈ {1, 2, · · · , n} . (8)

The angle between the two dictionary basis that correspond to µ is calledthe minimum angle of this dictionary. Analogously, we define the maximumangle of a dictionary as the angle corresponding to the least correlated basis.We use the notation p, q to denote the minimum absolute angle between anytwo unit norm vectors p and q.

3.2 Least Angle Regression (LARS)

One of the most popular algorithms for solving the LASSO formulation is theLARS [10]. Since we use some of the properties of the LARS solution pathfor generating robust sparse codes, we provide here a brief review. Startingwith a zero activation vector, for a given input vector v, the LARS algorithmbuilds the active basis set in steps, each step finding a basis from the dictionaryB that is most correlated with the residual at that step; that is, C = BT v,

6

a1 = argmaxj |Cj |, where ak denotes the active set at the kth LARS step (wherek also equalling the size of the active basis set). Next, LARS does a gradientdescent in this direction (v′ = v + γba1) for a step γ, which is precomputedsuch that after the descent, the correlation of the residual becomes equal bothto the current active set and to another basis bj that is not yet included in theactive set. At this step, LARS adds the new basis to the active set; a2 = a1 ∪ j;discards the previous descent direction and starts to follow a descent directionthat is equiangular to both the bases in a2. At the kth step, LARS would haveadded k basis into the active set and would be following a direction that makesequal angles with all k bases in the set so far. In [10], efficient algorithms aresuggested to compute the equiangular directions.

There are a few important facts about this algorithm that need to be pointedout as they are important from the perspective of sparse coding for NN retrieval.

P1: For a non-orthonormal dictionary, the coefficients associated with the ac-tive basis can decrease in magnitude along the LARS regularization path.

P2: The naive LARS algorithm does not provide a solution to the LASSOproblem; the main issue being when the paths of the coefficients of theactive bases cross the axis during a gradient descent update. To mitigatethis issue, [10] suggests a proactive check for determining any possiblezero-crossings in the solution paths of the active set; preceding which thebasis gets temporarily removed from the active set.

In the next section, we propose our tuple representation of the sparse codethat enables fast NN retrieval.

4 Subspace Combination Tuple

Once we have a dictionary to describe a given data vector as a sparse linearcombination of the basis vectors, the next step is to formulate a representation ofthe data vectors in terms of their sparsity. Recall that there is a large set of basisvectors (due to the high dimensionality of the data and the overcompletenessof the dictionary) of which only a very few need to be active for sparse codinga data vector. Referring to (6), assume that the data vectors v are k-sparse inthe dictionary (i.e. have no more than k non-zero sparse coefficients) and thedictionary has n bases. If each basis is equally likely to be in the active set, thenwe have C =

(nk

)unique active basis combinations possible. For a dictionary

of size, say 2048, just having an active set of size 10 leads to approximately3 × 1026 unique combinations; a strong motivation for utilizing sparse codingfor NN retrieval. In this section, we build on this idea to propose a simplerepresentation for the sparse codes.

Consider the function h : R → Z defined as follows: if wi denotes the ithcoordinate of a vector w ∈ Rn, then

h(wi) :=

{i wi 6= 0;

0 otherwise.(9)

7

Let hgλ(v) := h ◦ gλ(v), where gλ is the LASSO formulation defined in (6). The

function h takes the integer coordinates from h and combines them to form atuple as follows:

h(w) :=

n⋃

i=1

{h(wi) > 0} (10)

The crux of our algorithm for NN retrieval via sparse coding is based on theobservation that for two data points x and x′:

sim(x, x′) ∝|hgλ

(x) ∩ hgλ(x′)|

|hgλ(x) ∪ hgλ

(x′)|, (11)

that is, the similarity is proportional to the number of subspaces overlappingbetween the sparse representations of the two data points. The notation | |refers to the set cardinality. This similarity measure is generally referred to asJaccard Index [14] in the domain of information retrieval.

Utilizing the definitions above, we formulate our tuple representation asfollows:

Definition 3. Subspace Combination Tuple: SCT1Given a data vector v and a dictionary B. Suppose w

∗ = gλ(v). A tuple ofintegers h(w∗) = 〈i1, i2, · · · , ik〉 is defined as an SCT, if i1, i2, · · · , ik are theunique identifiers of the bases and are arranged in the order in which the basesare activated by the LARS algorithm.

Although the definition of SCT looks unconventional, its significance willbecome apparent soon. From an implementation point of view, Definition 3poses two technical difficulties: (i) it requires a modification of the existingLARS implementations such that they provide the indices of the basis activatedalong the regularization path, and (ii) if another algorithm is used for sparsecoding other than LARS that does not directly provide a regularization path.

Thus we suggest an approximation to the SCT definition above, which isslightly more computational but achieves the same effect in practice.

Definition 4. Subspace Combination Tuple: SCT2Given a data vector v and a dictionary B. Suppose w

∗ = gλ(v). Let h(w∗) =〈i1, i2, · · · , ik〉 be a tuple of unique identifiers corresponding to active basesbi1, bi2, · · · , bik with associated coefficients ai1, ai2, · · · , aik respectively. Thenh(w∗) is called an SCT if |ai1| ≥ |ai2| ≥ · · · |aik|. That is, the basis iden-tifiers are arranged in monotonically decreasing order of the absolute sparsecoefficients.

In practice, it is likely that the magnitudes of the active basis follows theorder in which they are activated and thus this formulation provides a goodapproximation to SCT1. Unfortunately, due to property P1 of LARS (see Sec-tion 3.2), this is not guaranteed, in which case we need to take all possiblepermutations of the integers in SCT2 and query the respective hash buckets(which can be done in parallel). This approach might not scale when the size

8

(a) (b)

(c)

Figure 1: (a) Normalized SIFT descriptor, (b) the sparse representation of theSIFT descriptor in (a) using a set of 2048 bases dictionary, (c) an inverted fileorganization, the hash key is the SCT and the hash bucket holds the activecoefficients of the bases in the order of their entry in the SCT. The descriptorsare arranged in a linked list in case of bucket collisions.

of the support set is large (fortunately, as we will see, we work with smallersupport sets).

Representing the data vector as a tuple brings along the usage of a hashtable for fast NN retrieval. Let us see how. Figure 1 illustrates of the idea for atypical zero-mean normalized SIFT descriptor. Each column of the dictionaryis identified by its location index in the dictionary, and thus a hash key is aset of integers, which can be encoded as a bit stream. To tackle collisions inthe hash buckets, the colliding descriptors are organized using an efficient datastructure, such as a linked list (or a KD-tree for efficiency).

Given a query descriptor, we use (6) to first get its sparse activation vector,then the SCT is created from the activations, and the corresponding hash bucketin the table queried. If there are more than one entries in this hash bucket, thena suitable distance measure (cosine similarity measure or Euclidean distance) isused to find the descriptor that closely matches the query. The following theo-rem relates the Euclidean distance between two data points and their distancein the sparse space.

Theorem 1. If v1, v2 ∈ Rn are two zero-mean data points and if B is a learneddictionary, then suppose α1 and α2 are the sparse coefficient vectors recon-structing v1 and v2 (as per the LASSO formulation (7)) respectively such that

9

Figure 2: With reference to Theorem 1: O is the origin, Bα1 and Bα2 are twoapproximate reconstructions of the two zero-mean data vectors v1 and v2 re-spectively. The circles (hyperspheres in the general case) of radius δ and centersO1 and O2 represent the loci of the two data points given their approximatereconstructions.

‖v1 − Bα1‖22 ≤ δ2 and ‖v2 − Bα2‖

22 ≤ δ2. Assuming their supports are exact

(as they are hashed to the same hash bucket), we have:

‖v1 − v2‖2 ≤ 2δ + µ‖α1 − α2‖2 (12)

where µ is the maximum coherence of the dictionary atoms.

Proof. With reference to Figure 2, let Bα1 and Bα2 are two reconstructionscorresponding to the two data vectors v1 and v2 respectively. From the figure,it is clear that the two circles (hyperspheres in the general case) with centersO1 and O2 represent the loci of v1 and v2 respectively, given their respectivereconstruction vectors. From the construction, the theorem follows. �

4.1 Discussion

Space Partitioning: To provide the reader an intuition into the type of hash-ing generated by sparse coding; Figure 3 illustrates the partitioning introducedby the bases as the support size increases. Since the data and thus the dictio-nary is assumed to be zero-mean, all the bases are shown to pass through theorigin. The red points show 2D data points for which a dictionary of size 2× 7was learned. A 1-sparse reconstruction is essentially the bases themselves, whilehigher support sizes introduce all possible two-bases combinations in equiangu-lar directions such that we have a large number of subspace combinations foreach data point. For example, there are 21 bases for 2-sparse as shown in Fig-ure 3(b) and this goes to 35 bases in Figure 3(c). Intuitively, an SCT represents

10

(a) 1-sparse (b) 2-sparse (c) 3-sparse

Figure 3: Data space partitioning induced by sparse coding. The red pointsrepresent 2D data points and the lines denote the dictionary bases from a 2× 7learned dictionary. As the support size increases, multiple combinations of thedictionary bases in equiangular directions are generated spanning smaller andsmaller subspaces.

one of the lines and all the data points falling on this line are mapped to thecorresponding hash bucket. As the support size increases, more and more of thedata are spanned by the lines, each line covering a small portion of the dataset.The number of such lines increases exponentially as the support size increasesas is apparent from the illustration in Figure 3(c).

5 Robust Sparse Coding

There are a few problems with the direct application of the SCT formulation forefficient NN retrieval. These problems have been the primary bottlenecks for theadaptation of sparse coding in the NN retrieval community. The first problem isapparent from the property P2 (see Section 3.2); basis vectors getting droppedfrom the solution path, leading to non-matching SCTs. Another important issuebeing when the data is noisy which is most often the case; sparse coding is verysensitive to noise. In the presence of noise, some of the basis in the dictionarywill be activated for representing the noise; leading to different SCTs even forhighly correlated data points. A third problem, which is more of algorithmicnature being when the hash bucket for a given SCT is empty; that is, whichbucket needs to be looked at next to guarantee the retrieval of an NN. In thefollowing sections, we will address all these issues.

5.1 Basis Drop

As we discussed in the review of the LARS algorithm in Section 3.2, at timesthe descent direction of the LARS algorithm will be against the direction of oneof the active basis, i.e. if wk

i is the ith coordinate of the coefficient vector at thekth step and if γ is the step size, then the descent at this step will follow:

wk+1i = wk

i + γbiak (13)

11

5 10 15 200

10

20

30

40

50

Support Size (L)

% z

ero

−cro

ssin

gs

Figure 4: A plot of the percentage of 100K SIFT descriptors that had atleastone zero-crossing in its sparse regularization path against the allowed supportset size.

where biak is the ith coordinate of the direction that is equiangular to all the bases

in the active set ak. Since the dictionary bases are not assumed to be orthogonal

(due to the overcompleteness assumption), wk+1i crosses zero at γ = −wk

i /biak.

At this stage, LARS drops the active basis bi from the active set temporarilyand adds it back in the next step (refer [10] for details). By definition, our SCTrepresentation1 cannot take care of the dropping of the bases and thus the hashcode generated defaults.

To understand the extent of this issue, we did a study of the regularizationpaths of the LARS solution on a part of our SIFT dataset consisting of approx-imately 100K SIFT descriptors. We computed the regularization paths of thesedescriptors by increasing the support set size from 1 to L, for L = {5, 10, 15, 20}.For example, for L=5, we computed LARS 5 times on a SIFT descriptor, eachtime incrementing the allowed number of supports by one (L is also the numberof iterations of the LARS algorithm); the sign of each of the active basis in theset was then checked to see if any of the coefficients changed sign. Figure. 4shows the result of this experiment. The x-axis is the active set size (L) andy-axis plots the percentage of the dataset that had a zero-crossing at least once.We used a dictionary of size 2048 learned from 1M SIFT descriptors for thisexperiment and the regularization parameter λ was fixed at 0.5.

As is clear from the plot, the number of zero-crossings increase as we allowthe activated basis atoms to live longer in the LARS process. For L = 5,we observed that less than 0.2% of the dataset produced zero-crossings, whilethis increased to approximately 50% for L = 20. This result suggests thatif we restrict the support set size to less than 5, the effect of the zero-crossingproblem is negligible, and practically can be ignored. Based on this observation,we assume from now on that no basis gets dropped in our setting. The theoreticalresults that follow in this paper makes fundamental assumption.

1Note that this is not a serious impediment to our representation as this will be taken carethrough the robust sparse code formulation that we propose later.

12

5.2 Noisy Data

Next, let us consider the second problem in detail. For data points v1 and v2,assume that T1 = hgλ

(v1) and T2 = hgλ(v2). Assume that v1 = v + ǫ1 and

v2 = v + ǫ2, where ǫ1 and ǫ2 represent random noise and v is the underlyingnoise-free data vector. Suppose, we apply (6) to v1 and v2 using a dictionary Band a regularization λ, due to noise, there is no guarantee that T1 = T2, as someof the atoms in the dictionary might be used to represent the noise, and thusdefaulting our SCT based hash table indexing. This indeed happens in practiceand Figure 5 illustrates this scenario for two SIFT vectors that correspond tothe same keypoint in two images each having different amounts of noise.

(a) (b)

Figure 5: (a) Two SIFT descriptors that correspond to the same key point intwo images. (b) The sparse representation of the two SIFT descriptors using a2048 basis dictionary. As is clear from the figure, there are a few basis vectorsactivated for either of the descriptors (activations that are only blue(-) or red(.)alone).

The basic motivation for our approach to overcome this difficulty comes fromideas in wavelet transform based multiscale analysis in which a given signal isapproximated to various error levels by restricting the choice of the waveletscales used. That is, the signal is approximated poorly in coarse scales and theerror residual is decreased when using finer and finer scales. Assuming that thesignal-to-noise ratio is high, we propose a novel algorithm Multi-RegularizationSparse Coding (MRSC), to generate robust sparse codes based on the propertiesof the regularization on the LASSO solutions. To set the stage for the algorithm,the following section reviews the concept of regularization and builds up a fewtheoretical results that comes useful when putting forth the main arguments inthis paper.

5.2.1 Regularization

With reference to the LASSO formulation given in (6), it is clear that theregularization λ weighs the L1 constraint with respect to the L2 least squaresconstraint. This regularization has two effects: (i) model selection and (ii)shrinkage estimation. That is, the L1 constraint selects the subset of the basisvectors that best represent the given data vector (setting the coefficients for

13

other basis to zero) and at the same time, the non-zero coefficients selected areshrunk towards zero.

Lemma 1. Assuming the LASSO formulation in (6), if w is a minimizer ofgλ(v) for a data vector v and regularization λ > 0, if wi 6= 0, then sign(wi) =sign(bT

i v), where wi is the ith coordinate of w and bi is the ith basis in thedictionary.

Proof. Under optimality,

∂gλ(v) ∈ 0 ⇒ −BT (v − Bw) + λu(w) = 0 (14)

where the ithe coordinate of u is given by

u(wi) =

1 if wi > 0

−1 if wi < 0

[−1, 1] wi = 0

(15)

Given that wi 6= 0, (14) implies

bTi v = wi + u(wi)λ = sign(wi) (|wi| + λ) (16)

as bTi bi = 1. Now, assume bT

i v = 0 and let wi 6= 0; w.l.o.g let wi > 0. Thenu(wi)λ < 0. Given λ > 0, we arrive at a contradiction with the definition ofu(wi). Thus wi = 0 and u(wi) = 0. �

Lemma 2. Assuming the LASSO formulation in (6), if wi = 0 for a givenregularization λ > 0, then wi remains zero for λ′ = λ + δ, for any δ > 0.

Proof. Assume that for λ, we have wi = 0, so that using the definition of u(wi)from (15) we have

|bTi v| ≤ λ. (17)

Now, for λ′ = λ + δ, let us assume that wi 6= 0. Then, we have bTi v − wi =

u(wi)(λ + δ), where from Lemma 1, we have sign(bTi v) = sign(wi) = u(wi).

Using (17), we have bTi v−u(wi)λ = wi+u(wi)δ. If bT

i v < 0, then bTi v−u(wi)λ >

0, but wi +u(wi)δ < 0 and we arrive at a contradiction. Same argument applieswhen bT

i v > 0. Thus wi = 0 for all δ > 0.

Lemma 3. Assuming (6), w = 0 if λ ≥ ‖Bv‖∞.

Proof. From the optimality condition (14), and using Lemma 2 we have thatfor a wi = 0, λ ≥ |bT

i v|. Now, for w = 0, λ ≥ max |bTi v| for all bi ∈ B. �

Theorem 2. Assuming the LASSO formulation in (6), for any finite datavector v, increasing the regularization λ increases the sparsity of w.

Proof. Lemma 3 shows that for any finite v, if gλ(v) 6= 0, then there exist aλ′ such that λ < λ′ < ∞ for which each wi goes zero. From Lemma 2, wehave that once a coefficient wi = 0, it will not become nonzero for any furtherincrement in λ.

14

Lemma 4. If bi ∈ B is the most correlated basis to a data vector v, then bi willbe selected in the first step of LARS.

Proof. This is true by definition of LARS algorithm. �

Theorem 2 suggests that by adjusting the regularization, the sparsity of theLASSO solution can be adjusted; that is, the size of the support set (as in theLARS solution path) is inversely proportional to the regularization value. Thisconcept ties LARS with the LASSO solution path and the following theoremfollows:

Theorem 3. Assume v1, v2 are two zero-mean data vectors, and B is an over-complete dictionary learned from data. If µ is the maximum coherence of B.Suppose T1 = hgλ

(v1) and T2 = hgλ(v2) are the two SCTs (SCT1) found by (6)

for a regularization λ. If T1 6= T2, then there exists a λ′, where

λ < λ′ ≤ max(‖BT v1‖∞, ‖BT v2‖∞

)

such that the new tuples T′1 = hgλ′

(v1) and T′2 = hgλ′

(v2) generated by (6) willsatisfy,

I. T′1 ⊆ T1 and T

′2 ⊆ T2

II. T′1 = T

′2

iff there exists a bi ∈ B such that |vT bi| <√

12 (1 + µ), ∀v ∈ {v1, v2}.

Proof. I. is a direct result from the continuity properties of the LASSO as im-plied by Theorem 2.

II. Assume for the ease of analysis that the data vectors are unit normalized.

Let cosθ = µ so that√

12 (1 + µ) = cos θ

2 . In the following we assume the LARS

algorithm for sparse coding and use the previously introduced notation a, b todenote the minimum angle between the two vectors a and b

To prove the if part: Suppose ∃bi ∈ B such that |bTi v| <

√12 (1 + µ), then

bi, v < θ2 . This condition means that bi is the most correlated dictionary atom to

v and thus will be selected by LARS in the first step. If not, there should existanother basis bj such that this condition is satisfied. In which case, bT

i bj < µ,which contradicts the condition that the maximum coherence of the dictionaryis µ. Now, using the definition of SCT1, combined with Lemma 3, Lemma 4and Theorem 2 we have that ∃λ < λ′ < ‖Bv‖∞ such that T1 = T2.

To prove the only if part: Assuming that there exists a 0 ≤ λ < ∞ suchthat T1 = T2. Let T1 = T2 = 〈i1, i2, · · · , ik〉, then by the definition of SCT1, wehave that i1 is the most correlated basis to the data v and thus using Lemma 4will be selected at the first step of LARS, from which the result follows. �

The following result generalizes the above theorem.

15

Corollary 1. Suppose T1 = T2 = 〈i1, i2, · · · , ik〉 be two SCTs for two data

points v1 and v2 respectively. If bk is a unit vector equiangular to the k support

bases, then v, bk < θ2k

, where v ∈ {v1, v2}, and θ = mini,j |bi, bj |.

Proof. The proof is a direct extension of Theorem 3.

Corollary 2. With regard to the LASSO formulation in (7), for a data vectorv ∈ Rn, let T = hgλ

(v) and let |T | = ℓ where | | denotes the length of the SCT.Then ℓ s. t. (0 ≤ ℓ ≤ n) ⇔ λ s. t. (0 ≤ λ ≤ ‖Bv‖∞).

Proof. This is implied by the equivalence of the two LASSO formulations in (6)and (7) and the proof follows from the Lagrangian of (7).

Next we prove a few results relating to the relation between the data vectorsand their SCTs.

Theorem 4. Let θ = maxi,j bi, bj for bi, bj ∈ B. If v1, v2 > θ, then T1 6= T2

with probability one.

Proof. Since the maximum angle between dictionary atoms is θ, we arrive at acontradiction if T1 = T2 as per the definition of SCT1. �

The next theorem relates the correlation between two zero-mean data vectorsand the probability that they will have the same SCT for some λ.

Theorem 5. Let v1, v2 ∈ Rd be two zero-mean data vectors and let T1, T2 be

the respective SCTs. Let θ = maxi,j bi, bj for bi, bj ∈ B. If v1, v2 ≤ θ, thenProb(T1 = T2) ≥ 1/2 for any 0 ≤ λ ≤ ‖Bv‖∞.

Proof. Without loss of generality, we assume that the data vectors are unit-normalized, so that they live on the surface of an n-dimensional unit ball. Wehave

P (T1 = T2) = 1 − P (T1 6= T2) (18)

To understand what is going on, let us consider the 2D case of a circle as shown

in Figure 6. Let bi and bj are two adjacent basis such that bi, bj ≤ θ for some bi

and bj (that is, v1 and v2 lie in the interior of the sector defined by v1Ov2). The

arcs⌢

AB and⌢

BC represent regions around the two basis bi and bj respectivelysuch that if both v1 and v2 fall in either of these regions, they will have thesame SCT. Now, to generate the loci of all the SCTs that do not match, we canhave the sector generated by v1Ov2 as shown in the figure rotated from A to C

such that at some point, we have v1, B = θ − α and B, v2 = α, for some α ≤ θ.Then,

P (T1 6= T2) = Prob(v1 ∈ S(θ − α))Prob(v2 ∈ Sα))+

Prob(v2 ∈ S(θ − α))Prob(v1 ∈ S(α)) (19)

= 2SA(θ − α)SA(α)

SA(θ)2(20)

16

Figure 6: A simplified schema illustrating Theorem 7. O is the center of theunit ball, bi and bj are two bases under consideration. AB and BC are sectors

into which if the two data points v1 and v2 ((v1, v2) ≤ θ) fall, then they willhave the same SCT. To determine the probability of the data points to have

different SCTs, we rotate a sector of size θ (represented by⌢

v1Ov2) from A toC; an intermediate position of this arc is shown that makes an apex angle of αin region BC and θ − α in AB.

where S(α) is the surface of a hypershpercial cap with an apex angle of α andSA(α) represents the surface area.

Now, as area is an increasing function of the apex angle and since the totalarea of the sub-cap that we are interested in is SA(θ), we can rewrite (20) as:

f(α) =SA(α) (SA(θ) − SA(α))

SA(θ)2(21)

From [22], we have

SA(φ) =1

2AnIsin2(φ)(

d − 1

2,1

2) (22)

where φ is the apex angle of the cone, An is the surface area of an n-dimensionalunit ball and Ix(a, b) is the regularized incomplete beta function given by:

Ix(a, b) =1

B(a, b)

∫ x

0

ta−1(1 − t)b−1dt. (23)

To obtain an upper bound to (21), we can maximize the right-hand side of itwith respect to α. Substituting (22) in (21), applying the generalized Leibnizrule for integrating under a differential sign and equating to zero, we get

SA(α) =SA(θ)

2(24)

17

which implies α = θ/2. Thus we have:

P (T1 6= T2) ≤ 2SA(θ/2)2

SA(θ)2(25)

= 1/2, (26)

which implies P (T1 = T2) ≥ 1/2. �

5.2.2 Multi-Regularization Sparse Coding

From Theorem 3 we have that, if the SCT representations for two data vectorsthat make an angle less than θ/2 with the same basis do not coincide at oneregularization value, there should exist a larger L1 penalty at which the twoSCTs will accord. This implies that if we use multiple SCT codes for a set ofregularizers with the condition that atleast one of them will match up with aquery SCT, then the hashing can be successful. This is exactly the idea in theMulti-Regularization Sparse Coding algorithm we propose below. There is onekey difference though; instead of running the LARS algorithm in multiple passesfor each regularization value, we make a single pass, keeping track of the index ofactive basis support at each LARS iteration, incrementally building the SCT. Asimplied by Corollary 5 this is theoretically equivalent to running the algorithmfor different regularizations. Using this idea, we can create multiple SCTs fora sequence of support sizes: L := {Lmin, Lmin + 1, Lmin + 2 · · · , Lmax}. Foreach data vector v, we define the hash code set Hg(v) as follows:

Hg(v) :={

hgLmin(v), hgLmin+1

(v), · · · , hgLmax(v)

}. (27)

Given a query vector vq, we compute Hg(vq) for the same values of supportsizes, but query the hash table in the reverse order; that is, the hash codecorresponding to hgLmax

is queried first. If an SCT cannot match an NN at oneregularization value, we go to the next L which is one less than the previous one,such that a coarser subspace is checked for a neighbor. The process is repeateduntil the required number of NNs are retrieved. Algorithm 1 and Algorithm 2provide the details.

5.3 Hash Bucket Empty

The third problem that we need to consider when using sparse coding for NNis to answer the question of ’where to look for next’ if we find a hash bucketto be empty. This is partially explained in the previous section through usingmultiple sparse codes. The question still to be answered is where to query ifthe database point and the query point make angles with each other that isgreater than the θ of Theorem 6. In the following subsections, we provide a fewheuristics to tackle this scenario which have been found effective in practice.

18

Algorithm 1 MRSC Hashing

Require: B, D,1: Initialize hash table H2: for vi ∈ D do

3: w ⇐ SparseCode(vi,B, Lmax) {run LARS Lmax steps}4: c = wi, I = i, ∀wi 6= 0 {all nonzero coefficients in w}5: [αi, scti] ⇐ SortAndReorder(c, I,′ descend′)6: for j = Lmin → Lmax do

7: hij ⇐ Compress(scti,(Lmin..j)){(1..j) means coordinates from Lmin to j in scti}

8: H ⇐ AddToHashTable(hij , αi,(Lmin..j), H)9: end for

10: end for

11: return H

Algorithm 2 MRSC Query.

Require: B, H, vq

1: Initialize NNq ⇐ 02: w ⇐ SparseCode(vq,B, Lmax)3: c = wi, I = i, ∀wi 6= 0 {all nonzero coefficients in w}4: [αq, sctq] ⇐ SortAndReorder(c, I,′ descend′)5: for j = Lmax → Lmin do

6: hq ⇐ Compress(sctq,(Lmin..j))7: NNq ⇐ Query(hq, αq,H)

{Finds NNq in bucket hq such that ‖αNNq − αq‖2 is min.}8: if NNq 6= 0 then

9: return NNq

10: end if

11: end for

12: return NNq

19

5.3.1 Cyclic Subspace Combination Tuples

It often happens that the database point and the query point are noise corruptedversions of an underlying true data point; let v be a noisy data point and letT = 〈i1, i2, · · · , ik〉 be the respective SCT associated with T . Assume j∗ =

argmaxj∈T bij , v denotes the maximum angle that the data point makes withany basis in the active support. Then, as long as for a noise component ǫ

such that v′ = v + ǫ, v′, bij∗ ≤ θ, T ∩ T ′ 6= φ, where T ′ is the SCT for v′.A schematic illustration of this idea is suggested in Figure 7. A data pointA and a query point B, are assumed to be noise corrupted versions of eachother such that the noise displaces the spatial location of A to B. Note thatpoint A has SCTA = 〈j, k〉, while point B has a SCTB = 〈k, i〉. As long astheir displacements lie in a ball of radius β described by the largest angle thatA makes with any of the basis in the active set, they will have at least onesubspace overlapping.

Motivated by this idea, we propose a cyclic SCT, in which, instead of gener-ating a single MRSC, we do a cyclic rotation of the MRSC hash codes, makingeach of the basis as being the first entry in the respective SCT cycle, so thatthe MRSC algorithm queries neighboring subspaces for a given query ratherthan constraining only the subspace majored by the most correlated basis tothe query point. If the SCTs are shorter, such that computing all permuta-tions is not costly, then we can in fact look at every permutation of the MRSChash code, that provides a faster search of the subspaces which can infact beimplemented in parallel. After retrieving the neighbors from each of the SCTsin the cycle (or permutations), the NNs are decided by sorting the candidateneighbors. Referring back to Algorithm 2, Cyclic MSRC will have an additionalloop at line#5, where the sctq is rotated (or permuted). Also, before returningthe NNq, the algorithm accumulates the candidate neighbors found for the cycleand finds the closest neighbor from this set.

5.3.2 Hierarchical Sparse Coding

It is possible that the given data vectors have minor correlation such that there isno overlap between their dictionary support. A way to tackle this issue throughour framework is to use hierarchical sparse coding in which we learn multipledictionaries each with number of basis vectors reduced by half. That is, insteadof sparse coding the data point v ∈ Rn with a dictionary with m bases, wegenerate k different sparse codes for dictionaries of sizes {m, m

2 , m4 , · · · , m

2k };k ≤ n. Intuitively, the coherence of the dictionaries (and thereby the subspacescovered) decrease as the dictionary size decreases, such that each dictionarycovers a larger subspace, addressing the issue of getting the nearest neighbor.This is a topic to be analyzed on its own and we consider it as part of a futureextension to this paper.

20

Figure 7: Illustration of the cyclic MSRC algorithm: assuming an orthogonalsubspace for clarity of presentation, if a data point denoted by A inhabiting asubspace spanned by bases bj and bk gets displaced due to noise to a spatiallocation B spanned by the subspaces bi and bk. As long as the noise strength isless than the coherence of the data point against the strongest active basis, wewill have an SCT overlap using the cyclic MRSC algorithm.

6 Algorithm Analysis

Computational Complexity: Let us assume that we use a maximum ofk iterations of the LARS algorithm for sparse coding a data point v ∈ Rd

using a dictionary B ∈ Rd×n. Further, assuming we use Cholesky factorizationand updation for computing the least angle direction in each iteration of theLARS algorithm; then each direction computation takes O(j2) for the jth LARSstep, leading to O(k3) time for k steps. To compute the maximum correlateddictionary atom, for k steps we require nd + (n − 1)d + (n − 2)d + ... + (n −k)d) = O(ndk) computations. So total computational complexity is O(k3 +ndk). As smaller codes (k ≤ 5) are sufficient for our algorithm, the computationis dominated by the component O(ndk), which is the cost of making k innerproducts with the entire B.

Space complexity: Applying Theorem 1, we need to store only the coeffi-cients corresponding to the support in the SCT for computing the distances.Thus the space required is k memory locations for storing the coefficients andk log(n) bits for storing the SCT codes. Technically, as the active coefficientsare stored as 4 byte floating point numbers, the total space is O(32k +k log(n))bits. For example, using a support size of 5 and 2048 basis, we need 215 bitsper data point of which 55 bits are used for hashing.

Query Complexity: Assuming the data D is uniformly distributed into thehash buckets by the dictionary B ∈ Rd×n Thus, for an SCT of size k, we have

each hash bucket to contain s = |D|

(n

k)items. Assuming, we use linear search for

21

resolving hash conflicts, a query takes O(s) time. If hash bucket correspondingto an SCT of size k is empty, then we need to retreat to an SCT of size k − 1and so on; the extra time for linear search that we need to do can be derivedas follows: Let sj denotes the time for doing linear search for an SCT of size j,then sk−1

sk=

(nk

)/(

n(k−1)

)which leads to a complexity of sk−1 = O(k(n − k)sk).

7 Experiments

In this section, we detail our experimental setup; the datasets used and theperformance metrics against which the methods are evaluated. Our algorithmswere implemented in C++ for speed, with MATLAB interfaces for ease of datahandling. All the algorithms, except the one on speed of retrieval for an increas-ing database size, used a 2GHz 64bit machine with 4GB RAM.

7.1 Datasets

SIFT Dataset: Our experiments were primarily based on the SIFT descrip-tors created from the INRIA Holidays and the INRIA Copydays datasets2. Weused approximately 400M SIFT descriptors created from 200K images fromthese datasets.

Spin Image Dataset: 3D shape recognition and retrieval has recently beengaining a lot of attention due to the introduction of cheap 3D sensors. Sincethese sensors produce millions of 3D points per video frame, retrieving corre-sponding points across frames for object recognition is a challenging operation.One of the most popular 3D point cloud descriptors is the spin images [16].Each spin image is a 2D histogram of the changes in the 3D surface structure ofthe point cloud, computed in cylindrical coordinates, around a given point. Weused a 20×20 histogram, thus forming spin vectors of 400D for each 3D point inthe cloud. The descriptors were generated from the public SHREC dataset3 [3],consisting of 50 object classes, of which we used 10 classes for computing thespin images; amounting to a total of 1M descriptors.

7.2 Performance Metrics

This subsection details the metrics against which we analyze the performanceof the sparse coding algorithms.

Recall@K: Given a dataset D and a query set Q, we define:

Recall@K =1

|Q|

∑

∀q∈Q

NNAK(q,D) ∩ NNgt

1 (q,D) (28)

2http://lear.inrialpes.fr/ jegou/data.php3http://www.itl.nist.gov/iad/vug/sharp/contest/2011

22

where NNAK retrieves the K nearest neighbors of a query point q from D using

the respective algorithm A and NNgt1 (q,D) is the ground truth nearest neighbor

computed using linear scan. This definition of recall has been used in manypapers such as [15] as it provides a sense of approximately how many NNs haveto be retrieved by a given algorithm so that there is a good chance that the trueNN is retrieved.

Precision@K: Given a dataset D and a query set Q, we define:

Precision@K =1

|Q|

∑

∀q∈Q

NNA1 (q,D) ∩ NNgt

K(q,D) (29)

where NNgtK retrieves the K nearest neighbors of a query point q from D using

linear scan and NNA1 is the nearest neighbor found by the algorithm A. This

definition of precision provides an intuition on the quality of the NNs retrievedby the respective algorithm; essentially if the dataset point as returned by thealgorithm A belongs to the first K ground truth nearest neighbors.

7.3 Dictionary Learning

Learning the overcomplete basis dictionary is the first step in sparse coding;deciding the number of bases a critical component with regard to the accuracyand speed of sparse coding and NN retrieval. A dictionary with a large numberof bases could increase the coherence between the atoms such that the sparsecodes become too sensitive to noise or can even become ambiguous (coherencewhen unity). Too few dictionary atoms might result in less expressive SCTs.Thus, in this paper we approach the problem from a data driven perspectiveand use cross-validation against recall@1 to decide the optimum dictionary size.

Towards this end, we randomly chose 1M SIFT descriptors from the dataset.The dictionary learning was performed using the SPAMS toolbox [27, 26] for anincreasing number of bases. Typically, SIFT descriptors have integer dimensionsthat pose significant problems with the standard dictionary learning algorithms,especially the ill-conditioning of the matrices involved. Thus we scaled down allthe descriptors by dividing them by 255, and later normalizing each descriptor tohave zero mean. Figure 8(a) shows the recall@1 performance on NN over SIFTdescriptors for varying dictionary sizes on a small database of 10K descriptorsand a query size of 1K.

As is clear from the plot, the larger the size of the dictionary, more time istaken for solving the LASSO. Lower number of bases lead to too many activa-tions, leading to many hash collisions, thus reducing the performance. The bestrecall was observed for a dictionary of size 2048 for the SIFT descriptors andthus we use this dictionary for subsequent experiments in this paper. Figure 9shows a typical 256 basis SIFT dictionary. Sparse coding a SIFT descriptor us-ing a 2048 bases dictionary took approximately 100µs on an average. We fixedthe LASSO regularization to 0.5 for dictionary learning. As for the spin image

23

dataset, we followed suit and learned a dictionary of size 400×1600 using 100Kdescriptors from the dataset.

500 1000 1500 2000 2500 3000 3500 4000 450030

35

40

45

50

55

60

65

70

Dictionary Size

Re

ca

ll (%

)

(a)

Figure 8: A plot of the recall performance of the SIFT descriptors for varyingsizes of the dictionary. The dataset was of size 10K sparse-SIFT descriptors andthe query had 1K descriptors.

Figure 9: A sample SIFT dictionary with 256 bases learned from 1M trainingSIFT descriptors. Each basis is reshaped as a 16×8 patch in the display above.

7.4 Active Support Size

Another important parameter that needs to be estimated in our algorithm isthe size of the active support. As was seen in Section 5.1, an active set sizeless than 10 provides robustness against problems such as the basis drop in theLARS regularization path. To estimate the recall performance of NN againstvarious support sizes, we did the following cross-validation experiment. On asubset of our SIFT dataset consisting of 1M SIFT descriptors, we computedthe average recall@1 using 1K SIFT descriptors for two cases (i) using the SCTgenerated hash codes and (ii) using the MRSC algorithm for the active supportsize varying from 2 to L; recall that the latter experiment is required in casean empty hash bucket is encountered. Figure 10(a) and Figure 10(b) plot boththese cases respectively. Figures 10(c) and 10(d) plot the average query timesfor both the cases.

24

2 4 6 8 10 12 14 165

10

15

20

25

30

35

40

Support Size (L)

Re

ca

ll@1

(%)

(a)

3 3.5 4 4.5 5 5.5 6 6.5 730

40

50

60

70

80

90

100

Support Size (L)

Re

ca

ll@1

(%)

(b)

2 4 6 8 10 12 14 160

0.1

0.2

0.3

0.4

0.5

Support Size (L)

Se

arc

h T

ime

/qu

ery

(m

s)

(c)

3 3.5 4 4.5 5 5.5 6 6.5 70.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

Support Size (L)

Se

arc

h T

ime

/qu

ery

(m

s)

(d)

Figure 10: (a) Recall for an increasing support size, (b) Recall using the MRSCalgorithm for the support size varying from 2 to L (L being varied), (c) Querytime against increasing support size, (d) Query time for the MRSC algorithmfor support size varying from 2 to L. The experiment used 1M SIFT descriptorsand the performance averaged over 1K query descriptors.

As is clear from the plots, an active set size in the range of 4–7 providesthe best recall and the best query time. The exact match of the query SCTwith the ground truth happens less than 30% of the time and thus the recall asshown in Figure 10(a) is low. On the other hand, using multiple regularizationshelp address the issue of empty hash buckets and thus improving the recall asthe size of the support increases, as shown in Figure 10(b). Although the querytime goes down as the size of the hash code increases as shown in Figure 10(c),it increases after a certain point (L=5) in the MRSC algorithm. As suggestedfrom the above plots, we decided to choose the L value in the range of 2–5 inthis paper. A point to note here is that the recall goes low when L=2. This isbecause, there will be too many hash collisions and the number of coefficientscompared against to resolve the collision is too less.

7.5 Recall Experiments

In this subsection, we compare the recall performance of the MRSC algorithm(cyclic MRSC) against the state-of-the-art techniques. We chose six other pop-ular NN techniques to compare against: (i) Product Quantization (PQ) al-gorithm [15], (ii) KD-Trees (KDT), (iii) Euclidean Locality Sensitive Hashing(E2LSH) based on [6], (iv) Kernelized LSH (KLSH) [20], (v) Spectral Hashing(SH) [42] and (vi) Shift Invariant Kernel Hashing (SIKH) [35]. Spectral hashingused 40 bit hash codes as it was observed that higher number of bits reduced

25

the recall performance; which is expected as SH internally uses a PCA step,the sensitivity of the higher order bits increases as the length of the hash codeincreases as more and more basis are included corresponding to smaller eigen-values. We used 40 bits for SIKH as well. Other algorithms used the defaultparameters as provided by their implementations. For KLSH, we used the RBFkernel and used 1K descriptors randomly selected from the dataset for learningthe kernel. As for the PQ algorithm, we used the IVFADC variant as it seemedto show the best performance both in speed and accuracy.

Recall on SIFT: We used approximately 2M SIFT descriptors as the databaseand 1K query points. The average recall performance after 10 trials for an in-creasing number of retrieved points is shown in Figure 11(a). As is clear, MRSCalgorithm outperforms all the algorithms in recall, especially for recall@1. Al-though, SIKH performs very close to MRSC, the performance of other algo-rithms especially the KLSH was found to be very low. We found that theselection of the training set for building the kernel had a major impact on theperformance of this algorithm. Thus, we show the best performance of KLSHthat we observed from 10 different kernel initializations.

Recall on Spin Images: We used approximately 1M spin images (400D) asthe database and 1K descriptors, excluded from the dataset, as the query points.Since the KDT algorithm needs to store all the data points in memory, we did adimensionality reduction of these descriptors to 40D descriptors through a PCAprojection as suggested in [16]. The average recall performance after 10 trials isshown in Figure 11(b) for an increasing number of retrieved points. As we sawwith the SIFT dataset, MRSC seems superior to all the other algorithms in thequality of retrieved points.

1 10 100 10000

20

40

60

80

100

K

Recall@

K(%

)

MRSC

PQ

KDT

E2LSH

KLSH

SH

SIKH

(a) SIFT

1 10 100 100020

30

40

50

60

70

80

90

100

K

Recall@

K(%

)

MRSC

PQ

KDT

E2LSH

KLSH

SH

SIKH

(b) Spin Images

Figure 11: Recall@K for SIFT (left) and (right) compared against the state ofthe art methods. The comparisons were carried out on 1M points from the SIFTand spin image datasets respectively, and the recall computed for a query sizeof 1K descriptors averaged over 10 trials.

26

1 10 100 400

1

2

3

4

5

Database size (M)

searc

h/q

uery

(ms)

MRSC

PQ

(a)

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80

Database size (M)

Accura

cy(%

)

(b)

Figure 12: 12(a) plots the search time per query for increasing database size,12(b) plots the recall@1 for increasing database size.

7.6 Webscale Datasets

Retrieval Speed: Scalability to large datasets is an important property thatany NN algorithm should posses if targeted to work at webscale. In this subsec-tion, we show NN retrieval performance of MRSC (cyclic) against an increasingdatabase size. For this experiment we used the 400M SIFT dataset mentionedpreviously. The dataset was gradually incremented from 1M towards 400M,each time 1K descriptors were randomly chosen from the dataset (and thusexcluding from the database) and recall@1 computed. Figure 12(a) plots theaverage search time per query for our algorithm compared against the PQ al-gorithm. Since the optimized implementation of the PQ algorithm is licensed,we do the comparison against the results reported in [15] using the same hard-ware platform as suggested by the authors. Clearly, as seen from the plot, ourimplementation is rivals the state-of-the-art in search speed.

Recall Performance: Next, we compared the recall@1 performance of MRSCagainst an increasing database size. As increasing the database leads to in-creased hash table collisions, the purpose of this experiment was to analyze howgood the approximate sparse Euclidean distance computation was. We used thesame 400M SIFT dataset and did the recall@1 for 1K descriptors by graduallyincreasing database size as was done in the last experiment. Figure 12(b) showsthe result. Even though there is a slight drop in the recall as the database sizeincreases, the subsequent drop is minimal and on an average maintains a recallalmost close to the recall@1 performance as seen in Figure 11(a). This is ex-plained by the Figure 13(a) which shows that as the database size increases thenew points fall onto new hashbuckets such that the relative average increase inthe size of the existing buckets is minimal. The 2-sparse plots are omitted fromthe Figure 13(a) as they grow at a faster rate than support sizes; for example,we have an average bucket size of 27.36 for 40M SIFT descriptors, while thisincreases to 143.3 for 400M descriptors.

27

7.7 Precision

Finally, we compare the Precision@K for the SIFT and spin image datasets. Forthis experiment, we used the same setup as the one for the recall (as discussedin Section 7.5), except that instead of comparing K NNs found by the algorithmagainst one ground truth, we compared first NN found by the algorithm againstK nearest neighbors from the dataset as found using Euclidean linear scans.Figure 13(b) plots the mean average precision (mAP) computed on 1K SIFTand spin image datasets averaged over 10 trials. As is clear from the plot, thequality of the NNs retrieved by the cyclic MRSC algorithm lies in the top 50true nearest neighbors for more than 90% of the queries.

1 50 100 150 200 250 300 3501

1.5

2

2.5

3

3.5

4

4.5

5−sparse

4−sparse

3−sparse

Database size (M)

Avg. hash b

ucket siz

e

(a)

1 50 100 150 200 250 300 350 40050

60

70

80

90

100

K

mA

P(%

)

SIFT

SPIN

(b)

Figure 13: (a)Average hash bucket size against increasing database size, (b)Mean average precision for the SIFT and spin image datasets for 1K querydescriptors compared against an increasing number of K-nns found from thedataset using linear scans.

7.8 Robustness to Distortions

SIFT descriptors are commonly employed for finding feature correspondencesacross images in applications such as motion estimation, 3D reconstruction, etc.The images used to compute the correspondences are generally distorted in mul-tiple ways and thus robust computation of the matching descriptors is essentialfor the task. In this section, we explicitly evaluate the robustness of the MRSCalgorithm to compute SIFT correspondences across images undergoing variouscommonly seen distortions. For this experiment, we use the SIFT benchmarkdataset4, consisting of eight categories of images; each category containing siximages exhibiting a specific type of distortion. The distortion categories are:(i) BARK, with images undergoing magnification, (ii) BIKES, with fogged im-ages, (iii) LEUVEN, with illumination differences, (iv) TREES, with Gaussianblur, (v) BOAT, with rotation, (vi) GRAPHIC, with 3D camera transformation,(vii) UBC, granularity introduced by JPEG compression and (viii) WALL, withaffine transformations. Sample images from each category is shown in Figure 14in the order of increasing distortions.

4www.robots.ox.ac.uk/ vgg/research/affine/index.html

28

Figure 14: Sample images from each distortion category used to evaluate therecall@1 performance of SIFT descriptors.

SIFT descriptors were computed for each image in a category; the descriptorsfrom the first image in each category was set as the query set, while the databaseset at step i was set as the descriptors from image i, where i varies from 2to 6. The recall@1 performance was computed against a ground truth usingEuclidean linear scan. As there was too much of distortion in the images fori = 6 in each category, we did a slight relaxation of the nearest neighbor criteriasuch that we declared a candidate point to be a neighbor if it was inside an ǫ-neighborhood of the true neighbor, where we used ǫ = 1.12 (or 90% neighbor).We also pruned out ground truth correspondences that were greater than athreshold in the Euclidean distance, as there was a high chance of them beingfalse positives. We compared the performance against three other state-of-the-art algorithms in NN retrieval for the task: KLSH, KDT and PQ, with the sameanalysis configurations. The result of this experiment is shown in Figure 15(a)–Figure 15(h) with the recall averaged over all the queries in a category. Theplots clearly show the robustness of cyclic MRSC against other approaches.

8 Conclusion and Future Work

This paper introduced a novel application of sparse coding based on dictionarylearning for approximate nearest neighbor searches. A new representation ofthe data was proposed, which make them directly indexable using an invertedfile system. To make this representation deal with noisy data, we proposedan extension of the representation through utilizing the regularization path ofthe LASSO solution via the LARS algorithm. Extensive experiments were con-ducted to evaluate our algorithm against web scale datasets and demonstratedsuperior performance. Going forward, the following ideas are planned to beinvestigated; rather than performing sparse coding on multiple regularizations,

29

we plan to modify the algorithm such that the dictionary learned is made robustto perturbations in data. This approach leads to a robust dictionary learningframework. Using a hierarchical sparse coding framework with multiple learneddictionaries of various sizes for further improvising on the robustness is anotherdirection we plan to pursue.

Acknowledgements

We are indebted to three unknown reviewers whose insights and suggestionsthat has helped us improve this paper. We would like to thank Mr. DucFehr, University of Minnesota for helping us with the spin image dataset. Thismaterial is based upon work supported in part by the U.S. Army ResearchLaboratory and the U.S. Army Research Office under contract #911NF-08-1-0463 (Proposal 55111-CI), and the National Science Foundation through grants#IIP-0443945, #CNS-082-1474, #IIP-0934327, #CNS-1039741, and #SMA-1028-076.

References

[1] Aharon, M., Elad, M., Bruckstein, A.: K-SVD: An algorithm for designingovercomplete dictionaries for sparse representation. IEEE Transactions onsignal processing 54(11), 4311 (2006)

[2] Bohm, C., Berchtold, S., Keim, D.: Searching in high-dimensional spaces:Index structures for improving the performance of multimedia databases.ACM Computing Surveys (CSUR) 33(3), 322–373 (2001)

[3] Boyer, E., Bronstein, A., Bronstein, M., Bustos, B., Darom, T., Horaud,R., Hotz, I., Keller, Y., Keustermans, J., Kovnatsky, A., et al.: Shrec2011: robust feature detection and description benchmark. Arxiv preprintarXiv:1102.4258 (2011)

[4] Candes, E.: Compressive sampling. In: Proceedings of the InternationalCongress of Mathematicians, vol. 3, pp. 1433–1452 (2006)

[5] Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented his-tograms of flow and appearance. pp. 428–441. Springer (2006)

[6] Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hash-ing scheme based on p-stable distributions. Proceedings of the TwentiethAnnual Symposium on Computational Geometry pp. 253–262 (2004)

[7] Datta, R., Joshi, D., Li, J., Wang, J.: Image retrieval: Ideas, influences,and trends of the new age. ACM Computing Surveys (CSUR) 40(2), 1–60(2008)

30

[8] Donoho, D.: Compressed sensing. Technical Report, Department of Statis-tics, Stanford University (2004)

[9] Donoho, D.: Compressed sensing. IEEE Transactions on Information The-ory 52(4), 1289–1306 (2006)

[10] Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression.Annals of Statistics 32(2), 407–451 (2004)

[11] Elad, M., Aharon, M.: Image denoising via learned dictionaries and sparserepresentation. In: IEEE Computer Society Conference on Computer Vi-sion and Pattern Recognition, vol. 1, pp. 895–900. IEEE (2006)

[12] Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions viahashing. Proceedings of the 25th International Conference on Very LargeData Bases pp. 518–529 (1999)

[13] Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removingthe curse of dimensionality. Proceedings of the Thirtieth Annual ACMSymposium on Theory of Computing pp. 604–613 (1998)

[14] Jaccard, P.: Etude comparative de la distribution florale dans une portiondes Alpes et du Jura. Bulletin de la Socit vaudoise des Sciences Naturelles37, 547–579 (1901)

[15] Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neigh-bor search. IEEE Transactions on Pattern Analysis and Machine Intelli-gence 33(1), 117–128 (2011)

[16] Johnson, A.: Spin-images: A representation for 3-d surface matching.Ph.D. thesis, Carnegie Mellon University (1997)

[17] Katayama, N., Satoh, S.: The SR-tree: An index structure for high-dimensional nearest neighbor queries. Proceedings of the 1997 ACM SIG-MOD international conference on Management of data pp. 369–380 (1997)

[18] Kleinberg, J.: Two algorithms for nearest-neighbor search in high dimen-sions. Proceedings of the twenty-ninth annual ACM symposium on Theoryof computing pp. 599–608 (1997)

[19] Knuth, D.: The art of computer programming. Vol. 3, Sorting and Search-ing. Addison-Wesley Reading, MA (1973)

[20] Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalableimage search. In: 12th International Conference on Computer Vision, pp.2130–2137. IEEE (2009)

[21] Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximatenearest neighbor in high dimensional spaces. Proceedings of the ThirtiethAnnual ACM Symposium on Theory of Computing pp. 614–623 (1998)

31

[22] Li, S.: Concise Formulas for the Area and Volume of a Hyperspherical Cap.Asian Journal of Mathematics and Statistics 4, 66–70 (2011)

[23] Liu, T., Moore, A., Gray, A., Yang, K.: An investigation of practical ap-proximate nearest neighbor algorithms. Advances in Neural InformationProcessing Systems (2004)

[24] Liu, W., Wang, J., Kumar, S., Chang, S.F.: Hashing with graphs. In:L. Getoor, T. Scheffer (eds.) Proceedings of the 28th International Confer-ence on Machine Learning, ICML ’11, pp. 1–8. ACM, New York, NY, USA(2011)

[25] Lowe, D.: Distinctive image features from scale-invariant keypoints. Inter-national journal of computer vision 60(2), 91–110 (2004)

[26] Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online Dictionary Learning forSparse Coding. International Conference on Machine Learning, Montreal,Canada (2009)

[27] Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrixfactorization and sparse coding. Journal of Machine Learning Research 11,19–60 (2010)

[28] Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Superviseddictionary learning. Adv. NIPS 21 (2009)

[29] Mehlhorn, K.: Data structures and algorithms 3: multi-dimensional search-ing and computational geometry. Springer-Verlag New York, Inc., NewYork, NY, USA (1984)

[30] Mori, G., Belongie, S., Malik, J.: Shape contexts enable efficient retrievalof similar shapes. Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition 1, 723–730 (2001)

[31] Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automaticalgorithm configuration. In: International Conference on Computer VisionTheory and Application VISSAPP’09), pp. 331–340. INSTICC Press (2009)

[32] Murray, J., Kreutz-Delgado, K.: Sparse image coding using learned over-complete dictionaries. Machine Learning for Signal Processing pp. 579–588(2004)

[33] Nievergelt, J., Hinterberger, H., Sevcik, K.: The grid file: An adaptable,symmetric multi-key file structure. Trends in Information Processing Sys-tems pp. 236–251 (1981)

[34] Olshausen, B., Field, D.: Sparse coding with an overcomplete basis set: Astrategy employed by V1. Vision Research 37(23), 3311–3325 (1997)

32

[35] Raginsky, M., Lazebnik, S.: Locality-sensitive binary codes from shift-invariant kernels. In: Advances in Neural Information Processing Systems(2009)

[36] Robinson, J.: The KDB-tree: a search structure for large multidimensionaldynamic indexes. Proceedings of the 1981 ACM SIGMOD internationalconference on Management of data pp. 10–18 (1981)

[37] Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to objectmatching in videos. In: Proceedings of the International Conference onComputer Vision, vol. 2, pp. 1470–1477 (2003)

[38] Snavely, N., Seitz, S., Szeliski, R.: Photo tourism: exploring photo collec-tions in 3D. ACM (2006)

[39] Torralba, A., Fergus, R., Weiss, Y.: Small codes and large image databasesfor recognition. Proceedings of the 2008 IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition pp. 1 –8 (2008)

[40] Wang, J., Kumar, S., Chang, F.: Semi-supervised hashing for scalableimage retrieval. In: Proceedings of the 2010 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition. IEEE Computer So-ciety, San Francisco, USA (2010)

[41] Weber, R., Schek, H., Blott, S.: A quantitative analysis and performancestudy for similarity-search methods in high-dimensional spaces. Proceed-ings of the International Conference on Very Large Data Bases pp. 194–205(1998)

[42] Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. pp. 1753–1760.Citeseer (2009)

[43] Zepeda, J., Kijak, E., Guillemot, C.: Approximate nearest neighbors usingsparse representations. In: IEEE International Conference on AcousticsSpeech and Signal Processing (ICASSP), pp. 2370 –2373 (2010)

[44] Zhang, H., Berg, A.C., Maire, M., Malik, J.: SVM-KNN: Discriminativenearest neighbor classification for visual category recognition. In: Proceed-ings of the 2006 IEEE Computer Society Conference on Computer Visionand Pattern Recognition, pp. 2126–2136. IEEE Computer Society, Wash-ington, DC, USA (2006)

[45] Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features andkernels for classification of texture and object categories: A comprehen-sive study. In: Conference on Computer Vision and Pattern RecognitionWorkshop, pp. 13–13 (2006)

33

KLSHKDT PQ MRSC0

20

40

60

80

100

Re

ca

ll@1

(%

)

LEUVEN

(a)

KLSHKDT PQ MRSC0

20

40

60

80

100

Re

ca

ll@1

(%

)

UBC

(b)

KLSHKDT PQ MRSC0

20

40

60

80

100

Re

ca

ll@1

(%

)

WALL

(c)

KLSHKDT PQ MRSC0

20

40

60

80

100

Re

ca

ll@1

(%

)

BARK

(d)

KLSHKDT PQ MRSC0

20

40

60

80

100

Re

ca

ll@1

(%

)

BIKES

(e)

KLSHKDT PQ MRSC0

20

40

60

80

100

Re

ca

ll@1

(%

)

BOAT

(f)

KLSHKDT PQ MRSC0

20

40

60

80

100

Re

ca

ll@1

(%

)

GRAPHIC

(g)

KLSHKDT PQ MRSC0

20

40

60

80

100

Re

ca

ll@1

(%

)

TREES

(h)

Figure 15: Recall@1 for SIFT descriptors undergoing various image distortions.34

Technical Report - cs.umn.edu · PDF fileTechnical Report Department of Computer Science ... To begin with we sparse code the data using a learned basis ... The algorithm embeds the

Documents