-
Discriminative Decorrelation for Clustering
andClassification?
Bharath Hariharan1, Jitendra Malik1, and Deva Ramanan2
1 Univerisity of California at Berkeley, Berkeley, CA,
USA{bharath2,malik}@cs.berkeley.edu
2 University of California at Irvine, Irvine, CA,
[email protected]
Abstract. Object detection has over the past few years converged
onusing linear SVMs over HOG features. Training linear SVMs however
isquite expensive, and can become intractable as the number of
categoriesincrease. In this work we revisit a much older technique,
viz. Linear Dis-criminant Analysis, and show that LDA models can be
trained almosttrivially, and with little or no loss in performance.
The covariance matri-ces we estimate capture properties of natural
images. Whitening HOGfeatures with these covariances thus removes
naturally occuring correla-tions between the HOG features. We show
that these whitened features(which we call WHO) are considerably
better than the original HOG fea-tures for computing similarities,
and prove their usefulness in clustering.Finally, we use our
findings to produce an object detection system thatis competitive
on PASCAL VOC 2007 while being considerably easier totrain and
test.
1 Introduction
Over the last decade, object detection approaches have converged
on a singledominant paradigm: that of using HOG features and linear
SVMs. HOG fea-tures were first introduced by Dalal and Triggs [1]
for the task of pedestriandetection. More contemporary approaches
build on top of these HOG featuresby allowing for parts and small
deformations [2], training separate HOG detec-tors for separate
poses and parts [3] or even training separate HOG detectorsfor each
training exemplar [4].
Figure 1(a) shows an example image patch of a bicycle, and a
visualization ofthe corresponding HOG feature vector. Note that
while the HOG feature vectordoes capture the gradients of the
bicycle, it is dominated by the strong contoursof the fence in the
background. Figure 1(b) shows an SVM trained using justthis image
patch as a positive, and large numbers of background patches
asnegative [4]. As is clear from the figure, the SVM learns that
the gradients ofthe fence are unimportant, while the gradients of
the bicycle are important.
? This work was funded by ONR-MURI Grant N00014-10-1-0933 and
NSF Grant0954083.
-
2 Hariharan, Malik, and Ramanan
(a) Image (left) and HOG (right) (b) SVM
(c) PCA (d) LDA
Fig. 1. Object detection systems typically use HOG features, as
in (a). HOG featureshowever are often swamped out by background
gradients. A linear SVM learns to stressthe object contours and
suppress background gradients, as in (b), but requires
extensivetraining. An LDA model, shown in (d), has a similar effect
but with negligible training.PCA on the other hand completely kills
discriminative gradients, (c). The PCA, LDAand SVM visualizations
show the positive and negative components separately, withthe
positive components on the left and negative on the right.
However, training linear SVMs is expensive. Training involves
expensivebootstrapping rounds where the detector is run in a
scanning window over mul-tiple negative images to collect “hard
negative” examples. While this is feasiblefor training detectors
for a few tens of categories, it will be challenging whenthe number
of object categories is of the order of tens of thousands, which is
thescale in which humans operate.
However, linear SVMs aren’t the only linear classifiers around.
Indeed, Fisherproposed his linear discriminant as far back as 1936
[5]. Fisher discriminantanalysis tries to find the direction that
maximizes the ratio of the between-classvariance to the
within-class variance. Linear discriminant analysis (LDA) is
agenerative model for classification that is equivalent to Fisher’s
discriminantanalysis if the class covariances are assumed to be
equal. Textbook accounts ofLDA can be found, for example, in [6,
7]. Given a training dataset of positiveand negative features (x,
y) with y ∈ {0, 1}, LDA models the data x as generatedfrom
class-conditional Gaussians:
P (x, y) = P (x|y)P (y) where P (y = 1) = π and P (x|y) =
N(x;µy, Σ)
where means µy are class-dependent but the covariance matrixΣ is
class-independent.A novel feature x is classified as a positive if
P (y = 1|x) > P (y = 0|x), whichis equivalent to a linear
classifier with weights given by w = Σ−1(µ1 − µ0). Fig-ure 1(d)
shows the LDA model trained with the bicycle image patch as
positiveand generic image patches as background. Clearly, like the
SVM, the LDA modelsuppresses the contours of the background, while
enhancing the gradients of the
-
Discriminative Decorrelation for Clustering and Classification
3
bicycle. LDA has been used before in computer vision, one of the
earliest andmost popular appications being face recognition
[8].
Training an LDA model requires figuring out the means µy and Σ.
However,unlike an SVM which has to be trained from scratch for
every object category, weshow that µ0 (corresponding to the
background class) and Σ can be estimatedjust once, and reused for
all object categories, making training almost trivial.Intuitively,
LDA computes the average positive feature µ1, centers it with
µ0,and “whitens” it with Σ−1 to remove correlations. The matrix Σ
acts as amodel of HOG patches of natural images. For instance, as
we show in section 2,this matrix captures the fact that adjacent
HOG cells are highly correlatedowing to curvilinear continuity.
Thus, not all of the strong vertical gradients inthe HOG cells of
Figure 1(a) are important: many of them merely reflect
thecontinuity of contours. Removing these correlations therefore
leaves behind justthe discriminative gradients.
The LDA model is just the difference of means in a space that
has beenwhitened using the covariance matrix Σ. This suggests that
this whitened spacemight be significant outside of just training
HOG classifiers. In fact, we find thatdot products in this whitened
space are more indicative of visual similarity thandot products in
HOG space. Consequently, clustering whitened HOG featurevectors
(which we call WHO for Whitened Histogram of Orientations)
givesmore coherent and often semantically meaningful clusters.
Principal components analysis (PCA) is a related method that has
been ex-plored for tasks such as face recognition [9] and tools for
dimensionality reductionin object recognition [10]. In particular,
Ke and Sukthankar [11] and Schwartzet al [12] examine (linear)
low-dimensional projections of oriented gradient fea-tures. In PCA,
the data is projected onto the directions of the most variation,and
the directions of least variation are ignored. However, for our
purposes, thedirections that are ignored are often those that are
the most discriminative. Fig-ure 1(c) shows the result of
projecting the data down to the top 30 principalcomponents.
Clearly, this is even worse than the original HOG space: contoursof
the bicycle are more or less completely discarded. Our observations
mirrorthose of Belhumeur et al [8] who showed that in the context
of face recognition,the directions retained by PCA often correspond
to variations in illuminationand viewing direction, rather than
variations that would be discriminative of theidentity of the face.
[8] conclude that Fisher’s discriminant analysis outperformsPCA on
face recognition tasks. In section 4 we show concretely that the
lowdimensional subspace chosen by PCA is significantly worse than
whitened HOGas far as computing similarity is concerned.
Our aim in this paper is therefore to explore the advantages
provided bywhitened HOG features for clustering and classification.
In section 2 we go intothe details of our LDA models, describing
how we obtain our covariance matrix,and the properties of the
matrix. Section 3 describes our first set of experi-ments on the
INRIA pedestrian detection task, showing that LDA models canbe
competitive with linear SVMs. Section 4 outlines how WHO features
can beused for clustering exemplars. We then use these clusters to
train detectors, and
-
4 Hariharan, Malik, and Ramanan
evaluate the performance of the LDA model vis-a-vis SVMs and
other choicesin section 5. In section 6 we tie it all together to
produce a final object detec-tion system that performs
competitively on the PASCAL VOC 2007 dataset,while being
orders-of-magnitude faster to train (due to our LDA classifiers)
andorders-of-magnitude faster to test (due to our clustered
representations).
2 Linear Discriminant Analysis
In this section, we describe our model of image gradients based
on LDA. Forour HOG implementation, we use the augmented HOG
features of [2]. Briefly,given an image window of fixed size, the
window is divided into a grid of 8 × 8cells. From each cell we
extract a feature vector xij of gradient orientationsof
dimensionality d = 31. We write x = [xij ] for the final window
descriptorobtained by concatenating features across all locations
within the window. Ifthere are N cells in the window, the feature
vector has dimensionality Nd.
The LDA model is a linear classifier over x with weights given
by w =Σ−1(µ1 − µ0). Here Σ is an Nd × Nd matrix, and a naive
approach wouldrequire us to estimate this matrix again for every
value of N and also for everyobject category. In what follows we
describe a simple procedure that allows usto learn a Σ and a µ0
(corresponding to the background) once, and then reuseit for every
window size N and for every object category. Given a new
objectcategory, we need only a set of positive features which are
averaged, centered,and whitened to compute the final linear
classifier.
2.1 Estimating µ0 and Σ
Object-independent backgrounds: Consider the task of learning K
1-vs-allLDA models from a multi-class training set spanning K
objects and backgroundwindows. One can show that the maximum
likelihood estimate of Σ is the samplecovariance estimated across
the entire training set, ignoring class labels. If weassume that
the number of instances of any one object is small compared to
thetotal number of windows, we can similarly define a generic µ0
that is independentof object type. This means that we can learn a
generic µ0 and Σ from unlabeledwindows, and this need not be done
anew for every object category.
Marginalization: We are now left with the task of estimating a
µ0 and Σ forevery value of the window size N . However, note that
the statistics of smaller-sizewindows can be obtained by
marginalizing out statistics of larger-size windows.Gaussian
distributions can be marginalized by simply dropping the
marginalizedvariables from µ0 and Σ. This means that we can learn a
single µ0 and Σ forthe largest possible window of N0 cells, and
generate means and covariances forsmaller window sizes “on-the-fly”
by selecting subpartitions of µ0 and Σ. Thisreduces the number of
parameters to be estimated to an N0d dimensional µ0and an N0d×N0d
matrix Σ.
Scale and translation invariance: Image statistics are largely
scale andtranslation invariant [13]. We achieve such invariance by
including training win-dows extracted from different scales and
translations. We can further exploit
-
Discriminative Decorrelation for Clustering and Classification
5
translation invariance, or stationarity in statistical terms, to
reduce the numberof model parameters. To encode a stationary µ0, we
compute the mean HOGfeature µ = E[xij ], averaged over all features
x and cell locations (i, j). µ0 isjust µ replicated over all N0
cells.
Write Σ as a block matrix with blocks Σ(ij),(lk) = E[xijxTlk].
We then in-
corporate assumptions of translation invariance by modeling Σ
with a spatialautocorrelation function [14]:
Σ(ij),(lk) = Γ(i−l),(j−k) = E[xuvxT(u+i−l),(v+j−k)] (1)
where the expectation is over cell locations (u, v) and gradient
features x. Inother words, we assume that Σ(ij),(kl) depends only
on the relative offsets (i−k)and (j − l). Thus instead of
estimating an N0d × N0d matrix Σ, we only haveto estimate the d × d
matrices Γs,t for every offset (s, t). For a spatial windowwith N0
cells, there exist only N0 distinct relative offsets. Thus we only
need toestimate O(N0d
2) parameters.We now estimate µ and the matrices Γs,t from all
subwindows extracted
from a large set of unlabeled, 10,000 natural images (the PASCAL
VOC 2010dataset). This computation can be done once and for all,
and the resulting µand Γ stored. Then, given a new object category,
µ0 can be reconstructed byreplicating µ over all the cells in the
window and Σ can be reconstructed fromΓ using (1).
Regularization: Even given this large training set and ourO(N)
parametriza-tion, we found Σ to be low-rank and non-invertible.
This implies that it wouldbe even more difficult to learn a
separate covariance matrix for each positiveclass because we have
much fewer positive examples, further motivating a
single-covariance assumption. In general, it is difficult to learn
high-dimensional covari-ance matrices [14]. For typical-size N
values, Σ can grow to a 10, 000× 10, 000matrix. One solution is to
enforce conditional independence assumptions with aGaussian Markov
random field; we discuss this further below. In practice, we
reg-ularized the sample covariance by adding a small value (λ =
.01) to its diagonal,corresponding to an isotropic prior on Σ.
2.2 Properties of the covariance matrix
WHO: We define a whitened histograms of orientations (WHO)
descriptor asx̂ = Σ−1/2(x − µ0). The transformed feature vector x̂
then has an isotropiccovariance matrix. An alternative
interpretation of the linear discriminant is thatw computes the
difference between the average positive and negative featuresin WHO
space. Such descriptors maybe useful for clustering because
euclideandistances are more meaningful in this space. We explore
this further in section 4.We use a cholesky decomposition RRT = Σ
and Gaussian elimination (Matlab’sblackslash) to efficiently
compute this whitening transformation.
Analysis: We examine the structure of Σ in Fig.2. Intuitively, Σ
encodesgeneric spatial statistics about oriented gradients. For
example, due to curvilin-ear continuity, we expect a strong
horizontal gradient response to be correlated
-
6 Hariharan, Malik, and Ramanan
with a strong response at a horizontally-adjacent location.
Multiplying gradi-ent features by Σ−1 subtracts off such correlated
measurements. Because Σ−1
is sparse, features need only be de-correlated with adjacent or
nearby spatiallocations. This in turn suggests that image gradients
can be fit will with a 3rdor 4th-order spatial Markov model, which
may make for easier estimation andfaster computations. A spatial
Markov assumption makes intuitive sense; givenwe see a strong
horizontal gradient at a particular location, we expect to seea
strong gradient to its right regardless of the statistics to its
left. We experi-mented with such sparse models [15], but found an
unrestricted Σ to work welland simpler to implement.
Implications: Our statistical model, though quite simple, has
several impli-cations for scanning-window templates. (1) One should
learn templates of largerspatial extent than the object. For
example, a 2nd-order spatial Markov modelimplies that one should
score gradient features two cells away from the objectborder in
order to de-correlate features. Intuitively, this makes sense; a
pedes-trian template wants to find vertical edges at the side of
the face, but if it alsofinds vertical edges above the face, then
this evidence maybe better explainedby the vertical contour of a
tree or doorway. Dalal and Triggs actually made theempirical
observation that larger templates perform better, but attributed
thisto local context [1]; our analysis suggests that decorrelation
may be a better ex-planation. (2) Current strategies for modeling
occlusion/truncation by “zero”ingregions of a template may not
suffice [16, 17]. Rather, our model allows us toproperly
marginalize out such regions from µ and Σ. The resulting templatew
will not be equivalent to a zero-ed out version of the original
template, be-cause the de-correlation operation must change for
gradient features near theoccluded/truncated regions.
x0,0x0,0x−1,0x−2,0 x1,0 x2,0
x0,0x0,0x−1,0x−2,0 x1,0 x2,0
x0,0x0,0x−1,0x−2,0 x1,0 x2,0
x0,0x0,0x−1,0x−2,0 x1,0 x2,0
x0,0x0,0x−1,0x−2,0 x1,0 x2,0
Σ
Σ−1
Σ−1
> ǫ
Σ−1
< −ǫ
Fig. 2. We visualize correlations between 9 orientation features
in horizontally-adjacentHOG cells as concatenated set of 9 × 9
matrices. Light pixels are positive while darkpixels are negative.
We plot the covariance and precision matrix on the left, and
thepositive and negative values of the precision matrix on the
right. Multiplying a HOGvector with Σ−1 decorrelates it,
subtracting off gradient measurements from adjacentorientations and
locations. The sparsity pattern of Σ−1 suggests that one needs
todecorrelate features only a few cells away, indicating that
gradients maybe well-modeledby a low-order spatial Markov
model.
-
Discriminative Decorrelation for Clustering and Classification
7
(a) AP (b) Centered (c) LDA
Fig. 3. The performance (AP) of the LDA model and the centered
model (LDA with-out whitening) vis-a-vis a standard linear SVM on
HOG features. We also show thedetectors for the centered model and
the LDA model.
3 Pedestrian detection
HOG feature vectors were first described in detail in [1], where
they were shownto significantly outperform other competing features
in the task of pedestrian de-tection. This is a relatively easy
detection task, since pedestrians don’t vary sig-nificantly in
pose. Our local implementation of the Dalal-Triggs detector
achievesan average precision (AP) of 79.66% on the INRIA dataset,
outperforming theoriginal AP of 76.2% reported in Dalal’s thesis
[18]. We think this difference isdue to our SVM solver, which
implements multiple passes of data-mining forhard negatives. We
choose this task as our first test bed for WHO features.
We use our LDA model to train a detector and evaluate its
performance.Figure 3 shows our performance compared to that of a
standard linear SVM onHOG features. We achieve an AP of 75.10%.
This is slightly lower than the SVMperformance, but nearly
equivalent to the original performance of [18]. However,note that
compared to the SVM model, the LDA model is estimated only from
afew positive image patches and neither requires access to large
pools of negativeimages nor involves any costly bootstrapping
steps. Given this overwhelminglyreduced computation, this
performance is impressive.
Constructing our LDA model from HOG feature vectors involves two
steps,i.e, subtracting µ0 (centering) and multiplying by Σ
−1 (whitening). To teaseout the contribution of whitening, we
also evaluate the performance when thewhitening step is removed. In
other words, we consider the detector formed bysimply taking the
mean of the centered positive feature vectors. We call thisthe
“centered model”, and its performance is indicated by the black
curve inFigure 3. It achieves an AP of less than 10%, indicating
that whitening is crucialto performance. We also show the detectors
in Figure 3, and it can be clearlyseen that the LDA model does a
better job of identifying the discriminativecontours (the
characteristic shape of the head and shoulders) compared to
simplecentering.
-
8 Hariharan, Malik, and Ramanan
4 Clustering in WHO space
Owing to large intra-class variations in pose and appearance, a
single linearclassifier over HOG feature vectors can hardly be
expected to do well for genericobject detection. Hence many state
of the art methods train multiple “mixturecomponents”, multiple
“parts” or both [3, 2]. These mixture components andparts are
either determined based on extra annotations [3], or inferred as
latentvariables during training [2]. [4] consider an extreme
approach and considereach positive example as its own mixture
component, training a separate HOGdetector for each example.
In this section we consider a cheaper and simpler strategy of
producing com-ponents by simply clustering the feature vectors. As
a test bed we use the PAS-CAL VOC 2007 object detection dataset
(train+val) [19]. We first cluster theexemplars of a category using
kmeans on aspect ratio. Then for each cluster, weresize the
exemplars in that cluster to a common aspect ratio, compute
featurevectors on the resulting image patches and finally subdivide
the clusters usingrecursive normalized cuts [20]. The affinity we
use for N-cuts is the exponentialof the cosine of the angle between
the two feature vectors.
We can either cluster using HOG feature vectors or using WHO
feature vec-tors (x̂ = Σ−1/2(x−µ0), see section 2). Alternatively,
we can use PCA to projectHOG features down to a low dimensional
space (we use 30 dimensions), and clus-ter in that space. Figure 4
shows an example cluster obtained in each case for the’bus’
category. The cluster based on WHO features is in fact semantically
mean-ingful, capturing buses in a particular pose. HOG based
clustering produces lesscoherent results, and the cluster becomes
significantly worse when performedin the dimensionality-reduced
space. This is because as Figure 1 shows, HOGoverstresses
background, whereas whitening removes the correlations common
innatural images, leaving behind only discriminative gradients. PCA
goes the op-posite way and in fact removes discriminative
directions, making matters worse.Figure 5 shows some more examples
of HOG-based clusters and WHO-basedclusters. Clearly, the WHO-based
clusters are significantly more coherent.
5 Training each cluster
We now turn to the task of training detectors for each cluster.
Following ourexperiments in section 3, we have several choices:
1. Train a linear SVM for each cluster, using the images of the
cluster as pos-itives, and image patches from other
categories/background as negatives(SVM on cluster).
2. Train an LDA model on the cluster, i.e, use w = Σ−1(xmean−µ0)
(LDA oncluster).
3. Take the mean of the centered HOG features of the patches in
the cluster,i.e use w = xmean − µ0 (“centered model” on
cluster).
-
Discriminative Decorrelation for Clustering and Classification
9
(a) HOG (b) PCA (c) WHO
Fig. 4. Clusters obtained using N-cuts using HOG feature
vectors, HOG vectors pro-jected to a PCA basis and WHO feature
vectors. Observe that while all clusters makemistakes, the
HOG-based cluster is much less coherent than the WHO-based
cluster.The PCA cluster is even less coherent than the HOG-based
cluster.
[4] treat each exemplar separately, and get their boost from
training to discrim-inate each exemplar from the background. On the
other hand we believe thatwe can get bigger potential gains by
averaging over multiple positive examples.In order to evaluate
this, we also consider the following choices:
4. Train an LDA model on just the medoid, i.e w = Σ−1(xmedoid −
µ0) (LDAon the medoid).
5. Take the medoid of the cluster and train a linear SVM, using
the medoid aspositive and image patches from other
categories/background as negative.
We take the clusters obtained as described in the previous
section for threecategories : horse, motorbike and bus. For each
cluster we train detectors ac-cording to the five schemes above. We
then run each detector on the test setof PASCAL VOC 2007, and
compute its AP. The ground truth for each clusterconsists of all
objects of that category.
Table 1 shows a summary comparison of the five schemes, and
Figure 6compares the performance of the LDA model with the other
four schemes inmore detail. First note that both single-example
schemes perform worse thanthe LDA model. Indeed, for all but 6 of
the 77 clusters tested, the LDA modelachieves a higher AP than a
single SVM trained using the medoid. This clearlyshows that simple
averaging over similar positive examples helps more thanexplicitly
training to discriminate single exemplars from the background.
Thisalso provides an indirect validation of our clustering step,
since it indicates thateach cluster is coherent enough to be better
than any single individual example.In our experimental results, we
further quantitatively evaluate our clusters bydemonstrating that
they perform similarly to “brute-force” methods that train
aseparate exemplar template for every member of every cluster [4].
Our clusteredrepresentation performs similarly while being faster
to evaluate.
-
10 Hariharan, Malik, and Ramanan
(a) horse
(b) aeroplane
Fig. 5. Examples of clusters obtained for aeroplane and horse
using HOG feature vec-tors (left) and WHO feature vectors (right).
Note how the clusters based on WHO aresignificantly more coherent
than the clusters based on HOG.
Secondly, observe that on average the performance of the LDA
model isvery similar to the performance of a linear SVM, and is
also highly correlatedwith it. This reiterates our observations on
the pedestrian detection task insection 3. This also indicates that
our LDA model can be used in place of SVMsfor HOG based detectors
with little or no loss in performance, at a fraction ofthe
computational cost and with very little training data.
Finally, the performance of the centered model without whitening
is muchlower than the LDA model, and is in fact significantly worse
than even the single-example models. This again shows that
decorrelation, and not just centering, iscrucial for
performance.
6 Combining across clusters
In this section we attempt to tie the previous two sections
together to producea full object detection system. We compare here
to the approach of [4], whoshow competitive performance on PASCAL
VOC 2007 by simply training one
-
Discriminative Decorrelation for Clustering and Classification
11
LDA on cluster SVM on cluster LDA on medoid SVM on medoid
Centered
Mean AP 7.59± 4.86 6.75± 4.80 4.84± 4.13 4.05± 4.12 0.74±
2.02Median AP 9.25± 3.86 9.16± 4.04 4.65± 3.71 2± 3.6 0.06± 0.7
Table 1. Mean and median AP (in %) of the different models.
0 0.05 0.1 0.15 0.2 0.250
0.05
0.1
0.15
0.2
0.25
AP − LDA on cluster
AP
− S
VM
on
clus
ter
0 0.05 0.1 0.15 0.2 0.250
0.05
0.1
0.15
0.2
0.25
AP − LDA on cluster
AP
− "
Cen
tere
d m
odel
" on
clu
ster
0 0.05 0.1 0.15 0.2 0.250
0.05
0.1
0.15
0.2
0.25
AP − LDA on cluster
AP
− S
VM
on
med
oid
0 0.05 0.1 0.15 0.2 0.250
0.05
0.1
0.15
0.2
0.25
AP − LDA on cluster
AP
− L
DA
on
med
oid
Fig. 6. Performance (AP) of the LDA model compared to (from left
to right) an SVMtrained on the cluster, the centered model trained
on the cluster, an SVM trained onthe medoid and an LDA model
trained on the medoid. The blue line is the y = x line.The LDA
performs significantly better than both the single-example
approaches andis comparable to an SVM trained on the cluster.
linear SVM per exemplar. This performance is impressive given
that they useonly HOG features and do not have any parts [2,
3].
We agree with them on the fact that using multiple components
instead ofsingle monolithic detectors is necessary for handling the
large intra-class varia-tion. However, training a separate SVM for
each positive example entails a hugecomputational complexity.
Because the negative class for each model is essen-tially the
background, one would ideally learn background statistics just
once,and simply plug it in for each model.
LDA allows us to do precisely that. Background statistics in the
form of Σand µ are computed just once, and training only involves
computing the meanof the positive examples. This reduces the
computational complexity drastically:using LDA we can train all
exemplar models of a particular category on a singlemachine in a
few minutes. Table 2 shows how exemplar-LDA models compareto
exemplar-SVMs [4]. As can be seen, there is little or no drop in
performance.
Replacing SVMs by LDA significantly reduces the complexity at
train time.However at test time, the computational complexity is
still high because onehas to run a very large number of detectors
over the image. We can reduce thiscomputational complexity
considerably by first clustering the positive examplesas described
in Section 4. We then train one detector for each cluster,
resultingin far fewer detectors. For instance, the ’horse’ category
has 403 exemplars butonly 29 clusters.
To build a full object detection system, we need to combine
these clusterdetector outputs in a sensible way. Following [4], we
train a set of rescoringfunctions that rescore the detections of
each detector. Note that only detectionsthat score above a
threshold are rescored, while the rest are discarded.
-
12 Hariharan, Malik, and Ramanan
We train a separate rescoring function for each cluster. For
each detection,we construct two kinds of features. The first set of
features considers the dotproduct of the WHO feature vector of the
detection window with the WHOfeature vector of every exemplar in
the cluster. This gives us as many featuresas there are examples in
the cluster. These features encode the similarity of thedetection
window with the purported “siblings” of the detection window,
namelythe exemplars in the cluster.
The second set of features is similar to context features as
described in [4,3]. We consider every other cluster and record its
highest scoring detection thatoverlaps by more than 50% with this
detection window. These features recordthe similarity of the
detection window to other clusters and allow us to boostscores of
similar clusters and suppress scores of dissimilar clusters.
These features together with the original score given by the
detector formthe feature vector for the detection window. We then
train a linear SVM topredict which detection windows are indeed
true positives, and fit a logistic tothe SVM scores. At test time
the detections of each cluster detector are rescoredusing these
second-level classifiers, and then standard non-max suppression
isperformed to produce the final, sparse set of detections. Note
that this secondlevel rescoring is relatively cheap since only
detection windows that score abovea threshold are rescored. Indeed,
our cluster detectors can be thought of as thefirst step of a
cascade, and significantly more sophisticated methods can be usedto
rescore these detection windows.
As shown in Table 2, our performance is very close to the
performance ofthe Exemplar SVMs. This is in spite of the fact that
our first-stage detectorsrequire no training at all, and our second
stage rescoring functions have an orderof magnitude fewer
parameters than ESVM+Co-occ [4] (for instance, for thehorse
category, in the second stage we have fewer than 2000 parameters,
whileESVM+Co-occ has more than 100000). Although our performance is
lower thanpart-based models [2], one could combine such approaches
and possibly trainparts with LDA.
Finally, each detection of ours is associated with a cluster of
training exem-plars. We can go further and associate each detection
to the closest exemplarin the cluster, where distance is defined as
cosine distance in WHO space. Thisallows us to match each detection
to an exemplar, as in [4]. Figure 7 shows ex-amples of detections
and the training exemplars they are associated with. Ascan be seen,
the detections are matched to very similar and semantically
relatedexemplars.
7 Conclusion
Correlations are naturally present in features used in object
detection, and wehave shown that significant advantages can be
derived by accounting for thesecorrelations. In particular, LDA
models trained using these correlations can beused as a highly
efficient alternative to SVMs, without sacrificing
performance.Decorrelated features can also be used for clustering
examples, and we have
-
Discriminative Decorrelation for Clustering and Classification
13
ESVM ESVM ELDA Ours-only 1 Ours-only 2 Ours-full+Calibr +Co-occ
+Calibr
aeroplane 20.4 20.8 18.4 17.4 22.1 23.3bicycle 40.7 48.0 39.9
35.5 37.4 41.0
bird 9.3 7.7 9.6 9.7 9.8 9.9boat 10.0 14.3 10.0 10.9 11.1
11.0
bottle 10.3 13.1 11.3 15.4 14.0 17.0bus 31.0 39.7 39.6 17.2 18.0
37.8car 40.1 41.1 42.1 40.3 36.8 38.4cat 9.6 5.2 10.7 10.6 6.5
11.5
chair 10.4 11.6 6.1 10.3 11.2 11.8cow 14.7 18.6 12.1 14.3 13.5
14.5
diningtable 2.3 11.1 3 4.1 12.1 12.2dog 9.7 3.1 10.6 1.8 10.5
10.2
horse 38.4 44.7 38.1 39.7 43.1 44.8motorbike 32.0 39.4 30.7 26.0
25.8 27.9
person 19.2 16.9 18.2 23.1 21.3 22.4pottedplant 9.6 11.2 1.4 4.9
5.1 3.1
sheep 16.7 22.6 12.2 14.1 13.8 16.3sofa 11.0 17.0 11.1 8.7 12.2
8.9train 29.1 36.9 27.6 22.1 30.6 30.3
tvmonitor 31.5 30.0 30.2 15.2 12.8 28.8
Mean 19.8 22.6 19.1 17.0 18.3 21.0
Table 2. Our performance on VOC 2007, reported as AP in %. We
compare withESVM+Calibr and ESVM+Co-occ [4]. “ELDA+Calibr”
constructs exemplar modelsusing LDA, followed by a simple
calibration step [4]. The last three columns show theperformance
using our clusters instead of individual exemplars. “Ours-only 1”
is ourperformance using only the “sibling” features, while “Ours-
only 2” is our performanceusing only the context features. Clearly
both sets of features give us a boost. Our fullmodel performs
similarly to [4], but is much faster to train and test.
shown that the combination of these two ideas allows us to build
a competitiveobject detection system that is significantly faster
not just at train time butalso at run time. Our work can be built
upon to produce state-of-the-art objectdetection systems, mirroring
the developments in SVM-based approaches [2, 3].Our statistical
models also suggest that natural image statistics, largely
ignoredin the field of object detection, are worth (re)visiting.
For example, gradientstatistics may be better modeled with
heavy-tailed distributions instead of ourGaussian models [13].
However, the ideas expressed here are quite general, andas we have
shown, can also be applied to tasks other than object detection,
suchas clustering.
References
1. Dalal, N., Triggs, B.: Histograms of oriented gradients for
human detection. In:CVPR. (2005)
-
14 Hariharan, Malik, and Ramanan
Fig. 7. Detection and appearance transfer. The top row shows
detections while in thebottom row the detected objects have been
replaced by the most similar exemplars.
2. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.:
Object detectionwith discriminatively trained part-based models.
TPAMI 32 (2010)
3. Bourdev, L., Malik, J.: Poselets: Body part detectors trained
using 3d human poseannotations. In: ICCV. (2009)
4. Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of
exemplar-svms for objectdetection and beyond. In: ICCV. (2011)
5. Fisher, R.: The use of multiple measurements in taxonomic
problems. Annals ofHuman Genetics (1936)
6. Hastie, T., Tibshirani, R., Friedman, J.J.H.: The elements of
statistical learning.Springer (2009)
7. Duda, R., Hart, P.: Pattern recognition and scene analysis
(1973)8. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs.
fisherfaces: Recognition
using class specific linear projection. TPAMI 19 (1997)9. Turk,
M., Pentland, A.: Eigenfaces for recognition. Journal of cognitive
neuro-
science (1991)10. Murase, H., Nayar, S.: Visual learning and
recognition of 3-d objects from appear-
ance. IJCV 14 (1995)11. Ke, Y., Sukthankar, R.: Pca-sift: A more
distinctive representation for local image
descriptors. In: CVPR. (2004)12. Schwartz, W., Kembhavi, A.,
Harwood, D., Davis, L.: Human detection using
partial least squares analysis. In: ICCV. (2009)13. Hyvärinen,
A., Hurri, J., Hoyer, P.: Natural Image Statistics: A probabilistic
ap-
proach to early computational vision. (2009)14. Rue, H., Held,
L.: Gaussian Markov random fields: theory and applications.
(2005)15. Marlin, B., Schmidt, M., Murphy, K.: Group sparse priors
for covariance estima-
tion. In: UAI. (2009)16. Vedaldi, A., Zisserman, A.: Structured
output regression for detection with partial
truncation. In: NIPS. (2009)17. Gao, T., Packer, B., Koller, D.:
A segmentation-aware object detection model with
occlusion handling. In: CVPR. (2011)18. Dalal, N.: Finding
people in Images and Videos. PhD thesis, INRIA (2006)19.
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J.,
Zisserman, A.: The
PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.
(http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html)
20. Shi, J., Malik, J.: Normalized cuts and image segmentation.
TPAMI 22 (2000)