-
A Boosting Framework for Visuality-PreservingDistance Metric
Learning and Its Application
to Medical Image RetrievalLiu Yang, Student Member, IEEE, Rong
Jin, Lily Mummert, Member, IEEE,
Rahul Sukthankar, Member, IEEE, Adam Goode, Member, IEEE, Bin
Zheng,
Steven C.H. Hoi, Member, IEEE, and Mahadev Satyanarayanan,
Fellow, IEEE
AbstractSimilarity measurement is a critical component in
content-based image retrieval systems, and learning a good
distancemetric can significantly improve retrieval performance.
However, despite extensive study, there are several major
shortcomings withthe existing approaches for distance metric
learning that can significantly affect their application to medical
image retrieval. Inparticular, similarity can mean very different
things in image retrieval: resemblance in visual appearance (e.g.,
two images that looklike one another) or similarity in semantic
annotation (e.g., two images of tumors that look quite different
yet are both malignant).Current approaches for distance metric
learning typically address only one goal without consideration of
the other. This is problematicfor medical image retrieval where the
goal is to assist doctors in decision making. In these
applications, given a query image, the goal isto retrieve similar
images from a reference library whose semantic annotations could
provide the medical professional with greaterinsight into the
possible interpretations of the query image. If the system were to
retrieve images that did not look like the query, thenusers would
be less likely to trust the system; on the other hand, retrieving
images that appear superficially similar to the query but
aresemantically unrelated is undesirable because that could lead
users toward an incorrect diagnosis. Hence, learning a distance
metricthat preserves both visual resemblance and semantic
similarity is important. We emphasize that, although our study is
focused onmedical image retrieval, the problem addressed in this
work is critical to many image retrieval systems. We present a
boostingframework for distance metric learning that aims to
preserve both visual and semantic similarities. The boosting
framework first learnsa binary representation using side
information, in the form of labeled pairs, and then computes the
distance as a weighted Hammingdistance using the learned binary
representation. A boosting algorithm is presented to efficiently
learn the distance function. Weevaluate the proposed algorithm on a
mammographic image reference library with an Interactive
Search-Assisted Decision Support(ISADS) system and on the medical
image data set from ImageCLEF. Our results show that the boosting
framework comparesfavorably to state-of-the-art approaches for
distance metric learning in retrieval accuracy, with much lower
computational cost.Additional evaluation with the COREL collection
shows that our algorithm works well for regular image data
sets.
Index TermsMachine learning, image retrieval, distance metric
learning, boosting.
1 INTRODUCTION
TODAY, medical diagnosis remains both art and science.Doctors
draw upon both experience and intuition, usinganalysis and
heuristics to render diagnoses [1]. When
doctors augment personal expertise with research, themedical
literature is typically indexed by disease ratherthan by relevance
to current case. The goal of interactivesearch-assisted decision
support (ISADS) is to enabledoctors to make better decisions about
a given case byretrieving a selection of similar annotated cases
from largemedical image repositories.
A fundamental challenge in developing such systems isthe
identification of similar cases, not simply in terms ofsuperficial
image characteristics, but in a medically relevantsense. This
involves two tasks: extracting a representativeset of features and
identifying an appropriate measure ofsimilarity in the
high-dimensional feature space. The formerhas been an active
research area for several decades. Thelatter, largely ignored by
the medical imaging community,is the focus of this paper.
In an ISADS system, each case maps to a point in a
high-dimensional feature space and similar cases to the currentcase
(query) correspond to near neighbors in this space. Theneighborhood
of a point is defined by a distance metric, suchas the
euclideandistance.Ourpreviouswork showed that thechoice of distance
metric affects the accuracy of an ISADSsystem and thatmachine
learning enables the construction ofeffective domain-specific
distance metrics [2]. In a learneddistance metric, data points with
the same labels (e.g.,
30 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
. L. Yang is with Machine Learning Department, School of
ComputerScience, Carnegie Mellon University, 5000 Forbes Avenue,
Pittsburgh,PA 15231. E-mail: [email protected].
. R. Jin is with the Department of Computer Science and
Engineering, 3115Engineering Building, Michigan State University,
East Lansing,MI 48824. E-mail: [email protected].
. L. Mummert and R. Sukthankar are with Intel Research, 4720
Forbes Ave.,Suite 410, Pittsburgh, PA 15213.E-mail:
[email protected], [email protected].
. A. Goode and M. Satyanarayanan are with the School of Computer
Science,Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh,
PA 15231.E-mail: [email protected], [email protected].
. B. Zheng is with the Department of Radiology, University of
PittsburghMedical Center, Pittsburgh, PA 15213. E-mail:
[email protected].
. S.C.H. Hoi is with the Division of Information Systems, School
ofComputer Engineering, Nanyang Technological University,
Singapore639798. E-mail: [email protected].
Manuscript received 1 Jan. 2008; revised 9 Aug. 2008; accepted
24 Oct. 2008;published online 10 Nov. 2009.Recommended for
acceptance by J. Matas.For information on obtaining reprints of
this article, please send e-mail to:[email protected], and
reference IEEECS Log NumberTPAMI-2008-01-0001.Digital Object
Identifier no. 10.1109/TPAMI.2008.273.
0162-8828/10/$26.00 2010 IEEE Published by the IEEE Computer
Society
-
malignant masses) are closer than data points with
differentlabels (e.g., malignant versus benign). Thus, the labels
of thenear neighbors of the query are likely to be informative.
1.1 Distance Metric Learning with Side Information
Research in distance metric learning is driven by the need
tofind meaningful low-dimensional manifolds that capturethe
intrinsic structure of high-dimensional data. Distancemetric
learning has been successfully applied to a variety ofapplications,
such as content-based image retrieval [3] andtext categorization
[4].
Most distance metric learning techniques can be classifiedinto
two categories: unsupervised distance metric learningand supervised
distance metric learning. The former aims toconstruct a
low-dimensional manifold where geometricrelationships between most
data points are largely pre-served. Supervised distance metric
learning makes use ofclass-label information and identifies the
dimensions thatare most informative to the classes of examples. A
briefoverview of the related work is provided in Section 2.
Learning an effective distance metric with side informa-tion has
recently attracted increasing interest. Typically, theside
information is cast in the form of pairwise constraintsbetween data
elements, and the goal is to identify featuresthat are maximally
consistent with these constraints. Ingeneral, there are two types
of pairwise constraints:1) equivalence constraints specifying that
the two givenelements belong to the same class and 2)
inequivalenceconstraints indicating that the given elements are
fromdifferent classes. The optimal distance metric is learned
bykeeping the data elements of equivalence constraints close toeach
other while separating the data elements of inequiva-lence
constraints apart. A number of approaches have beendeveloped to
learn distance metrics from the pairwiseconstraints.We refer to
Section 2 for a comprehensive review.
One of the key challenges in learning a distance metric isits
computational cost. This is because many approaches aredesigned to
learn a full matrix of distance metrics whose sizescales with the
square of the data dimension. In addition toits large size, the
requirement that the metric matrix bepositive semidefinite further
increases the computationalcost [5]. Although several algorithms
have been proposed toimprove the computational efficiency (e.g.,
[6]), they stilltend to be computationally prohibitive when the
number ofdimensions is large. To address the computational issue,
wepropose a boosting framework that can efficiently learndistance
metrics for high-dimensional data.
1.2 Semantic Relevance and Visual Similarity
Most distance metric learning algorithms aim to
constructdistance functions that are consistent with the given
pairwiseconstraints. Since these constraints are usually based on
thesemantic categories of the data, the learned distance
metricessentially preserves only the semantic relevance
amongdatapoints. Thus, a drawbackwith these approaches is
that,whenthey are applied to image retrieval problems, images
rankedat the top of a retrieval list may not be visually similar to
thequery image, due to the gap between semantic relevance andvisual
similarity. For instance, a doughnut and a tire havesimilar shapes,
yet belong to different concept categories; asolar car looks almost
nothing like a regular car, thoughfunctionally, they both belong to
the same object category.Since, in image retrieval
applications,most users seek images
that are both semantically and visually close to the queryimage,
this requires learning distance functions that preserveboth
semantic relevance and visual resemblance. This issue isof
particular importance in medical image retrieval. If thesystem were
to retrieve images that did not look like thequery,
thendoctorswouldbe less likely to trust the system; onthe other
hand, retrieving images that appear superficiallysimilar to the
query but are semantically unrelated isundesirable because that
could lead doctors toward anincorrect diagnosis.
We address the challenge by automatically generatinglinks that
pair images with high visual resemblance. Thesevisual pairs,
togetherwith the provided side information, areused to train a
distance function that preserves both visualsimilarity and semantic
relevance between images. Thetrade-off between semantic relevance
and visual similaritycan be easily adjusted by the number of visual
pairs. Adetailed discussion of how these visual pairs are generated
isgiven in Section 4.
The remaining paper is organized as follows: Section 2reviews
the work related to ISADS, distance metric learning,and boosting.
Section 3 describes the boosting framework fordistancemetric
learning. Section 4 presents the application ofthe proposed
algorithm to retrieval of both medical imagesand regular
images.
2 RELATED WORK
Over the last decade, the increasing availability of
powerfulcomputing platforms and high-capacity storage hardwarehas
driven the creation of large, searchable image databases,such as
digitized medical image reference libraries. Theselibraries have
been used to train and validate computer-aided diagnosis (CAD)
systems in a variety of medicaldomains, including breast cancer.
However, the value ofCAD in clinical practice is controversial, due
to their black-box nature and lack of reasoning ability [7], [8],
[9], [10],[11], despite significant recent progress [12], [13],
[14], [15],[16], [17], [18], [19], [20] both in automated detection
andcharacterization of breast masses. An alternative
approach,espoused by efforts such as ISADS [2], eschews
automateddiagnosis in favor of providing medical professionals
withadditional context about the current case that could enablethem
to make a more informed decision. This is done byretrieving
medically relevant cases from the referencelibrary and displaying
their outcomes. Earlier work [2]has demonstrated that learning
domain-specific distancemetrics significantly improves the quality
of such searches.
In general, methods for distance metric learning fall intotwo
categories: supervised and unsupervised learning. Sinceourwork
ismost closely related to superviseddistancemetriclearning, we omit
the discussion of unsupervised distancemetric learning and refer
readers to a recent survey [21].
In supervised distance metric learning, most algorithmslearn a
distance metric from a set of equivalence constraintsand
inequivalence constraints between data objects. Theoptimal distance
metric is found by keeping objects inequivalence constraints close
and objects in inequivalenceconstraints well separated. Xing et al.
[22] formulatedistance metric learning into a constrained convex
pro-gramming problem by minimizing the distance between thedata
points in the same classes under the constraint that thedata points
from different classes are well separated. This
YANG ET AL.: A BOOSTING FRAMEWORK FOR VISUALITY-PRESERVING
DISTANCE METRIC LEARNING AND ITS APPLICATION TO... 31
-
algorithm is extended to the nonlinear case by theintroduction
of kernels [23]. Local Linear DiscriminativeAnalysis [24] estimates
a local distance metric using thelocal linear discriminant
analysis. Relevant ComponentsAnalysis (RCA) [25] learns a global
linear transformationfrom the equivalence constraints.
Discriminative Compo-nent Analysis (DCA) and Kernel DCA [26]
improve RCA byexploring inequivalence constraints and capturing
non-linear transformation via contextual information. LocalFisher
Discriminant Analysis (LFDA) [27] extends classicalLDA to the case
when the side information is in the form ofpairwise constraints.
Kim et al. [28] provide an efficientincremental learning method for
LDA, by adopting suffi-cient spanning set approximation for each
update step.Schultz and Joachims [29] extend the support
vectormachine to distance metric learning by encoding thepairwise
constraints into a set of linear inequalities.Neighborhood
Component Analysis (NCA) [30] learns adistance metric by extending
the nearest neighbor classifier.The maximum-margin nearest neighbor
(LMNN) classifier[6] extends NCA through a maximum margin
framework.Yang et al. [31] propose a Local Distance Metric (LDM)
thataddresses multimodal data distributions in distance
metriclearning by optimizing local compactness and local
separ-ability in a probabilistic framework. Finally, a number
ofrecent studies [28], [32], [33], [34], [35], [38], [39], [40],
[41],[42], [43], [44], [45], [46], [47], [48], [49] focus on
examiningand exploring the relationship among metric
learning,dimensionality reduction, kernel learning,
semi-supervisedlearning, and Bayesian learning.
Learning distance metrics by a boosting framework was
first presented by Hertz et al. [50], [51]. In addition, in
[36],
[37], [52], different boosting strategies are presented to
learn distance functions from labeled data. Although all of
these algorithms employ a boosting strategy to learn a
distance function, our algorithm differs from the existing
work in that earlier algorithms for distance function
learning closely follow AdaBoost [53] without considering
the optimization of the specified objective functions. Some
of the existing methods (e.g., [52]) do not have a well-
specified objective function; therefore, the convergence of
their algorithms and the optimality of the resulting
distance
function are unclear. In contrast, our algorithm is based on
the optimization of the objective function specified in our
study. Our contributions include a theoretical analysis
about the convergence condition of our algorithm and the
optimality of the resulting distance function. We believe
that the theoretical analysis of the proposed algorithm is
important and could be instrumental to the performance of
our boosting framework.We would also like to mention some recent
develop-
ments in nonmetric distance learning, such as Generalized
Nonmetric Multidimensional Scaling [54]. Although non-
metric distance learning appears to be more flexible than
metric distance learning, we believe that metric distance,
in general, is not only more intuitive but also more robust
to data noise due to the constraints imposed by the
triangle inequality.
3 A BOOSTING FRAMEWORK FOR DISTANCEMETRIC LEARNING
In this section, we present a novel boosting framework,termed
BDM (we follow the terminology from [2]), thatautomatically learns
a distance function from a given set ofpairwise constraints. The
main idea is to iteratively generatea set of binary features from
the side information. Thelearned binary features are used for data
representation,and the distance is computed as a weighted
Hammingdistance based on the learned binary data
representation.
3.1 Preliminaries
We denote by D fx1;x2; . . . ;xng the collection of datapoints.
Each x 2 Rd is a vector of d dimensions. We denoteby X x1;x2; . . .
;xn the data matrix containing the inputfeatures of both the
labeled and the unlabeled examples.Following [22], we assume that
the side information isavailable in the form of labeled pairs,
i.e., whether or nottwo examples are in the same semantic category
or not. Forconvenience of discussion, below we refer to examples
inthe same category as similar examples and examples indifferent
categories as dissimilar examples. Let the set oflabeled example
pairs be denoted by
P fxi;xj; yi;jjxi 2 D;xj 2 D; yi;j 2 f1; 0;1gg;where the class
label yi;j is positive (i.e., 1) when xi and xjare similar, and
negative (i.e., 1) when xi and xj aredifferent. yi;j is set to zero
when the example pair xi;xj isunlabeled. Finally, we denote by
dxi;xj the distancefunction between xi and xj. Our goal is to learn
a distancefunction that is consistent with the labeled pairs in
P.Remark 1. Note that standard labeled examples can always
be converted into a set of labeled pairs by assigning twodata
points from the same category to the positive classand two data
points from different categories to thenegative class. Similar
pairwise class labels are com-monly employed in multiclass
multimedia retrievalapplications [55], [56]. It is important to
emphasize thatthe reverse is typically difficult, i.e., it is
usually difficultto infer the unique category labels of examples
from thelabeled pairs [57].
Remark 2. We label two images in the training set as similarif
they either match in semantic category or if they appearvisually
related, as our goal is to simultaneously preserveboth the semantic
relevance as well as the visualsimilarity. For instance, two images
could be defined tobe similar only if they belonged to the same
semanticcategory or similarity could be defined based on theimages
visual similarity according to human perception.
3.2 Definition of Distance Function
Before presenting the boosting algorithm, we need to definea
distance function dxi;xj that is nonnegative and satisfiesthe
triangle inequality. A typical definition of distancefunction used
by several distance metric learning algo-rithms (e.g., [22], [31])
is
dxi;xj xi xj>Axi xj
q; 1
32 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
-
where A 2 IRdd is a positive semidefinite matrix thatspecifies
the distance metric. One drawback with thedefinition in (1) arises
from its high computational costdue to the size of A and the
constraint that matrix A has tobe positive semidefinite. This is
observed in our empiricalstudy. When the dimensionality d 500, we
find thatestimating A in (1) is computationally very expensive.
In order to address the above problems, we present herea
nonlinear distance function defined as follows:
dxi;xj XTt1
tftxi ftxj2; 2
where each fx : IRn ! f1;1g is a binary classificationfunction
(note that we define the image of the binary f to bef1;1g instead
of f0; 1g for a more concise presentationbelow) and t > 0, t 1;
2; . . . ; T , are the combinationweights. The key idea behind the
above definition is to firstgenerate a binary representation f1x; .
. . ; fT x by apply-ing the classification function ffixgTi1 to x.
Then, thedistance between xi and xj is computed as a
weightedHamming distance between the binary representations of
thetwo examples. Compared to (1), (2) is advantageous in that
itallows for a nonlinear distance function. Furthermore,
theiterative updates of the binary data representation,
andconsequently, the distance function, are the key to theefficient
algorithm that is presented in the next section. Weemphasize that
although (2) appears to be similar to thedistance function defined
in [36], [37], it differs from theexisting work in that each binary
function takes into accountall of the features. In contrast,
eachbinary function in [36], [37]is limited to a single feature and
therefore is significantly lessgeneral than the proposed
algorithm.
The following theorem shows that the distance functiondefined in
(2) is indeed a pseudometric, i.e., satisfies all theconditions of
a distance metric except for dx;y 0,x y. More specifically, we have
the following theorem:Theorem 3.1. The distance function defined in
(2) satisfies all the
properties of a pseudometric, i.e., 1) dxi;xj dxj;xi,2) dxi;xj
0, and 3) dxi;xj dxi;xk dxk;xj.
The first and second properties are easy to verify. Toprove the
third property, i.e., the triangle inequality, inTheorem 3.1, we
need the following lemma:
Lemma 3.2. The following inequality:
fxi fxj2 fxi fxk2 fxk fxj2 3holds for any binary function f :
IRd ! f1;1g.
The proof of the above lemma can be found inAppendix A. It is
straightforward to show the triangleinequality in Theorem 3.1 using
Lemma 3.2 since dxi;xj isa linear combination of fkxi fkxj2.3.3
Objective Function
The first step toward learning a distance function is todefine
an appropriate objective function. The criterionemployed by most
distance metric learning algorithms isto identify a distance
function dxi;xj that gives a smallvalue when xi and xj are similar
and a large value whenthey are different. We can generalize this
criterion by stating
that, for any data point, its distance to a similar example
should be significantly smaller than the distance to an
example that is not similar. This generalized criterion is
cast
into the following objective function, i.e.,
errP Xni1
Xnj1
Xnk1
Iyi;j 1Iyi;k 1
Idxi;xj > dxi;xk;4
where the indicator Ix outputs 1 when the Booleanvariable x is
true and zero otherwise. In the above, we use
Iyi;j 1 to select the pairs of dissimilar examples, andIyi;k 1
to select the pairs of similar examples. Everytriple xi;xj;xk is
counted as an error when the distancebetween the similar pair xi;xj
is larger than the distancebetween the dissimilar pair xi;xk.
Hence, the objectivefunction errP essentially measures the number
of errorswhen comparing the distance between a pair of similar
examples to the distance between a pair of dissimilar
examples.Although the classification error errP seems to be
a
natural choice for the objective function, it has two short-
comings when used to learn a distance function.
. It is well known in the study of machine learningthat directly
minimizing the training error tends toproduce a model that overfits
the training data.
. The objective function errP is a nonsmooth functiondue to the
indicator function Idxi;xj > dxi;xkand therefore is difficult to
optimize.
To overcome the shortcomings of errP, we propose thefollowing
objective function for distance metric learning:
F P Xni;j;k1
Iyi;j 1Iyi;k 1
exp dxi;xk dxi;xj
:
5
The key difference between F P and errP is thatIdxi > dxj is
replaced with exp dxi;xk dxi;xj
.
Since exp dxi;xk dxi;xj
> Idxi > dxj, by mini-mizing the objective function F P,
we are able toeffectively reduce the classification error errP.
Theadvantages of using exp dxi;xk dxi;xj
versus
Idxi > dxj are twofold.. Since expdxi;xk dxi;xj is a smooth
function,
the objective function F P can, in general, beminimized
effectively using standard optimizationtechniques.
. Similarly to AdaBoost [58], by minimizing theexponential loss
function in F P, we are able tomaximize the classification margin
and thereforereduce the generalized classification error
accordingto [53].
Despite the advantages stated above, we note that the
number of terms in (5) is on the order of On3,
potentiallycreating an expensive optimization problem. This
observa-
tion motivates the development of a computationally
efficient algorithm.
YANG ET AL.: A BOOSTING FRAMEWORK FOR VISUALITY-PRESERVING
DISTANCE METRIC LEARNING AND ITS APPLICATION TO... 33
-
3.4 Optimization Algorithm
Given the distance function in (2), our goal is to
learnappropriate classifiers fftxgTt1 and combination
weightsftgTt1. In order to efficiently learn the parameters
andfunctions, we follow the idea of boosting and adopt agreedy
approach for optimization. More specifically, westart with a
constant function for distance, i.e.,d0xi;xj 0, and learn a
distance function d1xi;xj d0xi;xj 1f1xi f1xj2. Using this distance
func-tion, the objective function in (5) becomes a function of 1and
f1x, and can be optimized efficiently using boundoptimization [59]
as described later. In general, given adistance function dt1xi;xj
that is learned in iterationt 1, we learn t and ftx by using the
following distancefunction dtxi;xj:
dtxi;xj dt1xi;xj t ftxi ftxj 2
:
Using the above expression for distance function, theobjective
function at iteration t, denoted by FtP, in (5)becomes a function
of t and ftx, i.e.,
FtP Xni;j;k1
Iyi;j 1Iyi;k 1
expdt1xi;xk dt1xi;xj exptftxi ftxk2 ftxi ftxj2:
To simplify our expression, we introduce the
followingnotations:
di;j dt1xi;xj; 6
fi ftxi; 7
i;j Iyi;j 1 expdi;j: 8Using the above notations, F P is
expressed as follows:
FtP Xni;j;k1
i;ji;k exptfi fk2 tfi fj2: 9
Hence, the key question is how to find the classifier fx and
that minimizes the objective function in (9). For conve-nience of
discussion, we drop the index t for t and ftx, i.e.,t ! and ftx !
fx. Now, we apply the boundoptimization algorithm [59] to optimize
FtP with respectto and fx. The main idea is to approximate the
differencebetween the objective functions of the current iteration
andthe previous iteration by a convex upper bound that has
aclosed-form solution. As shown in [59], the bound optimiza-tion is
guaranteed to find a local optimal solution.
Like most bound optimization algorithms, instead ofminimizing F
P in (9), we will minimize the differencebetween objective
functions from two consecutive itera-tions, i.e.,
; f FtP Ft1P; 10where f f1; . . . ; fn and Ft1P
Pni;j;k1
i;j
i;k is the
objective function of the first t 1 iterations. Note that; f 0
when 0. This condition guarantees that
when we minimize ; f, the resulting FtP is smallerthan Ft1P, and
therefore, the objective function willmonotonically decrease
through iterations. In addition, asshown in [59], minimizing the
bound is guaranteed to find alocally optimal solution.
First, in the following lemma, we construct an upperbound for ;
f that decouples the interaction between and f . Before stating the
result, we introduce the concept ofa graph Laplacian for readers
who may not be familiarwith the term. A graph Laplacian for a
similarity matrix S,denoted by LS, is defined as L diagS1 S, where
1 isan all-one vector and operator diagv turns vector v into
adiagonal matrix.
Lemma 3.3. For any > 0 and binary vector f 2 f1;1gn,
thefollowing inequality holds:
; f exp8 18
f>Lf exp8 18
f>Lf ; 11
where L and L are the graph Laplacians for the
similaritymatrices S and S, respectively, defined as
Si;j 1
2i;ji j ; Si;j
1
2i;ji j ; 12
where i is defined as
i Xnj1
i;j: 13
Recall thati;j is defined asi;j Iyi;j 1 expdi;j in (8).
The detailed proof of this lemma is given in Appendix B.
Remark. Since i;j / Iyi;j 1 (8), the similarity S dependsonly on
the data points from the must-link pairs(equivalence constraints).
Hence, f>Lf in (11) essentiallymeasures the inconsistency
between the binary vector fand the must-link constraints.
Similarly, f>Lf in (11)measures the inconsistency between f and
the cannot-linkpairs (inequivalence constraints). Hence, the
upperbound in (11) essentially computes the overall incon-sistency
between the labeled pairs and the binary vector f .
Next, using Lemma 3.3, we derive additional bounds for; f by
removing . This result is summarized in thefollowing theorem.
Theorem 3.4. For any binary vector f 2 f1;1gn, thefollowing
inequality holds:
min0
; f 18
max0;f>Lf
q
f>Lf
q
214
max0; f>Lf f>Lf 2
8n maxLp maxL2q ; 15where maxS is the maximum eigenvalue of
matrix S.
The proof of this theorem can be found in Appendix C.In the
following discussion, we will focus on minimizingthe upper bound of
the objective function stated inTheorem 3.4, which allows us to
reduce the computationalcost dramatically.
34 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
-
In order to search for the optimal binary solution f
that minimizes the upper bound of ; f, we decide tofirst search
for a continuous solution for f and then
convert the continuous f into a binary one by comparing
to a threshold b. In particular, we divide the optimization
procedure into two steps:
. searching for the continuous f that minimizes theupper bound
in (15) and
. searching for the threshold b that minimizes theupper bound in
(14) for a continuous solution f .
To differentiate the continuous solution f , we furthermore
denote by f^ the binary solution. It is important to note
that
the two steps use different upper bounds in Lemma 3.3.
This is because the looser upper bound in (15) allows for
efficient computation of continuous solution f , while the
tighter upper bound in (11) allows for a more accurate
estimation of threshold b.Finally, the optimization problems
related to the two
steps are summarized as follows, respectively:
maxf2IRn
f>L Lf ; 16
and
maxb2IR
f^>L f^
p
f^>L f^
p
s:t: f^i 1; fi > b;
1; fi b: 17
It is clear that the optimal solution to (16) is the maximum
eigenvector of matrix L L, and therefore, can becomputed very
efficiently. To find the b that optimizes the
problem in (17), it is sufficient to consider f1; f2; . . . ;
fn, in
turn, as the candidate solutions.Given the optimal f f1; . . . ;
fn, the next question is
how to learn a classification function fx to approximate f
.Here, we consider two cases: the linear classifier and the
nonlinear classifier. In the first case, we assume that the
classification function fx is based on a linear transforma-tion
of x, i.e., fx u>x, where u is a projection vector thatneeds to
be determined. Then, the optimization problem in
(16) is converted into the following problem:
maxu>u1
u>XL LX>u: 18
It is not difficult to see that the optimal projection u
that maximizes (18) is the maximum eigenvector of
XL LX>. In the second case, we exploit the kerneltrick.
Specifically, we introduce a nonlinear kernel function
kx;x0 and assume the classification function fx as
fx Xni1
kxi;xui:
Similarly to the linear case, we calculate the optimal
projection u u1; . . . ; un by computing the maximumeigenvector
of KL LK>, where K is a nonlinearkernel similarity matrix with
Ki;j kxi;xj. Fig. 1 sum-marizes the proposed boosted distance
metric learning
algorithm of both the linear and the nonlinear cases.
To further ensure that our algorithm is effective in
reducing the objective function despite being designed to
minimize the upper bound of the objective function, we
present the following theorem:
Theorem 3.5. Let St ; St , t 1; . . . ; T be the
similaritymatrices that are computed by running the boosting
algorithm
(in Fig. 1) using (12). Let Lt and Lt be the corresponding
graph Laplacians. Then, the objective function at the T
1iteration, i.e., FT1P, is bounded as follows:
FT1P F0PYTt01 t; 19
where
F0 Xni;j;k1
Iyi;j 1Iyi;k 1;
t maxLt Lt 2
8maxSt St maxLt maxLt :
The proof of this theorem can be found in Appendix D.Evidently,
we note that is bounded between 0; 1=8. Asrevealed in the above
theorem, although we only aim tominimize the upper bound of the
objective function, theupper bound of the objective function
decreases by a factorof 1 t in each iteration, and therefore, the
objectivefunction will, in general, decrease rapidly. This claim
issupported by our experimental results below.
3.5 Preserving Visual Similarity
As pointed out in Section 1, most distance metric
learningalgorithms learn a distance metric that only preserves
thesemantic similarity without taking into account the
visualresemblance between images. Fig. 2 shows a pair of twoimages
whose distance is very small according to adistance metric learned
from the labeled examples. Notethat, although both images are
malignant according to themedical annotation, their appearances are
rather different.By retrieving images that are only medically
relevant, the
YANG ET AL.: A BOOSTING FRAMEWORK FOR VISUALITY-PRESERVING
DISTANCE METRIC LEARNING AND ITS APPLICATION TO... 35
Fig. 1. Distance metric learning algorithm in a boosting
framework.
-
system is poorly suited for assisting doctors in providingthe
necessary context for informed decision making.
To address this problem, we introduce additional pair-wise
constraints to reflect the requirement of visualsimilarity. These
additional pairwise constraints, referredto as visual pairs, are
combined with the equivalence andinequivalence constraints to train
a distance metric using theboosting algorithm that is described
above. Ideally, thevisual pairs would be specifiedmanually by
domain experts.However, in the absence of such labels, we represent
animage by a vector of visual features and approximate thevisual
pairs by the pairs of images that fall within a smalleuclidean
distance in the space of visual features. Byincorporating the
visual pairs as a part of the pairwiseconstraints, the resulting
distance function will reflect notonly the semantic relevance among
images, but also thevisual similarity between images. Furthermore,
the trade-offbetween visual and semantic similarity in learning a
distancefunction can be adjusted by varying the number of
visualpairs. As shown in our experiments, employing a largenumber
of visual pairs biases the learned metric towardpreserving visual
similarity. Finally, we note that the sameset of low-level image
features is used to assess the medicalrelevance of images and to
generate visual pairs. The keydifference is that, in generating
visual pairs, every feature istreated with equal importance; in
contrast, the semanticrelevance between two images is judged by a
weighteddistance, and therefore, only a subset or combinations
ofimage features determines the semantic relevance of images.
We can also interpret visual pairs from the viewpoint ofBayesian
learning. In particular, introducing visual pairsinto our learning
scheme is essentially equivalent tointroducing a Bayesian prior for
the target distancefunction. Note that 1) the same set of visual
features isused to judge the semantic relevance and visual
similarityand 2) visual pairs are generated by the euclidean
distance.Hence, the introduction of visual pairs serves as
aregularizer for the distance function to be close to theeuclidean
distance. We emphasize the importance ofregularization in distance
metric learning, particularlywhen the number of pairwise
constraints is limited. Sincemost distance functions involve a
large number of para-meters, overfitting is likely in the absence
of appropriateregularization; resulting distance functions are
likely to fitthe training data very well, yet will fail to
correctly predictthe distances between the examples in the testing
set. Thisissue is examined further in our experiments below.
4 APPLICATIONS
This section presents evaluations of the proposed
boostingframework for learning distance functions in the context
ofboth medical and nonmedical image retrieval applications.We
denote the basic algorithm by BDM and the algorithmaugmented with
automatically generated visual pairs asBDM+V. The first set of
experiments employs our methodin an ISADS application for breast
cancer. The ISADSapplication allows a radiologist examining a
suspiciousmass in a mammogram to retrieve and study similar
masseswith outcomes before determining whether to recommend
abiopsy. We first describe the image repository used by
theapplication. We then empirically examine and evaluatedifferent
properties of the proposed algorithm, includingthe convergence of
the proposed algorithm, the effect ofvisual pairs on the
performance of image retrieval andclassification, and the impact of
training set size. Finally, wealso evaluate the proposed algorithm
using the medicalimage data set from ImageCLEF [60]. To
demonstrateBDM+Vs generalized efficacy on regular image data
setsbeyond the medical domain, we also present retrieval
andclassification results on the standard Corel data set.
4.1 Reference Library: UPMC MammogramsData Set
We used an image reference library based on digitizedmammograms
created by the Imaging Research Center ofthe Department of
Radiology at the University of Pittsburgh.The library consists of
2,522 mass regions of interest (ROI)including 1,800
pathology-verified malignant masses and722 CAD-cued benign masses.
Each mass ROI is representedby a vector of 38 morphological and
intensity distribution-related features, within which nine features
are computedfrom the whole breast area depicted in the
digitizedmammogram (global features) and the remaining featuresare
computed from the segmented mass region and itssurrounding
background tissue (local features). The ex-tracted visual features
are further normalized by the meanand the standard deviation
computed from the 2,522 selectedmass regions in the image data set.
A detailed description ofthe features, the normalization step, and
region segmenta-tion are described in [61], [62]. Fig. 3 shows a
significantoverlap between the two classes in the space spanned by
thefirst three principal eigenvectors computed by
PrincipalComponent Analysis (PCA). This result illustrates
thedifficulty in separating classes using simple methods.
36 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
Fig. 3. Three-dimensional PCA representation of the malignant
(red)
class and benign (blue) class.
Fig. 2. Two images with the same semantic label (malignant
masses in
this example) can look very different visually. In an ISADS
application, it
is important for the system to retrieve examples that are both
visually
and semantically similar.
-
4.2 Experiment Setup
We randomly select 600 images from the reference library toserve
as the training data set. Among them, 300 imagesdepict malignant
masses and 300 depict CAD-generatedbenign mass regions. The
remaining 1,922 images are usedfor testing. Through these
experiments, unless specified, thelinear BDM (described in Fig. 1)
is used for evaluation.
We evaluate the proposed algorithm in the context ofISADS using
two metrics. The first metric, classificationaccuracy, indicates
the extent to which malignant imagescan be detected based on the
images that are retrieved bythe system [18], [19]. We compute
classification accuracy bythe K Nearest Neighbor (KNN) classifier:
Given a testexample x, we first identify the K training examples
thathave the shortest distance to x, where distance is
computedusing the metric learned from training examples; we
thencompute the probability that x is malignant based on
thepercentage of its K nearest neighbors that belong to
themalignant class. These probabilities for test examples areused
to generate the Receiver Operating Characteristic(ROC) curve by
varying the threshold of the probability forpredicting malignancy.
Finally, the classification accuracy isassessed by the area under
the ROC curve. As has beenpointed out by many studies, the ROC
curve is a bettermetric for evaluating classification accuracy than
error rate,particularly when the populations of classes are
skewed.Cross validation has indicated that the optimal number
ofnearest neighbors (i.e., K) in KNN is 10. Every experimentis
repeated 10 times with randomly selected training imagesand the
final result is computed as an average over these10 runs. Both the
mean and standard deviation of the areaunder the ROC curve are
reported in the study.
The second metric, retrieval accuracy, reflects theproportion of
retrieved images that are medically relevant(i.e., in the same
semantic class) to the given query [16], [17].Unlike classification
accuracy where only a single value iscalculated, retrieval accuracy
is computed as a function ofthe number of retrieved images and thus
provides a morecomprehensive picture for the performance of ISADS.
Weevaluate retrieval accuracy in a leave-one-out manner, i.e.,using
one medical image in the test data set as the queryand the
remaining images in the test data set as the gallerywhen we conduct
the experiment of image retrieval. For agiven test image, we rank
the images in the gallery in the
ascending order of their distance to the query image. Wedefine
the retrieval accuracy for the ith test query image atrank position
k, denoted by rqki , as the percentage of thefirst k ranked images
that share the same semantic class(i.e., benign or malignant) as
the query image:
rqki Pk
j1 Iyi yjk
; 20
where j in the summation refers to the indices of the topk
ranked images. The overall retrieval accuracy at each rankposition
is an average over all images in the testing set.
4.3 Empirical Evaluation of the Proposed Algorithm(BDM+V)
In this section, we study the convergence of the
proposedalgorithm, the performance of the proposed algorithm
forboth image classification and retrieval, and, furthermore,the
effect of visual pairs on image retrieval.
4.3.1 Convergence of the Objective Function
Fig. 4a shows the reduction of the objective function (5)
andFig. 4b shows the reduction of the error rate errP in (4),both
as a function of the number of iterations. The numberof iterations
in Fig. 4 corresponds to the T from (2) andFig. 1. Recall that the
error rate errPmeasures the numberof errors when comparing the
distance between a pair ofsimilar examples to the distance between
a pair ofdissimilar examples. We also compare the change of thetwo
in the same figure (see Fig. 4c). The iteration stopswhen the
relative change in the objective function is smallerthan a
specified threshold (105 in our study).
First, we clearly observe that the value of the
objectivefunction drops at a rapid rate, which confirms the
theoreticanalysis stated in Theorem 3.5. Second, we observe that
theoverall error rate is also reduced significantly, and indeed,is
upper bounded by the objective function in (5), asdiscussed in
Section 3, although the bound is rather loose.
4.3.2 Effect of Visual Pairs
We first evaluate how the visual pairs affect the
retrievalaccuracy of BDM. Fig. 5 summarizes the retrieval
accuracyof BDM+V and BDM (i.e., with and without using the
visualpairs). For the purpose of comparison, we also include
the
YANG ET AL.: A BOOSTING FRAMEWORK FOR VISUALITY-PRESERVING
DISTANCE METRIC LEARNING AND ITS APPLICATION TO... 37
Fig. 4. Reduction of objective function and error rate over
iterations (312 iterations in total). (a) Objective function. (b)
Error rate errP. (c) Objectivefunction versus error rate errP.
-
retrieval accuracy for the euclidean distance. The
standarddeviation in the retrieval accuracy is illustrated by
thevertical bar. First, we observe that the retrieval accuracy
ofboth variants of BDM exceeds that of the euclidean
distancemetric, indicating that BDM is effective in learning
appro-priate distance functions. Second, we observe that
theincorporation of visual pairs improves retrieval accuracy.This
improvement can be explained from the viewpoint ofBayesian
statistics since the visual pairs can be viewed as aBayesian
priors, as discussed above. Hence, BDM withvisual pairs can be
interpreted as Maximum A Posterior(MAP), while BDM without visual
pairs can somehow beinterpreted as Maximum-Likelihood Estimation
(MLE). It iswell known that MAP-based approaches typically
outper-form MLE-based approaches. This is particularly true whenthe
number of training examples is not large in comparisonto the number
of parameters, allowing the target classifica-tion model to overfit
the training examples. By introducing aBayesian prior, we are able
to regularize the fitting of thetarget classification model for the
given training examples,thus alleviating the problem of
overfitting.
In the second experiment, we evaluate the effect of thevisual
pairs on classification accuracy. We compute the areaunder the ROC
curves (AUR), which is a common metricfor evaluating classification
accuracy. Table 1 shows theAUR results for BDM+V and BDM (i.e.,
with and withoutvisual pairs) and the euclidean distance metric.
Similarly tothe previous experiment, we observe that areas under
the
ROC curves of the two variants of BDM are significantlylarger
than that of the euclidean distance, showing thatBDM achieves
better classification accuracy than theeuclidean distance metric.
Similarly to retrieval accuracy,we observe that the incorporation
of visual pairs noticeablyimproves the classification accuracy.
The final experiment in this section is designed to studyhow
different numbers of visual pairs affect the classifica-tion and
retrieval performance. We vary the size ofneighborhood from 1, 5,
10, and 15 to 20 when generatingvisual pairs. The larger the
neighborhood size, the morevisual pairs are generated. Fig. 6 and
Table 2 show theretrieval accuracy and the area under ROC curves
forBDM+V using different neighborhood sizes for generatingvisual
pairs. We observe that the five different neighbor-hood sizes
result in similar performance in both classifica-tion and
retrieval. We thus conclude that BDM+V is overallinsensitive to the
number of visual pairs. Note that ourstudy is limited to a modest
range of visual pairs. The sizeof the euclidean near neighborhood
should be controlled;otherwise, this approximation fails to capture
visualsimilarity between images.
4.4 Comparison to State-of-the-Art Algorithms forDistance Metric
Learning
We compare BDM+V to three state-of-the-art algorithms
forlearning distance functions and distance metrics: Linear
38 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
Fig. 5. Comparison of retrieval accuracy. The learned
metrics
significantly outperform euclidean; adding visual pairs
(BDM+V)
consistently improves retrieval.
TABLE 1Comparison of the Classification Accuracy
The learned metrics result in better classification and the
addition ofvisual pairs (BDM+V) is significant.
TABLE 2Classification Results for BDM+V Using Different
Numbers
of Near Neighbors for Visual Pair Generation
BDM+V is relatively insensitive to the number of visual pairs
used.
Fig. 6. Retrieval accuracy curves for BDM+V using different
numbers of
near neighbors to generate visual pairs. Retrieval is relatively
insensitive
to the number of visual pairs used in BDM+V.
-
Boost Distance (denoted as DistBoost) [50], Large Margin
Nearest Neighbor Classifier (denoted as LMNN) [6],
andNeighborhood Component Analysis (denoted as NCA)
[30]. Euclidean distance is included as a comparative
reference (denoted as euclidean).
4.4.1 Results on UPMC Mammograms Data Set
Fig. 7 shows the retrieval accuracy curves for BDM+V and
the three comparative algorithms for distance metric
learning. First, we observe that all of the distance
learningalgorithms outperform the euclidean distance metric
except
for the DistBoost algorithm which performs considerably
worse than the euclidean distance metric. Second, BDM+V
and NCA perform consistently better than the otheralgorithms
across all the ranking positions. Table 3 shows
the area under the ROC curve for BDM+V and the baseline
methods. The proposed algorithm has the largest area
under the ROC curve, followed by LMNN, euclidean, NCA,and
finally, DistBoost. It is interesting to observe that
although NCA achieves a better retrieval accuracy than the
euclidean distance, its classification accuracy is consider-
ably lower than the euclidean distance.
4.4.2 Results on the ImageCLEF Data Set
To generalize the performance of the proposed algorithm,we
further evaluate the proposed algorithm on the medical
image data set provided by the ImageCLEF conference[60]. This is
a popular benchmark data set used to evaluateautomated medical
image categorization and retrieval. Itconsists of 15 medical image
categories with a total of2,785 images. All of the medical images
in this experimentare X-ray images collected from plain
radiography. Fig. 8shows a few examples of medical images in our
testbed.The category information can be found from the con-ference
Web site.
Following the typical practice in ImageCLEF, we processeach
medical image using a bank of Gabor wavelet filters[63] to extract
texture features. More specifically, eachimage is first scaled to
the size of 128 128. Then, the Gaborwavelet transform is applied to
the scaled image at fivescale levels and eight orientations, which
results in a total of40 subimages. Every subimage is further
normalized into8 8 64 features, which results in a total of 64 40
2;560 visual features for each medical image. PCA is used toreduce
the dimensionality from 2,560 to 200. We select atotal of 1,100
images from 11 categories in the ImageCLEFfor our experiments. We
randomly selected 40 percentimages for the training data set and
the remaining imagesserve as test queries.
The retrieval accuracy, defined in (20), is reported inFig. 9.
It is interesting to observe that NCA, which achieveshigh retrieval
accuracy on the UPMC Mammogram DataSet, now performs significantly
worse than the euclideandistance metric. On the other hand,
DistBoost, which
YANG ET AL.: A BOOSTING FRAMEWORK FOR VISUALITY-PRESERVING
DISTANCE METRIC LEARNING AND ITS APPLICATION TO... 39
Fig. 8. Examples of medical images in the ImageCLEF testbed.
Fig. 9. Retrieval accuracy by different distance metric
learning
algorithms on the ImageCLEF data set.
Fig. 7. Retrieval accuracy of distance metric learning
algorithms on the
mammogram data set.
TABLE 3Classification Accuracy on the Mammogram Data Set
-
performed poorly on the UPMC data set, is one of the
bestalgorithms. This result indicates that some of the
state-of-the-art distance metric learning algorithms are sensitive
tothe characteristics of data sets and their performance isusually
data-dependent. In contrast, BDM+V achievesgood retrieval accuracy
on both data sets, indicating therobustness of the proposed
algorithm.
We also conduct the classification experiment using theImageCLEF
data set. Table 4 summarizes the area under theROC curve for all
the 11 classes separately. As we observe,for most classes, BDM+V
achieves a performance that iscomparable to LMNN, the best among
the five competitors.
4.5 Computational Cost
As discussed in Section 1, high computational cost is one ofthe
major challenges in learning distance metrics. Manyapproaches aim
to learn a full matrix and therefore becomecomputationally
expensive as the dimensionality grows.BDM+V reduces the
computational cost by learning abinary representation in a boosting
framework, from whicha weighted Hamming distance is computed. Table
5 showsthe running time of the proposed algorithm and thebaseline
methods, for different dimensionality using theImageCLEF data set.
Note that the different numbers ofdimensions are created by
applying PCA to the images inthe database and selecting the top
eigenvectors forrepresenting images.
First, the proposed algorithm is considerably faster thanthe
three competitors when each image is represented bymore than 200
features. Second, the time consumed by the
proposed algorithm does not increase dramatically as the
number of dimensions increases from 100 to 500; in contrast,
for the three baseline algorithms, we observe a significant
increase in the computational time as the dimension grows
beyond 300. For instance, DistBoost is impressively fast
(524.1 seconds) with 200 dimensions but falls behind
BDM+V when the dimension increases to 300, and this gap
widens in the case of 400 and 500 dimensions. NCA is the
most computationally expensive among the four competi-
tors, starting at 1,896.1 seconds for 100 dimensions and
rising rapidly to end at 84,016.9 seconds for 500
dimensions.
From these experiments, it is evident that, for all of the
baseline methods, the efficiency issue becomes severe with
higher dimensionality. In contrast, due to its efficient
design,
the computational time for the proposed method increases
only linearly with respect to the dimensionality.
4.6 Regular Image Retrieval on the COREL Data Set
To demonstrate the efficacy of BDM+V for regular imageretrieval,
we test the proposed algorithm on the CORELdata set. We randomly
choose 10 categories from theCOREL data set and randomly select 100
examples fromeach category, resulting in an image collection
of1,000 images. Each image is represented by 36 differentvisual
features that belong to three categories: color, edge,and texture.
The details of the visual feature used torepresent the COREL data
set can be found in [31].
The retrieval accuracy is reported in Fig. 10. Although
the proposed algorithm BDM+V is outperformed overall
by LMNN and DistBoost, we observe that BDM+V
surpasses DistBoost at the first rank and outperforms
LMNN after rank 14.Table 6 reports the area under the ROC curve
for all the
11 classes separately. BDM+V performs comparably toLMNN, which
achieves the best results across the 10 classes.The other three
competitors, i.e., DistBoost, NCA, andeuclidean, often perform
significantly worse than LMNNand the proposed algorithm. Moreover,
the standarddeviation of BDM+V and LMNN is, in general, smallerthan
the three baselines, indicating the robustness of theproposed
algorithm.
40 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
TABLE 5Computation Time (Seconds) for the
Proposed and Baseline Algorithms as theNumber of Dimensions
Varies from 100 to 500
TABLE 4Area under ROC Curve on the ImageCLEF Data Set, Obtained
by the Proposed Baseline Algorithms
-
5 CONCLUSION AND DISCUSSIONS
In this paper, we present a novel framework that learns a
distance metric from side information. Unlike the other
distance metric learning algorithms that are designed to
learn a full matrix for distance metric, and therefore,
suffer
from computational difficulty, the proposed algorithm first
learns a binary representation for data and then computes
the weighted Hamming distance based on the learned
representation. A boosting algorithm is presented to
facilitate the learning of the binary representation and the
weights that are used to form the Hamming distance. In
addition to the computational efficiency, another advantage
of the proposed algorithm is that it is able to preserve
both
the semantic relevance and the visual similarity. This is
realized through the introduction of links that pair
visually
similar images. By training over the combination of visual
pairs and pairwise constraints that are generated based on
semantic relevance, the resulting distance metric is able to
preserve both the visual similarity and semantical rele-
vance. In contrast, the previous work on distance metric
learning tends to focus only on the semantic relevance. We
demonstrate the effectiveness of the proposed algorithm inthe
context of an ISADSs system for breast cancer and ontwo standard
image data sets (ImageCLEF and Corel).
APPENDIX A
PROOF OF LEMMA 3.2
To prove the inequality, we consider the following twocases:
. fxi fxj: In this case, the inequality in (3) holdsbecause the
left side of the inequality is zeros and theright side is
guaranteed to be nonnegative.
. fxi 6 fxj: In this case, fxk will be equal toeither fxi or fxj
since fx is a binary function.Hence, both sides of the inequality
are equal to 4,and therefore, the inequality in (3) holds.
APPENDIX B
PROOF OF LEMMA 3.3
To prove the inequality in (11), we first bound expfi fk2 fi fj2
by the following expression:
expfi fk2 fi fj2
exp2fi fk2
2 exp2fi fj
22
:
Since f2i 1 for any example xi, we have
fi fj24
fi fj2
4 1:
Hence, exp2fi fj2 can be upper bounded as follows:
exp2fi fj2
exp 8 fi fj2
4 0 fi fj
2
4
!
fi fj2
4exp8 fi fj
2
4
fi fj2
4exp8 1 1:
YANG ET AL.: A BOOSTING FRAMEWORK FOR VISUALITY-PRESERVING
DISTANCE METRIC LEARNING AND ITS APPLICATION TO... 41
TABLE 6Area under the ROC Curve on the Corel Data Set, Obtained
by the Proposed and Baseline Algorithms
Fig. 10. Retrieval accuracy on the Corel data set.
-
Using the above inequality, we have the objective
function F P in (9) upper bounded as follows:
F P Xni;j;k1
i;ji;k
Xni;j;k1
i;ji;kexpfi fj2 fi fk2 1
exp8 18
Xni;j1
i;jXnk1
i;k
!fi fj2
exp8 18
Xni;j1
i;jXnk1
i;k
!fi fj2
exp8 18
f>Lf exp8 18
f>Lf :
The last step of the above derivation is based on the
following equality:
f>LSf Xni;j1
Si;jfi fj2:
Finally, noting that ~F P, i.e., the objective function
ofprevious iteration, is equal to
Pni;j;k1
i;k
i;j, we have
; f F P ~F P upper bounded as follows:
; f exp8 18
f>Lf exp8 18
f>Lf :
APPENDIX C
PROOF OF THEOREM 3.4
We first denote by g; f the right-hand side of theinequality in
(11), i.e.,
g; f exp8 18
f>Lf exp8 18
f>Lf :
Note that g; f is a convex function of parameter . Wethen
compute min0g; f by setting the first orderderivative of to be
zero, i.e.,
@g; f@
exp8f>Lf exp8f>Lf 0:
We obtain the optimal by solving the above equation,
which is
max 0; 116
log f>Lf 1
16log f>Lf
:
Substituting the above expression for , we have
min0
g; f 18
max
0;
f>Lf
p
f>Lf
p 2 max0; f
>Lf f>Lf 28
f>Lf
p
f>Lf
p2
max0; f>Lf f>Lf 2
8n maxLp maxLp 2 :Since ; f g; f, we have the bound in (15).
APPENDIX D
PROOF OF THEOREM 3.5
According to Theorem 3.4, we have
Ft1PFtP 1
max0; f>L Lf28nFtP
maxLt
p maxLt p 2 : 21Since we choose f to maximize the fLt Lt f , we
have
max0;maxf
f>Lt Lt f maxLt Lt : 22
The above derivation uses the following fact:
maxLt Lt 1
n1>Lt Lt 1 0:
We can further simplify the bound in (21) by having
maxLt
p
maxLt
q 2 2maxLt maxLt :23
Finally, we can upper bound FtP as follows:
FtP Xni;j;k1
i;ji;k 121>St St 1
1
2maxSt St :
24By putting the inequalities in (22), (23), and (24), we
have
Ft1PFtP 1
maxLt Lt 28maxSt St maxLt maxLt
1 t:Using the above inequality, we can bound FT1P asfollows:
FT1P F0PYTt0
Ft1PFtP F0P
YTt01 t:
ACKNOWLEDGMENTS
This work was supported by the US National Science
Foundation (NSF) under grant IIS-0643494 and by the
National Center for Research Resources (NCRRs) under
grant No. 1 UL1 RR024153. Any opinions, findings,
conclusions, or recommendations expressed in this materi-
al are those of the authors and do not necessarily reflect
the
views of the NSF, NCRR, Intel, Michigan State University,
or Carnegie Mellon University.
REFERENCES[1] P. Croskerry, The Theory and Practice of Clinical
Decision-
Making, Canadian J. Anesthesia, vol. 52, no. 6, pp. R1-R8,
2005.[2] L. Yang, R. Jin, R. Sukthankar, B. Zheng, L. Mummert,
M.
Satyanarayanan, M. Chen, and D. Jukic, Learning DistanceMetrics
for Interactive Search-Assisted Diagnosis of Mammo-grams, Proc.
SPIE Conf. Medical Imaging, 2007.
[3] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R.
Jain,Content-Based Image Retrieval at the End of the Early
Years,IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22,
no. 12,pp. 1349-1380, Dec. 2000.
42 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010
-
[4] H. Kim, P. Howland, and H. Park, Dimension Reduction in
TextClassification with Support Vector Machines, J. Machine
LearningResearch, vol. 6, pp. 37-53, 2005.
[5] L. Vandenberghe and S. Boyd, Semidefinite Programming,SIAM
Rev., vol. 38, no. 1, pp. 49-95, 1996.
[6] K. Weinberger, J. Blitzer, and L. Saul, Distance
MetricLearning for Large Margin Nearest Neighbor
Classification,Advances in Neural Information Processing Systems,
MIT Press,http://www.seas.upenn.edu/kilianw/lmnn, 2006.
[7] D. Gur, J.H. Sumkin, L.A. Hardesty, and H.E.
Rockette,Computer-Aided Detection of Breast Cancer: Has
PromiseOutstripped Performance? J. Natl Cancer Inst., vol. 96, pp.
717-718, 2004.
[8] R.M. Nishikawa and M. Kallergi, Computer-Aided Detection
inIts Present Form Is Not an Effective Aid for Screening
Mammo-graphy, Medical Physics, vol. 33, pp. 811-814, 2006.
[9] T.M. Freer and M.J. Ulissey, Screening Mammography
withComputer-Aided Detection: Prospective Study of 12,860
Patientsin a Community Breast Center, Radiology, vol. 220, pp.
781-786,2001.
[10] L.A. Khoo, P. Taylor, and R.M. Given-Wilson,
Computer-AidedDetection in the United Kingdom National Breast
ScreeningProgramme: Prospective Study, Radiology, vol. 237, pp.
444-449,2005.
[11] J.M. Ko, M.J. Nicholas, J.B. Mendel, and P.J. Slanetz,
ProspectiveAssessment of Computer-Aided Detection in Interpretation
ofScreening Mammograms, Am. J. Roentgenology, vol. 187, pp.
1483-1491, 2006.
[12] M.L. Giger, Z. Huo, C.J. Vyborny, L. Lam, I. Bonta, K.
Horsch,R.M. Nishikawa, and I. Rosenbourgh, Intelligent CAD,
Work-station for Breast Imaging Using Similarity to Known Lesions
andMultiple Visual Prompt Aides, Proc. SPIE Conf. Medical
Imaging02: Image Processing, pp. 768-773, 2002.
[13] I. El-Naga, Y. Yang, N.P. Galatsanos, R.M. Nishikawa, and
M.N.Wernick, A Similarity Learning Approach to Content-BasedImage
Retrieval: Application to Digital Mammography, IEEETrans. Medical
Imaging, vol. 23, no. 10, pp. 1233-1244, Oct. 2004.
[14] C. Wei, C. Li, and R. Wilson, A General Framework for
Content-Based Medical Image Retrival with Its Application to
Mammo-grams, Proc. SPIE Conf. Medical Imaging 05: PACS and
ImagingInformatics, pp. 134-143, 2005.
[15] H. Alto, R.M. Rangayyan, and J.E. Desautels,
Content-BasedRetrieval and Analysis of Mammographic Masses, J.
ElectronicImaging, vol. 14, pp. 023016-1-023016-17, 2005.
[16] C. Muramatsu, Q. Li, K. Suzuki, R.A. Schmidt, J. Shiraishi,
G.M.Newstead, and K. Doi, Investigation of Psychophysical
Measurefor Evaluation of Similar Images for Mammographic
Masses:Preliminary Results, Medical Physics, vol. 32, pp.
2295-2304, 2005.
[17] B. Zheng et al., Interactive Computer Aided Diagnosis of
BreastMasses: Computerized Selection of Visually Similar Image
Setsfrom a Reference Library, Academic Radiology, vol. 14, no.
8,pp. 917-927, 2007.
[18] G.D. Tourassi, B. Harrawood, S. Singh, J.Y. Lo, and C.E.
Floyd,Evaluation of Information-Theoretic Similarity Measures
forContent-Based Retrieval and Detection of Masses in Mammo-grams,
Medical Physics, vol. 34, pp. 140-150, 2007.
[19] Y. Tao, S.B. Lo, M.T. Freedman, and J. Xuan, A Preliminary
Studyof Content-Based Mammographic Masses Retrieval Book, Proc.SPIE
Conf. Medical Imging 07, 2007.
[20] R.M. Nishikawa, Current Status and Future Directions
ofComputer-Aided Diagnosis in Mammography, ComputerizedMedical
Imaging Graphics, vol. 31, pp. 224-235, 2007.
[21] L. Yang and R. Jin, Distance Metric Learning: A
ComprehensiveSurvey, technical report, Michigan State Univ.,
2006.
[22] E. Xing, A. Ng, M. Jordan, and S. Russell, Distance
MetricLearning with Application to Clustering with Side
Information,Advances in Neural Information Processing Systems, MIT
Press, 2003.
[23] J.T. Kwok and I.W. Tsang, Learning with Idealized
Kernels,Proc. Intl Conf. Machine Learning, 2003.
[24] T. Hastie and R. Tibshirani, Discriminant Adaptive
NearestNeighbor Classification, IEEE Trans. Pattern Analysis and
MachineIntelligence, vol. 18, no. 6, pp. 607-616, June 1996.
[25] A.B. Hillel, T. Hertz, N. Shental, and D. Weinshall,
LearningDistance Functions Using Equivalence Relations, Proc. Intl
Conf.Machine Learning, 2003.
[26] S.C.H. Hoi, W. Liu, M.R. Lyu, and W.-Y. Ma, Learning
DistanceMetrics with Contextual Constraints for Image Retrieval,
Proc.IEEE CS Conf. Computer Vision and Pattern Recognition,
2006.
[27] M. Sugiyama, Local Fisher Discriminant Analysis for
SupervisedDimensionality Reduction, Proc. Intl Conf. Machine
Learning,2006.
[28] T.-K. Kim, S.-F. Wong, B. Stenger, J. Kittler, and R.
Cipolla,Incremental Linear Discriminant Analysis Using
SufficientSpanning Set Approximations, Proc. IEEE Conf. Computer
Visionand Pattern Recognition, 2007.
[29] M. Schultz and T. Joachims, Learning a Distance Metric
fromRelative Comparisons, Advances in Neural Information
ProcessingSystems, MIT Press, 2004.
[30] J. Goldberger, S. Roweis, G. Hinton, and R.
Salakhutdinov,Neighbourhood Components Analysis, Advances in
NeuralInformation Processing Systems, MIT Press, 2005.
[31] L. Yang, R. Jin, R. Sukthankar, and Y. Liu, An Efficient
Algorithmfor Local Distance Metric Learning, Proc. Natl Conf.
ArtificialIntelligence, 2006.
[32] A.B. Hillel and D. Weinshall, Learning Distance Function
byCoding Similarity, Proc. Intl Conf. Machine Learning, 2007.
[33] A. Woznica, A. Kalousis, and M. Hilario, Learning to
CombineDistances for Complex Representations, Proc. Intl Conf.
MachineLearning, 2007.
[34] W. Zhang, X. Xue, Z. Sun, Y. Guo, and H. Lu,
OptimalDimensionality of Metric Space for Classification, Proc.
Intl Conf.Machine Learning, 2007.
[35] H. Wang, H. Zha, and H. Qin, Dirichlet Aggregation:
Unsuper-vised Learning Towards an Optimal Metric for
ProportionalData, Proc. Intl Conf. Machine Learning, 2007.
[36] S. Zhou, B. Georgescu, D. Comaniciu, and J. Shao,
BoostMotion:Boosting a Discriminative Similarity Function for
Motion Estima-tion, Proc. IEEE CS Conf. Computer Vision and Pattern
Recognition,2006.
[37] B. Babenko, P. Dollar, and S. Belongie, Task Specific Local
RegionMatching, Proc. Intl Conf. Computer Vision, 2007.
[38] F. Li, J. Yang, and J. Wang, A Transductive Framework
ofDistance Metric Learning by Spectral Dimensionality
Reduction,Proc. Intl Conf. Machine Learning, 2007.
[39] P. Dollar, V. Rabaud, and S. Belongie, Non-Isometric
ManifoldLearning: Analysis and an Algorithm, Proc. Intl Conf.
MachineLearning, 2007.
[40] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon,
Information-Theoretic Metric Learning, Proc. Intl Conf. Machine
Learning,2007.
[41] J. Dillon, Y. Mao, G. Lebanon, and J. Zhang,
StatisticalTranslation, Heat Kernels, and Expected Distance, Proc.
Conf.Uncertainty in Artificial Intelligence, 2007.
[42] L. Yang, R. Jin, and R. Sukthankar, Bayesian Active
DistanceMetric Learning, Proc. Conf. Uncertainty in Artificial
Intelligence,2007.
[43] L. Torresani and K. Lee, Large Margin Component
Analysis,Advances in Neural Information Processing Systems, MIT
Press, 2007.
[44] K.Q. Weinberger, F. Sha, Q. Zhu, and L.K. Saul, Graph
LaplacianRegularization for Large-Scale Semidefinite Programming,
Ad-vances in Neural Information Processing Systems, MIT Press,
2007.
[45] D. Zhou, J. Huang, and B. Scholkopf, Learning with
Hyper-graphs: Clustering, Classification, and Embedding, Advances
inNeural Information Processing Systems, MIT Press, 2007.
[46] Z. Zhang and J. Wang, MLLE: Modified Locally
LinearEmbedding Using Multiple Weights, Advances in Neural
Informa-tion Processing Systems, MIT Press, 2007.
[47] A. Frome, Y. Singer, and J. Malik, Image Retrieval
andClassification Using Local Distance Functions, Advances in
NeuralInformation Processing Systems, MIT Press, 2007.
[48] O. Boiman and M. Irani, Similarity by Composition, Advances
inNeural Information Processing Systems, MIT Press, 2007.
[49] M. Belkin and P. Niyogi, Convergence of Laplacian
Eigenmaps,Advances in Neural Information Processing Systems, MIT
Press, 2007.
[50] T. Hertz, A.B. Hillel, and D. Weinshall, Boosting Margin
BasedDistance Functions for Clustering, Proc. Intl Conf.
MachineLearning,
http://www.cs.huji.ac.il/daphna/code/DistBoost.zip,2004.
[51] T. Hertz, A.B. Hillel, and D. Weinshall, Learning a
KernelFunction for Classification with Small Training Samples,
Proc.Intl Conf. Machine Learning, 2006.
YANG ET AL.: A BOOSTING FRAMEWORK FOR VISUALITY-PRESERVING
DISTANCE METRIC LEARNING AND ITS APPLICATION TO... 43
-
[52] G. Shakhnarovich, Learning Task-Specific Similarity,
PhDthesis, Massachusetts Inst. of Technology, 2005.
[53] Y. Freund and R.E. Schapire, A Decision-Theoretic
General-ization of On-Line Learning and an Application to
Boosting,J. Computer and System Sciences, vol. 55, no. 1, pp.
119-139, 1997.
[54] S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman,
and S.Belongie, Generalized Non-Metric Multidimensional
Scaling,Proc. Intl Conf. Artificial Intelligence and Statistics,
2007.
[55] B. Moghaddham and M.-H. Yang, Gender Classification
withSupport Vector Machines, Proc. Intl Conf. Face and
GestureRecognition, 2000.
[56] Y. Ke, D. Hoiem, and R. Sukthankar, Computer Vision for
MusicIdentification, Proc. IEEE CS Conf. Computer Vision and
PatternRecognition, 2005.
[57] J. Zhang and R. Yan, On the Value of Pairwise Constraints
inClassification and Consistency, Proc. Intl Conf. Machine
Learning,2007.
[58] R.E. Schapire, Theoretical Views of Boosting and
Applications,Proc. Intl Conf. Algorithmic Learning Theory,
1999.
[59] R. Salakhutdinov, S. Roweis, and Z. Ghahramani, On
theConvergence of Bound Optimization Algorithms, Proc.
Conf.Uncertainty in Artificial Intelligence, 2003.
[60] ImageCLEF, http://ir.shef.ac.uk/imageclef/, 2009.[61] B.
Zheng, J.K. Leader, G. Abrams, B. Shindel, V. Catullo, W.F.
Good, and D. Gur, Computer-Aided Detection Schemes: TheEffect of
Limiting the Number of Cued Regions in Each Case,Am. J.
Roentgenology, vol. 182, pp. 579-583, 2004.
[62] B. Zheng, A. Lu, L.A. Hardesty, J.H. Sumkin, C.M. Kakim,
M.A.Ganott, and D. Gur, A Method to Improve Visual Similarity
ofBreast Masses for an Interactive Computer-Aided
DiagnosisEnvironment, Medical Physics, vol. 33, pp. 111-117,
2006.
[63] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C. von
derMalsburg, R. Wurtz, and W. Konen, Distortion Invariant
ObjectTrcognition in the Dynamic Line Architecture, IEEE
Trans.Computers, vol. 42, no. 3, pp. 300-311, Mar. 1993.
Liu Yang received the BS degree in electronicsand information
engineering from Hua ZhongUniversity of Science and Technology,
China.She is currently working toward the PhD degreein the Machine
Learning Department of theSchool of Computer Science at Carnegie
MellonUniversity. Her research interest is primarily
onsemi-supervised learning, distance metric earn-ing, information
retrieval, and object Recogni-tion. She was selected as the machine
learning
department nominee from CMU for the IBM Fellowship. She is a
studentmember of the IEEE.
Rong Jin received the BA degree in engineeringfrom Tianjin
University, the MS degree inphysics from Beijing University, and
the MSand PhD degrees in computer science fromCarnegie Mellon
University. He has been anassociate professor in the Computer
Scienceand Engineering Department at Michigan StateUniversity since
2008. His research is focusedon statistical machine learning and
its applica-tion to large-scale information management. He
has published more than 80 conference and journal articles on
therelated topics. He received the US National Science Foundation
(NSF)Career Award in 2006.
Lily Mummert received the PhD degree incomputer science from
Carnegie Mellon Uni-versity in 1996. She is a research scientist
atIntel Research Pittsburgh, working in the area ofdistributed
systems. Before joining Intel in 2006,she was a research staff
member at the IBMT.J. Watson Research Center, where sheworked on
problems in enterprise systemmanagement and contributed to several
pro-ducts. Her current research is focused on
enabling interactive applications that process data from
heterogeneous,potentially high-data-rate sensors such as video and
audio. She is amember of the IEEE.
Rahul Sukthankar received the PhD degree inrobotics from
Carnegie Mellon University andthe BSE degree in computer science
fromPrinceton. He is a senior principal researchscientist at Intel
Research Pittsburgh and anadjunct research professor in the
RoboticsInstitute at Carnegie Mellon. He was previouslya senior
researcher at HP/Compaqs CambridgeResearch Lab and a research
scientist at JustResearch. His current research focuses on
computer vision and machine learning, particularly in the areas
ofobject recognition and information retrieval in medical imaging.
He is amember of the IEEE.
Adam Goode received the bachelors degree incomputer science and
psychology from Rensse-laer Polytechnic Institute and the
mastersdegree from the Human-Computer InteractionInstitute at
Carnegie Mellon University. He is aproject scientist working at
Carnegie Mellon,working on Diamond, a system for interactivesearch.
He has been working as a research staffmember at Carnegie Mellon
since 2001. He is amember of the IEEE.
Bin Zheng received the PhD degree in electricalengineering from
the University of Delaware.Currently, he is a research associate
professorin the Department of Radiology at the Universityof
Pittsburgh. He is also the principal investigatoron a number of
biomedical imaging researchprojects funded by the US National
Institutes ofHealth (NIH). His research projects and
interestsinclude computer-aided detection and diagnosis(CAD) of
medical images, content-based image
retrieval, machine learning, and receiver operating
characteristic (ROC)-type observer performance studies and data
analysis. He and hiscolleagues have published more than 60 refereed
articles in developingand evaluating CAD schemes and systems for
digitized mammograms,lung CT images, and digital microscopic
pathology images.
Steven C.H. Hoi received the BS degree incomputer science from
Tsinghua University,Beijing, China, and the MS and PhD degreesin
computer science and engineering from theChinese University of Hong
Kong. He is cur-rently an assistant professor in the School
ofComputer Engineering of Nanyang Technologi-cal University,
Singapore. His research interestsinclude statistical machine
learning, multimediainformation retrieval, Web search, and data
mining. He is a member of the IEEE.
Mahadev Satyanarayanan received the bache-lors and masters
degrees from the IndianInstitute of Technology, Madras, and the
PhDdegree in computer science from CarnegieMellon. He is the
Carnegie Group professor ofcomputer science at Carnegie Mellon
University.From May 2001 to May 2004, he served as thefounding
director of Intel Research Pittsburgh,one of four
university-affiliated research labsestablished worldwide by Intel
to create disrup-
tive information technologies through its Open Collaborative
Researchmodel. He is a fellow of the ACM and the IEEE and was the
foundingeditor-in-chief of IEEE Pervasive Computing.
. For more information on this or any other computing
topic,please visit our Digital Library at
www.computer.org/publications/dlib.
44 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010