A Boosting Framework for Visuality-Preserving

A Boosting Framework for Visuality-PreservingDistance Metric Learning and Its Application

to Medical Image RetrievalLiu Yang, Student Member, IEEE, Rong Jin, Lily Mummert, Member, IEEE,

Rahul Sukthankar, Member, IEEE, Adam Goode, Member, IEEE, Bin Zheng,

Steven C.H. Hoi, Member, IEEE, and Mahadev Satyanarayanan, Fellow, IEEE

AbstractSimilarity measurement is a critical component in content-based image retrieval systems, and learning a good distancemetric can significantly improve retrieval performance. However, despite extensive study, there are several major shortcomings withthe existing approaches for distance metric learning that can significantly affect their application to medical image retrieval. Inparticular, similarity can mean very different things in image retrieval: resemblance in visual appearance (e.g., two images that looklike one another) or similarity in semantic annotation (e.g., two images of tumors that look quite different yet are both malignant).Current approaches for distance metric learning typically address only one goal without consideration of the other. This is problematicfor medical image retrieval where the goal is to assist doctors in decision making. In these applications, given a query image, the goal isto retrieve similar images from a reference library whose semantic annotations could provide the medical professional with greaterinsight into the possible interpretations of the query image. If the system were to retrieve images that did not look like the query, thenusers would be less likely to trust the system; on the other hand, retrieving images that appear superficially similar to the query but aresemantically unrelated is undesirable because that could lead users toward an incorrect diagnosis. Hence, learning a distance metricthat preserves both visual resemblance and semantic similarity is important. We emphasize that, although our study is focused onmedical image retrieval, the problem addressed in this work is critical to many image retrieval systems. We present a boostingframework for distance metric learning that aims to preserve both visual and semantic similarities. The boosting framework first learnsa binary representation using side information, in the form of labeled pairs, and then computes the distance as a weighted Hammingdistance using the learned binary representation. A boosting algorithm is presented to efficiently learn the distance function. Weevaluate the proposed algorithm on a mammographic image reference library with an Interactive Search-Assisted Decision Support(ISADS) system and on the medical image data set from ImageCLEF. Our results show that the boosting framework comparesfavorably to state-of-the-art approaches for distance metric learning in retrieval accuracy, with much lower computational cost.Additional evaluation with the COREL collection shows that our algorithm works well for regular image data sets.

Index TermsMachine learning, image retrieval, distance metric learning, boosting.

1 INTRODUCTION

TODAY, medical diagnosis remains both art and science.Doctors draw upon both experience and intuition, usinganalysis and heuristics to render diagnoses [1]. When

doctors augment personal expertise with research, themedical literature is typically indexed by disease ratherthan by relevance to current case. The goal of interactivesearch-assisted decision support (ISADS) is to enabledoctors to make better decisions about a given case byretrieving a selection of similar annotated cases from largemedical image repositories.

A fundamental challenge in developing such systems isthe identification of similar cases, not simply in terms ofsuperficial image characteristics, but in a medically relevantsense. This involves two tasks: extracting a representativeset of features and identifying an appropriate measure ofsimilarity in the high-dimensional feature space. The formerhas been an active research area for several decades. Thelatter, largely ignored by the medical imaging community,is the focus of this paper.

In an ISADS system, each case maps to a point in a high-dimensional feature space and similar cases to the currentcase (query) correspond to near neighbors in this space. Theneighborhood of a point is defined by a distance metric, suchas the euclideandistance.Ourpreviouswork showed that thechoice of distance metric affects the accuracy of an ISADSsystem and thatmachine learning enables the construction ofeffective domain-specific distance metrics [2]. In a learneddistance metric, data points with the same labels (e.g.,

30 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 1, JANUARY 2010

. L. Yang is with Machine Learning Department, School of ComputerScience, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh,PA 15231. E-mail: [email protected].

. R. Jin is with the Department of Computer Science and Engineering, 3115Engineering Building, Michigan State University, East Lansing,MI 48824. E-mail: [email protected].

. L. Mummert and R. Sukthankar are with Intel Research, 4720 Forbes Ave.,Suite 410, Pittsburgh, PA 15213.E-mail: [email protected], [email protected].

. A. Goode and M. Satyanarayanan are with the School of Computer Science,Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15231.E-mail: [email protected], [email protected].

. B. Zheng is with the Department of Radiology, University of PittsburghMedical Center, Pittsburgh, PA 15213. E-mail: [email protected].

. S.C.H. Hoi is with the Division of Information Systems, School ofComputer Engineering, Nanyang Technological University, Singapore639798. E-mail: [email protected].

Manuscript received 1 Jan. 2008; revised 9 Aug. 2008; accepted 24 Oct. 2008;published online 10 Nov. 2009.Recommended for acceptance by J. Matas.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2008-01-0001.Digital Object Identifier no. 10.1109/TPAMI.2008.273.

0162-8828/10/$26.00 2010 IEEE Published by the IEEE Computer Society

malignant masses) are closer than data points with differentlabels (e.g., malignant versus benign). Thus, the labels of thenear neighbors of the query are likely to be informative.

1.1 Distance Metric Learning with Side Information

Research in distance metric learning is driven by the need tofind meaningful low-dimensional manifolds that capturethe intrinsic structure of high-dimensional data. Distancemetric learning has been successfully applied to a variety ofapplications, such as content-based image retrieval [3] andtext categorization [4].

Most distance metric learning techniques can be classifiedinto two categories: unsupervised distance metric learningand supervised distance metric learning. The former aims toconstruct a low-dimensional manifold where geometricrelationships between most data points are largely pre-served. Supervised distance metric learning makes use ofclass-label information and identifies the dimensions thatare most informative to the classes of examples. A briefoverview of the related work is provided in Section 2.

Learning an effective distance metric with side informa-tion has recently attracted increasing interest. Typically, theside information is cast in the form of pairwise constraintsbetween data elements, and the goal is to identify featuresthat are maximally consistent with these constraints. Ingeneral, there are two types of pairwise constraints:1) equivalence constraints specifying that the two givenelements belong to the same class and 2) inequivalenceconstraints indicating that the given elements are fromdifferent classes. The optimal distance metric is learned bykeeping the data elements of equivalence constraints close toeach other while separating the data elements of inequiva-lence constraints apart. A number of approaches have beendeveloped to learn distance metrics from the pairwiseconstraints.We refer to Section 2 for a comprehensive review.

One of the key challenges in learning a distance metric isits computational cost. This is because many approaches aredesigned to learn a full matrix of distance metrics whose sizescales with the square of the data dimension. In addition toits large size, the requirement that the metric matrix bepositive semidefinite further increases the computationalcost [5]. Although several algorithms have been proposed toimprove the computational efficiency (e.g., [6]), they stilltend to be computationally prohibitive when the number ofdimensions is large. To address the computational issue, wepropose a boosting framework that can efficiently learndistance metrics for high-dimensional data.

1.2 Semantic Relevance and Visual Similarity

Most distance metric learning algorithms aim to constructdistance functions that are consistent with the given pairwiseconstraints. Since these constraints are usually based on thesemantic categories of the data, the learned distance metricessentially preserves only the semantic relevance amongdatapoints. Thus, a drawbackwith these approaches is that,whenthey are applied to image retrieval problems, images rankedat the top of a retrieval list may not be visually similar to thequery image, due to the gap between semantic relevance andvisual similarity. For instance, a doughnut and a tire havesimilar shapes, yet belong to different concept categories; asolar car looks almost nothing like a regular car, thoughfunctionally, they both belong to the same object category.Since, in image retrieval applications,most users seek images

that are both semantically and visually close to the queryimage, this requires learning distance functions that preserveboth semantic relevance and visual resemblance. This issue isof particular importance in medical image retrieval. If thesystem were to retrieve images that did not look like thequery, thendoctorswouldbe less likely to trust the system; onthe other hand, retrieving images that appear superficiallysimilar to the query but are semantically unrelated isundesirable because that could lead doctors toward anincorrect diagnosis.

We address the challenge by automatically generatinglinks that pair images with high visual resemblance. Thesevisual pairs, togetherwith the provided side information, areused to train a distance function that preserves both visualsimilarity and semantic relevance between images. Thetrade-off between semantic relevance and visual similaritycan be easily adjusted by the number of visual pairs. Adetailed discussion of how these visual pairs are generated isgiven in Section 4.

The remaining paper is organized as follows: Section 2reviews the work related to ISADS, distance metric learning,and boosting. Section 3 describes the boosting framework fordistancemetric learning. Section 4 presents the application ofthe proposed algorithm to retrieval of both medical imagesand regular images.

2 RELATED WORK

Over the last decade, the increasing availability of powerfulcomputing platforms and high-capacity storage hardwarehas driven the creation of large, searchable image databases,such as digitized medical image reference libraries. Theselibraries have been used to train and validate computer-aided diagnosis (CAD) systems in a variety of medicaldomains, including breast cancer. However, the value ofCAD in clinical practice is controversial, due to their black-box nature and lack of reasoning ability [7], [8], [9], [10],[11], despite significant recent progress [12], [13], [14], [15],[16], [17], [18], [19], [20] both in automated detection andcharacterization of breast masses. An alternative approach,espoused by efforts such as ISADS [2], eschews automateddiagnosis in favor of providing medical professionals withadditional context about the current case that could enablethem to make a more informed decision. This is done byretrieving medically relevant cases from the referencelibrary and displaying their outcomes. Earlier work [2]has demonstrated that learning domain-specific distancemetrics significantly improves the quality of such searches.

In general, methods for distance metric learning fall intotwo categories: supervised and unsupervised learning. Sinceourwork ismost closely related to superviseddistancemetriclearning, we omit the discussion of unsupervised distancemetric learning and refer readers to a recent survey [21].

In supervised distance metric learning, most algorithmslearn a distance metric from a set of equivalence constraintsand inequivalence constraints between data objects. Theoptimal distance metric is found by keeping objects inequivalence constraints close and objects in inequivalenceconstraints well separated. Xing et al. [22] formulatedistance metric learning into a constrained convex pro-gramming problem by minimizing the distance between thedata points in the same classes under the constraint that thedata points from different classes are well separated. This

YANG ET AL.: A BOOSTING FRAMEWORK FOR VISUALITY-PRESERVING DISTANCE METRIC LEARNING AND ITS APPLICATION TO... 31

algorithm is extended to the nonlinear case by theintroduction of kernels [23]. Local Linear DiscriminativeAnalysis [24] estimates a local distance metric using thelocal linear discriminant analysis. Relevant ComponentsAnalysis (RCA) [25] learns a global linear transformationfrom the equivalence constraints. Discriminative Compo-nent Analysis (DCA) and Kernel DCA [26] improve RCA byexploring inequivalence constraints and capturing non-linear transformation via contextual information. LocalFisher Discriminant Analysis (LFDA) [27] extends classicalLDA to the case when the side information is in the form ofpairwise constraints. Kim et al. [28] provide an efficientincremental learning method for LDA, by adopting suffi-cient spanning set approximation for each update step.Schultz and Joachims [29] extend the support vectormachine to distance metric learning by encoding thepairwise constraints into a set of linear inequalities.Neighborhood Component Analysis (NCA) [30] learns adistance metric by extending the nearest neighbor classifier.The maximum-margin nearest neighbor (LMNN) classifier[6] extends NCA through a maximum margin framework.Yang et al. [31] propose a Local Distance Metric (LDM) thataddresses multimodal data distributions in distance metriclearning by optimizing local compactness and local separ-ability in a probabilistic framework. Finally, a number ofrecent studies [28], [32], [33], [34], [35], [38], [39], [40], [41],[42], [43], [44], [45], [46], [47], [48], [49] focus on examiningand exploring the relationship among metric learning,dimensionality reduction, kernel learning, semi-supervisedlearning, and Bayesian learning.

Learning distance metrics by a boosting framework was

first presented by Hertz et al. [50], [51]. In addition, in [36],

[37], [52], different boosting strategies are presented to

learn distance functions from labeled data. Although all of

these algorithms employ a boosting strategy to learn a

distance function, our algorithm differs from the existing

work in that earlier algorithms for distance function

learning closely follow AdaBoost [53] without considering

the optimization of the specified objective functions. Some

of the existing methods (e.g., [52]) do not have a well-

specified objective function; therefore, the convergence of

their algorithms and the optimality of the resulting distance

function are unclear. In contrast, our algorithm is based on

the optimization of the objective function specified in our

study. Our contributions include a theoretical analysis

about the convergence condition of our algorithm and the

optimality of the resulting distance function. We believe

that the theoretical analysis of the proposed algorithm is

important and could be instrumental to the performance of

our boosting framework.We would also like to mention some recent develop-

ments in nonmetric distance learning, such as Generalized

Nonmetric Multidimensional Scaling [54]. Although non-

metric distance learning appears to be more flexible than

metric distance learning, we believe that metric distance,

in general, is not only more intuitive but also more robust

to data noise due to the constraints imposed by the

triangle inequality.

3 A BOOSTING FRAMEWORK FOR DISTANCEMETRIC LEARNING

In this section, we present a novel boosting framework,termed BDM (we follow the terminology from [2]), thatautomatically learns a distance function from a given set ofpairwise constraints. The main idea is to iteratively generatea set of binary features from the side information. Thelearned binary features are used for data representation,and the distance is computed as a weighted Hammingdistance based on the learned binary data representation.

3.1 Preliminaries

We denote by D fx1;x2; . . . ;xng the collection of datapoints. Each x 2 Rd is a vector of d dimensions. We denoteby X x1;x2; . . . ;xn the data matrix containing the inputfeatures of both the labeled and the unlabeled examples.Following [22], we assume that the side information isavailable in the form of labeled pairs, i.e., whether or nottwo examples are in the same semantic category or not. Forconvenience of discussion, below we refer to examples inthe same category as similar examples and examples indifferent categories as dissimilar examples. Let the set oflabeled example pairs be denoted by

P fxi;xj; yi;jjxi 2 D;xj 2 D; yi;j 2 f1; 0;1gg;where the class label yi;j is positive (i.e., 1) when xi and xjare similar, and negative (i.e., 1) when xi and xj aredifferent. yi;j is set to zero when the example pair xi;xj isunlabeled. Finally, we denote by dxi;xj the distancefunction between xi and xj. Our goal is to learn a distancefunction that is consistent with the labeled pairs in P.Remark 1. Note that standard labeled examples can always

be converted into a set of labeled pairs by assigning twodata points from the same category to the positive classand two data points from different categories to thenegative class. Similar pairwise class labels are com-monly employed in multiclass multimedia retrievalapplications [55], [56]. It is important to emphasize thatthe reverse is typically difficult, i.e., it is usually difficultto infer the unique category labels of examples from thelabeled pairs [57].

Remark 2. We label two images in the training set as similarif they either match in semantic category or if they appearvisually related, as our goal is to simultaneously preserveboth the semantic relevance as well as the visualsimilarity. For instance, two images could be defined tobe similar only if they belonged to the same semanticcategory or similarity could be defined based on theimages visual similarity according to human perception.

3.2 Definition of Distance Function

Before presenting the boosting algorithm, we need to definea distance function dxi;xj that is nonnegative and satisfiesthe triangle inequality. A typical definition of distancefunction used by several distance metric learning algo-rithms (e.g., [22], [31]) is

dxi;xj xi xj>Axi xj

q; 1


where A 2 IRdd is a positive semidefinite matrix thatspecifies the distance metric. One drawback with thedefinition in (1) arises from its high computational costdue to the size of A and the constraint that matrix A has tobe positive semidefinite. This is observed in our empiricalstudy. When the dimensionality d 500, we find thatestimating A in (1) is computationally very expensive.

In order to address the above problems, we present herea nonlinear distance function defined as follows:

dxi;xj XTt1

tftxi ftxj2; 2

where each fx : IRn ! f1;1g is a binary classificationfunction (note that we define the image of the binary f to bef1;1g instead of f0; 1g for a more concise presentationbelow) and t > 0, t 1; 2; . . . ; T , are the combinationweights. The key idea behind the above definition is to firstgenerate a binary representation f1x; . . . ; fT x by apply-ing the classification function ffixgTi1 to x. Then, thedistance between xi and xj is computed as a weightedHamming distance between the binary representations of thetwo examples. Compared to (1), (2) is advantageous in that itallows for a nonlinear distance function. Furthermore, theiterative updates of the binary data representation, andconsequently, the distance function, are the key to theefficient algorithm that is presented in the next section. Weemphasize that although (2) appears to be similar to thedistance function defined in [36], [37], it differs from theexisting work in that each binary function takes into accountall of the features. In contrast, eachbinary function in [36], [37]is limited to a single feature and therefore is significantly lessgeneral than the proposed algorithm.

The following theorem shows that the distance functiondefined in (2) is indeed a pseudometric, i.e., satisfies all theconditions of a distance metric except for dx;y 0,x y. More specifically, we have the following theorem:Theorem 3.1. The distance function defined in (2) satisfies all the

properties of a pseudometric, i.e., 1) dxi;xj dxj;xi,2) dxi;xj 0, and 3) dxi;xj dxi;xk dxk;xj.

The first and second properties are easy to verify. Toprove the third property, i.e., the triangle inequality, inTheorem 3.1, we need the following lemma:

Lemma 3.2. The following inequality:

fxi fxj2 fxi fxk2 fxk fxj2 3holds for any binary function f : IRd ! f1;1g.

The proof of the above lemma can be found inAppendix A. It is straightforward to show the triangleinequality in Theorem 3.1 using Lemma 3.2 since dxi;xj isa linear combination of fkxi fkxj2.3.3 Objective Function

The first step toward learning a distance function is todefine an appropriate objective function. The criterionemployed by most distance metric learning algorithms isto identify a distance function dxi;xj that gives a smallvalue when xi and xj are similar and a large value whenthey are different. We can generalize this criterion by stating

that, for any data point, its distance to a similar example

should be significantly smaller than the distance to an

example that is not similar. This generalized criterion is cast

into the following objective function, i.e.,

errP Xni1

Xnj1

Xnk1

Iyi;j 1Iyi;k 1

Idxi;xj > dxi;xk;4

where the indicator Ix outputs 1 when the Booleanvariable x is true and zero otherwise. In the above, we use

Iyi;j 1 to select the pairs of dissimilar examples, andIyi;k 1 to select the pairs of similar examples. Everytriple xi;xj;xk is counted as an error when the distancebetween the similar pair xi;xj is larger than the distancebetween the dissimilar pair xi;xk. Hence, the objectivefunction errP essentially measures the number of errorswhen comparing the distance between a pair of similar

examples to the distance between a pair of dissimilar

examples.Although the classification error errP seems to be a

natural choice for the objective function, it has two short-

comings when used to learn a distance function.

. It is well known in the study of machine learningthat directly minimizing the training error tends toproduce a model that overfits the training data.

. The objective function errP is a nonsmooth functiondue to the indicator function Idxi;xj > dxi;xkand therefore is difficult to optimize.

To overcome the shortcomings of errP, we propose thefollowing objective function for distance metric learning:

F P Xni;j;k1

Iyi;j 1Iyi;k 1

exp dxi;xk dxi;xj

:

5

The key difference between F P and errP is thatIdxi > dxj is replaced with exp dxi;xk dxi;xj

.

Since exp dxi;xk dxi;xj

> Idxi > dxj, by mini-mizing the objective function F P, we are able toeffectively reduce the classification error errP. Theadvantages of using exp dxi;xk dxi;xj

versus

Idxi > dxj are twofold.. Since expdxi;xk dxi;xj is a smooth function,

the objective function F P can, in general, beminimized effectively using standard optimizationtechniques.

. Similarly to AdaBoost [58], by minimizing theexponential loss function in F P, we are able tomaximize the classification margin and thereforereduce the generalized classification error accordingto [53].

Despite the advantages stated above, we note that the

number of terms in (5) is on the order of On3, potentiallycreating an expensive optimization problem. This observa-

tion motivates the development of a computationally

efficient algorithm.


3.4 Optimization Algorithm

Given the distance function in (2), our goal is to learnappropriate classifiers fftxgTt1 and combination weightsftgTt1. In order to efficiently learn the parameters andfunctions, we follow the idea of boosting and adopt agreedy approach for optimization. More specifically, westart with a constant function for distance, i.e.,d0xi;xj 0, and learn a distance function d1xi;xj d0xi;xj 1f1xi f1xj2. Using this distance func-tion, the objective function in (5) becomes a function of 1and f1x, and can be optimized efficiently using boundoptimization [59] as described later. In general, given adistance function dt1xi;xj that is learned in iterationt 1, we learn t and ftx by using the following distancefunction dtxi;xj:

dtxi;xj dt1xi;xj t ftxi ftxj 2

:

Using the above expression for distance function, theobjective function at iteration t, denoted by FtP, in (5)becomes a function of t and ftx, i.e.,

FtP Xni;j;k1

Iyi;j 1Iyi;k 1

expdt1xi;xk dt1xi;xj exptftxi ftxk2 ftxi ftxj2:

To simplify our expression, we introduce the followingnotations:

di;j dt1xi;xj; 6

fi ftxi; 7

i;j Iyi;j 1 expdi;j: 8Using the above notations, F P is expressed as follows:

FtP Xni;j;k1

i;ji;k exptfi fk2 tfi fj2: 9

Hence, the key question is how to find the classifier fx and that minimizes the objective function in (9). For conve-nience of discussion, we drop the index t for t and ftx, i.e.,t ! and ftx ! fx. Now, we apply the boundoptimization algorithm [59] to optimize FtP with respectto and fx. The main idea is to approximate the differencebetween the objective functions of the current iteration andthe previous iteration by a convex upper bound that has aclosed-form solution. As shown in [59], the bound optimiza-tion is guaranteed to find a local optimal solution.

Like most bound optimization algorithms, instead ofminimizing F P in (9), we will minimize the differencebetween objective functions from two consecutive itera-tions, i.e.,

; f FtP Ft1P; 10where f f1; . . . ; fn and Ft1P

Pni;j;k1

i;j

i;k is the

objective function of the first t 1 iterations. Note that; f 0 when 0. This condition guarantees that

when we minimize ; f, the resulting FtP is smallerthan Ft1P, and therefore, the objective function willmonotonically decrease through iterations. In addition, asshown in [59], minimizing the bound is guaranteed to find alocally optimal solution.

First, in the following lemma, we construct an upperbound for ; f that decouples the interaction between and f . Before stating the result, we introduce the concept ofa graph Laplacian for readers who may not be familiarwith the term. A graph Laplacian for a similarity matrix S,denoted by LS, is defined as L diagS1 S, where 1 isan all-one vector and operator diagv turns vector v into adiagonal matrix.

Lemma 3.3. For any > 0 and binary vector f 2 f1;1gn, thefollowing inequality holds:

; f exp8 18

f>Lf exp8 18

f>Lf ; 11

where L and L are the graph Laplacians for the similaritymatrices S and S, respectively, defined as

Si;j 1

2i;ji j ; Si;j

1

2i;ji j ; 12

where i is defined as

i Xnj1

i;j: 13

Recall thati;j is defined asi;j Iyi;j 1 expdi;j in (8).

The detailed proof of this lemma is given in Appendix B.

Remark. Since i;j / Iyi;j 1 (8), the similarity S dependsonly on the data points from the must-link pairs(equivalence constraints). Hence, f>Lf in (11) essentiallymeasures the inconsistency between the binary vector fand the must-link constraints. Similarly, f>Lf in (11)measures the inconsistency between f and the cannot-linkpairs (inequivalence constraints). Hence, the upperbound in (11) essentially computes the overall incon-sistency between the labeled pairs and the binary vector f .

Next, using Lemma 3.3, we derive additional bounds for; f by removing . This result is summarized in thefollowing theorem.

Theorem 3.4. For any binary vector f 2 f1;1gn, thefollowing inequality holds:

min0

; f 18

max0;f>Lf

q

f>Lf

q

214

max0; f>Lf f>Lf 2

8n maxLp maxL2q ; 15where maxS is the maximum eigenvalue of matrix S.

The proof of this theorem can be found in Appendix C.In the following discussion, we will focus on minimizingthe upper bound of the objective function stated inTheorem 3.4, which allows us to reduce the computationalcost dramatically.


In order to search for the optimal binary solution f

that minimizes the upper bound of ; f, we decide tofirst search for a continuous solution for f and then

convert the continuous f into a binary one by comparing

to a threshold b. In particular, we divide the optimization

procedure into two steps:

. searching for the continuous f that minimizes theupper bound in (15) and

. searching for the threshold b that minimizes theupper bound in (14) for a continuous solution f .

To differentiate the continuous solution f , we furthermore

denote by f^ the binary solution. It is important to note that

the two steps use different upper bounds in Lemma 3.3.

This is because the looser upper bound in (15) allows for

efficient computation of continuous solution f , while the

tighter upper bound in (11) allows for a more accurate

estimation of threshold b.Finally, the optimization problems related to the two

steps are summarized as follows, respectively:

maxf2IRn

f>L Lf ; 16

and

maxb2IR

f^>L f^

p

f^>L f^

p

s:t: f^i 1; fi > b;

1; fi b: 17

It is clear that the optimal solution to (16) is the maximum

eigenvector of matrix L L, and therefore, can becomputed very efficiently. To find the b that optimizes the

problem in (17), it is sufficient to consider f1; f2; . . . ; fn, in

turn, as the candidate solutions.Given the optimal f f1; . . . ; fn, the next question is

how to learn a classification function fx to approximate f .Here, we consider two cases: the linear classifier and the

nonlinear classifier. In the first case, we assume that the

classification function fx is based on a linear transforma-tion of x, i.e., fx u>x, where u is a projection vector thatneeds to be determined. Then, the optimization problem in

(16) is converted into the following problem:

maxu>u1

u>XL LX>u: 18

It is not difficult to see that the optimal projection u

that maximizes (18) is the maximum eigenvector of

XL LX>. In the second case, we exploit the kerneltrick. Specifically, we introduce a nonlinear kernel function

kx;x0 and assume the classification function fx as

fx Xni1

kxi;xui:

Similarly to the linear case, we calculate the optimal

projection u u1; . . . ; un by computing the maximumeigenvector of KL LK>, where K is a nonlinearkernel similarity matrix with Ki;j kxi;xj. Fig. 1 sum-marizes the proposed boosted distance metric learning

algorithm of both the linear and the nonlinear cases.

To further ensure that our algorithm is effective in

reducing the objective function despite being designed to

minimize the upper bound of the objective function, we

present the following theorem:

Theorem 3.5. Let St ; St , t 1; . . . ; T be the similaritymatrices that are computed by running the boosting algorithm

(in Fig. 1) using (12). Let Lt and Lt be the corresponding

graph Laplacians. Then, the objective function at the T 1iteration, i.e., FT1P, is bounded as follows:

FT1P F0PYTt01 t; 19

where

F0 Xni;j;k1

Iyi;j 1Iyi;k 1;

t maxLt Lt 2

8maxSt St maxLt maxLt :

The proof of this theorem can be found in Appendix D.Evidently, we note that is bounded between 0; 1=8. Asrevealed in the above theorem, although we only aim tominimize the upper bound of the objective function, theupper bound of the objective function decreases by a factorof 1 t in each iteration, and therefore, the objectivefunction will, in general, decrease rapidly. This claim issupported by our experimental results below.

3.5 Preserving Visual Similarity

As pointed out in Section 1, most distance metric learningalgorithms learn a distance metric that only preserves thesemantic similarity without taking into account the visualresemblance between images. Fig. 2 shows a pair of twoimages whose distance is very small according to adistance metric learned from the labeled examples. Notethat, although both images are malignant according to themedical annotation, their appearances are rather different.By retrieving images that are only medically relevant, the


Fig. 1. Distance metric learning algorithm in a boosting framework.

system is poorly suited for assisting doctors in providingthe necessary context for informed decision making.

To address this problem, we introduce additional pair-wise constraints to reflect the requirement of visualsimilarity. These additional pairwise constraints, referredto as visual pairs, are combined with the equivalence andinequivalence constraints to train a distance metric using theboosting algorithm that is described above. Ideally, thevisual pairs would be specifiedmanually by domain experts.However, in the absence of such labels, we represent animage by a vector of visual features and approximate thevisual pairs by the pairs of images that fall within a smalleuclidean distance in the space of visual features. Byincorporating the visual pairs as a part of the pairwiseconstraints, the resulting distance function will reflect notonly the semantic relevance among images, but also thevisual similarity between images. Furthermore, the trade-offbetween visual and semantic similarity in learning a distancefunction can be adjusted by varying the number of visualpairs. As shown in our experiments, employing a largenumber of visual pairs biases the learned metric towardpreserving visual similarity. Finally, we note that the sameset of low-level image features is used to assess the medicalrelevance of images and to generate visual pairs. The keydifference is that, in generating visual pairs, every feature istreated with equal importance; in contrast, the semanticrelevance between two images is judged by a weighteddistance, and therefore, only a subset or combinations ofimage features determines the semantic relevance of images.

We can also interpret visual pairs from the viewpoint ofBayesian learning. In particular, introducing visual pairsinto our learning scheme is essentially equivalent tointroducing a Bayesian prior for the target distancefunction. Note that 1) the same set of visual features isused to judge the semantic relevance and visual similarityand 2) visual pairs are generated by the euclidean distance.Hence, the introduction of visual pairs serves as aregularizer for the distance function to be close to theeuclidean distance. We emphasize the importance ofregularization in distance metric learning, particularlywhen the number of pairwise constraints is limited. Sincemost distance functions involve a large number of para-meters, overfitting is likely in the absence of appropriateregularization; resulting distance functions are likely to fitthe training data very well, yet will fail to correctly predictthe distances between the examples in the testing set. Thisissue is examined further in our experiments below.

4 APPLICATIONS

This section presents evaluations of the proposed boostingframework for learning distance functions in the context ofboth medical and nonmedical image retrieval applications.We denote the basic algorithm by BDM and the algorithmaugmented with automatically generated visual pairs asBDM+V. The first set of experiments employs our methodin an ISADS application for breast cancer. The ISADSapplication allows a radiologist examining a suspiciousmass in a mammogram to retrieve and study similar masseswith outcomes before determining whether to recommend abiopsy. We first describe the image repository used by theapplication. We then empirically examine and evaluatedifferent properties of the proposed algorithm, includingthe convergence of the proposed algorithm, the effect ofvisual pairs on the performance of image retrieval andclassification, and the impact of training set size. Finally, wealso evaluate the proposed algorithm using the medicalimage data set from ImageCLEF [60]. To demonstrateBDM+Vs generalized efficacy on regular image data setsbeyond the medical domain, we also present retrieval andclassification results on the standard Corel data set.

4.1 Reference Library: UPMC MammogramsData Set

We used an image reference library based on digitizedmammograms created by the Imaging Research Center ofthe Department of Radiology at the University of Pittsburgh.The library consists of 2,522 mass regions of interest (ROI)including 1,800 pathology-verified malignant masses and722 CAD-cued benign masses. Each mass ROI is representedby a vector of 38 morphological and intensity distribution-related features, within which nine features are computedfrom the whole breast area depicted in the digitizedmammogram (global features) and the remaining featuresare computed from the segmented mass region and itssurrounding background tissue (local features). The ex-tracted visual features are further normalized by the meanand the standard deviation computed from the 2,522 selectedmass regions in the image data set. A detailed description ofthe features, the normalization step, and region segmenta-tion are described in [61], [62]. Fig. 3 shows a significantoverlap between the two classes in the space spanned by thefirst three principal eigenvectors computed by PrincipalComponent Analysis (PCA). This result illustrates thedifficulty in separating classes using simple methods.


Fig. 3. Three-dimensional PCA representation of the malignant (red)

class and benign (blue) class.

Fig. 2. Two images with the same semantic label (malignant masses in

this example) can look very different visually. In an ISADS application, it

is important for the system to retrieve examples that are both visually

and semantically similar.

4.2 Experiment Setup

We randomly select 600 images from the reference library toserve as the training data set. Among them, 300 imagesdepict malignant masses and 300 depict CAD-generatedbenign mass regions. The remaining 1,922 images are usedfor testing. Through these experiments, unless specified, thelinear BDM (described in Fig. 1) is used for evaluation.

We evaluate the proposed algorithm in the context ofISADS using two metrics. The first metric, classificationaccuracy, indicates the extent to which malignant imagescan be detected based on the images that are retrieved bythe system [18], [19]. We compute classification accuracy bythe K Nearest Neighbor (KNN) classifier: Given a testexample x, we first identify the K training examples thathave the shortest distance to x, where distance is computedusing the metric learned from training examples; we thencompute the probability that x is malignant based on thepercentage of its K nearest neighbors that belong to themalignant class. These probabilities for test examples areused to generate the Receiver Operating Characteristic(ROC) curve by varying the threshold of the probability forpredicting malignancy. Finally, the classification accuracy isassessed by the area under the ROC curve. As has beenpointed out by many studies, the ROC curve is a bettermetric for evaluating classification accuracy than error rate,particularly when the populations of classes are skewed.Cross validation has indicated that the optimal number ofnearest neighbors (i.e., K) in KNN is 10. Every experimentis repeated 10 times with randomly selected training imagesand the final result is computed as an average over these10 runs. Both the mean and standard deviation of the areaunder the ROC curve are reported in the study.

The second metric, retrieval accuracy, reflects theproportion of retrieved images that are medically relevant(i.e., in the same semantic class) to the given query [16], [17].Unlike classification accuracy where only a single value iscalculated, retrieval accuracy is computed as a function ofthe number of retrieved images and thus provides a morecomprehensive picture for the performance of ISADS. Weevaluate retrieval accuracy in a leave-one-out manner, i.e.,using one medical image in the test data set as the queryand the remaining images in the test data set as the gallerywhen we conduct the experiment of image retrieval. For agiven test image, we rank the images in the gallery in the

ascending order of their distance to the query image. Wedefine the retrieval accuracy for the ith test query image atrank position k, denoted by rqki , as the percentage of thefirst k ranked images that share the same semantic class(i.e., benign or malignant) as the query image:

rqki Pk

j1 Iyi yjk

; 20

where j in the summation refers to the indices of the topk ranked images. The overall retrieval accuracy at each rankposition is an average over all images in the testing set.

4.3 Empirical Evaluation of the Proposed Algorithm(BDM+V)

In this section, we study the convergence of the proposedalgorithm, the performance of the proposed algorithm forboth image classification and retrieval, and, furthermore,the effect of visual pairs on image retrieval.

4.3.1 Convergence of the Objective Function

Fig. 4a shows the reduction of the objective function (5) andFig. 4b shows the reduction of the error rate errP in (4),both as a function of the number of iterations. The numberof iterations in Fig. 4 corresponds to the T from (2) andFig. 1. Recall that the error rate errPmeasures the numberof errors when comparing the distance between a pair ofsimilar examples to the distance between a pair ofdissimilar examples. We also compare the change of thetwo in the same figure (see Fig. 4c). The iteration stopswhen the relative change in the objective function is smallerthan a specified threshold (105 in our study).

First, we clearly observe that the value of the objectivefunction drops at a rapid rate, which confirms the theoreticanalysis stated in Theorem 3.5. Second, we observe that theoverall error rate is also reduced significantly, and indeed,is upper bounded by the objective function in (5), asdiscussed in Section 3, although the bound is rather loose.

4.3.2 Effect of Visual Pairs

We first evaluate how the visual pairs affect the retrievalaccuracy of BDM. Fig. 5 summarizes the retrieval accuracyof BDM+V and BDM (i.e., with and without using the visualpairs). For the purpose of comparison, we also include the


Fig. 4. Reduction of objective function and error rate over iterations (312 iterations in total). (a) Objective function. (b) Error rate errP. (c) Objectivefunction versus error rate errP.

retrieval accuracy for the euclidean distance. The standarddeviation in the retrieval accuracy is illustrated by thevertical bar. First, we observe that the retrieval accuracy ofboth variants of BDM exceeds that of the euclidean distancemetric, indicating that BDM is effective in learning appro-priate distance functions. Second, we observe that theincorporation of visual pairs improves retrieval accuracy.This improvement can be explained from the viewpoint ofBayesian statistics since the visual pairs can be viewed as aBayesian priors, as discussed above. Hence, BDM withvisual pairs can be interpreted as Maximum A Posterior(MAP), while BDM without visual pairs can somehow beinterpreted as Maximum-Likelihood Estimation (MLE). It iswell known that MAP-based approaches typically outper-form MLE-based approaches. This is particularly true whenthe number of training examples is not large in comparisonto the number of parameters, allowing the target classifica-tion model to overfit the training examples. By introducing aBayesian prior, we are able to regularize the fitting of thetarget classification model for the given training examples,thus alleviating the problem of overfitting.

In the second experiment, we evaluate the effect of thevisual pairs on classification accuracy. We compute the areaunder the ROC curves (AUR), which is a common metricfor evaluating classification accuracy. Table 1 shows theAUR results for BDM+V and BDM (i.e., with and withoutvisual pairs) and the euclidean distance metric. Similarly tothe previous experiment, we observe that areas under the

ROC curves of the two variants of BDM are significantlylarger than that of the euclidean distance, showing thatBDM achieves better classification accuracy than theeuclidean distance metric. Similarly to retrieval accuracy,we observe that the incorporation of visual pairs noticeablyimproves the classification accuracy.

The final experiment in this section is designed to studyhow different numbers of visual pairs affect the classifica-tion and retrieval performance. We vary the size ofneighborhood from 1, 5, 10, and 15 to 20 when generatingvisual pairs. The larger the neighborhood size, the morevisual pairs are generated. Fig. 6 and Table 2 show theretrieval accuracy and the area under ROC curves forBDM+V using different neighborhood sizes for generatingvisual pairs. We observe that the five different neighbor-hood sizes result in similar performance in both classifica-tion and retrieval. We thus conclude that BDM+V is overallinsensitive to the number of visual pairs. Note that ourstudy is limited to a modest range of visual pairs. The sizeof the euclidean near neighborhood should be controlled;otherwise, this approximation fails to capture visualsimilarity between images.

4.4 Comparison to State-of-the-Art Algorithms forDistance Metric Learning

We compare BDM+V to three state-of-the-art algorithms forlearning distance functions and distance metrics: Linear


Fig. 5. Comparison of retrieval accuracy. The learned metrics

significantly outperform euclidean; adding visual pairs (BDM+V)

consistently improves retrieval.

TABLE 1Comparison of the Classification Accuracy

The learned metrics result in better classification and the addition ofvisual pairs (BDM+V) is significant.

TABLE 2Classification Results for BDM+V Using Different Numbers

of Near Neighbors for Visual Pair Generation

BDM+V is relatively insensitive to the number of visual pairs used.

Fig. 6. Retrieval accuracy curves for BDM+V using different numbers of

near neighbors to generate visual pairs. Retrieval is relatively insensitive

to the number of visual pairs used in BDM+V.

Boost Distance (denoted as DistBoost) [50], Large Margin

Nearest Neighbor Classifier (denoted as LMNN) [6], andNeighborhood Component Analysis (denoted as NCA)

[30]. Euclidean distance is included as a comparative

reference (denoted as euclidean).

4.4.1 Results on UPMC Mammograms Data Set

Fig. 7 shows the retrieval accuracy curves for BDM+V and

the three comparative algorithms for distance metric

learning. First, we observe that all of the distance learningalgorithms outperform the euclidean distance metric except

for the DistBoost algorithm which performs considerably

worse than the euclidean distance metric. Second, BDM+V

and NCA perform consistently better than the otheralgorithms across all the ranking positions. Table 3 shows

the area under the ROC curve for BDM+V and the baseline

methods. The proposed algorithm has the largest area

under the ROC curve, followed by LMNN, euclidean, NCA,and finally, DistBoost. It is interesting to observe that

although NCA achieves a better retrieval accuracy than the

euclidean distance, its classification accuracy is consider-

ably lower than the euclidean distance.

4.4.2 Results on the ImageCLEF Data Set

To generalize the performance of the proposed algorithm,we further evaluate the proposed algorithm on the medical

image data set provided by the ImageCLEF conference[60]. This is a popular benchmark data set used to evaluateautomated medical image categorization and retrieval. Itconsists of 15 medical image categories with a total of2,785 images. All of the medical images in this experimentare X-ray images collected from plain radiography. Fig. 8shows a few examples of medical images in our testbed.The category information can be found from the con-ference Web site.

Following the typical practice in ImageCLEF, we processeach medical image using a bank of Gabor wavelet filters[63] to extract texture features. More specifically, eachimage is first scaled to the size of 128 128. Then, the Gaborwavelet transform is applied to the scaled image at fivescale levels and eight orientations, which results in a total of40 subimages. Every subimage is further normalized into8 8 64 features, which results in a total of 64 40 2;560 visual features for each medical image. PCA is used toreduce the dimensionality from 2,560 to 200. We select atotal of 1,100 images from 11 categories in the ImageCLEFfor our experiments. We randomly selected 40 percentimages for the training data set and the remaining imagesserve as test queries.

The retrieval accuracy, defined in (20), is reported inFig. 9. It is interesting to observe that NCA, which achieveshigh retrieval accuracy on the UPMC Mammogram DataSet, now performs significantly worse than the euclideandistance metric. On the other hand, DistBoost, which


Fig. 8. Examples of medical images in the ImageCLEF testbed.

Fig. 9. Retrieval accuracy by different distance metric learning

algorithms on the ImageCLEF data set.

Fig. 7. Retrieval accuracy of distance metric learning algorithms on the

mammogram data set.

TABLE 3Classification Accuracy on the Mammogram Data Set

performed poorly on the UPMC data set, is one of the bestalgorithms. This result indicates that some of the state-of-the-art distance metric learning algorithms are sensitive tothe characteristics of data sets and their performance isusually data-dependent. In contrast, BDM+V achievesgood retrieval accuracy on both data sets, indicating therobustness of the proposed algorithm.

We also conduct the classification experiment using theImageCLEF data set. Table 4 summarizes the area under theROC curve for all the 11 classes separately. As we observe,for most classes, BDM+V achieves a performance that iscomparable to LMNN, the best among the five competitors.

4.5 Computational Cost

As discussed in Section 1, high computational cost is one ofthe major challenges in learning distance metrics. Manyapproaches aim to learn a full matrix and therefore becomecomputationally expensive as the dimensionality grows.BDM+V reduces the computational cost by learning abinary representation in a boosting framework, from whicha weighted Hamming distance is computed. Table 5 showsthe running time of the proposed algorithm and thebaseline methods, for different dimensionality using theImageCLEF data set. Note that the different numbers ofdimensions are created by applying PCA to the images inthe database and selecting the top eigenvectors forrepresenting images.

First, the proposed algorithm is considerably faster thanthe three competitors when each image is represented bymore than 200 features. Second, the time consumed by the

proposed algorithm does not increase dramatically as the

number of dimensions increases from 100 to 500; in contrast,

for the three baseline algorithms, we observe a significant

increase in the computational time as the dimension grows

beyond 300. For instance, DistBoost is impressively fast

(524.1 seconds) with 200 dimensions but falls behind

BDM+V when the dimension increases to 300, and this gap

widens in the case of 400 and 500 dimensions. NCA is the

most computationally expensive among the four competi-

tors, starting at 1,896.1 seconds for 100 dimensions and

rising rapidly to end at 84,016.9 seconds for 500 dimensions.

From these experiments, it is evident that, for all of the

baseline methods, the efficiency issue becomes severe with

higher dimensionality. In contrast, due to its efficient design,

the computational time for the proposed method increases

only linearly with respect to the dimensionality.

4.6 Regular Image Retrieval on the COREL Data Set

To demonstrate the efficacy of BDM+V for regular imageretrieval, we test the proposed algorithm on the CORELdata set. We randomly choose 10 categories from theCOREL data set and randomly select 100 examples fromeach category, resulting in an image collection of1,000 images. Each image is represented by 36 differentvisual features that belong to three categories: color, edge,and texture. The details of the visual feature used torepresent the COREL data set can be found in [31].

The retrieval accuracy is reported in Fig. 10. Although

the proposed algorithm BDM+V is outperformed overall

by LMNN and DistBoost, we observe that BDM+V

surpasses DistBoost at the first rank and outperforms

LMNN after rank 14.Table 6 reports the area under the ROC curve for all the

11 classes separately. BDM+V performs comparably toLMNN, which achieves the best results across the 10 classes.The other three competitors, i.e., DistBoost, NCA, andeuclidean, often perform significantly worse than LMNNand the proposed algorithm. Moreover, the standarddeviation of BDM+V and LMNN is, in general, smallerthan the three baselines, indicating the robustness of theproposed algorithm.


TABLE 5Computation Time (Seconds) for the

Proposed and Baseline Algorithms as theNumber of Dimensions Varies from 100 to 500

TABLE 4Area under ROC Curve on the ImageCLEF Data Set, Obtained by the Proposed Baseline Algorithms

5 CONCLUSION AND DISCUSSIONS

In this paper, we present a novel framework that learns a

distance metric from side information. Unlike the other

distance metric learning algorithms that are designed to

learn a full matrix for distance metric, and therefore, suffer

from computational difficulty, the proposed algorithm first

learns a binary representation for data and then computes

the weighted Hamming distance based on the learned

representation. A boosting algorithm is presented to

facilitate the learning of the binary representation and the

weights that are used to form the Hamming distance. In

addition to the computational efficiency, another advantage

of the proposed algorithm is that it is able to preserve both

the semantic relevance and the visual similarity. This is

realized through the introduction of links that pair visually

similar images. By training over the combination of visual

pairs and pairwise constraints that are generated based on

semantic relevance, the resulting distance metric is able to

preserve both the visual similarity and semantical rele-

vance. In contrast, the previous work on distance metric

learning tends to focus only on the semantic relevance. We

demonstrate the effectiveness of the proposed algorithm inthe context of an ISADSs system for breast cancer and ontwo standard image data sets (ImageCLEF and Corel).

APPENDIX A

PROOF OF LEMMA 3.2

To prove the inequality, we consider the following twocases:

. fxi fxj: In this case, the inequality in (3) holdsbecause the left side of the inequality is zeros and theright side is guaranteed to be nonnegative.

. fxi 6 fxj: In this case, fxk will be equal toeither fxi or fxj since fx is a binary function.Hence, both sides of the inequality are equal to 4,and therefore, the inequality in (3) holds.

APPENDIX B

PROOF OF LEMMA 3.3

To prove the inequality in (11), we first bound expfi fk2 fi fj2 by the following expression:

expfi fk2 fi fj2

exp2fi fk2

2 exp2fi fj

22

:

Since f2i 1 for any example xi, we have

fi fj24

fi fj2

4 1:

Hence, exp2fi fj2 can be upper bounded as follows:

exp2fi fj2

exp 8 fi fj2

4 0 fi fj

2

4

!

fi fj2

4exp8 fi fj

2

4

fi fj2

4exp8 1 1:


TABLE 6Area under the ROC Curve on the Corel Data Set, Obtained by the Proposed and Baseline Algorithms

Fig. 10. Retrieval accuracy on the Corel data set.

Using the above inequality, we have the objective

function F P in (9) upper bounded as follows:

F P Xni;j;k1

i;ji;k

Xni;j;k1

i;ji;kexpfi fj2 fi fk2 1

exp8 18

Xni;j1

i;jXnk1

i;k

!fi fj2

exp8 18

Xni;j1

i;jXnk1

i;k

!fi fj2

exp8 18

f>Lf exp8 18

f>Lf :

The last step of the above derivation is based on the

following equality:

f>LSf Xni;j1

Si;jfi fj2:

Finally, noting that ~F P, i.e., the objective function ofprevious iteration, is equal to

Pni;j;k1

i;k

i;j, we have

; f F P ~F P upper bounded as follows:

; f exp8 18

f>Lf exp8 18

f>Lf :

APPENDIX C

PROOF OF THEOREM 3.4

We first denote by g; f the right-hand side of theinequality in (11), i.e.,

g; f exp8 18

f>Lf exp8 18

f>Lf :

Note that g; f is a convex function of parameter . Wethen compute min0g; f by setting the first orderderivative of to be zero, i.e.,

@g; f@

exp8f>Lf exp8f>Lf 0:

We obtain the optimal by solving the above equation,

which is

max 0; 116

log f>Lf 1

16log f>Lf

:

Substituting the above expression for , we have

min0

g; f 18

max

0;

f>Lf

p

f>Lf

p 2 max0; f

>Lf f>Lf 28

f>Lf

p

f>Lf

p2

max0; f>Lf f>Lf 2

8n maxLp maxLp 2 :Since ; f g; f, we have the bound in (15).

APPENDIX D

PROOF OF THEOREM 3.5

According to Theorem 3.4, we have

Ft1PFtP 1

max0; f>L Lf28nFtP

maxLt

p maxLt p 2 : 21Since we choose f to maximize the fLt Lt f , we have

max0;maxf

f>Lt Lt f maxLt Lt : 22

The above derivation uses the following fact:

maxLt Lt 1

n1>Lt Lt 1 0:

We can further simplify the bound in (21) by having

maxLt

p

maxLt

q 2 2maxLt maxLt :23

Finally, we can upper bound FtP as follows:

FtP Xni;j;k1

i;ji;k 121>St St 1

1

2maxSt St :

24By putting the inequalities in (22), (23), and (24), we have

Ft1PFtP 1

maxLt Lt 28maxSt St maxLt maxLt

1 t:Using the above inequality, we can bound FT1P asfollows:

FT1P F0PYTt0

Ft1PFtP F0P

YTt01 t:

ACKNOWLEDGMENTS

This work was supported by the US National Science

Foundation (NSF) under grant IIS-0643494 and by the

National Center for Research Resources (NCRRs) under

grant No. 1 UL1 RR024153. Any opinions, findings,

conclusions, or recommendations expressed in this materi-

al are those of the authors and do not necessarily reflect the

views of the NSF, NCRR, Intel, Michigan State University,

or Carnegie Mellon University.

REFERENCES[1] P. Croskerry, The Theory and Practice of Clinical Decision-

Making, Canadian J. Anesthesia, vol. 52, no. 6, pp. R1-R8, 2005.[2] L. Yang, R. Jin, R. Sukthankar, B. Zheng, L. Mummert, M.

Satyanarayanan, M. Chen, and D. Jukic, Learning DistanceMetrics for Interactive Search-Assisted Diagnosis of Mammo-grams, Proc. SPIE Conf. Medical Imaging, 2007.

[3] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain,Content-Based Image Retrieval at the End of the Early Years,IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22, no. 12,pp. 1349-1380, Dec. 2000.


[4] H. Kim, P. Howland, and H. Park, Dimension Reduction in TextClassification with Support Vector Machines, J. Machine LearningResearch, vol. 6, pp. 37-53, 2005.

[5] L. Vandenberghe and S. Boyd, Semidefinite Programming,SIAM Rev., vol. 38, no. 1, pp. 49-95, 1996.

[6] K. Weinberger, J. Blitzer, and L. Saul, Distance MetricLearning for Large Margin Nearest Neighbor Classification,Advances in Neural Information Processing Systems, MIT Press,http://www.seas.upenn.edu/kilianw/lmnn, 2006.

[7] D. Gur, J.H. Sumkin, L.A. Hardesty, and H.E. Rockette,Computer-Aided Detection of Breast Cancer: Has PromiseOutstripped Performance? J. Natl Cancer Inst., vol. 96, pp. 717-718, 2004.

[8] R.M. Nishikawa and M. Kallergi, Computer-Aided Detection inIts Present Form Is Not an Effective Aid for Screening Mammo-graphy, Medical Physics, vol. 33, pp. 811-814, 2006.

[9] T.M. Freer and M.J. Ulissey, Screening Mammography withComputer-Aided Detection: Prospective Study of 12,860 Patientsin a Community Breast Center, Radiology, vol. 220, pp. 781-786,2001.

[10] L.A. Khoo, P. Taylor, and R.M. Given-Wilson, Computer-AidedDetection in the United Kingdom National Breast ScreeningProgramme: Prospective Study, Radiology, vol. 237, pp. 444-449,2005.

[11] J.M. Ko, M.J. Nicholas, J.B. Mendel, and P.J. Slanetz, ProspectiveAssessment of Computer-Aided Detection in Interpretation ofScreening Mammograms, Am. J. Roentgenology, vol. 187, pp. 1483-1491, 2006.

[12] M.L. Giger, Z. Huo, C.J. Vyborny, L. Lam, I. Bonta, K. Horsch,R.M. Nishikawa, and I. Rosenbourgh, Intelligent CAD, Work-station for Breast Imaging Using Similarity to Known Lesions andMultiple Visual Prompt Aides, Proc. SPIE Conf. Medical Imaging02: Image Processing, pp. 768-773, 2002.

[13] I. El-Naga, Y. Yang, N.P. Galatsanos, R.M. Nishikawa, and M.N.Wernick, A Similarity Learning Approach to Content-BasedImage Retrieval: Application to Digital Mammography, IEEETrans. Medical Imaging, vol. 23, no. 10, pp. 1233-1244, Oct. 2004.

[14] C. Wei, C. Li, and R. Wilson, A General Framework for Content-Based Medical Image Retrival with Its Application to Mammo-grams, Proc. SPIE Conf. Medical Imaging 05: PACS and ImagingInformatics, pp. 134-143, 2005.

[15] H. Alto, R.M. Rangayyan, and J.E. Desautels, Content-BasedRetrieval and Analysis of Mammographic Masses, J. ElectronicImaging, vol. 14, pp. 023016-1-023016-17, 2005.

[16] C. Muramatsu, Q. Li, K. Suzuki, R.A. Schmidt, J. Shiraishi, G.M.Newstead, and K. Doi, Investigation of Psychophysical Measurefor Evaluation of Similar Images for Mammographic Masses:Preliminary Results, Medical Physics, vol. 32, pp. 2295-2304, 2005.

[17] B. Zheng et al., Interactive Computer Aided Diagnosis of BreastMasses: Computerized Selection of Visually Similar Image Setsfrom a Reference Library, Academic Radiology, vol. 14, no. 8,pp. 917-927, 2007.

[18] G.D. Tourassi, B. Harrawood, S. Singh, J.Y. Lo, and C.E. Floyd,Evaluation of Information-Theoretic Similarity Measures forContent-Based Retrieval and Detection of Masses in Mammo-grams, Medical Physics, vol. 34, pp. 140-150, 2007.

[19] Y. Tao, S.B. Lo, M.T. Freedman, and J. Xuan, A Preliminary Studyof Content-Based Mammographic Masses Retrieval Book, Proc.SPIE Conf. Medical Imging 07, 2007.

[20] R.M. Nishikawa, Current Status and Future Directions ofComputer-Aided Diagnosis in Mammography, ComputerizedMedical Imaging Graphics, vol. 31, pp. 224-235, 2007.

[21] L. Yang and R. Jin, Distance Metric Learning: A ComprehensiveSurvey, technical report, Michigan State Univ., 2006.

[22] E. Xing, A. Ng, M. Jordan, and S. Russell, Distance MetricLearning with Application to Clustering with Side Information,Advances in Neural Information Processing Systems, MIT Press, 2003.

[23] J.T. Kwok and I.W. Tsang, Learning with Idealized Kernels,Proc. Intl Conf. Machine Learning, 2003.

[24] T. Hastie and R. Tibshirani, Discriminant Adaptive NearestNeighbor Classification, IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 18, no. 6, pp. 607-616, June 1996.

[25] A.B. Hillel, T. Hertz, N. Shental, and D. Weinshall, LearningDistance Functions Using Equivalence Relations, Proc. Intl Conf.Machine Learning, 2003.

[26] S.C.H. Hoi, W. Liu, M.R. Lyu, and W.-Y. Ma, Learning DistanceMetrics with Contextual Constraints for Image Retrieval, Proc.IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.

[27] M. Sugiyama, Local Fisher Discriminant Analysis for SupervisedDimensionality Reduction, Proc. Intl Conf. Machine Learning,2006.

[28] T.-K. Kim, S.-F. Wong, B. Stenger, J. Kittler, and R. Cipolla,Incremental Linear Discriminant Analysis Using SufficientSpanning Set Approximations, Proc. IEEE Conf. Computer Visionand Pattern Recognition, 2007.

[29] M. Schultz and T. Joachims, Learning a Distance Metric fromRelative Comparisons, Advances in Neural Information ProcessingSystems, MIT Press, 2004.

[30] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov,Neighbourhood Components Analysis, Advances in NeuralInformation Processing Systems, MIT Press, 2005.

[31] L. Yang, R. Jin, R. Sukthankar, and Y. Liu, An Efficient Algorithmfor Local Distance Metric Learning, Proc. Natl Conf. ArtificialIntelligence, 2006.

[32] A.B. Hillel and D. Weinshall, Learning Distance Function byCoding Similarity, Proc. Intl Conf. Machine Learning, 2007.

[33] A. Woznica, A. Kalousis, and M. Hilario, Learning to CombineDistances for Complex Representations, Proc. Intl Conf. MachineLearning, 2007.

[34] W. Zhang, X. Xue, Z. Sun, Y. Guo, and H. Lu, OptimalDimensionality of Metric Space for Classification, Proc. Intl Conf.Machine Learning, 2007.

[35] H. Wang, H. Zha, and H. Qin, Dirichlet Aggregation: Unsuper-vised Learning Towards an Optimal Metric for ProportionalData, Proc. Intl Conf. Machine Learning, 2007.

[36] S. Zhou, B. Georgescu, D. Comaniciu, and J. Shao, BoostMotion:Boosting a Discriminative Similarity Function for Motion Estima-tion, Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,2006.

[37] B. Babenko, P. Dollar, and S. Belongie, Task Specific Local RegionMatching, Proc. Intl Conf. Computer Vision, 2007.

[38] F. Li, J. Yang, and J. Wang, A Transductive Framework ofDistance Metric Learning by Spectral Dimensionality Reduction,Proc. Intl Conf. Machine Learning, 2007.

[39] P. Dollar, V. Rabaud, and S. Belongie, Non-Isometric ManifoldLearning: Analysis and an Algorithm, Proc. Intl Conf. MachineLearning, 2007.

[40] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon, Information-Theoretic Metric Learning, Proc. Intl Conf. Machine Learning,2007.

[41] J. Dillon, Y. Mao, G. Lebanon, and J. Zhang, StatisticalTranslation, Heat Kernels, and Expected Distance, Proc. Conf.Uncertainty in Artificial Intelligence, 2007.

[42] L. Yang, R. Jin, and R. Sukthankar, Bayesian Active DistanceMetric Learning, Proc. Conf. Uncertainty in Artificial Intelligence,2007.

[43] L. Torresani and K. Lee, Large Margin Component Analysis,Advances in Neural Information Processing Systems, MIT Press, 2007.

[44] K.Q. Weinberger, F. Sha, Q. Zhu, and L.K. Saul, Graph LaplacianRegularization for Large-Scale Semidefinite Programming, Ad-vances in Neural Information Processing Systems, MIT Press, 2007.

[45] D. Zhou, J. Huang, and B. Scholkopf, Learning with Hyper-graphs: Clustering, Classification, and Embedding, Advances inNeural Information Processing Systems, MIT Press, 2007.

[46] Z. Zhang and J. Wang, MLLE: Modified Locally LinearEmbedding Using Multiple Weights, Advances in Neural Informa-tion Processing Systems, MIT Press, 2007.

[47] A. Frome, Y. Singer, and J. Malik, Image Retrieval andClassification Using Local Distance Functions, Advances in NeuralInformation Processing Systems, MIT Press, 2007.

[48] O. Boiman and M. Irani, Similarity by Composition, Advances inNeural Information Processing Systems, MIT Press, 2007.

[49] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps,Advances in Neural Information Processing Systems, MIT Press, 2007.

[50] T. Hertz, A.B. Hillel, and D. Weinshall, Boosting Margin BasedDistance Functions for Clustering, Proc. Intl Conf. MachineLearning, http://www.cs.huji.ac.il/daphna/code/DistBoost.zip,2004.

[51] T. Hertz, A.B. Hillel, and D. Weinshall, Learning a KernelFunction for Classification with Small Training Samples, Proc.Intl Conf. Machine Learning, 2006.


[52] G. Shakhnarovich, Learning Task-Specific Similarity, PhDthesis, Massachusetts Inst. of Technology, 2005.

[53] Y. Freund and R.E. Schapire, A Decision-Theoretic General-ization of On-Line Learning and an Application to Boosting,J. Computer and System Sciences, vol. 55, no. 1, pp. 119-139, 1997.

[54] S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S.Belongie, Generalized Non-Metric Multidimensional Scaling,Proc. Intl Conf. Artificial Intelligence and Statistics, 2007.

[55] B. Moghaddham and M.-H. Yang, Gender Classification withSupport Vector Machines, Proc. Intl Conf. Face and GestureRecognition, 2000.

[56] Y. Ke, D. Hoiem, and R. Sukthankar, Computer Vision for MusicIdentification, Proc. IEEE CS Conf. Computer Vision and PatternRecognition, 2005.

[57] J. Zhang and R. Yan, On the Value of Pairwise Constraints inClassification and Consistency, Proc. Intl Conf. Machine Learning,2007.

[58] R.E. Schapire, Theoretical Views of Boosting and Applications,Proc. Intl Conf. Algorithmic Learning Theory, 1999.

[59] R. Salakhutdinov, S. Roweis, and Z. Ghahramani, On theConvergence of Bound Optimization Algorithms, Proc. Conf.Uncertainty in Artificial Intelligence, 2003.

[60] ImageCLEF, http://ir.shef.ac.uk/imageclef/, 2009.[61] B. Zheng, J.K. Leader, G. Abrams, B. Shindel, V. Catullo, W.F.

Good, and D. Gur, Computer-Aided Detection Schemes: TheEffect of Limiting the Number of Cued Regions in Each Case,Am. J. Roentgenology, vol. 182, pp. 579-583, 2004.

[62] B. Zheng, A. Lu, L.A. Hardesty, J.H. Sumkin, C.M. Kakim, M.A.Ganott, and D. Gur, A Method to Improve Visual Similarity ofBreast Masses for an Interactive Computer-Aided DiagnosisEnvironment, Medical Physics, vol. 33, pp. 111-117, 2006.

[63] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C. von derMalsburg, R. Wurtz, and W. Konen, Distortion Invariant ObjectTrcognition in the Dynamic Line Architecture, IEEE Trans.Computers, vol. 42, no. 3, pp. 300-311, Mar. 1993.

Liu Yang received the BS degree in electronicsand information engineering from Hua ZhongUniversity of Science and Technology, China.She is currently working toward the PhD degreein the Machine Learning Department of theSchool of Computer Science at Carnegie MellonUniversity. Her research interest is primarily onsemi-supervised learning, distance metric earn-ing, information retrieval, and object Recogni-tion. She was selected as the machine learning

department nominee from CMU for the IBM Fellowship. She is a studentmember of the IEEE.

Rong Jin received the BA degree in engineeringfrom Tianjin University, the MS degree inphysics from Beijing University, and the MSand PhD degrees in computer science fromCarnegie Mellon University. He has been anassociate professor in the Computer Scienceand Engineering Department at Michigan StateUniversity since 2008. His research is focusedon statistical machine learning and its applica-tion to large-scale information management. He

has published more than 80 conference and journal articles on therelated topics. He received the US National Science Foundation (NSF)Career Award in 2006.

Lily Mummert received the PhD degree incomputer science from Carnegie Mellon Uni-versity in 1996. She is a research scientist atIntel Research Pittsburgh, working in the area ofdistributed systems. Before joining Intel in 2006,she was a research staff member at the IBMT.J. Watson Research Center, where sheworked on problems in enterprise systemmanagement and contributed to several pro-ducts. Her current research is focused on

enabling interactive applications that process data from heterogeneous,potentially high-data-rate sensors such as video and audio. She is amember of the IEEE.

Rahul Sukthankar received the PhD degree inrobotics from Carnegie Mellon University andthe BSE degree in computer science fromPrinceton. He is a senior principal researchscientist at Intel Research Pittsburgh and anadjunct research professor in the RoboticsInstitute at Carnegie Mellon. He was previouslya senior researcher at HP/Compaqs CambridgeResearch Lab and a research scientist at JustResearch. His current research focuses on

computer vision and machine learning, particularly in the areas ofobject recognition and information retrieval in medical imaging. He is amember of the IEEE.

Adam Goode received the bachelors degree incomputer science and psychology from Rensse-laer Polytechnic Institute and the mastersdegree from the Human-Computer InteractionInstitute at Carnegie Mellon University. He is aproject scientist working at Carnegie Mellon,working on Diamond, a system for interactivesearch. He has been working as a research staffmember at Carnegie Mellon since 2001. He is amember of the IEEE.

Bin Zheng received the PhD degree in electricalengineering from the University of Delaware.Currently, he is a research associate professorin the Department of Radiology at the Universityof Pittsburgh. He is also the principal investigatoron a number of biomedical imaging researchprojects funded by the US National Institutes ofHealth (NIH). His research projects and interestsinclude computer-aided detection and diagnosis(CAD) of medical images, content-based image

retrieval, machine learning, and receiver operating characteristic (ROC)-type observer performance studies and data analysis. He and hiscolleagues have published more than 60 refereed articles in developingand evaluating CAD schemes and systems for digitized mammograms,lung CT images, and digital microscopic pathology images.

Steven C.H. Hoi received the BS degree incomputer science from Tsinghua University,Beijing, China, and the MS and PhD degreesin computer science and engineering from theChinese University of Hong Kong. He is cur-rently an assistant professor in the School ofComputer Engineering of Nanyang Technologi-cal University, Singapore. His research interestsinclude statistical machine learning, multimediainformation retrieval, Web search, and data

mining. He is a member of the IEEE.

Mahadev Satyanarayanan received the bache-lors and masters degrees from the IndianInstitute of Technology, Madras, and the PhDdegree in computer science from CarnegieMellon. He is the Carnegie Group professor ofcomputer science at Carnegie Mellon University.From May 2001 to May 2004, he served as thefounding director of Intel Research Pittsburgh,one of four university-affiliated research labsestablished worldwide by Intel to create disrup-

tive information technologies through its Open Collaborative Researchmodel. He is a fellow of the ACM and the IEEE and was the foundingeditor-in-chief of IEEE Pervasive Computing.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.