-
Optimal Transport in Reproducing Kernel HilbertSpaces: Theory
and Applications
Zhen Zhang , Student Member, IEEE, Mianzhi Wang , Student
Member, IEEE,
and Arye Nehorai , Life Fellow, IEEE
Abstract—In this paper, we present a mathematical and
computational framework for comparing and matching distributions
in
reproducing kernel Hilbert spaces (RKHS). This framework, called
optimal transport in RKHS, is a generalization of the optimal
transport problem in input spaces to (potentially)
infinite-dimensional feature spaces. We provide a computable
formulation of
Kantorovich’s optimal transport in RKHS. In particular, we
explore the case in which data distributions in RKHS are Gaussian,
obtaining
closed-form expressions of both the estimated Wasserstein
distance and optimal transport map via kernel matrices. Based on
these
expressions, we generalize the Bures metric on covariance
matrices to infinite-dimensional settings, providing a new metric
between
covariance operators. Moreover, we extend the correlation
alignment problem to Hilbert spaces, giving a new strategy for
matching
distributions in RKHS. Empirically, we apply the derived
formulas under the Gaussianity assumption to image classification
and domain
adaptation. In both tasks, our algorithms yield state-of-the-art
performances, demonstrating the effectiveness and potential of
our
framework.
Index Terms—Optimal transport, reproducing kernel hilbert
spaces, kernel methods, optimal transport map, Wasserstein
distance,
Wasserstein geometry, covariance operator, image classification,
domain adaptation
Ç
1 INTRODUCTION
THE popularity of optimal transport (OT) has grown dra-matically
in recent years. Techniques built upon optimaltransport have
achieved great success in many applications,including computer
vision [1], [2], [3], [4], [5], statisticalmachine learning [6],
[7], [8], [9], geometry processing [10],[11], [12], fluidmechanics
[13], and optimal control.
As the name suggests, OT aims at finding an optimalstrategy of
transporting the mass from source locations totarget locations.
More specifically, assume we are given apile of sand, modeled by
the probability measure m, and ahole with the same volume, modeled
by the probabilitymeasure n (see Fig. 1). We also have a cost
function cðx; yÞ(usually a distance function named the “ground
distance”)describing how much it costs to move one unit of massfrom
location x to location y. The OT problem correspondsto finding the
optimal transport map T (or plan) to mini-mize the total cost of
filling up the hole.
Given the two probability measures m and n, the optimaltransport
map can be considered as the most efficient maptransferring m to n,
in the sense of minimizing the totaltransport cost. This map has
been successfully applied tocolor transfer [3], Bayesian inference
[14], and domain adap-tation [7], [8]. The total minimal cost can
be viewed as the
discrepancy, the so-called Wasserstein distance, between mand n.
Intuitively, if m and n are similar, the transportationcost will be
small. Different from other discrepancies, suchas K-L divergence
and the L2 distance, the Wasserstein dis-tance incorporates the
geometry information of the underly-ing support through the cost
function. Because of itsgeometric characteristics, the Wasserstein
distance providesa powerful framework for comparing and analyzing
proba-bility distributions [1], [15]. Moreover, in some
machinelearning problems, it also has been used to define a
lossfunction for generative models to improve their stabilityand
interpretability [6], [16]. There are references exploitingthe case
where OT operates on Gaussian measures. In [17],textures are
modeled by Gaussian measures, and synthetictextures are obtained
via OT mixing. In [18], an elegantframework is proposed for
comparing and interpolatingGaussian mixture models.
All the works mentioned above exploit the machinery ofOT in
original input spaces (usually Euclidean spaces Rn).However, the OT
problem in reproducing kernel Hilbertspaces (RKHS) has not been
widely investigated. In thispaper, we propose a theoretical and
computational frame-work to bridge this gap. The motivations are
the following.
1) There are various ways to represent data, such asstrings
[19], graphs [20], proteins [21], automata [22],and lattices [23].
For some of these representations,we have access to only the
data-dependent kernelfunctions characterizing the affinity
relations betweenexamples, instead of the ground distance or the
costfunction. Thus, it is not straightforward to formulatethe OT
problem for such datasets. Sometimes, even
� The authors are with the Department of Electrical and Systems
Engineer-ing, Washington University in St.Louis, Saint Louis, MO
63130.E-mail: {zhen.zhang, mianzhi.wang, nehorai}@wustl.edu.
Manuscript received 2 Oct. 2017; revised 13 Jan. 2019; accepted
27 Feb. 2019.Date of publication 4 Mar. 2019; date of current
version 3 June 2020.(Corresponding author: Arye
Nehorai.)Recommended for acceptance by C. H. Lampert.Digital Object
Identifier no. 10.1109/TPAMI.2019.2903050
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 42, NO. 7, JULY 2020 1741
0162-8828� 2019 IEEE. Personal use is permitted, but
republication/redistribution requires IEEE permission.See ht
_tps://www.ieee.org/publications/rights/index.html for more
information.
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
https://orcid.org/0000-0001-8104-5785https://orcid.org/0000-0001-8104-5785https://orcid.org/0000-0001-8104-5785https://orcid.org/0000-0001-8104-5785https://orcid.org/0000-0001-8104-5785https://orcid.org/0000-0002-3317-7035https://orcid.org/0000-0002-3317-7035https://orcid.org/0000-0002-3317-7035https://orcid.org/0000-0002-3317-7035https://orcid.org/0000-0002-3317-7035https://orcid.org/0000-0002-9055-9865https://orcid.org/0000-0002-9055-9865https://orcid.org/0000-0002-9055-9865https://orcid.org/0000-0002-9055-9865https://orcid.org/0000-0002-9055-9865mailto:
-
for metric spaces (like Riemannian manifolds), ker-nels are more
powerful than distance functions inmeasuring the similarity between
points [24].
2) There is a huge number of machine learning algo-rithms
formulated in RKHS, due to its capability ofcapturing nonlinear
structures. The performance ofthese algorithms depends highly on
data distribu-tions in feature spaces. Hence, it is of vital
impor-tance to develop a general framework to analyze andmatch RKHS
probability measures.
Following the common procedures for kernel-based meth-ods, we
first map data into a RKHS through a feature mapf, then formulate
the OT in the resulting space. Because ofthe implicity of feature
maps, we have no access to thepushforward measures1 on the RKHS,
which makes it dif-ferent from the OT problem in the original input
spaces. Thekey point of our work is taking advantage of the
interplay betweenkernel functions and probability measures to
develop computableformulations and expressions.
Since the straightforward formulation of OT in RKHSinvolves the
implicit feature map, we propose an equivalentand computable
formulation in which the problem of OTbetween pushforward measures
on RKHS can be fullydetermined by the kernel function. It will be
seen that thealternative formulation can be viewed as the OT
problem inthe original space, with the cost function induced by
thekernel. We name the corresponding Wasserstein distancethe
“kernel Wasserstein distance” (KW for short).
For the case in which pushforward measures areGaussian, we use
kernel matrices to derive closed formexpressions of the empirical
Wasserstein distance and opti-mal transport map, which we term the
“kernel Gauss-Wasserstein distance” (KGW for short) and the
“kernelGauss-optimal transport map” (KGOT for short),
respec-tively. If the expectations of two Gaussian measures arethe
same, then KGW introduces a distance between covari-ance operators,
generalizing the Bures metric on covariancematrices to
infinite-dimensional settings. We term thisdistance the “kernel
Bures distance” (KB for short). Moreinterestingly, the KB distance
does not require covarianceoperators to be strictly positive (or
invertible), which makesit rather appealing since the estimated
covariance operatorsfrom finite samples are always rank-deficient.
The KGOTmap is a continuous linear operator. It introduces a
new
alignment strategy for RKHS distributions by forcing
thecovariance operator of the source distribution to approachthat
of the target distribution.
Empirically, we apply the tools developed under the Gaus-sianity
assumption to image classification and domain adap-tation tasks. In
image classification, we represent each imagewith a collection of
feature samples (the so-called “ensemble”[25]), then employ
theKGWorKBdistance to quantify the dif-ference between them. In
domain adaptation, we solve thedomain shift issue in RKHS. That is,
we use the KGOTmap totransport the samples in the source domain to
the targetdomain to reduce the distribution difference. The
promisingresults for both tasks demonstrate the strong capability
of ourframework in comparing andmatching distributions.
Here, we provide insights on our strategy in the
aboveapplications. Our approaches are based on the resultsobtained
from optimal transport between Gaussian distribu-tions on RKHS. As
mentioned above, one favorable propertyof RKHS Gaussian
distribution is that we can obtain closedform solutions. Moreover,
it has been both numerically andtheoretically justified that after
nonlinear kernel (e.g., RBFkernels) transformations, data are more
likely to be Gaussian[26]. This phenomenon is exploited by many
kernel basedmethods. For example, in [27], [28], the probabilistic
kernelPCA is formulated based on the latent Gaussian model inRKHS.
In [26], Fisher discriminative analysis is implementedin feature
spaces by assuming that RKHS samples belongingto different classes
follow Gaussian distributions with thesame covariance operator but
different means. In [29], theGaussianity of RKHS data is assumed in
order to computethe mutual information. More detailed discussion of
thisassumption can be found in [25], [30], and [26]. On the
otherhand, our approaches can also be interpreted from the
per-spective of Hilbert space embeddings, without the Gaussian-ity
assumption in RKHS. The KGW distance and the KGOTmap operate only
on RKHSmeans and covariance operators,which are informative enough
to characterize data distribu-tions. Therefore, the problem of
comparing and matchingdistributions can be naturally solved by
comparing andaligning kernel means and covariance operators.
Contributions. The contributions of our work are summa-rized as
follows. (1) We introduce a systematic frameworkfor optimal
transport in RKHS, including both theoreticaland computational
formulations. (2) Assuming Gaussianityin RKHS, we derive
closed-form expressions of the esti-mated Wasserstein distance and
optimal transport map viaGram matrices. (3) We apply our
formulations to the tasksof image classification and domain
adaptation. On severalchallenging datasets, our methods outperform
state-of-the-art approaches, demonstrating the effectiveness and
poten-tial of our framework.
Related Work. From the mathematical perspective, ourwork lies at
the intersection of two topics: reproducing ker-nel Hilbert spaces
[31] and optimal transport [32]. The topo-logical properties of
RKHS, which are the cornerstones ofour work, are systematically
characterized in [31]. Formulat-ing OT in abstract spaces is
considered in [33], [34], and [35].In [34] and [35], general
expressions of the Wasserstein dis-tance between Gaussian measures
on Hilbert spaces arederived. All the works above provide rigorous
foundationsfor our framework. We will show how the theorems
from
Fig. 1. Illustration of the optimal transport problem.
1. Given a probability measure m on the input space, mapping
thedata through the implicit map f, we are interested in the data
distribu-tion in RKHS. Such distribution is called the pushforward
measure,denoted as f#m, satisfying that for any subset A in
RKHS,
f#mðAÞ ¼ mðf�1ðAÞÞ.
1742 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 42, NO. 7, JULY 2020
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
RKHS and OT elegantly interact with each other to advancethe
construction of “OT in RKHS”. In fact, RKHS provides aplatform for
the theory of OT in abstract spaces to beapplied in real-world
problems. A recent quite relevantwork can be found in [36], where
the authors proposed aWasserstein distance based framework for
statistical analy-sis of the Gaussian process (GP). They formulated
OT prob-lem in the space of GPs, which is essentially a RKHS.
From the empirical perspective, there are several
relatedapproaches for image classification and domain
adaptation.
In image classification, the strategy of representing imageswith
collections of feature vectors has attracted increasingattention.
The subsequent procedure of quantifying the dis-similarities
between such ensembles is actually the crucialproblem in image
classification. The related algorithms deal-ing with this problem
can be approximately categorized intotwo classes: covariance
matrix-based approaches and covari-ance operator-based approaches.
The methods belonging tothe first class, such as [37] and [38],
exploit the second-orderstatistics constructed in the original
input spaces, characteriz-ing the differences by comparing
covariance matrices. Themethods in the second class, such as [39],
[40], and [41],encode ensembles with infinite-dimensional RKHS
cov-ariance operators, and compute the kernelized versionsof
divergences or distances between them. Covarianceoperator-based
approaches usually achieve better perfor-mance since covariance
operators can capture nonlinearcorrelations. Remarkably, all the
above approaches takeadvantage of the non-Euclidean geometry of
covariancematrices and covariance operators, which is usually
prettyfavorable in computer vision problems [42]. In our work,
wederive the computable expression of the kernel Bures dis-tance
between covariance operators, which generalizes theWasserstein
geometry to the infinite-dimensional RKHS.Moreover, the KB distance
also achieve promising results.
Domain shift, which occurswhen the training (source) andtesting
(target) datasets follow different distributions, usually
results in poor performance of the trainedmodel on the
targetdomain. It is a fundamental problem in statistics
andmachinelearning, and usually happens in real world
applications.There are many strategies to deal with this issue. For
exam-ple, the methods in [43], [44] aim at identifying a
domain-invariant subspace where the source and target
distributionsare similar. The works in [45] and [46] exploit the
immediatesubspaces treated as points on the geodesic curve of
theGrassmann manifold. The authors either sample a finitenumber of
subspaces or integrate along the geodesics tomodel the domain
shift. In [47], an algorithm is introducedfor minimizing the
distributions difference through reweigh-ing samples. More
recently, OT-based methods [7], [8] havebeen proposed. In [7], the
authors use OT to find a non-rigidtransformation to align source
and target distributions. Theypropose several regularization
schemes to improve the regu-larity of the learnedmapping. In [8],
the authors formulate anoptimization problem to learn an explicit
transformation toapproximate the OT map, so it can generalize to
out-of-sam-ples patterns. We develop our method from a
significantlydifferent view. The methodological difference is that
wematch distributions in RKHS, while all the works mentionedabove
attempt to reduce the dissimilarity of distributions inthe original
input space. Thanks to the Gaussianity of data inRKHS, we can
conduct the alignment with the KGOT map, acontinuous linear
operator having an explicit expression.Regularity can be guaranteed
by its continuity and linearity.In [48], the task of matching RKHS
distributions is formu-lated as aligning kernel matrices. However,
kernel matricesmay have different sizes, and their rows/columns do
not nec-essarily correspond. To tackle such problems, the
authorsintroduce the “surrogate kernel”. Different from [48],
ourKGOT map directly operates on covariance operators, whichis more
intuitive and straightforward, totally avoiding theabove problems.
In addition, if we select the linear kernel,i.e., kðxx; yyÞ ¼
xxTyy, our approach degenerates to aligningcovariancematrices,
which is similar to CORAL [49].
Organization. In Section 2, we provide the background ofRKHS and
optimal transport in Euclidean spaces. Sections 3and 4 form the
core of our work, where we develop thecomputational framework of OT
in RKHS, together withclosed-form expressions of the empirical KGW
distance andKGOT map. In Section 5, we describe details of
applyingthe derived formulas to image classification and
domainadaptation, respectively. In Section 6, we report the
experi-mental results on real datasets. In the Supplementary
Mate-rial, which can be found on the Computer Society
DigitalLibrary at
http://doi.ieeecomputersociety.org/10.1109//TPAMI.2019.2903050, we
provide the proofs of all mathe-matical results in this paper,
along with further technicaldiscussions and more experimental
results. In Table. 1, welist the notations introduced in the
paper.
2 BACKGROUND
In this section, we first introduce the reproducing
kernelHilbert space. Next, we review two classical formulations
ofthe optimal transport problem in Rn. We then discuss therelevant
conclusions of the special case, in which the proba-bility measures
are Gaussian.
We use k � k2 to denote Euclidean distance. We usePrðRnÞ to
indicate the set of Borel probability measures on
TABLE 1Notations
Symbol Acronym Meaning
dWðm; nÞ – The Wasserstein distance betweenprobability measures
m and n on Rn.
dGaWðm; nÞ GaW A pseudo metric on measures with finitefirst and
second order moments. If m and nare Gaussian, dGaWðm; nÞ is just
thecorresponding Wasserstein distancebetween m and n.
dBðS1S1;S2S2Þ – The Bures metric between positivesemidefinte
matrices S1S1 and S2S2.
TG – The optimal transport map betweenGaussian measures on
Rn.
dHWðm; nÞ KW The Wasserstein distance betweenprobability
measures f#m and f#n onRKHS.
dHGWðm; nÞ KGW The Wasserstein distance betweenGaussian measures
f#m and f#n onRKHS.
dHB ðR1; R2Þ KB The kernel Bures distance between
RKHScovariances operators R1 and R2.
THG KGOT The optimal transport map betweenGaussian measures on
RKHS.
ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT
SPACES: THEORY AND APPLICATIONS 1743
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
Rn, and use PrðRn �RnÞ to indicate the set of Borel proba-bility
measures on the product space Rn �Rn.
2.1 Reproducing Kernel Hilbert Spaces
Let X be a nonempty set, and let H be a Hilbert space ofR-valued
function defined on X . A function k : X � X iscalled a reproducing
kernel of H, and H is a reproducingkernel Hilbert space, if k
satisfies:
1) 8x 2 X , kð�; xÞ 2 H,2) 8x 2 X , f 2 H, hf; kð�; xÞiH ¼
fðxÞ.
Define the implicit feature map f : X ! H as fðxÞ ¼ kð�;
xÞ.Then, we have hfðxÞ;fðyÞiH ¼ kðx; yÞ, 8x; y 2 X .
It can be easily shown that k is positive definite. On theother
hand, the Moore-Aronszajn theorem says that anypositive definite
kernel k is associated with a unique RKHS.
2.2 Optimal Transport in Rn
2.2.1 Two Formulations
Monge’s Formulation. Given two probability measuresm; n 2
PrðRnÞ, Monge’s problem is to find a transport mapT : Rn ! Rn that
pushes m to n (denoted as T#m ¼ n) to min-imize the total transport
cost. The problem is formulated as
infT#m¼n
ZRn
k~xx� T ð~xxÞk22dmð~xxÞ; (1)
where k~xx� T ð~xxÞk22 is the cost function, reflecting the
geome-try information of the underlying supports. The
physicalmeaning of Monge’s formulation is illustrated in Fig.
1.
However, in some cases, this formulation is ill-posed, inthe
sense that the existence of T cannot be guaranteed. Atypical
example is where m is a Dirac measure but n is not.There is no such
T transferring m to n. To tackle this issue,Kantorovich gives a
relaxed version of OT.
Kantorovich’s Formulation. Kantorovich’s formulation ofOT is a
relaxation of Monge’s. In Kantorovich’s formulation,the objective
function is minimized over all transport plansinstead of transport
maps. It can be written as follows:
infp2Pðm;nÞ
ZRn�Rn
k~xx�~yyk22dpð~xx;~yyÞ; (2)
where Pðm; nÞ is the set of joint probability measures onRn �Rn,
with marginals m and n.
The transport plan pð~xx;~yyÞ is a joint probability
measuredescribing the amount of mass transported from location
~xxto location ~yy. Different from Monge’s problem, Kant-orovich’s
formulation allows splitting the mass. That is, themass at one
location can be divided and transported to mul-tiple destinations.
It can be proved [32] that the square rootof the minimal cost of
(2) defines a metric on PrðRnÞ. Thismetric is the so-called
Wasserstein distance, denoted asdWðm; nÞ. That is,
dWðm; nÞ , infp2Pðm;nÞ
h ZRn�Rn
k~xx�~yyk22dpð~xx;~yyÞi12: (3)
2.2.2 OT between Gaussian Measures on Rn
The following theorem provides a lower bound for the
Was-serstein distance between arbitrary measures m and n,
together with a condition under which the lower bound
isachieved. The lower bound is just the Wasserstein distancebetween
Gaussianmeasures, named the “Gauss-Wassersteindistance” (GaW for
short, and not be confused with theGromov-Wasserstein distance,
denoted byGW).
Theorem 1 (See [50]). Let m and n be two probability measureson
Rn with finite first and second order moments. Let ~mmm and~mmn,
and SSm and SSn be the corresponding expectations andcovariance
matrices, respectively. Write
dGaWðm; nÞ ¼hk~mmm � ~mmnk22 þ trðSSm þ SSn � 2SSmnÞ
i12;
(4)where SSmn ¼ ðSS
12mSSnSS
12mÞ
12. Then,
1) dGaWðm; nÞ � dWðm; nÞ, and2) The equality will be valid if
both m and n are Gaussian.
Remark 1. ð�Þ12 denotes the principle matrix square root,
i.e.,for any positive semi-definite (PSD) matrix SS, write the
eigendecomposition SS ¼ UULLUUT , and then SS12 ¼ UULL12UUT .The
function dGaW can be considered as a pseudo metric
on probability measures with finite first and second
ordermoments. Based on conclusion (2), we see that if m and n
areGaussian, then dGaWðm; nÞ is just the corresponding Wasser-stein
distance. Hence, dGaW defines a metric on the set of allGaussian
measures, which are uniquely characterized bythe first two order
statistics. In the case that m and n havethe same expectation, dGaW
introduces a metric on covari-ance matrices, which is known as the
Bures metric [51].
Corollary 1. Let SymþðnÞ be the set of all positive
semi-definitematrices of size n� n. Then,
dBðSS1;SS2Þ ¼htrðSS1 þ SS2 � 2SS12Þ
i12; (5)
defines a metric on SymþðnÞ.Note that dB defines a metric on PSD
matrices, including
rank-deficient ones. It is a rather desirable property in
prac-tice, because the dimension of samples is sometimes largerthan
their size, which will result in rank-deficiency of theestimated
covariance matrices. dB is well-defined on suchmatrices, without
any regularization operations.
Usually, given PSD matrices SS1 and SS2, we havedBðSS1;SS2Þ 6¼
dBð00;SS1 � SS2Þ, which implies that the Buresmetric exploits the
non-Euclidean geometry of SymþðnÞ.Such a geometry is the so-called
Wasserstein geometry [52],in which dB is just the geodesic distance
function.
Different from the Wasserstein distance, the optimaltransport
map between Gaussian measures usually need toconsider the rank of
the covariance matrices. We start fromthe ideal case where both the
covariance matrices of m and nare of full-rank.
Theorem 2 (See [50]). Let m and n be two Gaussian measureson Rn
whose covariance matrices are of full rank. Let ~mmm and~mmn, and
SSm and SSn, denote the respective expectations andcovariance
matrices. Then the optimal transport map TGbetween m and n exists,
and can be written as
TGð~xxÞ ¼ SS�12
m SSmnSS�12m ð~xx� ~mmmÞ þ ~mmn: (6)
1744 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 42, NO. 7, JULY 2020
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
We can see that in the full-rank case, the most “efficient”map
transferring one Gaussian measure to another is affine.However, if
the covariance matrix is rank-deficient, whichcorresponds to the
case where the Gaussian measure con-centrates on a low-dimensional
affine subspace of Rn, theconclusions in the above theorem do not
necessarily hold.Even the existence of the optimal transport map
cannot beguaranteed. A simple example is that if SSm is
rank-deficient,but SSn is of full rank, it is impossible to find an
affine mapto transfer Gaussian measures m to n. To tackle this
issue,we first project the data with distribution n onto the
rangespace of SSm, where the Gaussian measure m concentratesand SSm
is regular, and then formulate the OT problem, asdescribed in the
next theorem.
Theorem 3. Let m and n be two Gaussian measures defined onRn.
Let �m and �n be the corresponding centered Gaussian meas-ures
which are derived from m and n, respectively, by transla-tion. Let
PPm be the projection matrix onto ImðSSmÞ. Then theoptimal
transport map TG from �m to PPm#�n is linear and self-adjoint, and
can be written as
TGð~xxÞ ¼ ðSS12mÞySSmnðSS
12mÞy~xx: (7)
Remark 2. “y” denotes the Moore-Penrose inverse. ImðSSÞdenotes
the image of the linear transform SS, i.e.,ImðSSÞ ¼ fSS~xx;~xx 2
Rng.Generally speaking, different from Theorem 2, the map
TG in (7) in fact transfers �m to PPm#�n, the projected version
of�n, instead of to �n itself. In the special case where the
Gauss-ian measures m and n satisfy ImðSSnÞ � ImðSSmÞ, �n remainsthe
same under the projection onto ImðSSmÞ, i.e., PPm#�n ¼ �n.So TG in
(7) is just the optimal transport map from �m to �n.This result, as
an extended version of Theorem 2, was alsodeveloped in [35].
3 KANTOROVICH’S OT IN RKHS
This section introduces Kantorovich’s optimal transportproblem
in RKHS. In the first part, we provide an equiva-lent and
computable formulation of this problem. In the sec-ond part, we
discuss the OT optimization problem onempirical distributions.
3.1 The Formulation of OT in RKHS
Let the input space ðX ;BXÞ be a measurable space with aBorel
s�algebra BX , and PrðXÞ be the set of Borel probabil-ity measures
on X . Let k be a positive definite kernel onX � X , and ðHK;BHKÞ
be the reproducing kernel Hilbertspace generated by k. Let f : X !
HK be the correspondingfeature map. For any m 2 PrðXÞ, let f#m be
the pushforwardmeasure of m.
Given m; n 2 PrðXÞ, the Kantorovich optimal transportbetween
pushforward measures f#m and f#n onHK is writ-ten as
dWðf#m;f#nÞ
¼h
infpK2Pðf#m;f#nÞ
ZHK�HK
ku� vk2HKdpKðu; vÞi12;
(8)
where Pðf#m;f#nÞ is the set of joint probability measuresonHK
�HK, with marginals f#m and f#n.
Eq. (8) is a natural analogy of (3). However, (8) is formu-lated
through an implicit nonlinear map, whose expressionwe usually
cannot access, making it difficult to use directly.We next provide
an equivalent and computable formula-tion, the form of which is
fully determined by the kernelfunction.
Theorem 4. Let ðX ;BXÞ be a Borel space, and let the
reproduc-ing kernel k be measurable. Given m; n 2 PrðXÞ, we
write
dHWðm; nÞ ¼h
infp2Pðm;nÞ
ZX�X
d2ðx; yÞdpðx; yÞi12; (9)
where d2ðx; yÞ ¼ kfðxÞ � fðyÞk2HK ¼ kðx; xÞ þ kðy; yÞ � 2kðx;
yÞ.Then,
1) dHWðm; nÞ ¼ dWðf#m;f#nÞ, and2) If p� is a minimizer of (9),
then ðf;fÞ#p� is a mini-
mizer of (8), where ðf;fÞ : X � X ! HK �HK isdefined as ðf;fÞðx;
yÞ ¼ ðfðxÞ;fðyÞÞ.
If the feature map f is injective, the equivalence between(9)
and (8) can be easily justified by applying the measuretransform
formula twice. In addition, with injective f,dðx; yÞ is a distance
function on X . Consequently, dHWðm; nÞdefines a metric on PrðXÞ.
However, in many cases, featuremaps are not injective. For example,
consider the kernel
kð~xx;~yyÞ ¼ expð� kAA~xx�AA~yyk22
2s2Þ satisfying kerðAAÞ 6¼ f00g, which is
pretty common in the setting of Mahalanobis metric learn-ing.
The corresponding feature map f is non-injective, sincefor any ~xx,
~yy satisfying ~xx�~yy 2 kerðAAÞ, we have kfð~xxÞ �fð~yyÞk2HK ¼ 0.
In Theorem 4, we in fact present a more gen-eral conclusion, only
requiring the feature map to be mea-surable. The central idea of
obtaining Theorem 4 is applyingthe “transformation-invariant
property of minimal metrics”[33]. We provide the detailed proof
process in the supple-mentary material.
3.2 Discrete Optimal Transport
In most applications, we have access to only the
empiricalmeasures or histograms, m̂ ¼ Pni¼1 m̂idxi and n̂ ¼ Pmj¼1
n̂jdyj ,where dxi (or dyj ) is the Dirac measure centering at xi (
or yj),and m̂i (or n̂j) is the probability of mass being
associatedwith xi (or yj). The discrete version of (9) can be
written as
MinPP2Unm
trðPPTDDÞ; (10)
where Unm denotes the set of n�m nonnegative
matricesrepresenting the probabilistic couplings, whose
marginalsare m̂ and n̂, i.e., Unm ¼ fPP 2 Rn�mþ jPP~11m ¼ ~̂mm;
PPT~11n ¼ ~̂nng,and DD denotes the n�m cost matrix, with DDi;j ¼
kðxi; xiÞþkðyj; yjÞ � 2kðxi; yjÞ.
4 OT BETWEEN GAUSSIAN MEASURES ON RKHS
In this section, we provide the mathematical computationsof OT
under the condition that the pushforward measureson RKHS are
Gaussian.
Let m be a Borel probability measure on X . We assumethat the
mean, mm ¼ EXmðfðXÞÞ, and the covariance
ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT
SPACES: THEORY AND APPLICATIONS 1745
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
operator, Rm ¼ EXmððfðXÞ �mmÞ ðfðXÞ �mmÞÞ,2 existand bounded
with respect to the Hilbert norm and Hilbert-Schmidt norm (see
[53]), respectively. We note that mm is anelement in HK, and Rm is
a self-adjoint, nonnegative opera-tor on HK, belonging to the
tensor product space HK HK.If the data distributions in RKHS (the
corresponding push-forward measures) are Gaussian, the conclusions
in RKHSare similar to the ones in Euclidean spaces.
Proposition 1. Assume that the hypotheses in Theorem 4 hold.Let
m, n 2 PrðXÞ. Letmm andmn, and Rm and Rn, be the corre-sponding
means and covariance operators, respectively. Write
dHGWðm; nÞ ¼hkmm �mnk2HK þ trðRm þRn � 2RmnÞ
i12;
(11)where Rmn ¼ ðR
12mRnR
12mÞ12. Then,
(1) dHGWðm; nÞ � dHWðm; nÞ, and(2) The equality will be valid if
both f#m and f#n are
Gaussian.
Remark 3.
(1) The square root of an nonnegative, self-adjoint,and compact
operator G is defined as G
12 ¼P
i¼1ffiffiffiffiffiffiffiffiffiffiffiffi�iðGÞ
p’iðGÞ ’iðGÞ, where �iðGÞ and
’iðGÞ are eigenvalues and eigenfunctions of G.(2) The trace of
an trace-class operator G on a separa-
ble Hilbert space H is defined as trðGÞ ¼PdimðHÞi¼1 hGei; eii,
where feigdimðHÞi¼1 is an orthonor-
mal system ofH.It can be seen that KGW serves as a lower bound
for KW,
which reveals the connection between the general andGaussian
cases of the Wasserstein distance in RKHS. Analo-gous to Corollary
1, we generalize the Wasserstein geome-try assigned on PSD to
infinite-dimensional settings, andobtain the kernel Bures distance,
dHB , between RKHS covari-ance operators.
Corollary 2. Let SymþðHKÞ � HK HK be the set of nonnega-tive,
self-adjoint, and trace-class operators inHK. Then
dHB ðR1; R2Þ ¼htrðR1 þR2 � 2R12Þ
i12; (12)
defines a metric on SymþðHKÞ.The kernel Gauss-Wasserstein
distance, dHGW, consists of
two terms. The first term is just the squared maximummean
discrepancy (MMD) [54], i.e., MMD2ðm; nÞ ¼ kmm�mnk2HK , measuring
the distance between the centers of thedata in RKHS. The second
term, dHB , quantifies the differ-ence between the dispersions of
the data in RKHS.
If the kernel k is characteristic [54], KGW actuallyinduces a
metric on PrðXÞ, which can be concluded fromthe perspective of
kernel embedding of distributions.Because k is characteristic [54],
the kernel mean embedd-ing of any m 2 PrðXÞ, i.e., m ! mm 2 HK, is
injective,which leads to the injectiveness of the embedding
m ! ðmm; RmÞ 2 HK� SymþðHKÞ. Since KGW is a metric onHK �
SymþðHKÞ, KGW induces a metric on PrðXÞ. In thenext part, we
explore the informativeness of covarianceoperators, and discuss how
dHB quantifies the discrepancybetween distributions. To do this, we
first introduce the3-splitting property of measures.
Definition 1. Let m 2 PrðXÞ. If there exist disjoint subsets
V1,V2, and V3, satisfying X ¼ V1
SV2
SV3, and mðV1Þ;
mðV2Þ;mðV3Þ > 0, then we say m satisfies the
3-splittingproperty.
Note that the 3-splitting property is rather mild in thesense
that it precludes only measures concentrating on oneor two
singletons, i.e., m ¼ �dx þ ð1� �Þdy, � 2 ½0; 1�. LetPrsðXÞ be the
set of Borel measures satisfying the 3-splittingproperty. The
following theorem presents the injectivenessof the mapping from
PrsðXÞ to SymþðHKÞ. Consequently,combining with the fact that KB is
a metric on SymþðHKÞ,we conclude that KB induces a metric on
PrsðXÞ.Theorem 5. Let the measurable space ðX ;BXÞ be locally
compact
and Hausdorff. Let k be a c0-universal reproducing kernel.3
Then, the embedding m ! Rm, 8m 2 PrsðXÞ is injective.To
demonstrate why the 3-splitting property is required,
we provide a counterexample. For � 2 ½0; 1�, let m ¼ �dx þð1�
�Þdy and n ¼ ð1� �Þdx þ �dy. Clearly, both m and ndon’t satisfy the
3-splitting property, and the correspondingRKHS covariance
operators are the same, i.e., Rm ¼Rn ¼ �ð1� �Þ
�fðxÞ � fðyÞ� �fðxÞ � fðyÞ�. Thus, in this
case, dHB cannot distinguish m and n. We also note that if X
isRn, many popular kernels, such as Gaussian, Laplacian,
andB1-spline are c0-universal [55].
As for the optimal transport map, we consider the rank-deficient
case, since the ranks of the estimated covarianceoperators are
always finite. The conclusions are quite simi-lar to those of
Theorem 3. The only difference is that we areworking on the
pushforward measures on RKHS.
Proposition 2. Given m; n 2 PrðXÞ, assume the
pushforwardmeasures f#m and f#n on RKHS are Gaussian. Let �mf and
�nfbe the respective centered measures of f#m and f#n. Let Pm bethe
projection operator on ImðRmÞ. Then the kernel Gauss-opti-mal
transport map THG between �mf and Pm#ð�nfÞ is a linear
andself-adjoint operator, and can be written as
THG ðuÞ ¼ ðR12mÞyRmnðR
12mÞyu; 8u 2 HK: (13)
For almost all kernel methods, the core task is transfer-ring
the expressions involving implicit feature maps to thekernel-based
expressions. After doing this, one can carryout computations using
the “kernel trick”. In the next twosubsections, we provide explicit
expressions of the estimatedKGW distance (11) and KGOT map (13) via
kernel matrices,which are two of the main contributions of this
paper.
4.1 The Empirical Estimation of the KGW Distance
Let XX ¼ ½x1; x2; . . . ; xn� and YY ¼ ½y1; y2; . . . ; ym� be
two sam-ple matrices from two probability measures m and n,
2. The tensor product of Hilbert spaces H is isomorphic to the
spaceof Hilbert-Schmidt operators, and is defined such that ðu vÞw
¼hv; wiHu, 8u; v; w 2 H.
3. We refer to [55] or the supplementary material, available
online,for the definition of the c0�universal kernel.
1746 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 42, NO. 7, JULY 2020
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
respectively. Let FX ¼ ½fðx1Þ;fðx2Þ; . . . ;fðxnÞ� and FY
¼½fðy1Þ;fðy2Þ; . . . ;fðymÞ� be two corresponding mapped
datamatrices. Let KKXX, KKXY , and KKYY be the kernel
matricesdefined by ðKKXXÞij ¼ kðxi; xjÞ, ðKKXY Þij ¼ kðxi; yjÞ,
andðKKYY Þij ¼ kðyi; yjÞ. Let HHn ¼ IIn�n � 1n~11n~11Tn and HHm
¼IIm�m � 1m~11m~11Tm be two centering matrices. Then the empiri-cal
means, m̂m and m̂n, are estimated as m̂m ¼ 1nFX~11n andm̂n ¼
1mFY~11m. The empirical covariance operators, R̂m andR̂n, are
estimated as R̂m ¼ 1nFXHHnFTX and R̂n ¼ 1mFY HHmFTY .Proposition 3.
The empirical kernel Gauss-Wasserstein distance
is
d̂HGWðm; nÞ ¼h 1ntrðKKXXÞ þ 1
mtrðKKYY Þ � 2
mn~11TnKKXY
~11m
� 2ffiffiffiffiffiffiffiffimn
p kHHnKKXYHHmk�i12:
(14)The kernel Bures distance between R̂m and R̂n is
dHB ðR̂m; R̂nÞ ¼h 1ntrðKKXXHHnÞ þ 1
mtrðKKYYHHmÞ
� 2ffiffiffiffiffiffiffiffimn
p kHHnKKXYHHmk�i12:
(15)
Remark 4. k � k� denotes the nuclear norm, i.e., kAAk� ¼Pri¼1
siðAAÞ, where siðAAÞ are the singular values of matrix
AA.
Computational Complexity. For convenience, we assumethe sample
sizes are the same, i.e., m ¼ n. It takes Oðn2Þoperations to
compute the first three terms of d̂HGW. If wewrite trðKXXKXXHHnÞ ¼
trðKXXKXXÞ þ~11TnKKXX~11n (similarly forKKYY ), it takes Oðn2Þ
operations to compute the first twoterms of d̂HB . Now we consider
last term kHHnKKXYHHnk� inboth d̂HGW and d̂
HB . To avoid large-scale matrix multiplica-
tions, we write HHnKKXYHHn ¼ KKXY þ 1n2
ð~11TnKKYX~11nÞ~11n~11Tn�1n~11nð~11TnKKXY Þ � 1n ðKKXY~11nÞ~11Tn ,
whose complexity is Oðn2Þ.Moreover, the nuclear norm requires Oðn3Þ
operations.
4.2 The Empirical Estimation of the KGOT Map
Proposition 4. Let XX and YY be data matrices sampled from mand
n, respectively. Then the empirical projection operator onImðR̂mÞ
is
P̂m ¼ FXHHnCCyXXHHnFTX; (16)and the empirical Gauss-optimal
transport map from �mf andPm#ð�nfÞ is
T̂HG ¼ffiffiffiffiffin
m
rFXHHnCC
yXXCC
12XYYXCC
yXXHHnF
TX; (17)
whereCCXX ¼ HHnKKXXHHn; (18a)
CCXYYX ¼ HHnKKXYHHmKKYXHHn: (18b)
Both (16) and (17) are computable expressions. That is, givenany
element u 2 HK, we can directly apply our formulationsto obtain the
corresponding images P̂mðuÞ and T̂HG ðuÞ. More-over, we emphasize
that the estimated KGOT plays the role
of aligning RKHS covariance operators, as summarized inthe next
proposition.
Proposition 5.
T̂HG R̂mT̂HG ¼ P̂mR̂nP̂m: (19)
Eq. (19) can be interpreted in the following way. First,data
sampled from �nf are projected onto ImðR̂mÞ, the imageof R̂m. The
resultant covariance operator is P̂mR̂nP̂m. Next,data sampled from
�mf, which already concentrates onImðR̂mÞ, are “transported” by the
KGOT map T̂HG . The corre-sponding covariance operator becomes T̂HG
R̂mT̂
HG . By doing
this, the covariance operators are aligned, which leads tothe
similar distributions of these two transformed datasetsin RKHS.
Regularization. Since the smallest eigenvalues of kernelmatrices
CXXCXX are usually close to zero, the Moore-Penroseinverse
CyXXC
yXX is ill-conditioned. There are several methods
that can be used to deal with this issue. (1) One can take
theinversion of the top d eigenvalues ofCXXCXX, and set other
eigen-values to be zero. However, the drawback is that it is
usuallydifficult to select such a cutoff. (2) One can use the
regularizedversion of CXXCXX. That is, we can use ðCXXCXX þ �IIÞ�1
to replaceCyXXC
yXX. This is a adhoc strategy in practice because it is more
efficient to select the regularizer �. However, this
methoddestroys the low-rank structure of CyXXC
yXX. (3) One can also use
ðC2XX þ �IIÞ�1CXXCXXC2XX þ �IIÞ�1CXXCXX [56] to
approximateCyXXCyXX, based on the factthat lim�!0ðC2XX þ �IIÞ�1C2XX
þ �IIÞ�1CXXCXX ¼ CyXXCyXX. In our experiments, weuse this strategy,
since it not only is efficient to implement,but also
preservesCyXXC
yXX’s low-rank structure.
5 APPLICATIONS
In this section, we apply the developed formulas (14) and(15),
and (17) to image classification and domain
adaptation,respectively.
5.1 Image Classification
5.1.1 Proposed Approach
Each image is represented by a collection of (pixel-wise)
fea-ture samples, which can be low-level features or
learnedfeatures extracted from deep neural networks. We apply
theKGW (or the KB) distance to solve the core problem of mea-suring
the difference between image representations. Inother words, the
distance between two images is the kernelGauss-Wasserstein (or the
kernel Bures) distance betweenthe two corresponding feature
collections. After obtainingthe distances between any pair of
images, we employ thekernel SVM as the final classifier. Our
approach is schema-tized in Fig. 2. Note that the above procedures
make up atwo-layer kernel machine. The first-layer kernel, K1, is
usedto compute the KGW (or the KB) distance, while the
sec-ond-layer kernel,K2, is for the kernel SVM.
5.2 Domain Adaptation
5.2.1 Problem Formulation
A domain adaptation task involves two data domains: thesource
domain and the target domain. The source domain iscomposed of
labeled data fXXs; llsg ¼ fðxsi ; lsi ÞgNsi¼1, which can
ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT
SPACES: THEORY AND APPLICATIONS 1747
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
be used to train a reliable classifier. The target domain
refersto the unlabeled data YY t ¼ fytjgNtj¼1, whose statistical
andgeometrical characteristics are different. Domain adaptationaims
to adapt the classifier trained on the source domain tothe target
domain.
5.2.2 Proposed Approach
Central Idea.Weemploy theOTmap to transport the RKHSdatain the
sourcedomain to the target domain, thenwe train a classi-fier based
on the transported data. We adopt the Gaussianityassumption in
RKHS.Hence, the problem ofmatching distribu-tions in RKHS can be
solved by aligning the correspondingcovariance operators. Because
of the rank-deficiency issue, wefirst project the target data onto
ImðR̂sÞ, and then apply the esti-matedKGOTmap (17). The procedure
is schematized in Fig. 3.
Preprocessing with PCA. The source and target data areacquired
in different scenarios, which probably results ingeometrical
distortions, especially for visual datasets [44]. Toalleviate this
issue, we first apply Principal Component Anal-ysis (PCA) to the
raw data to construct consistent feature rep-resentations. That is,
we concatenate the source and targetsamples to form a large data
matrix, from which we obtainthe joint principal components. We use
the scores of datapoints on these principal components as the new
representa-tions. Note that many state-of-the-art algorithms for
visualdatasets, like domain invariant projection [44],
subspacealignment [57], accept PCA as a preprocessing procedure.And
some algorithms, like transfer subspace learning [58],joint
distribution alignment [43], are formulated by
solvingoptimizations, which are motivated by PCA and its
variants.We emphasize that with the PCA-preprocessing procedure,the
subspace mismatch issues might be reduced, since thejoint principal
subspaces involve the geometrical informationof both the source and
target samples. However, the statisti-cal distribution
differencemay still be large.We solve the dis-tributionmismatch
problem in RKHSwith the KGOTmap.
Technical Details.4 Let FsX and FtY denote the source and
target samples in RKHS, respectively. Then FsXHHNs andFtY HHNt
are the corresponding centered samples. After beingprojected onto
ImðR̂sÞ, the projection of the target data is
P̂sðFtY HHNtÞ ¼ FsXHHNsCCyXXCCXY : (20)
After being transported to the target domain, the sourcedata
becomes
T̂HG ðFsXHHNsÞ ¼ffiffiffiffiffiffiNsNt
sFsXHHNsCC
yXXCC
12XYYXCC
yXXCCXX: (21)
Then, the inner product matrix between the projected
targetsamples and the transported source samples is
InnInnts ¼ �P̂sðFtY HHNtÞ�T �T̂HG ðFsXHHNsÞ�¼
ffiffiffiffiffiffiNsNt
sCCYXCC
yXXCC
12XYYXCC
yXXCCXX:
(22)
Similarly,
InnInnss ¼ NsNt
CC12XYXY CC
yXXCC
12XYXY ; (23a)
InnInntt ¼ CCYXCCyXXCCXY : (23b)So, after distribution matching,
we obtain a domain-invariant kernel matrix, KKNew, and a distance
matrix,DistDistts, i.e.,
KKNew ¼ InnInnss ðInnInntsÞT
InnInnts InnInntt
� �; (24)
DistDistts ¼~11Nt�diagðInnInnssÞ�T þ �diagðInnInnttÞ�~11TNs �
2InnInnts:
(25)
Domain-Invariant Kernel Machines. After nonlinear corre-lation
alignment, the new kernel matrix (24) can be used inany
kernel-based learning algorithms. For example, in ker-nel ridge
regression, the predicted labels for the target data-set YY t
are
~llY ¼ ðInnInntsÞðInnInnss þ gIINsÞ�1~llX: (26)
Fig. 2. We first represent each image Ii by a collection of
feature sam-ples Ai. Next, compute the KGW (or the KB) distances
between any pairof images. Finally, we apply kernel SVM to conduct
classification.
Fig. 3. (a) The labeled dataset, XXs, in the source domain and
the unla-beled dataset, YY t, in the target domain. Dots and stars
represent differ-ent classes; (b) Map XXs and YY t to the RKHS HK,
and centralize themapped data. (The centered source dataset FsXHHNs
lies in ImðR̂sÞ); (c)Project the target dataset FtY HHNt onto
ImðR̂sÞ. The projection isP̂sðFtY HHNt Þ; (d) Apply the KGOT map to
transport the source data to thetarget domain. The transported data
is T̂HG ðFsXHHNs Þ. Now T̂HG ðFsXHHNs Þand P̂sðFtY HHNt Þ are
similarly distributed on ImðR̂sÞ � HK. Finally, train aclassifier
using T̂HG ðFsXHHNs Þ, and apply the resultant classifier toP̂sðFtY
HHNt Þ .
4. We provide the detailed derivation of the mathematical
results(20), (21), (22), and (23) in the Supplementary Material,
which can befound on the Computer Society Digital Library at
http://doi.ieeecom-putersociety.org/10.1109/TPAMI.2019.2903050.
1748 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 42, NO. 7, JULY 2020
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
For the kernel support vector machine, after training a
clas-sifier on the source partition ðInnInnss;~llXÞ, we can predict
thelabels of the target by
~llY ¼ ðInnInntsÞð~aa�~llXÞ þ~bb; (27)
where ~aa is the Lagrangian multiplier, � is the
Hadamardproduct, and~bb is the bias.
With (24) or (25), the K-nearest neighbors classifier inRKHS can
also be constructed. There are two ways to quan-tify the affinity
between points in RKHS: the inner productand the distance. That is,
given any target data ytj, we canidentify its nearest neighbors by
finding the maximal valuesin the jth row of InnInnts or the minimal
values in the jth rowof DistDistts. In practice, given different
datasets and kernelfunctions, the best choices for the affinity
characterizationmethods are also different. We view the choice as
a“hyperparameter” and use the cross-validation strategy tochoose
InnInnts or DistDistts.
6 EXPERIMENTS
This section is divided into three parts. The experiments inthe
first part numerically demonstrate and validate themathematical
results developed in the paper. In the secondpart, we study the
behavior of the kernel Gauss-Wassersteindistance and the kernel
Bures distance in viruses, textures,materials, and scenes
classification. In the third part, weevaluate our approach for
domain adaptation on threebenchmark datasets in the context of
object recognition anddocument classification.
6.1 Toy Examples
This section includes two experiments with simulated data.In the
first experiment, we numerically demonstrate ourclaim that the KGW
distance is a lower bound of the KWdistance (see Proposition 1). In
the second experiment, wedemonstrate that the KGOT map can match
the data distri-butions in RKHS. In both experiments, we use the
RBF
kernel kð~xx;~yyÞ ¼ expð� k~xx�~yyk22
2s2Þ, and choose s ¼ 2.
6.1.1 Synthetic Data I
We consider two classes of Gaussian distributions mðmÞ ¼Nðm~11;
IIÞ and nðmÞ ¼ Nð�m~11; IIÞ on R2, parameterized by
a real number m taking values in f0:1; 0:2; . . . ; 3g. For
eachm, we draw 100 independent samples from mðmÞ, denotedas XXðmÞ ¼
½~xx1;~xx2 . . . ;~xx100�ðmÞ, and 100 independent sam-ples from
nðmÞ, denoted as YY ðmÞ ¼ ½~yy1;~yy2 . . . ;~yy100�ðmÞ.We use
expression (14) to compute the empirical KGWdistance, and use (10)
to compute the empirical KW dis-tance. For each m, the results are
averaged over 50 repeti-tions. The results are shown in Fig. 4.
Clearly, KGW isless than KW.
6.1.2 Synthetic Data II
In this experiment, we construct the source data matrixXXs ¼
½~xx1;~xx2 . . . ;~xx500� 2 R3�500 by independently drawing1500
samples from the exponential distribution pðxÞ ¼expð�xÞ and
arranging them in a 3� 500 matrix. We con-struct the target data
matrix YY t ¼ ½~yy1;~yy2 . . . ;~yy500� 2 R3�500by independently
drawing samples from the uniform dis-tribution on ½�2;�1�3. These
two datasets are visualized inFig. 5a. Mapping these samples to the
RKHS, we investi-gate the performance of the KGOT map. The
centeredsource and target sample sets in RKHS are FsXHH500 andFtY
HH500, respectively. We aim to numerically demonstratethat T̂HG
ðFsXHH500Þ (see (21)) and P̂sðFtY HH500Þ (see (20)) aresimilarly
distributed in RKHS. For the sake of visualiza-tion, we choose a
coordinate system ðliÞ3i¼1, in which li istaken to be the
evaluation functional at point xi, i.e.,liðfÞ ¼ fðxiÞ ¼ hfðxiÞ;
fiHK 8f 2 HK. The coordinates ofT̂HG ðFsXHH500Þ are
Fig. 4. The estimated KGW and KW distances between Gaussian
distri-butionsNðm~11; IIÞ andNð�m~11; IIÞ.
Fig. 5. (a) The source dataset XXs, and the target dataset YY t;
(b) Therepresentations of the datasets T̂HG ðFsXHHNs Þ and P̂sðFtY
HHNt Þ under thecoordinate sysmtem ðliÞ3i¼1.
ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT
SPACES: THEORY AND APPLICATIONS 1749
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
~XXs ¼fT ðx1ÞfT ðx2ÞfT ðx3Þ
264
375FsXHH500CCyXXCC12XYYXCCyXXCCXX
¼ KK13XXHH500CCyXXCC12XYYXCC
yXXCCXX 2 R3�500;
(28)
whereKK13XX denotes the first three rows ofKKXX . The
coordi-nates of P̂sðFtY HH500Þ are
~YY t ¼fT ðx1ÞfT ðx2ÞfT ðx3Þ
264
375FsXHH500ðCCXXÞyCCXY
¼ KK13XXHH500CCyXXCCXY 2 R3�500:
(29)
We visualize the new data points ~XXs and ~YY t in Fig. 5b. It
canbe seen that the distributions of ~XXs and ~YY t are quite close
toeach other.
6.2 Image Classification
In this section, we evaluate the performance of the KGW
dis-tance and the KB distance on the multiple-categories
imageclassification task. As described in Section 5.1, our
methodinvolves two kernels, for both of which we employ the RBF
kernel, i.e., k1ð~xx;~yyÞ ¼ expð� k~xx�~yyk22
s21
Þ and k2ðI1; I2Þ ¼ exp��
ðd̂HGW
Þ2s22
�(or k2ðI1; I2Þ ¼ exp
�� ðd̂HB Þ2s22
�). Note that k2 is not nec-
essarily positive definite. We regularize the
correspondingkernel matrices by adding a small diagonal term, gII,
as in[59]. For the hyperparameters, we set s21 to be the median
ofthe squared Euclidean distances between all the samples,and we
set g ¼ 10�4. In each experiment, we choose a smallsubset in the
training dataset to tune s22, which takes valuesin f0:1; 0:2; 0:6;
1; 2; 4g �M, whereM is the median of all thesquared KGW or KB
distances. The tradeoff parameter C ofSVM is taken in f0:1; 1; 10;
100; 1000g.
6.2.1 Data Preparation
We use four benchmark image datasets: the Kylberg virusdataset
[60], the Kylberg texture dataset [61], the UIUC data-set [62], and
the TinyGraz03 dataset [63]. We consider boththe low-level features
and the deep features.
Low-level features.TheKylberg virus dataset contains 15 clas-ses
of virus. Each class has 100 grayscale images of 41� 41pixels. We
follow the experimental protocol in [39]. At eachpixel ðu; vÞ, we
extract a 25-dimensional feature vector, i.e.,
~xxu;v ¼ Iu;v; @I@u
; @I@v
; @2I@u2
; @2I@v2
;G0;0u;v
; . . . ;G3;4u;v
� �T
;
where Iu;v is the intensity at ðu; vÞ, @I@u (@I@vÞ is the
derivative of Iat the horizontal (vertical) direction, @
2I@u2
(@2I@v2
) is the second
order derivative at the horizontal (vertical) direction, GO;Su;v
is
the response of theGaborwavelet [64] with orientationO tak-ing
values in f0; 1; 2; 3g and scale S taking values inf0; 1; 2; 3; 4g,
and j � j denotes the magnitude. To reduce thecomputational burden,
for each image, we use 1000 samplesout of the total 41� 41 ¼ 1681
observations as the representa-tion. In each class of virus, we
randomly select 90 images asthe training set and use the remaining
ones as the testing set.
The Kylberg texture dataset contains 28 categories of tex-tures.
Each category has 160 grayscale images taken withand without
rotation. Following the protocol in [39], weresize each image to
128� 128 pixels, and compute 1024observations on a coarse grid
(i.e., every 4 pixels in the hori-zontal and vertical directions ).
At each pixel ðu; vÞ, weextract a 5-dimensional feature vector,
i.e.,
~xxu;v ¼ Iu;v; @I@u
; @I@v
; @2I@u2
; @2I@v2
� �T
:
We randomly select five images in each category as thetraining
set and use the remaining ones as the testing set.
The UIUC dataset contains 18 categories of materials,each of
which has 12 images. Following [65], at each pixel,we extract a
19-dimensional feature vector, i.e,
~xxu;v ¼ IRu;v; IGu;v; IBu;v; @I@u
; @I@v
; @2I@u2
; @2I@v2
;G0;0u;v
; :::;G2;3u;v
� �T
:
We randomly select 1000 feature vectors of each image as
itsrepresentation. As in [65], we randomly select half of theimages
in each class as the training data, and use the rest asthe testing
data.
For all the above three datasets, we repeat the corre-sponding
random training/testing split procedure 10 timesand report the
average accuracy and the standard deviation.
The TinyGraz03 dataset contains 20 classes of outdoorscenes,
each of which has at least 40 images of size 32� 32.Following [65],
at each pixel, we extract a 7-dimensional fea-ture vector, i.e,
~xxu;v ¼ IRu;v; IGu;v; IBu;v; @I@u
; @I@v
; @2I@u2
; @2I@v2
� �T
:
We use the training/testing split recommended in [63].Deep
features. For the Kylberg virus, UIUC, and Tiny-
Graz03 dataset, we also conduct experiments using thehypercolumn
descriptor [66] extracted from the deep con-volutional neural
network. To obtain the hypercolumndescriptors, we normalize and
resize each image to a fixedsize of 224� 224� 3 (in the format of W
�H � C) and feedit into a pre-trained AlexNet [67], [68]. We then
extract thefeature maps from the maxpool2 layer and conv4 layer.
Thesizes of these features are 13� 13� 192, and 13� 13�
256,respectively. We concatenate the feature maps extractedfrom the
maxpool2 and conv4 layers. As a result, each imageis represented by
13� 13 ¼ 169 feature vectors of thedimension 192þ 256 ¼ 448.
6.2.2 Experimental Results
We compare our approaches with the following state-of-the-art
approaches: (1) MMD-based methods [69], denoted asMMD1 andMMD2,
where the level-1 kernels (i.e., the embed-ding kernels) of both
MMD1 and MMD2 are the RBF kernel,and the level-2 kernels of MMD1
and MMD2 are linear andRBF, respectively; (2) RKHS
Bregman-divergence [39],denoted as SH; (3) covariance
discriminative learning [37],denoted as CDL. For all the
approaches, we use SVM as thefinal classifier.
We report the classification results in Table 2. For mosttasks,
our approaches KGW-SVM and KB-SVM outperform
1750 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 42, NO. 7, JULY 2020
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
baseline methods. We see that covariance operator
basedapproaches, like the kernel Bures distance and the
kernelBregman divergence, have superior performance to
MMD-basedmethods. One reason is that covariance operators
basedmethods exploit the intrinsic Riemannian structure of
positiveoperators, which is usually favorable for computer
vision.Note that by integrating with the deep hypercolumn
descrip-tor, our KB-SVM approach achieves a very high
classificationaccuracy of 72 percent on the challenging TinyGraz03
dataset,whose correct recognition rate by humans is 30 percent.
6.3 Domain Adaptation
In this section, we conduct experiments for visual
objectrecognition and document classification to evaluate
ourapproach.
6.3.1 Datasets
Three benchmark datasets are considered: COIL20, Office-Caltech,
and Reuters-21578. In total, we have 32 adaptationtasks.
The COIL20 dataset contains a total of 1,440 grayscaleimages of
20 classes of objects. The images of each objectwere taken at a
pose interval of 5 degrees. Consequently,each object has 72 images.
Each image in COIL20 is 32� 32pixels with 256 gray levels. We adopt
the public datasetreleased by Long [43]. The total dataset is
partitioned intotwo subsets, COIL1 and COIL2. COIL1 consists of all
theimages taken in the directions of ½0; 85� or ½180; 265�.COIL2
consists of all the images taken in the directions of½90; 175� or
½270; 355�. There are two domain adaptationtasks, i.e., C1 ! C2 and
C2 ! C1.
The Office-Caltech dataset is an increasingly popularbenchmark
dataset for visual domain adaptation. It containsthe images of ten
classes of objects taken from four domains:958 images downloaded
from Amazon, 1,123 images gath-ered from the web image search
(Caltech-256), 157 imagestaken with a DSLR camera, and 295 images
from Webcams.In total, they form 12 domain adaptation tasks, e.g.,
A ! C,A ! D, . . . ,W ! D. We consider two types of features:
theSURF features and the DeCAF6 deep learning features. TheSURF
features represent each image with an 800-bin nor-malized histogram
whose codebook is trained from a subsetof Amazon images. We use the
public dataset released byGong [45]. The DeCAF6 features [70],
extracted from the 6thlayers of a convolutional neural network,
represent eachimage with a 4,096-dimensional vector.
The Reuters-21578 dataset has three top categories, i.e.,orgs,
places and people, each of which has many
subcategories. Samples that belong to different subcatego-ries
are treated as drawn from different domains. Therefore,we can
construct six cross-domain document datasets: orgsvs people, people
vs orgs, orgs vs places, places vs orgs, people vsplaces, and
places vs people. We adopt the preprocessed ver-sion of
Reuters-21578, which contains 3,461 documents rep-resented by
4,771-dimensional features.
In summary, we have constructed 2þ 12� 2þ 6 ¼ 32domain
adaptation tasks.
6.3.2 Methods
We compare our approach with many state-of-the-art algo-rithms:
(1) 1-Nearest neighbor classifier without adaptation(NN), (2)
standard support vector machine (SVM), (3) Prin-cipal components
analysis (PCA), (4) Optimal transportwith entropy regularization
(OT-IT) [7], (5) Geodesic flowkernel (GFK) [45], (6) Joint
distribution alignment (JDA)[43], (7) Correlation alignment (CORAL)
[49], (8) Transfer-able component analysis (TCA) [71], (9) Subspace
alignment(SA) [57], (10) Domain invariant projection (DIP) [44],
(11)Surrogate kernel machine (SKM) [48], and (12) Kernel
meanmatching (KMM) [47].
In the object recognition tasks, we apply all the algo-rithms to
the data after PCA-preprocessing, and use NN asthe final
classifier. Note that the choice of InnInnts or DistDistts forKGOT
and CORAL is marked by a subscript. In the docu-ment classification
tasks, we apply all the algorithms to theraw data, and use SVM as
the final classifier.
6.3.3 Implementation Details
In order to fairly compare the above methods, we adopt
theevaluation protocol introduced in [71] and [43]. That is, weuse
the whole labeled data in the source domain for traininga
classifier (“full training” protocol). To choose hyperpara-meters
for all the methods, we randomly select a very smallsubset of the
target samples to tune parameters. We con-sider the following
parameter ranges. For algorithmsinvolving subspace learning, we
search for the best dimen-sion k in f10; 15; . . . ; 40g. For
algorithms involving regulari-zation parameters, we search for the
best ones in f0:01;0:1; 1; 2; 10; 100g. For the tradeoff parameter
C in SVM, weselect the best C in f0:01; 0:1; 1; 10; 50; 100;
1000g.
6.3.4 Experimental Results
The experimental results on 32 domain adaptation tasks
arereported in Tables 3, 4, 5, and 6. For each task, the best
resultis highlighted in bold. Overall, our KGOT-based
approachesachieve better performance than the baseline methods.
On
TABLE 2Classification Accuracy (in %) on the Kylberg Virus,
Texture, UIUC, and TinyGraz03 Datasets
Methodslow-level features deep features
MeanVirus Texture UIUC TinyGraz03 Virus UIUC TinyGraz03
KGW-SVM 80:4� 3:8 93:1� 0:9 48:4� 3:648:4� 3:6 6161 84:1�
1:384:1� 1:3 60:3� 1:760:3� 1:7 70 71.0KB-SVM 78:7� 3:1 93:9�
1:293:9� 1:2 48:2� 2:5 58 82:4� 1:8 59:1� 2:0 72 70.3MMD1-SVM 43:3�
6:2 59:3� 2:3 23:3� 3:8 32 68:7� 0:7 51:4� 3:7 47 46.4MMD2-SVM
71:8� 3:1 93:3� 1:1 43:3� 6:1 51 82:9� 1:5 56:5� 3:9 70
67.0Bregman-SVM 81:2� 2:981:2� 2:9 91:4� 1:3 45:4� 3:0 59 82:7� 1:4
53:9� 2:8 68 68.8CDL-SVM 69:5� 3:1 79:9� 1:1 36:3� 2:0 41 83:7� 1:3
58:3� 2:9 70 62.7
ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT
SPACES: THEORY AND APPLICATIONS 1751
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
theOffice-Caltech dataset with the SURF features, the
averagerecognition accuracy of our approach is 18.04 percent
higherthan that of the 1NN algorithm without domain
adaptation,which demonstrates the power of aligning RKHS
covarianceoperators in tackling domain shift issues. On the
Reuters-21578 dataset, KGOT’s average classification accuracy
signifi-cantly exceeds the best competitive method’s by 3.62
percent.On average, KGOT has superior performance to CORAL,
because KGOT aligns the covariance descriptors in the non-linear
feature space, which can capture high-order statistics.
7 CONCLUSION AND FUTURE WORK
In this paper, we presented a novel, theoretically robust,and
computational framework, namely optimal transport inreproducing
kernel Hilbert spaces, for comparing and
TABLE 3Recognition Accuracies ( in % ) on COIL20 Dataset
Task NN PCA OT-IT GFK JDA CORALDist TCA SA KGOTDist
C1 ! C2 83.33 86.53 85.69 88.06 90.28 89.31 89.31 88.19 93.06C2
! C1 84.72 87.22 86.39 88.33 88.06 89.17 89.72 88.75 89.72Mean
84.03 86.88 86.04 88.20 89.17 89.24 89.52 88.47 91.39
TABLE 4Recognition Accuracies ( in % ) on Office-Caltech Dataset
with the SURF Features
Task NN PCA OT-IT GFK JDA CORALInn TCA DIP KGOTInn
A ! C 26.00 35.98 36.24 42.68 37.67 34.73 42.74 39.98 39.89A ! D
25.48 32.48 35.03 40.52 36.94 28.66 37.58 39.49 42.04A ! W 29.83
34.24 42.71 42.37 40.34 35.93 40.00 38.64 42.03C ! A 23.70 37.89
45.41 41.13 43.42 46.45 46.76 41.75 49.37C ! D 25.48 39.49 45.86
45.22 52.87 43.95 47.13 45.22 50.96C ! W 25.76 34.92 42.71 40.34
43.05 36.27 40.68 37.29 43.05D ! A 28.50 33.72 33.92 34.34 33.29
34.13 34.86 33.82 37.06D ! C 26.27 31.17 31.43 33.48 30.99 31.61
32.95 30.99 34.64D ! W 63.39 81.36 87.46 85.08 92.20 83.73 91.19
84.41 87.46W ! A 22.96 30.79 37.58 33.09 37.06 39.46 29.44 30.38
38.00W ! C 19.86 30.19 32.41 30.90 29.92 33.66 30.72 26.09 36.60W !
D 59.24 80.89 89.17 90.45 89.81 78.34 89.15 91.72 91.72Mean 31.37
41.93 46.66 46.63 47.30 43.91 46.93 44.98 49.41
TABLE 5Recognition Accuracies ( in % ) on Office-Caltech Dataset
with the Deep Features
Task NN PCA OT-IT GFK JDA CORALInn TCA SA KGOTInn
A ! C 83.70 79.43 83.26 78.09 83.26 85.31 83.08 80.59 85.66A ! D
80.25 80.89 84.08 84.71 80.25 80.80 82.17 89.17 86.62A ! W 74.58
70.85 77.29 76.27 77.97 76.27 80.34 83.05 82.37C ! A 89.98 89.46
88.73 89.14 90.08 91.13 90.50 89.35 91.44C ! D 86.62 87.90 90.45
88.54 91.08 86.62 86.62 90.45 92.36C ! W 78.64 81.36 88.47 80.34
83.73 81.12 79.66 81.36 87.12D ! A 85.70 89.14 83.30 89.04 91.54
88.73 91.65 87.06 91.75D ! C 79.16 78.01 83.97 78.36 82.37 80.41
83.53 81.39 85.57D ! W 99.66 98.64 98.31 99.32 100 99.32 98.98
99.32 99.32W ! A 77.14 83.30 88.94 83.92 88.62 82.05 83.72 83.72
89.67W ! C 74.80 78.72 79.07 76.22 81.30 78.72 79.79 79.79 84.95W !
D 100 100 99.36 100 100 100 100 100 100Mean 84.19 84.81 87.10 85.33
87.52 85.87 86.67 87.10 89.74
TABLE 6Recognition Accuracies ( in % ) on the Reuters-21578
Dataset
Tasks SVM PCA OT-IT GFK TCA CORAL SKM KMM KGOT
Orgs vs People 77.57 80.87 80.96 82.04 81.37 76.07 79.55 77.81
82.04People vs Orgs 80.43 81.81 84.31 82.30 84.39 76.71 82.22 82.94
85.77Orgs vs Places 69.89 73.63 76.22 73.92 74.59 73.15 74.50 71.52
76.22Places vs Orgs 65.16 65.35 75.10 76.57 73.33 64.66 69.88 67.91
77.95People vs Places 61.19 62.95 66.85 64.90 62.12 59.05 63.60
61.65 72.98Places vs People 60.26 68.89 61.75 65.55 60.54 60.26
60.35 59.52 71.96Mean 69.08 72.25 74.20 74.21 72.72 68.32 71.68
70.23 77.82
1752 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 42, NO. 7, JULY 2020
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
matching distributions in RKHS. Assuming Gaussianity inRKHS, we
obtained closed-form expressions of both theempirical Wasserstein
distance and optimal transport map,which respectively generalize
the covariance descriptorcomparison and alignment problems from
Euclidean spacesto (potentially) infinite-dimensional feature
spaces. Empiri-cally, we apply our formulations to image
classification anddomain adaptation. For both tasks, our approaches
achievestate-of-the-art results.
Our approaches are rather flexible in the sense that theycan be
naturally integrated with other machine learningtopics, such as
kernel learning, metric learning and subspace/manifold learning.
Moreover, our approaches support vari-ous data representations,
such as proteins, strings, andgraphs. Therefore, they have great
potential to succeed inmany applicationswhere kernel functions
arewell-defined.
In future work, we intend to conduct ensemble classifica-tion
and transfer learning on other types of datasets. We arealso
interested in further improving the performance of theproposed
approaches for domain adaptation. We plan tomodify our formulations
of OT in RKHS, enabling it to alignthe joint distributions of
features and labels between differ-ent domains.
ACKNOWLEDGMENTS
This work was supported in part by the AFOSR
grantFA9550-16-1-0386.
REFERENCES[1] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth
mover’s dis-
tance as a metric for image retrieval,” Int. J. Comput. Vis.,
vol. 40,no. 2, pp. 99–121, 2000.
[2] O. Pele and M. Werman, “Fast and robust earth mover’s
dis-tances,” in Proc. IEEE 12th Int. Conf. Comput. Vis., 2009, pp.
460–467.
[3] S. Ferradans, N. Papadakis, J. Rabin, G. Peyr�e, and J.-F.
Aujol,“Regularized discrete optimal transport,” in Proc. Int. Conf.
ScaleSpace Variational Methods Comput. Vis., 2013, pp. 428–439.
[4] S. Kolouri, Y. Zou, and G. K. Rohde, “Sliced wasserstein
kernelsfor probability distributions,” in Proc. IEEE Conf. Comput.
Vis. Pat-tern Recognit., 2016, pp. 5258–5267.
[5] A. Gramfort, G. Peyr�e, and M. Cuturi, “Fast optimal
transportaveraging of neuroimaging data,” in Proc. Int. Conf. Inf.
Process.Med. Imag., 2015, pp. 261–272.
[6] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein
generativeadversarial networks,” in Proc. Int. Conf. Mach. Learn.,
2017,pp. 214–223.
[7] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy,
“Optimaltransport for domain adaptation,” IEEE Trans. Pattern Anal.
Mach.Intell., vol. 39, no. 9, pp. 1853–1865, Sep. 2017.
[8] M. Perrot, N. Courty, R. Flamary, and A. Habrard, “Mapping
esti-mation for discrete optimal transport,” in Proc. Adv. Neural
Inf.Process. Syst., 2016, pp. 4197–4205.
[9] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A.
Poggio,“Learning with a Wasserstein loss,” in Proc. Adv. Neural
Inf. Pro-cess. Syst., 2015, pp. 2053–2061.
[10] J. Solomon, F. De Goes, G. Peyr�e, M. Cuturi, A.
Butscher,A. Nguyen, T. Du, and L. Guibas, “Convolutional
Wassersteindistances: Efficient optimal transportation on geometric
domains,”ACMTrans. Graph., vol. 34, no. 4, 2015, Art. no. 66.
[11] N. Bonneel, J. Rabin, G. Peyr�e, and H. Pfister, “Sliced
and RadonWasserstein barycenters of measures,” J. Math. Imag. Vis.,
vol. 51,no. 1, pp. 22–45, 2015.
[12] G. Peyr�e, M. Cuturi, and J. Solomon, “Gromov-Wasserstein
aver-aging of kernel and distance matrices,” in Proc. Int. Conf.
Mach.Learn., 2016, pp. 2664–2672.
[13] J. A. Carrillo, L. C. Ferreira, and J. C. Precioso, “A
mass-transpor-tation approach to a one dimensional fluid mechanics
model withnonlocal velocity,” Adv. Math., vol. 231, no. 1, pp.
306–327, 2012.
[14] T. A. ElMoselhy and Y.M.Marzouk, “Bayesian inferencewith
opti-malmaps,” J. Comput. Phys., vol. 231, no. 23, pp. 7815–7850,
2012.
[15] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From
wordembeddings to document distances,” in Proc. Int. Conf.
Mach.Learn., 2015, pp. 957–966.
[16] G. Montavon, K.-R. M€uller, and M. Cuturi, “Wasserstein
trainingof restricted Boltzmann machines,” in Proc. Adv. Neural
Inf. Pro-cess. Syst., 2016, pp. 3718–3726.
[17] S. Ferradans, G.-S. Xia, G. Peyr�e, and J.-F. Aujol,
“Static anddynamic texture mixing using optimal transport,” in
Proc. Int. Conf.Scale Space VariationalMethods Comput. Vis., 2013,
pp. 137–148.
[18] Y. Chen, T. T. Georgiou, and A. Tannenbaum, “Optimal
transportfor gaussian mixture models,” IEEE Access, vol. 7, pp.
6269–6278,2019.
[19] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini,
andC. Watkins, “Text classification using string kernels,” J.
Mach.Learn. Res., vol. 2, pp. 419–444, 2002.
[20] R. I. Kondor and J. D. Lafferty, “Diffusion kernels on
graphs andother discrete input spaces,” in Proc. 19th Int. Conf.
Mach. Learn.,2002, pp. 315–322.
[21] A. Ben-Hur and W. S. Noble, “Kernel methods for predicting
pro-tein-protein interactions,” Bioinf., vol. 21, Suppl 1, pp.
i38–46,Jun. 2005.
[22] C. Cortes, P. Haffner, and M. Mohri, “Rational kernels:
Theoryand algorithms,” J. Mach. Learn. Res., vol. 5, pp.
1035–1062,Aug. 2004.
[23] C. Cortes, P. Haffner, and M. Mohri, “Lattice kernels for
spoken-dialog classification,” in Proc. IEEE Int. Conf. Acoust.
Speech SignalProcess., Apr. 2003, vol. 1, pp. I–628.
[24] Z. Zhang, M. Wang, Y. Xiang, and A. Nehorai,
“Geometry-adapted Gaussian random field regression,” in Proc. IEEE
Int.Conf. Acoust. Speech Signal Process., 2017, pp. 6528–6532.
[25] S. K. Zhou and R. Chellappa, “From sample similarity to
ensemblesimilarity: Probabilistic distance measures in reproducing
kernelhilbert space,” IEEE Trans. Pattern Anal. Mach. Intell., vol.
28, no. 6,pp. 917–929, Jun. 2006.
[26] S.-Y. Huang, and C.-R. Hwang, “Kernel Fisher discriminant
analy-sis in Gaussian reproducing kernel Hilbert spaces,” Inst.
Stat. Sci.,Tech. Rep., Taipei, Taiwan: Academia Sinica, 2006.
[27] Z. Zhang, G. Wang, D.-Y. Yeung, and J. T. Kwok,
“Probabilistic ker-nel principal component analysis,” Dept. Comput.
Sci., The HongKong Univ. Sci. Technol., Hong Kong, Tech. Rep.
HKUST-CS04-03,2004.
[28] M. Alvarez and R. Henao, “Probabilistic kernel principal
compo-nent analysis through time,” in Proc. Int. Conf. Neural Inf.
Process.,2006, pp. 747–754.
[29] F. R. Bach and M. I. Jordan, “Learning graphical models
withMercer kernels,” in Proc. Adv. Neural Inf. Process. Syst.,
2003, pp.1033–1040.
[30] R. Kondor and T. Jebara, “A kernel between sets of
vectors,” inProc. 20th Int. Conf. Mach. Learn., 2003, pp.
361–368.
[31] A. Berlinet and C. Thomas-Agnan, Reproducing Kernel
HilbertSpaces in Probability and Statistics. Berlin, Germany:
Springer, 2011.
[32] C. Villani, Topics in Optimal Transportation. Providence,
RI, USA:American Mathematical Society, 2003, vol. 58.
[33] S. T. Rachev and L. Ruschendorf, “A transformation property
ofminimal metrics,” Theory Probability Appl., vol. 35, no. 1, pp.
110–117, 1991.
[34] J. Cuesta-Albertos, C. Matr�an-Bea, and A. Tuero-Diaz,
“Onlower bounds for the L2-Wasserstein metric in a hilbert space,”
J.Theoretical Probability, vol. 9, no. 2, pp. 263–283, 1996.
[35] M. Gelbrich, “On a formula for the L2 Wasserstein metric
betweenmeasures on Euclidean and hilbert spaces,” Mathematische
Nach-richten, vol. 147, no. 1, pp. 185–203, 1990.
[36] A. Mallasto and A. Feragen, “Learning from uncertain
curves: The2-Wasserstein metric for Gaussian processes,” in Proc.
Adv. NeuralInf. Process. Syst., 2017, pp. 5665–5674.
[37] R. Wang, H. Guo, L. S. Davis, and Q. Dai, “Covariance
discrimina-tive learning: A natural and efficient approach to image
set classi-fication,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2012,pp. 2496–2503.
[38] O. Tuzel, F. Porikli, and P. Meer, “Pedestrian detection
via classifi-cation on Riemannian manifolds,” IEEE Trans. Pattern
Anal. Mach.Intell., vol. 30, no. 10, pp. 1713–1727, Oct. 2008.
[39] M. Harandi, M. Salzmann, and F. Porikli, “Bregman
divergencesfor infinite dimensional covariance matrices,” in Proc.
IEEE Conf.Comput. Vis. Pattern Recognit., 2014, pp. 1003–1010.
ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT
SPACES: THEORY AND APPLICATIONS 1753
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
-
[40] M. H. Quang, M. San Biagio, and V. Murino,
“Log-Hilbert-Schmidt metric between positive definite operators
onHilbert spaces,” in Proc. Adv. Neural Inf. Process. Syst.,
2014,pp. 388–396.
[41] H. Q. Minh, “Affine-invariant Riemannian distance between
infi-nite-dimensional covariance operators,” in Proc. Int. Conf.
Netw.Geometric Sci. Inf., 2015, pp. 30–38.
[42] X. Pennec, P. Fillard, and N. Ayache, “A riemannian
framework fortensor computing,” Int. J. Comput. Vis., vol. 66, no.
1, pp. 41–66, 2006.
[43] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer
featurelearning with joint distribution adaptation,” in Proc. IEEE
Int.Conf. Comput. Vis., 2013, pp. 2200–2207.
[44] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, andM.
Salzmann,“Unsupervised domain adaptation by domain invariant
projec-tion,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp.
769–776.
[45] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow
kernelfor unsupervised domain adaptation,” in Proc. IEEE Conf.
Comput.Vis. Pattern Recognit., 2012, pp. 2066–2073.
[46] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation
forobject recognition: An unsupervised approach,” in Proc. IEEE
Int.Conf. Comput. Vis., 2011, pp. 999–1006.
[47] J. Huang, A. Gretton, K. M. Borgwardt, B. Sch€olkopf, and
A. J.Smola, “Correcting sample selection bias by unlabeled data,”
inProc. Adv. Neural Inf. Process. Syst., 2007, pp. 601–608.
[48] K. Zhang, V. Zheng, Q. Wang, J. Kwok, Q. Yang, and I.
Marsic,“Covariate shift in Hilbert space: A solution via sorrogate
ker-nels,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 388–395.
[49] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly
easydomain adaptation,” in Proc. 30th AAAI Conf. Artif. Intell.,
2016,pp. 2058–2065.
[50] D. Dowson and B. Landau, “The Fr�echet distance between
multi-variate normal distributions,” J. Multivariate Anal., vol.
12, no. 3,pp. 450–455, 1982.
[51] R. Bhatia, T. Jain, and Y. Lim, “On the Bures Wasserstein
distancebetween positive
definitematrices,”ExpositionesMathematicae, 2018.
[52] A. Takatsu, et al., “Wasserstein geometry of gaussian
measures,”Osaka J. Math., vol. 48, no. 4, pp. 1005–1026, 2011.
[53] A. Gretton, O. Bousquet, A. Smola, and B. Sch€olkopf,
“Measuringstatistical dependence with hilbert-schmidt norms,” in
Proc. 16thInt. Conf. Algorithmic Learn. Theory, 2005, pp.
63–77.
[54] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch€olkopf,
andA. Smola, “A kernel two-sample test,” J. Mach. Learn. Res., vol.
13,pp. 723–773, 2012.
[55] B. K. Sriperumbudur, K. Fukumizu, and G. R.
Lanckriet,“Universality, characteristic kernels and RKHS embedding
ofmeasures,” J. Mach. Learn. Res., vol. 12, pp. 2389–2410, Jul.
2011.
[56] K. Fukumizu, L. Song, and A. Gretton, “Kernel Bayes’ rule:
Bayes-ian inference with positive definite kernels,” J. Mach.
Learn. Res.,vol. 14, no. 1, pp. 3753–3783, 2013.
[57] B. Fernando, A. Habrard, M. Sebban, and T.
Tuytelaars,“Unsupervised visual domain adaptation using
subspacealignment,” in Proc. IEEE Int. Conf. Comput. Vis., 2013,
pp. 2960–2967.
[58] Y. Xu, X. Fang, J. Wu, X. Li, and D. Zhang, “Discriminative
trans-fer subspace learning via low-rank and sparse
representation,”IEEE Trans. Image Process., vol. 25, no. 2, pp.
850–863, Feb. 2016.
[59] M. Cuturi, “Sinkhorn distances: Lightspeed computation of
opti-mal transport,” in Proc. Adv. Neural Inf. Process. Syst.,
2013,pp. 2292–2300.
[60] G. Kylberg, M. Uppstr€om, K.-O. Hedlund, G. Borgefors, and
I.-M.Sintorn, “Segmentation of virus particle candidates in
transmis-sion electron microscopy images,” J. Microscopy, vol. 245,
no. 2,pp. 140–147, 2012.
[61] G. Kylberg, “The kylberg texture dataset v. 1.0,” Centre
for ImageAnalysis, Swedish Univ. Agricultural Sci. Uppsala
Univ.,Uppsala, Sweden, External report (Blue series) 35, (Sept.
2011).[Online]. Available: http:
//www.cb.uu.se/~gustaf/texture/
[62] Z. Liao, J. Rock, Y.Wang, and D. Forsyth, “Non-parametric
filteringfor geometric detail extraction and material
representation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2013, pp. 963–970.
[63] A. Wendel and A. Pinz, “Scene categorization from tiny
images,”inWorkshop Austrian Assoc. Pattern Recognit., 2007, pp.
49–56.
[64] T. S. Lee, “Image representation using 2D Gabor wavelets,”
IEEETrans. Pattern Anal.Mach. Intell., vol. 18, no. 10, pp.
959–971, Oct. 1996.
[65] M. Faraki, M. T. Harandi, and F. Porikli, “Approximate
infinite-dimensional region covariance descriptors for image
classi-fication,” in Proc. IEEE Int. Conf. Acoust. Speech Signal
Process., 2015,pp. 1364–1368.
[66] B. Hariharan, P. Arbelez, R. Girshick, and J. Malik,
“Hypercolumnsfor object segmentation and fine-grained
localization,” in Proc. Conf.Pattern Recognit., Jun. 2015, pp.
447–456.
[67] A. Krizhevsky, “One weird trick for parallelizing
convolutionalneural networks,” CoRR, vol. abs/1404.5997, Apr.
2014.
[68] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
classifi-cation with deep convolutional neural networks,” in Proc.
Adv.Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[69] K. Muandet, K. Fukumizu, F. Dinuzzo, and B.
Sch€olkopf,“Learning from distributions via support measure
machines,” inProc. Adv. Neural Inf. Process. Syst., 2012, pp.
10–18.
[70] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E.
Tzeng, andT. Darrell, “Decaf: A deep convolutional activation
feature forgeneric visual recognition,” in Proc. Int. Conf. Mach.
Learn., 2014,pp. 647–655.
[71] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain
adapta-tion via transfer component analysis,” IEEE Trans. Neural
Netw.,vol. 22, no. 2, pp. 199–210, Feb. 2011.
Zhen Zhang (S’17) received the BSc degree fromthe University of
science and technology of China,in 2014. He is currently working
toward the PhDdegree in the Preston M. Green Department
ofElectrical and Systems Engineering, WashingtonUniversity in St.
Louis, St. Louis, MO, under theguidance of Dr. A. Nehorai. His
research interestsinclude the areas of machine learning and
com-puter vision. He is a student member of the IEEE.
Mianzhi Wang (S’15) received the BSc degree inelectronic
engineering from Fudan University,Shanghai, China, in 2013. He is
currently workingtoward the PhD degree in the Preston M.
GreenDepartment of Electrical and Systems Engineering,Washington
University in St. Louis, St. Louis, MO,under the guidance of Dr. A.
Nehorai. His researchinterests include the areas of statistical
signal proc-essing for sensor arrays, optimization,
andmachinelearning. He is a studentmember of the IEEE.
Arye Nehorai (S’80-M’83-SM’0-F’94-LF’17)received the BSc and MSc
degrees from theTechnion, Israel, and the PhD degree fromStanford
University, California. He is the Eugeneand Martha Lohman professor
of Electrical Engi-neering in the Preston M. Green Departmentof
Electrical and Systems Engineering (ESE) atWashington University in
St. Louis (WUSTL). Heserved as chair of this department from 2006
to2016. Under his leadership, the undergraduateenrollment has more
than tripled and the masters
enrollment has grown seven-fold. He is also professor in the
Division ofBiology and Biomedical Sciences (DBBS), the Division of
Biostatistics, theDepartment of Biomedical Engineering, and
Department of Computer Sci-ence and Engineering, and director of
the Center for Sensor Signal andInformation Processing at WUSTL.
Prior to serving at WUSTL, he was afaculty member at Yale
University and the University of Illinois at Chicago.He served as
editor-in-chief of the IEEE Transactions on Signal Process-ing from
2000 to 2002. From 2003 to 2005 he was the vice president
(Pub-lications) of the IEEE Signal Processing Society (SPS), the
chair of thePublications Board, and a member of the Executive
Committee of thisSociety. He was the founding editor of the special
columns on LeadershipReflections in IEEE Signal Processing Magazine
from 2003 to 2006. Hereceived the 2006 IEEE SPS Technical
Achievement Award and the 2010IEEE SPS Meritorious Service Award.
He was elected distinguished lec-turer of the IEEE SPS for a term
lasting from 2004 to 2005. He receivedseveral best paper awards in
IEEE journals and conferences. In 2001 hewas named University
Scholar of the University of Illinois. He was the prin-cipal
investigator of the Multidisciplinary University Research
Initiative(MURI) project titled Adaptive Waveform Diversity for
Full Spectral Domi-nance from 2005 to 2010. He is a life fellow of
the IEEE since 1994, fellowof the Royal Statistical Society since
1996, and fellow of AAAS since 2012.
" For more information on this or any other computing
topic,please visit our Digital Library at
www.computer.org/csdl.
1754 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 42, NO. 7, JULY 2020
Authorized licensed use limited to: WASHINGTON UNIVERSITY
LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE
Xplore. Restrictions apply.
http: //www.cb.uu.se/~gustaf/texture/
/ColorImageDict > /JPEG2000ColorACSImageDict >
/JPEG2000ColorImageDict > /AntiAliasGrayImages false
/CropGrayImages true /GrayImageMinResolution 150
/GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false
/GrayImageDownsampleType /Bicubic /GrayImageResolution 300
/GrayImageDepth -1 /GrayImageMinDownsampleDepth 2
/GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true
/GrayImageFilter /DCTEncode /AutoFilterGrayImages false
/GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict >
/GrayImageDict > /JPEG2000GrayACSImageDict >
/JPEG2000GrayImageDict > /AntiAliasMonoImages false
/CropMonoImages true /MonoImageMinResolution 1200
/MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false
/MonoImageDownsampleType /Bicubic /MonoImageResolution 600
/MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000
/EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode
/MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None
] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false
/PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000
0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true
/PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ]
/PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier ()
/PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped
/False
/CreateJDFFile false /Description >>>
setdistillerparams> setpagedevice