Wasserstein Riemannian Geometry of Positive …...WASSERSTEIN RIEMANNIAN GEOMETRY OF POSITIVE-DEFINITE MATRICES 3 singular. This has been done by R. J. McCann [22, Example 1.7] and

Wasserstein Riemannian Geometry of Positive-definite Matrices∗

Luigi Malago† , Luigi Montrucchio‡ , and Giovanni Pistone§

Abstract. The Wasserstein distance on multivariate non-degenerate Gaussian densities is a Riemannian dis-tance. After reviewing the properties of the distance and the metric geodesic, we derive an explicitform of the Riemannian metrics on positive-definite matrices and compute its tensor form withrespect to the trace scalar product. The tensor is a matrix, which is the solution of a Lyapunovequation. We compute explicit form for the Riemannian exponential, the normal coordinates charts,the Riemannian gradient, and discuss the gradient flow of the entropy. Finally, the Levi-Civita co-variant derivative is computed in matrix form together with the differential equation for the paralleltransport. While all computations are given in matrix form, notheless we discuss the use of a specialmoving frame. Applications are briefly discussed.

Key words. Gaussian distribution, Wasserstein distance, Riemannian metrics, Natural gradient, RiemannianExponential, Normal coordinates, Levi-Civita covariant derivative, Optimization on positive-definitesymmetric matrices, Information Geometry.

AMS subject classifications. 15B48, 53C23, 53C25, 60D05

1. Introduction and overview. Given two Gaussian distributions Nn (µi,Σi), i = 1, 2,consider the Gaussian vector (X1, X2) where each block has an assigned distribution. The useof the number

(1) G2 = inf E(‖X1 −X2‖2

)as an index of dissimilarity between distributions has been considered by many classical au-thors, C. Gini (1914), P. Levy, and M. R. Frechet. There is a considerable contemporaryliterature on this problem, where the index is called L2-Wasserstein distance and it is dis-cussed in general, outside our special case of Gaussian distributions. Among the literaturemost relevant for the approach used in this paper we mention Y. Brenier [9], R. J. McCann[22], F. Otto [24]. The explicit computation of the value of W 2 is known, together with anexplicit formula for the metric geodesic.

The value of the index of Eq. (1) as a function of means and dispersion matrices has beencomputed by some authors, in particular I. Olkin and F. Pukelsheim [23], D. C. Dowson andB. V. Landau [12], C. R. Givens and R. M. Shortt [15], M. Gelbrich [14], R. Bhatia et al. [8].

∗Compiled Thursday 25th January, 2018, 10:46Funding: L. Malago acknowledges the support of RIST. L. Montrucchio acknowledges the support of Collegio

Carlo Alberto. G. Pistone is member of GNAFA-INdAM and acknowledges the support of de Castro Statistics andCollegio Carlo Alberto.†Romanian Institute of Science and Technology ([email protected]).‡Collegio Carlo Alberto, Piazza Vincenzo Arbarello 8, 10122 Torino, Italy ([email protected]).§de Castro Statistics, Collegio Carlo Alberto, Piazza Vincenzo Arbarello 8, 10122 Torino, Italy (gio-

[email protected]).

1

mailto:[email protected]




2 L. MALAGO, L. MONTRUCCHIO, AND G. PISTONE

They found

G2 = ‖µ1 − µ2‖2 + Tr

(Σ1 + Σ2 − 2

(Σ

1/21 Σ2Σ

1/21

)1/2)

(2)

= ‖µ1 − µ2‖2 + Tr(

Σ1 + Σ2 − 2 (Σ1Σ2)1/2).(3)

The W 2 index is actually the square of a distance, hence a measure of divergence betweendistribution. We can mimic the argument of a famous paper by Amari [3], which derives thenotion of both Fisher metric and natural gradient from the second-order approximation of theKullback-Leibler divergence. In fact, we show that for small H such that the divergence ofΣ +H and Σ is constant it holds

(4) Tr

(Σ + (Σ +H)− 2

(Σ1/2(Σ +H)Σ1/2

)1/2)' 1

2Tr (LΣ [H]H) ,

where LΣ [H] Σ + ΣLΣ [H] = H. According to a standard argument, for any smooth functionf , as H → 0, the increment f(Σ +H)− f(Σ) is maximized in the direction

(5) grad f(Σ) = ∇f(Σ)Σ + Σ∇f(Σ) .

The quadratic form in Eq. (4) is the natural candidate for the Riemannian metric associ-ated to the given distance and the gradient in Eq. (5) is the natural gradient.

The fact that the distance L2-Wasserstein geometry is actually Riemannian on a suit-able subset of distribution has been stated in general in [24, §4] and has been developed inthe Gaussian case by A. Takatsu [27]. In the present paper, we proceed along these linesby deriving explicit forms of the Riemannian metrics, the Riemannian exponential, and theLevi-Civita (covariant) derivative, Riemannian parallel transport, and the Riemannian Hes-sian. The perspective is dictated by the authors’ interests in Machine Learning, ManifoldOptimization, and Information Geometry. Further relevant, more technical, references arecited in the text when needed.

1.1. Overview. In Sec. 2 we review the properties of the space of symmetric matrices weare going to use in particular, the trace norm, Riccati equation, Lyapunov equation, and thecalculus of the mapping σ : A 7→ AA∗, where A is a non-singular square matrix.

The set of positive-definite matrices is seen as an elementary manifold, as it is an openset of an Euclidean space. In this context, the mapping σ is a submersion and we computehorizontal vectors at each point. The manifold we deal with is finite dimensional. However,all operations of interest being matrix operations, there is no need of choosing a basis, so weuse the language of non-parametric differential geometry of W. Klingenberg [17] and S. Lang[18]. A short review of the matrix algebra we need is given in Sec. 2

In Sec. 3 we review for reference purposes well-known results about the metric geometryinduced by the dissimilarity index. We re-state the result as Prop. 3.2 and, for sake ofcompleteness, we provide in Appendix a further proof inspired by [12]. The index itself turnsout to be a distance. Its value is attained on a joint degenerate distribution, and, it is possibleto write down an explicit metric geodesic, if the distributions at end-points are not both

WASSERSTEIN RIEMANNIAN GEOMETRY OF POSITIVE-DEFINITE MATRICES 3

singular. This has been done by R. J. McCann [22, Example 1.7] and it is restated as Prop.3.4.

The space of nondegenerate Gaussian measures (or, equivalently, the space of positivedefinite matrices) can be endowed with a Riemann structure that induces the L2-Wassersteindistance. This is discussed in Sec. 4. We use the presentation given by in [27], cf. also [8],which in turn adapts to the Gaussian case the original work [24, §4]. We add a detailedexplanation of the “coordinate system” associated with the Riemannian geometry. We showthat the Wasserstein Riemannian metrics is given at each dispersion matrix Σ by

(6) WΣ(U, V ) = Tr (LΣ [U ] ΣLΣ [V ]) =1

2Tr (LΣ [U ]V ) ,

where U, V are symmetric matrices and LΣ [U ] is the solution X of the matrix equationXΣ + ΣX = U .

The form of the Riemannian exponential is derived in Sec. 5. The natural gradient andsome applications to optimization and the example of the gradient flow of the entropy arediscussed in Sec. 6.

The analysis of the second-order geometry is treated in Sec. 7, where we compute theLevi-Civita covariant derivative, the Riemannian Hessian, and other related topics.

2. Symmetric matrices. Given a mean vector µ ∈ Rn and a symmetric n × n non-negative definite dispersion matrix Σ ∈ Sym+ (n), there exists a unique Gaussian distributionon Rn, denoted Nn (µ,Σ), with the given parameters, and conversely. The set Gn of Gaussiandistributions on Rn is in 1-to-1 correspondence with the space of its parameters,

(7) Gn 3 Nn (µ,Σ)↔ (µ,Σ) ∈ Rn × Sym+ (n) .

Moreover, Gn is closed for weak convergence and the identification is continuous in bothdirections. A reference for Gaussian di0stributions is T. W. Anderson [5].

For ease of later reference, we review below a few results on spaces of matrices. Generalreferences are the monographs by P. R. Halmos [16], J. R. Magnus and H. Neudecker [19],and R. Bhatia [7].

The vector space of n×m real matrices is denoted by M(n×m), while square matrices aredenoted M(n) = M(n × n). It is an Euclidean space of dimension nm and the vectorizationmapping M(n ×m) 3 A 7→ vec (A) ∈ Rnm is an isometry, 〈A,B〉 = (vec (A))∗(vec (B)) =Tr (AB∗). The norm is denoted by ‖A‖ =

√Tr (AA∗).

Symmetric matrices Sym (n) form a vector subspace of M(n) whose orthogonal comple-ment is the vector space of anti-symmetric matrices Sym⊥ (n). We will find it convenient touse, with reference to symmetric matrices, the equivalent scalar product 〈A,B〉2 = 1

2 Tr (AB∗)e.g., see Eq. (6).

The closed pointed cone of non-negative-definite symmetric matrices is Sym+ (n) and itsinterior, the open cone of the positive-definite symmetric matrices, is Sym++ (n).

Given A,B ∈ Sym (n), the equation TAT = B is called Riccati equation. If A ∈ Sym++ (n)and B ∈ Sym+ (n), then the equation TAT = B has unique solution T ∈ Sym+ (n). In fact,from TAT = B it follows A1/2TA1/2A1/2TA1/2 = A1/2BA1/2 and, in turn, A1/2TA1/2 =

4 L. MALAGO, L. MONTRUCCHIO, AND G. PISTONE(A1/2BA1/2

)1/2because T ∈ Sym+ (n) hence, the solution is

(8) T = A−1/2(A1/2BA1/2

)1/2A−1/2 .

Notice that det (T ) = det (A)−1/2 det (B)1/2, hence det (T ) > 0 if det (B) > 0. In termsof random variables, if X ∈ Nn (0, A) and Y = Nn (0, B), then T is the unique matrix inSym+ (n) such that Y ∼ TX.

More compact closed-form solutions of the Riccati equation are available. Given A ∈Sym++ (n) and B ∈ Sym+ (n), observe that AB = A1/2(A1/2BA1/2)A−1/2. By similarity, wesee that the eigenvalues of AB are non-negative, hence the square root

(9) (AB)1/2 = A1/2(A1/2BA1/2)1/2A−1/2

is well defined, see [7, Ex. 4.5.2]. Now we can re-write Eq. (8) as

(10) T = A−1A1/2(A1/2BA1/2

)1/2A−1/2 = A−1(AB)1/2 .

As AB = A(BA)A−1, the eigenvalues of AB and BA are the same, so that the same argumentused before shows that

(11) T = (BA)1/2A−1 .

The square-of-the-matrix mapping σ : A 7→ A2 is an injection of Sym++ (n) onto it-self. The derivative is dXσ(A) = XA + AX and the derivative operator dσ(A) is invert-ible. Alternative notation for the derivative we would occasionaly find convenient to use aredXσ(A) = dσ(A)[X].

For each assigned matrix V ∈ Sym (n), the matrix X = (dσ(A))−1V is the unique solutionX in the space Sym (n) to the Lyapunov equation

(12) V = XA+AX .

We will write X = LA [V ]. In the paper, we are going to use the obvious equations

V = LA [V ]A+ALA [V ] ,(13)

X = LA [XA+AX] .(14)

The Lyapunov operator itself can be seen as a derivative. In fact, the inverse of themapping σ is σ−1 : Σ→ Σ1/2. By the derivative-of-the-inverse rule

(15) dV σ−1(Σ) = (dσ(σ−1(Σ)))−1[V ] = LA [V ] , Σ1/2 = A .

If Σ is the dispersion of a non-singular Gaussian distribution, then C = Σ−1 ∈ Sym++ (n)is the concentration matrix and represents an alternative and useful parameterization. Fromthe Lyapunov equation V = XΣ + ΣX we obtain Σ−1V Σ−1 = Σ−1X +XΣ−1, hence

(16) LΣ [V ] = LΣ−1

[Σ−1V Σ−1

]and LΣ−1 [U ] = LΣ [ΣUΣ] .


In a similar way, we obtain another useful equation,

(17) LΣ [V ] = Σ−1/2LΣ

[Σ−1/2V Σ−1/2

]Σ−1/2 .

There is a relation between the Lyapunov equation and the trace. From XΣ + ΣX = V ,we get Σ−1XΣ +X = Σ−1V , then

(18) Tr (LΣ [V ]) =1

2Tr(Σ−1V

).

We will need later the derivative of A 7→ LA [V ] for a fixed V . We compute it by differen-tiating in the direction U , with respect A, the identity (13). We have

(19) 0 = dULA [V ]A+ LA [V ]U + ULA [V ] +AdULA [V ] .

Hence dULA [V ] is the solution to the Lyapunov equation

(20) dULA [V ]A+AdULA [V ] = −(LA [V ]U + ULA [V ]) ,

so that

(21) dULA [V ] = −LA [LA [V ]U + ULA [V ]] .

We shall need also the second derivative of the function σ−1 : Σ 7→ Σ1/2. From Eq. (15)we have dσ−1(Σ)[V ] = LΣ1/2 [V ], then, from

(22) d2σ−1(Σ)[U, V ] = LΣ1/2 [LΣ1/2 [V ]U + ULΣ1/2 [V ]] .

Lyapunov equation is of crucial importance for us, as the linear operator LA enters in theexpression of the Riemannian metric of interest with respect to the standard scalar product,see Eq. (6). In fact, the numerical implementation of the scalar product WΣ(U, V ) requiresthe computation of the matrix LΣ [U ].

There are many ways to write down the closed-form solution to Eq. (12). The integralform solution is

(23) LA [V ] =

∫ ∞0

dt e−tAV e−tA .

The vectorized form of the Lyapunov equation (12) is

(24) vec (V ) = (A⊗ I + I ⊗A)vec (X) ,

and so

(25) LA [V ] = bind((A⊗ I + I ⊗A)−1vec (V )

).

A solution based on the spectral decomposition A = OΛO∗, Λ = diag (λj : j = 1, . . . , n)and O∗O = I, is computed using Hadamard’s matrix product [aij ] ◦ [bij ] = [aijbij ] as

(26) O∗XO =

[1

λi + λj

]di,j=1

◦ (O∗V O) .


This closed form solutions are discussed in [7]. Efficient numerical solutions are not basedon the closed forms above, but rely on specialized numerical algorithms as discussed by E. L.Wachspress [28] and by V. Simoncini [26].

We study now the square-of-a-matrix operation when acting on general invertible matrices.We show in the next proposition that this operation is a submersion. We recall the definition,see [11, Ch. 8, Ex. 8–10] or [18, §II.2]. Let O be an open set of the Hilbert space H, andf : O → N a smooth surjection from the Hilbert space H onto a manifold N , i.e., assumethat for each A ∈ O the derivative at A, df(A) : H → Tf(A)N is surjective. In such acase, for each C ∈ N , the fiber f−1(C) is a sub-manifold. Given a point A ∈ f−1(C), avector U ∈ H is called vertical if it is tangent to the manifold f−1(C). Each such a tangentvector U is the velocity at t = 0 of some smooth curve t 7→ γ(t) with γ(0) = A and γ(0) = U .Precisely, from f(γ(t)) = C for all t we derive the characterization of vertical vectors. We havedf(A)[γ(0)] = 0 i.e., the tangent space at A is TAf

−1(f(A)) = Ker(df(A)). The orthogonalspace to the tangent space TAf

−1(f(A)) is called the space of horizontal vectors at A,

(27) HA = Ker(df(A))⊥ = Im (df(A)∗) .

Let us apply the general theory to our case of interest. We denote by GL(n) ⊂ M(n)the open set of invertible matrices; O (n) is the subgroup of GL(n) of orthogonal matrices;Sym⊥ (n) is the subspace of M(n) of anti-symmetric matrices; and Sym∗ (n) = Sym (n) ∩GL(n). We are going to show that the mapping

(28) σ : GL (n) 3 A 7→ AA∗ ∈ Sym++ (n)

is indeed a submersion. The fiber σ−1(C) = {A ∈ GL(n)| AA∗ = C} is characterized asfollows. From AA∗ = C = C1/2C1/2, we get

(29) C−1/2AA∗C−1/2 = I ,

hence there exists an orthogonal matrix R = A∗C−1/2 ∈ O (n) such that A = C1/2R∗.Conversely, AA∗ = C if A = C1/2R∗.

Proposition 2.1. 1) For each given A ∈ GL(n) we have the orthogonal splitting

(30) M(n) = Sym (n)A⊕ Sym⊥ (n) (A∗)−1 .

2) The mapping

(31) σ : GL(n) 3 A 7→ AA∗ ∈ Sym++ (n)

is a submersion with fibers

(32) σ−1(C) ={C1/2R| R ∈ O (n)

}and its differential at A is dXσ(A) = XA∗ +AX∗. The kernel of the differential is

(33) Ker(dσ(A)) = Sym⊥ (n) (A∗)−1

and its orthogonal complement, HA = Ker(dσ(A))⊥, is

(34) HA = Sym (n)A.

3) The orthogonal projection of X ∈M(n) onto HA is LAA∗ [XA∗ +AX∗].


Proof. 1) Assume 〈B,CA〉 = 0, for all C ∈ Sym (n) that is, CA ∈ Sym+ (A) . ThenTr (BA∗C) = 0, so that BA∗ ∈ Sym⊥ (n) that is, B ∈ Sym⊥ (n) (A∗)−1.

2) Consider the matrix A as a point in the fiber manifold σ−1(AA∗). The derivative ofσ at A, X 7→ dXσ(A) = XA∗ + AX∗, is surjective, because for each W ∈ Sym (n) we havedσ(A)

[12W (A∗)−1

]= W , hence σ is a surjection and the fiber σ−1(AA∗) is a sub-manifold

of GL(n). Let us compute the splitting of M(n) into the kernel of dσ(A) and its orthogonal,M(n) = Ker(dσ(A))⊕HA. As the vector space tangent to σ−1(AA∗) at A is the kernel of thederivative at A:

Ker(dσ(A)) = {X ∈ M(n)| XA∗ +AX∗ = 0}= {X ∈ M(n)| (AX∗)∗ = −AX∗} .

Hence, X ∈ Ker(dσ(A)) ⇐⇒ AX∗ ∈ Sym⊥ (n). Namely, Ker(dσ(A)) = Sym⊥ (n) (A∗)−1.But we have just proved that this implies HA = Sym (n)A.

3) Consider the decomposition of X into the horizontal and the vertical part, X = CA+D(A∗)−1 with C ∈ Sym (n) and D ∈ Sym⊥ (n). By transposition, we get X∗ = A∗C −A−1D.From the previous two equations, we obtain the two equations XA∗ = C(AA∗)+D and AX∗ =(AA∗)C −D. The sume of the two previous equations is XA∗ + AX∗ = C(AA∗) + (AA∗)C,which is a Lyapunov equation.

3. Wasserstein distance. The aim of this section is to present the derivation of theWasserstein distance in the Gaussian case and equation for the metric geodesic.

3.1. Block-Gaussian. Let us consider the case where the dispersion matrix Σ ∈ Sym+ (2n)is partitioned into n× n blocks, we consider random variables X and Y such that

(35)

[XY

]∼ N2n (µ,Σ) , Σ =

[Σ1 KK∗ Σ2

],

so that Kij = Cov (Xi, Yj) if i = 1, . . . , n and j = (n + 1), . . . , 2n. It follows that K2ij ≤

(Σ1)ii(Σ2)jj , which in turn imply the bounds

(36) ‖K‖22 ≤ Tr (Σ1) Tr (Σ2) and supij|Kij | ≤

1

2(Tr (Σ1) + Tr (Σ2)) .

Assigned the mean vectors µ1, µ2 ∈ R2 and dispersion matrices Σ1,Σ2 ∈ Sym+ (n), wedefine the set of jointly Gaussian distributions with given marginals to be

(37) G((µ1,Σ1), (µ2,Σ2)) =

{N2n

([µ1

µ2

],

[Σ1 KK∗ Σ2

])},

and the Gini dissimilarity index to be

(38) G2((µ1,Σ1), (µ2,Σ2)) =

inf

{E[‖X − Y ‖2

]∣∣∣∣[XY]∼ γ, γ ∈ G((µ1,Σ1), (µ2,Σ2))

}=

‖µ1 − µ2‖2 + Tr (Σ1) + Tr (Σ2)− 2 supK

{Tr (K) |

[Σ1 KK∗ Σ2

]∈ Sym+ (2n)

}


Actually, because of the bound in Eq. (36), the set G((µ1,Σ1), (µ2,Σ2)) is compact and theinf is attained.

It is easy to verify that the relation

(39) G((µ1,Σ1), (µ2,Σ2)) =

√min

{E[‖X − Y ‖2

]|[XY

]∼ γ, γ ∈ G((µ1,Σ1), (µ2,Σ2))

}defines a distance on Gn ' Rn × Sym+ (n).

Actually, the symmetry of G is clear as well as the triangle inequality is easily obtainedby considering Gaussian distributions on Rn × Rn × Rn with given marginals. To conclude,assume that the min is reached at some γ. Then

(40) 0 = G((µ1,Σ1), (µ2,Σ2)) = Eγ[|X − Y |2

]⇔ µ1 = µ2 and Σ1 = Σ2 .

A further observation is that distance G is homogeneous i.e.,

(41) G((λµ1, λ2Σ1), (λµ2, λ

2Σ2)) = λG((µ1,Σ1), (µ2,Σ2)), λ ≥ 0 .

3.2. Computing the quadratic dissimilarity index. Now we proceed to determine the Giniindex for multivariate Gaussian distributions. We will present a proof as given by Dowsonand Landau [12] but with some corrections.

Given Σ1,Σ2 ∈ Sym+ (n), each admissible K’s in (38) belong to a compact set of M(n)because of the bound (36), so the maximum of the function 2 Tr (K) is reached. So we are ledto study:

(42)

α(Σ1,Σ2) = max

K∈M(n)2 Tr (K)

subject to

Σ =

[Σ1 KK∗ Σ2

]∈ Sym+ (2n)

Similarly, the value of the min problem will be denoted by β(Σ1,Σ2).Next result provides a solution to problem (42).

Proposition 3.1. 1) Let Σ1,Σ2 ∈ Sym+ (n). Then

(43) α(Σ1,Σ2) = 2 Tr

((Σ

1/21 Σ2Σ

1/21

)1/2)

and β(Σ1,Σ2) = −α(Σ1,Σ2) .

2) If moreover det (Σ1) > 0, then

(44) α(Σ1,Σ2) = 2 Tr(

(Σ1Σ2)1/2).

Proof. 1) The lengthy proof is gathered in a final App. 9.1.2) From Eq. (9) we obtain

(45) Tr

((Σ

1/21 Σ2Σ

1/21

)1/2)

= Tr

(Σ

1/21

(Σ

1/21 Σ2Σ

1/21

)1/2Σ−1/21

)= Tr

((Σ1Σ2)1/2

).


The following result enables us to find easily the exact lower and upper bounds of E[‖X − Y ‖2

].

Proposition 3.2. Let X,Y be multivariate Gaussian random variables taking values in Rnand having means µ1 and µ2 and dispersion matrices Σ1 and Σ2 respectively. Then

‖µ1 − µ2‖2 + Tr

(Σ1 + Σ2 − 2

(Σ

1/21 Σ2Σ

1/21

)1/2)≤ E

[‖X − Y ‖2

]≤

‖µ1 − µ2‖2 + Tr

(Σ1 + Σ2 + 2

(Σ

1/21 Σ2Σ

1/21

)1/2).

If det Σ1 6= 0, then the extremal values are attained at the joint distribution of

(46)

[X

µ2 ± T (X − µ1)

]∼

N2n

([µ1

µ2

],

[Σ1 ±TΣ1

±Σ1T Σ2

])= N2n

([µ1

µ2

],

[Σ1 ±(Σ2Σ1)1/2

±(Σ1Σ2)1/2 Σ2

]),

respectively, where T ∈ Sym+ (n) is the solution to the Riccati equation TΣ1T = Σ2.

Proof. From Proposition 3.1 and Eq. (38), it follows

min[‖X − Y ‖2

]= ‖µ1 − µ2‖2 + Tr (Σ1) + Tr (Σ2)− 2 Tr

((Σ

1/21 Σ2Σ

1/21

)1/2),

max[‖X − Y ‖2

]= ‖µ1 − µ2‖2 + Tr (Σ1) + Tr (Σ2) + 2 Tr

((Σ

1/21 Σ2Σ

1/21

)1/2).

To check the extremal points it suffices to observe that, in view of relation (8):

(47) Tr (TΣ1) = Tr

(Σ−1/21

(Σ

1/21 Σ2Σ

1/21

)1/2Σ

1/21

)= Tr

((Σ

1/21 Σ2Σ

1/21

)1/2).

Hence it is verified that the extremal values are attained at Y = µ2 ± T (X − µ1). In thesecond form of the distribution we are using Eq. (10) and Eq. (11).

The fact that the Gini dissimilarity is a distance which makes Rn × Sym+ (n) a metricspace is formally claimed in the next Proposition.

Proposition 3.3. The relation

(48) G ((µ1,Σ1), (µ2,Σ2)) =

√‖µ1 − µ2‖2 + Tr

(Σ1 + Σ2 − 2

(Σ

1/21 Σ2Σ

1/21

)1/2)

defines a distance on Rn × Sym+ (n).

Let us now find the geodesic in the metric space(Rn × Sym++ (n) , G

).

Proposition 3.4. The geodesic from (µ1,Σ1) to (µ2,Σ2), with (µ1,Σ1), (µ2,Σ2) ∈ Rn ×Sym++ (n), is the curve

(49) Γ: [0, 1] 3 t 7→ (µ(t),Σ(t)) ,


where µ(t) = (1− t)µ1 + tµ2 and

(50) Σ(t) = ((1− t)I + tT )Σ1((1− t)I + tT ) =

(1− t)2Σ1 + t2Σ2 + t(1− t)(

(Σ1Σ2)1/2 + (Σ2Σ1)1/2),

where T is the (unique) non-negative definite solution to the Riccati equation TΣ1T = Σ2.

Proof. Clearly, Γ(0) = (µ1,Σ1) and Γ(1) = (µ2,Σ2). Let us compute the distance betweenΓ(0) and the point

(51) Γ(t) = (µ(t),Σ(t)) = (µ1 + t(µ2 − µ1), ((1− t)I + tT )Σ1((1− t)I + tT )) .

We have

Σ1/21 Σ(t)Σ

1/21 = Σ

1/21 ((1− t)I + tT )Σ1((1− t)I + tT )Σ

1/21

=(

Σ1/21 ((1− t)I + tT )Σ

1/21

)(Σ

1/21 ((1− t)I + tT )Σ

1/21

),

so that

(52)(

Σ1/21 Σ(t)Σ

1/21

)1/2= Σ

1/21 ((1− t)I + tT )Σ

1/21 ,

and hence(53)

Tr

((Σ

1/21 Σ(t)Σ

1/21

)1/2)

= Tr(

Σ1/21 ((1− t)I + tT )Σ

1/21

)= (1− t) Tr (Σ1) + tTr (TΣ1) .

We have

Tr (Σ(t)) = Tr (((1− t)I + tT )Σ1((1− t)I + tT ))

= (1− t)2 Tr (Σ1) + 2t(1− t) Tr (TΣ1) + t2 Tr (Σ2)

Collecting all the above results,

Tr

(Σ1 + Σ(t)− 2

(Σ

1/21 Σ(t)Σ

1/21

)1/2)

= Tr (Σ1) +

(1− t)2 Tr (Σ1) + 2t(1− t) Tr (TΣ1) + t2 Tr (Σ2)− 2(1− t) Tr (Σ1)− 2tTr (TΣ1) =

t2 Tr (Σ1) + t2 Tr (Σ2)− 2t2 Tr (TΣ1) = t2 Tr

(Σ1 + Σ2 − 2

(Σ

1/21 Σ2Σ

1/21

)1/2).

In conclusion,

G(Γ(0),Γ(t)) =

√‖µ(0)− µ(t)‖2 + Tr

(Σ(0) + Σ(t)− 2

(Σ(0)1/2Σ(t)Σ(0)1/2

)1/2)=

tG(Γ(0),Γ(1)) .


A few remarks are of order.1. Clearly Proposition 3.4 still holds under the only assumption that Σ1 is not singular.2. The definition of geodesic in metric spaces we use here is related to Merger convexity

property, see [25, p. 78]. A stronger definition requires the proportionality of thedistance between couple of points on the curve, i.e.,

(54) G (Γ(s),Γ(t)) = |t− s|G (Γ(0),Γ(1)) ,

for s, t ∈ [0, 1]. It will be proved later that in fact our geodesic enjoy such a strongerproperty.

3.3. Degenerate distributions. A few results formulated in the previous section requiredthe dispersion matrices to be nonsingular. It is interesting to analyze the opposite case inwhich both matrices Σ1 and Σ2 are singular.

The simplest case occurs when the two subspaces, Range Σ1 and Range Σ2, are orthogonal.Under all joint distribution for the random vector (X,Y ), with marginals X ∼ N2 (0,Σ1)

and Y ∼ N2 (0,Σ2) , the values of X and Y will lie into orthogonal subspaces, so that XY ∗ = 0.Hence ‖X − Y ‖2 = ‖X‖2 + ‖Y ‖2, and

(55) E ‖X − Y ‖2 = E ‖X‖2 + E ‖Y ‖2 = Tr (Σ1) + Tr (Σ2) .

So any joint distribution (X,Y ) attains the optimal value√

Tr (Σ1) + Tr (Σ2).If we now define X(t) = (1− t)X + tY , then

(56) E[‖X −X(t)‖2

]= E

[t2 ‖X − Y ‖2

]= t2 [Tr (Σ1) + Tr (Σ2)] ,

and so we have the geodesic joining the two random vectors X and Y .

The previous example can be extended by taking two singular matrices

(57) Σ1 = σ21vv∗ and Σ2 = σ2

2ww∗

where v 6= w ∈ Rn and ‖v‖ = ‖w‖ = 1. Clearly, Range Σ1 ∩ Range Σ2 = {0} and theyare one-dimensional spaces spanned by vectors v and w, respectively (it is not restrictive toassume v∗w ≥ 0, too). By (48), it follows that the distance between the two matrices is

(58) G (Σ1,Σ2) =√σ2

1 + σ22 − 2σ1σ2v∗w.

Despite singularity of these matrices, it can be directly found the point realizing the minimumin (38). It is given by the singular matrix in Sym+ (2n):

(59)

[σ2

1vv∗ σ1σ2vw

∗

σ1σ2wv∗ σ2

2ww∗

]=

[σ1vσ2w

] [σ1v∗ σ2w

∗ ] .


4. Wasserstein Riemannian geometry. We have seen how to compute the geodesic of thedistance given by the Gini dissimilarity. As the component Rn carries the standard Euclideangeometry, we focus on the geometry of the matrix part i.e., we shall restrict our analysis to 0-mean distributions Nn (0,Σ). Moreover, we consider positive definite dispersion matrices. Thepurpose of this section is to endow the open set Sym++ (n) with a structure of Riemannianmanifold by deriving a metric whose distance is equal to the Wasserstein distance. TheRiemannian metric is obtained by pushing forward the Euclidean geometry of square matricesto the space of dispersion matrices via the mapping σ : A 7→ AA∗ = Σ, cf. [27].

In view of Prop. 2.1, the mapping σ : GL(n)→ Sym++ (n) ⊂ M(n), is a submersion, thespace Sym⊥ (n) (A∗)−1 is the space of the vertical vectors at A, and the space HA = Sym (n)Ais the set of horizontal vectors at A.

We recall that a submersion f : GL(n)→ Sym++ (n) is called Riemannian if for all A thedifferential restricted to horizontal vectors df(A) : HA → Tf(A) Sym++ (n) is an isometry i.e.,

(60) U, V ∈ HA ⇒ 〈df(A)[U ], df(A)[V ]〉f(A) = 〈U, V 〉 .

A linear isometry is always 1-to-1 and, if it is onto, we can write backward that

(61) X,Y ∈ Tf(A) Sym++ (n)⇒ 〈X,Y 〉f(A) =

⟨(df(A)|HA

)−1X,(df(A)|HA

)−1Y

⟩.

A Riemannian submersion preserves the length of curves. Let [0, 1] 3 t 7→ γ(t) be a smoothcurve in H and consider its image [0, 1] 3 t 7→ f(γ(t)). The velocity of the image is t 7→df(γ(t))[γ(t)] and its length is

(62)

∫ 1

0dt ‖df(γ(t))[γ(t)]‖f(γ(t)) =

∫ 1

0dt ‖γ(t)‖H

Fix a matrix A ∈ GL(n) such that σ(A) = AA∗ = Σ, and consider the open convex cone

(63) H++A = Sym++ (n)A ⊂ HA.

We denote by σA the restriction to H++A of σ.

Proposition 4.1. The mapping

(64) σA : H++A 3 B 7→ BB∗ = C ∈ Sym++ (n)

is globally invertible, the solution to σA(B) = C being

(65) B = C−1/2(C1/2ΣC1/2)1/2C−1/2A .

Proof. For each C ∈ Sym++ (n), the equation

(66) C = BB∗ = (BA−1A)(BA−1A)∗ = (BA−1)Σ(BA−1)∗

is a Riccati equation for BA−1. As B ∈ Sym++ (n)A, we have BA−1 ∈ Sym++ (n) and

(67) BA−1 = C−1/2(C1/2ΣC1/2)1/2C−1/2

is the unique solution.


We come now to the point, i.e., the construction of a metric based on horizontal vectorsat a point.

Proposition 4.2. The scalar product

(68) 〈U, V 〉Σ ≡WΣ(U, V ) = Tr [LΣ(U)ΣLΣ(V )] , U, V ∈ Sym (n) ,

is the isometric push-forward of the metric on Sym∗ (n) by the mapping σ : A 7→ AA∗ = Σ.

Proof. Let X ∈ M(n) and consider the decomposition of X = XV +XH with XV vertical atA and XH horizontal at A. Then dσ(A)[X] = dσ(A)[XH ] and the restriction of the derivativedσ(A) to the vector space HA of horizontal vectors at A is 1-to-1 onto the tangent space ofSym++ (n) at AA∗, that is, Sym (n). For such a restriction, we have, for each H ∈ HA,

U = dσ(A)[H] = HA∗ +AH∗ = HA−1AA∗ +A(HA−1A)∗

= (HA−1)AA∗ +AA∗(HA−1)∗ = (HA−1)AA∗ +AA∗(HA−1) ,

so that the inverse mapping of the restriction is given by

(69) H =(dσ(A)|HA

)−1(U) = LAA∗(U)A ,

Let us push-forward the inner product from HA to TAA∗ Sym++ (n).From Eq. (69), we have

WAA∗(U, V ) =

⟨(dσ(A)|HA

)−1(U),

(dσ(A)|HA

)−1(V )

⟩=

〈LAA∗(U)A,LAA∗(V )A〉 = Tr (LAA∗(U)AA∗LAA∗(V )) .

which depends on AA∗ = Σ only.

There is a tensorial form of Wasserstein Riemannian metric which is useful because itrepresents the bilinear form of the metric in the standard scalar product.

Proposition 4.3. It holds

(70) WΣ(U, V ) =1

2Tr (LΣ(U)V ) =

1

2〈LΣ(U), V 〉 ≡ 〈LΣ(U), V 〉2 .

Proof. We have

(71) Tr (LΣ(U)ΣLΣ(V )) = Tr (LΣ(V )ΣLΣ(U)) = Tr (LΣ(U)LΣ(V )Σ) ,

and, taking the semi-sum of the first and the last term of the previous equation, we have

(72) WΣ(U, V ) =1

2Tr {LΣ(U) [LΣ(V )Σ + ΣLΣ(V )]} =

1

2Tr {LΣ(U)V } ,

where the last line follows from Eq. (13).


We have seen that there is a metric geodesic for the Wasserstein distance, connecting apair of matrices Σ1,Σ2 ∈ Sym++ (n). We now show that the same curve is the WassersteinRiemannian geodesic.

Recall that the symmetric matrix

(73) T = Σ−1/21 (Σ

1/21 Σ2Σ

1/21 )1/2Σ

−1/21 .

is the unique solution in Sym+ (n) of the Riccati equation TΣ1T = Σ2. Note further that

(74) det (T ) = det (Σ2)1/2 det (Σ1)−1/2 > 0.

Proposition 4.4. The parametric curve

(75) t 7→ Σ(t) = ((1− t)I + tT )Σ1((1− t)I + tT ) ∈ Sym++ (n)

is the unique Wasserstein Riemannian geodesic from Σ1 to Σ2.

Proof. Set A1 = Σ1/21 and A2 = TΣ

1/21 . We have

(76) A1, A2 ∈ H++

Σ1/21

= Sym++ (n) Σ1/21 .

Actually, A1 = IΣ1/21 and A2 = TΣ

1/21 , with I, T ∈ Sym++ (n).

Therefore the straight line from A1 to A2,

(77) t 7→ A(t) = (1− t)A1 + tA2 ∈ H++

Σ1/21

, t ∈ [0, 1] ,

stays in H++

Σ1/21

. As a consequence, t 7→ A(t)A(t)∗ = Σ(t) is a geodesic in the Wasserstein

Riemannian metric connecting Σ1 to Σ2 = A(1)A(1)∗.This way, we get the curve

(78) t 7→ Σ(t) = A(t)A(t)∗ = (I + t(T − I))Σ1(I + t(T − I)) ∈ Sym++ (n)

that agrees with Eq. (49).

Remark 4.5. The equivalence between Riemannian distance and Gini index (or Wasser-stein distance) can be confirmed by computing the length in Sym (n) of the the geodesic

t 7→ A(t), t ∈ [0, 1]. As A(t) = (T − I) Σ1/21 , we have

∥∥∥A(t)∥∥∥ =

√Tr(A(t)A(t)∗

)=√

Tr ((T − I)Σ1(T − I)) =√(Tr (Σ1) + Tr (Σ2)− Tr (TΣ1)− Tr (Σ1T )) =√

Tr (Σ1) + Tr (Σ2)− Tr((Σ2Σ1)1/2

)− Tr

((Σ1Σ2)1/2

).

In the last equality we have used Eq. (10) and Eq. (11).


5. Wasserstein Riemannian exponential. We aim now at reformulating Riemanniangeodesic in terms of the exponential map. The purpose is that of writing the geodesic arcpassing through a given point and having a given velocity at the point itself.

The velocity of the geodesic (75):

(79) Σ(t) = (T − I)Σ1 + Σ1(T − I) + 2t(T − I)Σ1(T − I)

is affine in t. In particular, its initial velocity is

(80) Σ(0) = (T − I)Σ1 + Σ1(T − I).

By inverting Lyapunov equation (80), we get T − I = LΣ(0)(Σ(0)). Therefore,

Σ(t) = Σ(0) + t [(T − I)Σ(0) + Σ(0)(T − I)] + t2(T − I)Σ(0)(T − I)(81)

= Σ(0) + tΣ(0) + t2LΣ(0)(Σ(0))Σ(0)LΣ(0)(Σ(0)) .

We are so led to the following definition (see [1, p. 101–102]).

Definition 5.1. For any C ∈ Sym++ (n) and V ∈ Sym (n) ' TC Sym++ (n), the Wasser-stein Riemannian exponential is

(82) ExpC (V ) = C + V + LC(V )CLC(V ) = (LC(V ) + I)C(LC(V ) + I) ,

Next proposition concerns properties of the Riemannian exponential.

Proposition 5.2. 1) All geodesics emanating from a point C ∈ Sym++ (n) are of the formΣ(t) = ExpC (tV ), with t ∈ JV , where JV is the open interval about the origin:

(83) JV ={t ∈ R : I + tLC(V ) ∈ Sym++ (n)

}.

2) The map V 7→ ExpC (V ) , restricted to the open set

(84) Θ ={V ∈ Sym (n) : I + LC(V ) ∈ Sym++ (n)

},

is a diffeomorphism of Θ into Sym++ (n) with inverse

(85) LogC (B) = (BC)1/2 + (CB)1/2 − 2C ;

3) The derivative of the Riemannian exponential is

(86) dX (V 7−→ ExpC (V )) = X + LC(X)CLC(V ) + LC(V )CLC(X) .

Remark 5.3. Clearly, 0 ∈ JV . Moreover, the interval JV is unbounded from the right,that is, it is of the kind JV = (t,+∞), provided V ∈ Sym+ (n). Likewise, JV = (−∞, t), ifV ∈ Sym− (n).

Analogously, Θ is an open set containing the origin and so V 7→ ExpC (V ) is a localdiffeomorphism around the origin.

Since the geodesics are not defined for all the values of the parameter t ∈ R, we havethat the Riemannian manifold Sym++ (n) is geodesically incomplete. Of course this is not asurprising fact: Sym++ (n) is not a complete metric space, and hence Hopf-Rinow theoremimplies that it is not geodesically complete (see [11]).


Proof of Prop. 5.2. 1) Let

(87) Σ (t) = ExpC (tV ) = C + tV + t2LC(V )CLC(V ), t ∈ JV .

Clearly, Σ(0) = C, Σ(0) = V . Pick a scalar t ∈ JV and consider the two matricesΣ (0) and Σ (t) on the curve Σ. Introduce the new parametrization Σ (τ) = Σ (τ t) so thatΣ (0) = Σ (0) and Σ (1) = Σ (t). We have,

(88) Σ (τ) = C + τ tV + τ2t2LC (V )CLC (V ) .

By the relation (13) and setting T − I = tLC (V ), the equation above becomes

Σ (τ) = C + tτLC (V )C + tτCLC (V ) + t2τ2LC (V )CLC (V )(89)

= C + τ(T − I)C + τC(T − I) + τ2(T − I)C(T − I) = TCT(90)

= [(1− τ) I + τT ]C [(1− τ) I + τT ] .(91)

On the other hand, t ∈ JV implies that T = I + tLC (V ) ∈ Sym++ (n). Moreover,Σ (1) = TCT = T Σ (0)T. In view of Eq. (75), the curve Σ(t) = ExpC (tV ), with t ∈ [0, t] (or[t, 0]) is a portion of the geodesic Σ(t), t ∈ JV .

2) By Eq. (82) we have the Riccati equation

(92) ExpC (V ) = (I + LC(V ))C(I + LC(V )) = B

which solution is

(93) I + LC(V ) = C−1/2(C1/2BC1/2)1/2C−1/2

provided I +LC(V ) ∈ Sym++ (n). This is true for a sufficiently small neighborhood ‖V ‖ < rof the origin. The inversion of the operator LC(·) and Eq. (9) provide the desired formula forLogC (B).

3) The derivative follows from a simple bilinear computation.

6. Natural gradient. First, let us study the second order approximation of the matrixfunction in Eq. (2). For a given Σ ∈ Sym++ (n), let H ∈ Sym (n) such that (Σ ± H) ∈Sym++ (n). It follows that Σ + θH ∈ Sym++ (n) for all θ ∈ [−1,+1]. Consider the divergencein Eq. (2) with µ1 = µ2 = 0, Σ1 = Σ, Σ2 = Σ + θH, and the parametric function

(94) θ 7→W 2(Σ,Σ + θH) = Tr

(Σ + (Σ + θH)− 2

(Σ1/2(Σ + θH)Σ1/2

)1/2)

=

2 Tr (Σ) + θTr (H)− 2 Tr

((Σ2 + θΣ1/2HΣ1/2

)1/2).

The first derivative is computed using Eq. (15) and Eq. (18),

(95)d

dθW 2(Σ,Σ + θH) = Tr (H)− 2 Tr

(L

(Σ2+θΣ1/2HΣ1/2)1/2

[Σ1/2HΣ1/2

])=

Tr (H)− Tr

((Σ2 + θΣ1/2HΣ1/2

)−1/2 (Σ1/2HΣ1/2

)).


We observe that

(96) W 2(Σ,Σ + θH)∣∣θ=0

=d

dθW 2(Σ,Σ + θH)

∣∣∣∣θ=0

= 0 ,

and proceed to the computation of the second derivative. We find, by derivation of thecomposed function, that

(97)d2

dθ2W 2(Σ,Σ + θH) =

Tr

((Σ2 + θΣ1/2HΣ1/2

)−1/2L

(Σ2+θΣ1/2HΣ1/2)1/2

[Σ1/2HΣ1/2

] (Σ2 + θΣ1/2HΣ1/2

)−1/2 (Σ1/2HΣ1/2

)),

so that

(98)d2

dθ2W 2(Σ,Σ + θH)

∣∣∣∣θ=0

= Tr(

Σ−1LΣ

[Σ1/2HΣ1/2

]Σ−1Σ1/2HΣ1/2

)=

Tr(

Σ−1/2LΣ

[Σ1/2HΣ1/2

]Σ−1/2H

)= Tr (LΣ [H]H) ,

where we have used the Eq. (17). In conclusion we have shown that

(99) W 2(Σ,Σ + θH) =θ2

2Tr (LΣ [H]H) + o(θ2) .

The equation above confirms that the form of the Riemannian metric associated to Wasser-stein distance is the metric that has been introduced above. Moreover the solution of theproblem

(100)

max f(X +H)− f(X)

subject to

W 2(X,X +H) = ε (small and fixed)

allows to identify the direction of the maximal increase of the function f as the naturalgradient, according to the name introduced by Amari [3] i.e., the Riemannian gradient asdefined below.

The Riemannian gradient is the gradient with respect to the scalar product of the metric.We denote by ∇ the gradient with respect to the scalar product 〈·, ·〉2 and by grad the gradientwith respect to the Riemannian metric. By Prop. 4.3 we now that WΣ(X,Y ) = 〈LΣ [X] , Y 〉2,hence for each smooth scalar field φ we have

(101) gradφ(Σ) = L−1Σ (∇φ(Σ)) = ∇φ(Σ)Σ + Σ∇φ(Σ) ,

where the second equality follows from Eq. (14). Conversely,

(102) LΣ [gradφ(Σ)] = ∇φ(Σ) .


The gradient flow of a smooth scalar field φ is the flow generated by the vector fieldγ 7→ (γ,− gradφ(γ)), that is, the flow of the differential equation

(103) γ(t) = − gradφ(γ(t)) = − (∇φ(γ(t))γ(t) + γ(t)∇φ(γ(t))) .

We discuss below two relevant examples of gradient flow. With reference to the fullGaussian distribution, one considers smooth functions defined on Rn × Sym++ (n). The firstcomponent of the gradient does not require a special gradient as the Riemannian structure isthe Euclidean one. The full gradient will thus have two components:

(104) gradφ(µ,Σ) = (∇1φ(µ,Σ), grad2 φ(µ,Σ)) = (∇1φ(µ,Σ),∇2φ(µ,Σ)Σ + Σ∇2φ(µ,Σ)) .

The first example in Sec. 6.1 refers to the gradient flow of the mean value of an objectivefunction f : Rn → R. Its Euler scheme is used in optimization, see [1, Ch. 4] and [20]. In thesecond example in Sec. 6.2 we discuss the gradient flow of the entropy function of a centeredGaussian.

6.1. Relaxed function. We call relaxation to the full Gaussian model of the objectivefunction f : Rn → R the function

(105) φ(µ,Σ) = E [f(X)] , X ∼ Nn (µ,Σ) .

If we would include the Dirac measures in the Gaussian model, then f(x) = φ(x, 0) andthe function φ would actually be an extension of the given function. However, we consideronly Σ ∈ Sym++ (n) in order to work with a function defined on our manifold.

There are two ways to calculate the expected value as a function of µ and Σ. Each ofthem leads to a peculiar expression of the natural gradient.

The first one arises from the relation

(106) φ(µ,Σ) = E[f(Σ1/2Z + µ)

], Z ∼ Nn (0, I) .

which will lead to an equation for the gradient involving the derivatives of f . The second oneuses

(107) φ(µ,Σ) =

∫f(x)(2π)−n/2 det (Σ)−1/2 exp

(−1

2(x− µ)∗Σ−1(x− µ)

)dx .

In this second case the natural gradient will be achieved by an equation not involving thegradient of the function f . Both forms have their own field of application.

Let us start with Case (106). Under standard conditions regarding the derivation underthe expectation sign, we have

(108) ∇1φ(µ,Σ) = E[∇f(Σ1/2Z + µ)

]= E [∇f(X)] .

By means of (15), it is straightforward to calculate the derivative dU (Σ 7→ φ(µ,Σ)).


Note that ∇f is the column vector and so ∇∗f will be a row vector. We have

dUφ(µ,Σ) = E[df(Σ1/2Z + µ)[LΣ1/2 (U)Z]

]= E

[∇∗f(Σ1/2Z + µ)LΣ1/2 (U)Z

]= E

[Tr∇∗f(Σ1/2Z + µ)LΣ1/2 (U)Z

].

Under symmetrization (and setting X = Σ1/2Z + µ):

dUφ(µ,Σ) =1

2E [TrLΣ1/2 (U) (Z∇∗f(X) +∇f(X)Z)]

= 〈U,E ((Z∇∗f(X) +∇f(X)Z))〉Σ1/2

=1

2ETrLΣ1/2 (Z∇∗f(X) +∇f(X)Z)U

= 〈ELΣ1/2 (Z∇∗f(X) +∇f(X)Z) , U〉2 .

It follows that

(109) ∇2φ(µ,Σ) = ELΣ1/2 (Z∇∗f(X) +∇f(X)Z) .

Calculating the natural gradient:

grad2 φ(µ,Σ) = ΣELΣ1/2 (Z∇∗f(X) +∇f(X)Z) +

ELΣ1/2 (Z∇∗f(X) +∇f(X)Z) Σ.

If we set Ξ = E [Z∇∗f(X) +∇f(X)Z], the natural gradient admits the representation

(110) grad2 φ(µ,Σ) = ΣLΣ1/2 (Ξ) + LΣ1/2 (Ξ) Σ.

We move on to consider the second Case (107). Following the standard computation ofthe Fisher score and starting from the log-density p(x;µ,Σ) of Nn (µ,Σ), we have

log p(x;µ,Σ) = −n2

log 2π − 1

2log det Σ− 1

2(x− µ)∗Σ−1(x− µ)(111)

= −n2

log 2π − 1

2log det Σ− 1

2Tr(Σ−1(x− µ)(x− µ)∗

).

Denoting the partial derivative du (µ 7−→ log p(x;µ,Σ)) as du log p(x;µ,Σ), and dU (Σ 7−→ log p(x;µ,Σ))as dU log p(x;µ,Σ), we get:

du log p(x;µ,Σ) = (x− µ)∗Σ−1u =⟨Σ−1(x− µ), u

⟩dU log p(x;µ,Σ) = −1

2Tr(Σ−1U

)+

1

2Tr(Σ−1UΣ−1(x− µ)(x− µ)∗

)=

1

2

⟨Σ−1(x− µ)(x− µ)∗Σ−1 − Σ−1, U

⟩=⟨Σ−1 ((x− µ)(x− µ)∗ − Σ) Σ−1, U

⟩2


So that

duφ(µ,Σ) =

∫f(x) du log p(x;µ,Σ) p(x;µ; Σ) dx

=

⟨Σ−1

∫f(x)(x− µ)p(x;µ; Σ) dx, u

⟩and

dUφ(µ,Σ) =

∫f(x) dU log p(x;µ,Σ) p(x;µ,Σ) dx

=

⟨Σ−1

∫f(x) ((x− µ)(x− µ)∗ − Σ) p(x;µ,Σ) dx Σ−1, U

⟩2

.

At last, thanks to (104), the natural gradient of φ(µ,Σ) will be

∇1φ(µ,Σ) = Σ−1

∫f(x)(x− µ)p(x;µ; Σ) dx

grad2 φ(µ,Σ) =

∫f(x) ((x− µ)(x− µ)∗ − Σ) p(x;µ,Σ) dx Σ−1

+ Σ−1

∫f(x) ((x− µ)(x− µ)∗ − Σ) p(x;µ,Σ) dx.

6.2. Entropy flow. The flow of entropy can be easily calculated by Eq. 111. We have

E(µ,Σ) = −∫

log p(x;µ,Σ)p(x;µ,Σ) dx

=n

2log 2π +

1

2log det Σ− 1

2Tr(Σ−1Σ

)=n

2(log 2π − 1) +

1

2log det Σ .

The entropy does not depend on µ so that ∇1E(µ,Σ) = 0. Moreover (see [19, §8.3]) weknow that ∇E(Σ) = Σ−1, so that

(112) grad E(Σ) = (Σ−1Σ + ΣΣ−1) = 2I.

The entropic flow will be solution to the equations

(113) µ(t) = 0, Σ(t) + 2I = 0 ,

that is

(114) µ(t) = µ(0), Σ(t) = Σ(0)− 2tI .

The integral curve is defined for all t such that 2t < λ∗, λ∗ being the minimum of thespectre of Σ(0).


7. Second order geometry. Recall that Sym++ (n) as an open set of the Hilbert spaceSym (n) endowed with the scalar product 〈X,Y 〉2 = 1

2 Tr (XY ). We have shown in Prop. 4.3that the Wasserstein Riemannian metric W is expressed in terms of the scalar product ofSym (n) by

(115) WΣ(X,Y ) = 〈X,Y 〉Σ = 〈LΣ [X] , Y 〉2 ,

for each (Σ, X), (Σ, Y ) in the trivial tangent bundle T Sym++ (n) ' Sym++ (n) × Sym (n).In the equation above, L : Sym++ (n) 7→ L(Sym (n) , Sym (n)) is the operator defining theWasserstein metric the standard scalar product.

In the trivial chart, a smooth vector field X is a smooth mapping X : Sym++ (n) →Sym (n). The action of the vector field X on the scalar field f that is, Xf , is expressed in thetrivial chart by dXf that is, the scalar field whose value at point Σ is the derivative of f inthe direction X(Σ). Similarly, we denote by dYX the vector field whose value at point Σ isthe derivative at Σ of X in the direction Y (Σ). The Lie bracket [X,Y ] of two smooth vectorfields X,Y is expressed by dXY − dYX.

While we prefer to express our computation in matrix algebra, in some cases it is usefulto use a vector basis. We discuss below a field of vector bases of particular interest for us.

The set of symmetric matrices

(116) Ep,q = epe∗q + eqe

∗p, p, q = 1, . . . , n ,

spans the vector space Sym (n). There are repeated elements and a unique enumeration isgiven by the set A of parts on {1, . . . , n} with 1 or 2 elements. This generating set is relatedto the product of matrices by the equation

(117) Ep,qEr,s + Er,sEp,q = δq,rEp,s + δq,sE

p,r + δp,rEq,s + δp,sE

q,r ,

where δ is the Kronecker symbol.In particular, we can take the trace of the equation above to get

(118) 〈Ep,q, Er,s〉 = δp,rδq,s + δp,sδq,r ,

which in turn implies

(119) 〈Ep,q, Er,s〉 =

0 if {p, q} 6= {r, s},1 if {p, q} = {r, s} and p 6= q,

2 if {p, q} = {r, s} and p = q

We can select an orthogonal basis by canceling the repeated elements. In the sequel, wedenote by (Eα)α∈A the vector basis above, properly normalized to obtain an orthonormalbasis. We do not show the normalising constants in order to simplify the notation.

For each Σ ∈ Sym++ (n) the sequence

(120) Eα(Σ) = EαΣ + ΣEα, α ∈ A ,


is a vector basis of Sym (n) ' TΣ Sym++ (n) because it is the image of a vector basis undera linear mapping which is onto. We will call such a sequence of vector fields the (principal)moving frame. Notice the following properties: Eα = dEαΣ2; LΣ [Eα(Σ)] = Eα; Eα(I) = 2Eα.

At a generic Σ, we can express each Eα in the (Eβ)β’s orthonormal basis as

(121) Eα(Σ) =∑β

2gα,β(Σ)Eβ , gα,β(Σ) = Tr(EαΣEβ

),

which is verified by right-multiplying by Eγ Eq. (120) and taking the trace of the resultingequation. As

(122) WΣ(Eα, Eβ) = Tr(LΣ [Eα(Σ)] ΣLΣ

[Eβ(Σ)

])= Tr

(EαΣEβ

),

the matrix [gα,β]α,β is the expression of the Riemannian metric in the moving frame.In fact, X,Y are vector fields expressed in the moving frame, X =

∑α xαEα, Y =∑

β yβEβ, then

(123) WΣ(X,Y ) =∑α,β

xα(Σ)yβ(Σ)gα,β(Σ) = Tr

(∑α

xα(Σ)Eα

)Σ

∑β

yβ(Σ)Eβ

.

This expression of the scalar product is to be compared to that used in [27, ].Any vector field X has two representations, one with respect to the moving frame (Eα)α

and another one with respect to the basis (Eα)α. The two representations are related to eachother as follows. We have

(124) X =∑α

xαEα =∑α

xα∑β

2gα,βEβ =

∑β

(∑α

2xαgα,β

)Eβ ,

so that

(125) 〈X,Eγ〉2 =1

2Tr (XEγ) =

∑β

(∑α

xαgα,β

)Tr(EβEγ

)=∑α

xαgα,γ ,

hence, by applying the inverse matrix [gα,β(Σ)] = [gα,β(Σ)]−1, we have

(126) xα =∑γ

gα,γ 〈X,Eγ〉2 .

For example, LΣ [V ] =∑

α `αΣ(V )Eα(Σ), with

(127) `αΣ(V ) =∑γ

gα,γ(Σ) 〈LΣ [V ] , Eγ〉2 = WΣ(V,∑γ

gα,γEγ) .

The next example is related to the discussion of covariant derivatives to follow. Considera third vector field Z =

∑γ zγEγ . As the mapping Σ 7→ gα,β(Σ) is a restriction of a linear


mapping on Sym (n), we have

(128) dZWΣ(X,Y ) =∑α,β

(dZxα(Σ)yβ(Σ) + xα(Σ)dZyβ(Σ)) gα,β(Σ) + xα(Σ)yβ(Σ) Tr(EαZ(Σ)Eβ

)=

∑α,β

(dZxα(Σ)yβ(Σ) + xα(Σ)dZyβ(Σ)) gα,β(Σ) +∑α,β,γ

xα(Σ)yβ(Σ)zγ(Σ) Tr(EαEγ(Σ)Eβ

),

where dZxα(Σ) and dZyβ can be expandex with respect to the components zγ of Z.For couple of vector fields X,Y , we denote by DYX the action of a covariant derivative,

namely a bilinear operator satisfying, for each scalar field f ,(CD1) DfYX = fDYX,(CD2) DY (fX) = dY fX + fDYX,

see e.g [11, Sect. 3] or [18, Ch. 8.4].A convenient way to express a covariant derivative in the moving frame (120) is to define

define the Christoffel symbols

(129) DEαEβ =∑γ

Γγα,βEγ ,

For X =∑

α xαEα, Y =∑

β yβEβ, by using (CD1), (CD2), and Eq. (121), we obtain

(130) DXY =∑α,β

xαDEα(yβEβ) =∑α,β

xα

((dEαyβ) Eβ + yβ

(DEαEβ

))=

∑α,γ

xαdEαyγEγ +∑α,β,γ

yβΓγα,βEγ =

∑γ

∑α,β

xα

(dEαyγ + yβΓγα,β

)Eγ .

The scalar product of DXY with Z =∑

δ zδEδ is

(131) 〈DXY,Z〉Σ =∑α,β,γ,δ

xα

(dEαyγ + yβΓγα,β

)gδ,γzδ .

In Sec. 7.1 below we compute the Levi-Civita (covariant) derivative of a vector field, thatis, the unique covariant derivative D that, for all vector fields X,Y, Z, is:

(LC1) compatible with the metric, dXW (Y,Z) = W (DXY,Z) +W (Y,DXZ);(LC2) torsion-free, DYX −DXX = [X,Y ] = dYX − dXY .

In Sec. 7.1.2 we compute the Riemannian parallel transport of vectors. In Sec. 7.2 wecompute the Hessian of a scalar field.

All the computations are straightforward in matrix form. Lang [18, Ch. 8.4] in hisstatement MD3 provides an equation to compute the Levi-Civita derivative without actuallyintroducing any base of the vector space of coordinates. In this formalism we use the Cristoffelsymbol that is, minus the spray in Lang’s language. This allow to write the the differentialequation of the parallel transport without introducing a vector basis. Given the Levi-Civitaderivative, the Riemannian Hessian of a scalar field f is given by the bi-linear operator DXdY f .We provide too some of the expressions in the moving frame.


7.1. Levi-Civita derivative. In order to have a compact notation, it will be convenientto write the symmetrized of a matrix A ∈ M (n) as {A}S = 1

2 (A+A∗). If either A orB is symmetric, then Tr ({A}S B) = Tr (AB). We denote by X,Y, Z smooth vector fieldson Sym++ (n). We shall use repeatedly the expression for the derivative of the vector fieldΣ 7→ LΣ [X]. In view of Eq. 21 and our notation for the symmetrization, it holds

(132) dY LΣ [X] = −2LΣ [{LΣ [X]Y }S ] .

Proposition 7.1. The Levi-Civita derivative DXY is implicitly defined by

(133) 〈DXY, Z〉Σ =

〈dXY, Z〉Σ + 〈X, {LΣ [Y ]Z}S〉Σ − 〈X, {LΣ [Z]Y }S〉Σ − 〈Y, {LΣ [Z]X}S〉Σ =

〈dXY,Z〉Σ +1

2Tr (LΣ [X]ZLΣ [Y ])− 1

2Tr (LΣ [X]Y LΣ [Z])− 1

2Tr (LΣ [Y ]XLΣ [Z]) ,

while the Levi-Civita derivative itself is given by

(134) DXY = dXY − {LΣ [X]Y + LΣ [Y ]X}S + {ΣLΣ [X]LΣ [Y ] + ΣLΣ [Y ]LΣ [X]}S .

Proof. In our case, Eq. MD3 of [18, p. 205] becomes

(135) 2 〈DXY,LΣ [Z]〉2 =

2 〈dXY,LΣ [Z]〉2 + 〈Y, dXLΣ [Z]〉2 + 〈X, dY LΣ [Z]〉2 − 〈X, dZLΣ [Y ]〉2 .

By Eq. (21) we have

(136) 〈Y, dXLΣ [Z]〉2 = −2 〈Y,LΣ [{LΣ [Z]X}S ]〉2 = −2 〈Y, {LΣ [Z]X}S〉Σ ,

and, analogously,

(137) 〈X, dY LΣ [Z]〉2 = −2 〈X, {LΣ [Z]Y }S〉Σ , 〈X, dZLΣ [Y ]〉2 = −2 〈X, {LΣ [Y ]Z}S〉Σ .

This way, Eq. (135) becomes the first part of Eq. (133).The second part of Eq. (133) are then easily obtained. For instance,

(138) 〈X, {LΣ [Z]}S〉Σ =1

2Tr (LΣ [X] {ZLΣ [Y ]}S) =

1

2Tr (LΣ [X]ZLΣ [Y ]) .

Regarding the explicit formula of the Levi-Civita derivative (134), observe that

(139)1

2Tr (LΣ [X]ZLΣ [Y ]) =

1

2Tr (LΣ [Y ]LΣ [X]Z) =

1

2Tr ({LΣ [X]LΣ [Y ]}S Z) =

1

2Tr (LΣ [{LΣ [X]LΣ [Y ]}S Σ + Σ {LΣ [X]LΣ [Y ]}S ]Z) =

〈{LΣ [X]LΣ [Y ]}S Σ + Σ {LΣ [X]LΣ [Y ]}S , Z〉Σ =

〈{ΣLΣ [X]LΣ [Y ]}S + {ΣLΣ [Y ]LΣ [X]}S , Z〉Σ =

〈{ΣLΣ [X]LΣ [Y ] + ΣLΣ [Y ]LΣ [X]}S , Z〉Σ .


Moreover,

(140)1

2Tr (LΣ [X]Y LΣ [Z]) +

1

2Tr (LΣ [Y ]XLΣ [Z]) =

1

2Tr ({LΣ [X]Y + LΣ [Y ]X}S LΣ [Z]) = 〈{LΣ [X]Y + LΣ [Y ]X}S , Z〉Σ .

Therefore, Eq. (133) can be written as

(141) 〈DXY, Z〉Σ =

〈dXY − {LΣ [X]Y + LΣ [Y ]X}S + {ΣLΣ [X]LΣ [Y ] + ΣLΣ [Y ]LΣ [X]}S , Z〉Σ ,

and the desired result obtains.The equation we have used for computing the Levi-Civita derivative is proved in the given

reference. However, in order to have a self-contained presentation, we proceed by checkingEq. (133) against conditions (LC1) and (LC2). In fact, we have

(142) 〈DXY,Z〉Σ + 〈Y,DXZ〉Σ = 〈dXY,Z〉Σ + 〈Y, dXZ〉Σ − Tr (LΣ [Y ]XLΣ [Z]) .

On the other hand, if we compute the derivative of W (Y,Z) at Σ, we obtain

(143) dX 〈Y,Z〉Σ = dX Tr (LΣ [Y ] ΣLΣ [Z]) =

Tr (dXLΣ [Y ] ΣLΣ [Z]) + Tr (LΣ [Y ] ΣdXLΣ [Z]) + Tr (LΣ [Y ]XLΣ [Z]) .

Since

(144) dXLΣ [Y ] = LΣ [dXY ]− 2LΣ [{LΣ [Y ]}S X] ,

we have

(145) Tr (dXLΣ [Y ] ΣLΣ [Z]) = 〈dXY,Z〉Σ − 2 〈{LΣ [Y ]X}S , Z〉Σ =

〈dXY,Z〉Σ − Tr ({LΣ [Y ]X}S LΣ [Z]) = 〈dXY,Z〉Σ − Tr (LΣ [Y ]XLΣ [Z]) ,

and, in a similar way, we can write

(146) Tr (LΣ [Y ] ΣdXLΣ [Z]) 〈Y, dXZ〉Σ − Tr (LΣ [Z]XLΣ [Y ]) .

Substitution of the equations above in Eq. (143) proves (LS1).Condition (LS2), is checked by

(147) 〈DXY −DYX,Z〉Σ = 〈dXY − dYX,Z〉Σ = 〈[Y,X] , Z〉Σ .


7.1.1. Levi-Civita derivative in the moving frame. Let us express the Levi-Civita deriva-tive in the moving frame (120). Consider the vector field X(Σ) = Eα(Σ) = EαΣ + ΣEα andthe vector field Y (Σ) = Eβ(Σ) = EβΣ + ΣEβ. By Eq. (134), we have

(148) DEαEβ =

dEαEβ −{LΣ [Eα] Eβ + LΣ

[Eβ]Eα}S

+{

ΣLΣ [Eα]LΣ

[Eβ]

+ ΣLΣ

[Eβ]LΣ [Eα]

}S.

We are going to compute one by one the three terms in the equation above.The first term in Eq. (148) is

(149) dEαEβ = d(EαΣ+ΣEα)(EβΣ + ΣEβ) =

Eβ(EαΣ + ΣEα) + (EαΣ + ΣEα)Eβ =

EβEαΣ + EβΣEα + EαΣEβ + ΣEαEβ .

The second one is

(150) −{LΣ [Eα] Eβ + LΣ

[Eβ]Eα}S

= −{Eα(EβΣ + ΣEβ) + Eβ(EαΣ + ΣEα)

}S

=

−{EαEβΣ + EαΣEβ + EβEαΣ + EβΣEα

}S

=

− 1

2

(EαEβΣ + EβEαΣ + ΣEβEα + ΣEαEβ

)−(EαΣEβ + EβΣEα

).

The sum is

(151)1

2

(EβEαΣ + ΣEαEβ

)− 1

2

(EαEβΣ + ΣEβEα

).

The third term is

(152){

ΣLΣ [Eα]LΣ

[Eβ]

+ ΣLΣ

[Eβ]LΣ [Eα]

}S

={

ΣEαEβ + ΣEβEα}S

=

1

2

(ΣEαEβ + ΣEβEα + EβEαΣ + EαEβΣ

).

In conclusion,

(153) DEαEβ = EβEαΣ + ΣEαEβ .

The computation of the Christoffel symbols∑

γ Γσα,βEγ = DEαEβ would require the solu-tion of the equations

(154) EβEαΣ + ΣEαEβ =∑γ

Γγα,β(Σ) (EγΣ + ΣEγ) .

We do not do that here.


Instead, let us take now X = xαEα and Y = yβEβ. From the properties (CD1) and (CD2),we have

(155) D(xαEα)(yβEβ) = xαDEα(yβEβ) = xα

(dEαyβE

β + yβDEαEβ)

=

xαdEαyβEβ + xαyβ


).

Finally, for general X and Y ,

(156) DXY =∑α,β

xαdEαyβEβ +

∑α,β

xαyβ


).

7.1.2. Parallel transport. The expression of the Levi-Civita derivative in Eq. (133) canbe re-written as

(157) 〈DXY,Z〉Σ = 〈dXY, Z〉Σ + 〈Γ(Σ;X,Y ), Z〉Σ ,

where Γ(Σ; ·, ·) is the symmetric tensor field defined by

(158) 〈Γ(Σ;X,Y ), Z〉Σ =

1

2Tr (LΣ [X]ZLΣ [Y ])− 1

2Tr (LΣ [X]Y LΣ [Z])− 1

2Tr (LΣ [Y ]XLΣ [Z]) =

1

2Tr (LΣ [Y ]LΣ [X]Z)− 1

2Tr ((LΣ [X]Y + LΣ [Y ]X)LΣ [Z]) =

1

2Tr (LΣ [Y ]LΣ [X] (LΣ [Z] Σ + ΣLΣ [Z]))− 1

2Tr ((LΣ [X]Y + LΣ [Y ]X)LΣ [Z]) =

1

2Tr ((ΣLΣ [Y ]LΣ [X] + LΣ [Y ]LΣ [X] Σ− LΣ [X]Y − LΣ [Y ]X)LΣ [Z]) =

〈{ΣLΣ [Y ]LΣ [X] + LΣ [Y ]LΣ [X] Σ− LΣ [X]Y − LΣ [Y ]X}S , Z〉Σ .

We have

(159) Γ(Σ;X,Y ) = {ΣLΣ [Y ]LΣ [X] + LΣ [Y ]LΣ [X] Σ− LΣ [X]Y − LΣ [Y ]X}S ,

and, on the diagonal,

(160) Γ(Σ;X,X) = ΣLΣ [X]LΣ [X] + LΣ [X]LΣ [X] Σ− LΣ [X]X −XLΣ [X] .

Γ(Σ;X,Y ) is the expression in the trivial chart of the Christoffel symbol of the Levi-Civitaderivative as in [17]. In [18], −Γ is called the spray of the Levi-Civita derivative. Given theChristoffel symbol, we can write the linear differential equation of the parallel transport alonga curve t 7→ Σ(t) as

(161)

{UV (t) + Γ(Σ(t); Σ(t), UV (t)) = 0 ,

UV (0) = V ,

see [18, VIII, §3 and §4]. Recall that the parallel transport for the Levi-Civita derivative isisometric.


We do not discuss here the representation in the moving frame of Eq. (161). We limitourselves to mention that the action of the Christoffel symbol on vector fields expressed inthe moving frame can be computed from

(162) Γ(Σ; Eα, Eβ) ={ΣLΣ

[Eβ]LΣ [Eα] + LΣ

[Eβ]LΣ [Eα] Σ− LΣ [Eα] Eβ − LΣ

[Eβ]Eα}S

={ΣEβEα + EβEαΣ− Eα(EβΣ + ΣEβ)− Eβ(EαΣ + ΣEα)

}S

={ΣEβEα + EβEαΣ− EαEβΣ− EαΣEβ − EβEαΣ− EβΣEα

}S

=

− (EαΣEβ + EβΣEα) .

7.2. Riemannian Hessian. According to [1, Def. 5.5.1] and [11, p. 141], the RiemannianHessian of a smooth scalar field φ : Sym++ (n)→ R, is the Levi-Civita covariant derivative ofgradφ, namely, for each vector field X, it is the vector field HessX φ whose value at Σ is

(163) HessX φ(Σ) = DX(gradφ)(Σ) = DX(∇φ(Σ)Σ + Σ∇φ(Σ)) .

The associated symmetric bilinear form is (see [1, Proposition 5.5.3])

(164) Hessφ(Σ) (X,Y ) = 〈DX(gradφ)(Σ), Y 〉Σ .

It is enough to compute the diagonal of the symmetric form, hence we let X = Z = V inthe second part of Eq. (133) to obtain

Hessφ(Σ) (V, V ) = 〈dV Y, V 〉Σ +

1

2Tr [LΣ (V )V LΣ (Y )]− 1

2Tr [LΣ (V )Y LΣ (V )]− 1

2Tr [LΣ (V )V LΣ (Y )] =

〈dV Y, V 〉Σ −1

2Tr [LΣ (V )Y LΣ (V )] ,

where Y = gradφ (Σ). After plugging Y = gradφ (Σ) = Σ∇φ (Σ) +∇φ (Σ) Σ into it, we geteasily

(165) Hessφ(Σ) (V, V ) =⟨∇2V φ (Σ) Σ + Σ∇2

V φ (Σ) , V⟩

Σ+ Tr [∇φ (Σ)V LΣ (V )]− Tr [LΣ (V )∇φ (Σ) ΣLΣ (V )] .

Plugging V = LΣ [V ] Σ + ΣLΣ [V ] into the second term of the RHS, we have at last

(166) Hessφ(Σ) (V, V ) =⟨∇2V φ (Σ) Σ + Σ∇2

V φ (Σ) , V⟩

Σ+ Tr [∇φ (Σ)LΣ (V ) ΣLΣ (V )] .

Relation (166) substantiates the following important property that links the Hessian tothe derivative along a geodesic (see the proof of Proposition 5.5.4 of [1]).


Proposition 7.2. Let φ : Sym++ (n)→ R be a smooth scalar field and define

(167) ϕ (t) = φ (expΣ (tV )) .

It holds

(168) ϕ (0) = Hessφ(Σ) (V, V ) .

Proof. By Proposition 5.2

(169) Σ(t) = ExpΣ (tV ) = Σ + tV + t2LΣ(V )ΣLΣ(V )

where Σ(0) = Σ and Σ(0) = V. Hence ϕ (t) =⟨∇φ(Σ(t)), Σ(t)

⟩2, and

(170) ϕ (t) =⟨∇2φ(Σ(t))[Σ(t)], Σ(t)

⟩2

+⟨∇φ(Σ(t)), Σ(t)

⟩2.

At t = 0,

(171) ϕ (0) =⟨∇2φ(Σ)[V ], V

⟩2

+ 2 〈∇φ(Σ),LΣ(V )ΣLΣ(V )〉2 .

In view of Eq. (166),

(172) Hessφ(Σ) (V, V ) =⟨∇2V φ (Σ) , V

⟩2

+ 2 〈∇φ (Σ) ,LΣ (V ) ΣLΣ (V )〉 = ϕ (0) .

�

8. Discussion. The present paper is intended to be an introduction to a research projectin progress. It contains both a review of the literature and novel results. The issue of acomparison between Fisher and Wasserstein metric is not discussed as it is, for example, inChevallier et al. [10].

It is the Authors’ plan to investigate the following developments and applications.1. Push-back of the geometry with the mapping Sym (n) 3 A 7→ eA ∈ Sym++ (n).2. Computation of the curvature tensor.3. Numerical solution and simulation methods foe the relevant equation of the geometry

namely: geodesic, parallel transport, Hessian.4. Linear optimization method using the natural gradient as direction of increase using

the Riemannian exponential as a retraction. Cf. [1] and in Amari monograph [4].5. Second order optimization method with the Riemannian Hessian and the Riemannian

exponential. Cf. [1] and [4].6. Diffusions on Sym++ (n) with the method of J. Armstrong and D. Brigo [6].7. Sub-manifold of the correlation matrices i.e. with diagonal elements all equal 1. In

this case, the tangent space at each point is the space of symmetric matrices with zerodiagonal.

8. Sub-manifold of trace 1 matrices. This application possibly requires the generalizationto Complex Gaussians see e.g., Fassino et al. [13], and Hermitian matrices as in Bhatiaet al. [8].


9. Sub-manifold of the concentration matrices with a given sparsity pattern. In this casethe Wasserstein distance interpretation is not available but see the Bhatia interpreta-tion of the distance [8].

REFERENCES

[1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds, PrincetonUniversity Press, 2008. With a foreword by Paul Van Dooren.

[2] C. D. Aliprantis and K. C. Border, Infinite dimensional analysis, Springer, Berlin, third ed., 2006.A hitchhiker’s guide.

[3] S.-I. Amari, Natural gradient works efficiently in learning, Neural Computation, 10 (1998), pp. 251–276,https://doi.org/10.1162/089976698300017746, http://dx.doi.org/10.1162/089976698300017746.

[4] S.-i. Amari, Information geometry and its applications, vol. 194 of Applied Mathematical Sciences,Springer, [Tokyo], 2016, https://doi.org/10.1007/978-4-431-55978-8.

[5] T. W. Anderson, An introduction to multivariate statistical analysis, Wiley Series in Probability andStatistics, Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, third ed., 2003.

[6] J. Armstrong and D. Brigo, Coordinate-free Stochastic Differential Equations as Jets.arXiv:1602.03931, Feb. 2016.

[7] R. Bhatia, Positive definite matrices, Princeton Series in Applied Mathematics, Princeton UniversityPress, Princeton, NJ, 2007.

[8] R. Bhatia, T. Jain, and Y. Lim, On the Bures-Wasserstein distance between positive definite matrices.arXiv:1712.01504, Dec. 2017.

[9] Y. Brenier, Polar factorization and monotone rearrangement of vector-valued functions, Comm. PureAppl. Math., 44 (1991), pp. 375–417, https://doi.org/10.1002/cpa.3160440402.

[10] E. Chevallier, E. Kalunga, and J. Angulo, Kernel density estimation on spaces of Gaussian dis-tributions and symmetric positive definite matrices, SIAM J. Imaging Sci., 10 (2017), pp. 191–215,https://doi.org/10.1137/15M1053566.

[11] M. P. do Carmo, Riemannian geometry, Mathematics: Theory & Applications, Birkhuser Boston Inc.,1992. Translated from the second Portuguese edition by Francis Flaherty.

[12] D. C. Dowson and B. V. Landau, The Frechet distance between multivariate normal distributions,J. Multivariate Anal., 12 (1982), pp. 450–455, https://doi.org/10.1016/0047-259X(82)90077-X, http://dx.doi.org/10.1016/0047-259X(82)90077-X.

[13] C. Fassino, G. Pistone, E. Riccomagno, and M. P. Rogantin, Moments of the multivariate Gaussiancomplex random variable. arXiv:170809622, Aug. 2017.

[14] M. Gelbrich, On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbertspaces, Math. Nachr., 147 (1990), pp. 185–203, https://doi.org/10.1002/mana.19901470121.

[15] C. R. Givens and R. M. Shortt, A class of Wasserstein metrics for probability distributions, MichiganMath. J., 31 (1984), pp. 231–240, https://doi.org/10.1307/mmj/1029003026.

[16] P. R. Halmos, Finite-dimensional vector spaces, The University Series in Undergraduate Mathematics,D. Van Nostrand Co., Inc., Princeton-Toronto-New York-London, 1958. 2nd ed.

[17] W. P. A. Klingenberg, Riemannian geometry, vol. 1 of De Gruyter Studies in Mathematics, Walter deGruyter & Co., Berlin, second ed., 1995, https://doi.org/10.1515/9783110905120.

[18] S. Lang, Differential and Riemannian manifolds, vol. 160 of Graduate Texts in Mathematics, Springer-Verlag, third ed., 1995.

[19] J. R. Magnus and H. Neudecker, Matrix differential calculus with applications in statistics and econo-metrics, Wiley Series in Probability and Statistics, John Wiley & Sons, Ltd., Chichester, 1999. Revisedreprint of the 1988 original.

[20] L. Malago and G. Pistone, Combinatorial optimization with information geometry: Newton method,Entropy, 16 (2014), pp. 4260–4289.

[21] O. L. Mangasarian and S. Fromovitz, The Fritz John necessary optimality conditions in the presenceof equality and inequality constraints, J. Math. Anal. Appl., 17 (1967), pp. 37–47, https://doi.org/10.1016/0022-247X(67)90163-1.

https://doi.org/10.1162/089976698300017746

http://dx.doi.org/10.1162/089976698300017746

https://doi.org/10.1007/978-4-431-55978-8

https://doi.org/10.1002/cpa.3160440402

https://doi.org/10.1137/15M1053566

https://doi.org/10.1016/0047-259X(82)90077-X

http://dx.doi.org/10.1016/0047-259X(82)90077-X

http://dx.doi.org/10.1016/0047-259X(82)90077-X

https://doi.org/10.1002/mana.19901470121

https://doi.org/10.1307/mmj/1029003026

https://doi.org/10.1515/9783110905120

https://doi.org/10.1016/0022-247X(67)90163-1

https://doi.org/10.1016/0022-247X(67)90163-1


[22] R. J. McCann, Polar factorization of maps on Riemannian manifolds, Geom. Funct. Anal., 11 (2001),pp. 589–608, https://doi.org/10.1007/PL00001679.

[23] I. Olkin and F. Pukelsheim, The distance between two random vectors with given dispersion matrices,Linear Algebra Appl., 48 (1982), pp. 257–263, https://doi.org/10.1016/0024-3795(82)90112-4.

[24] F. Otto, The geometry of dissipative evolution equations: the porous medium equation, Comm. PartialDifferential Equations, 26 (2001), pp. 101–174, https://doi.org/10.1081/PDE-100002243.

[25] A. Papadopoulos, Metric spaces, convexity and non-positive curvature, vol. 6 of IRMA Lectures inMathematics and Theoretical Physics, European Mathematical Society (EMS), Zurich, second ed.,2014, https://doi.org/10.4171/132.

[26] V. Simoncini, Computational methods for linear matrix equations, SIAM Rev., 58 (2016), pp. 377–441,https://doi.org/10.1137/130912839.

[27] A. Takatsu, Wasserstein Geometry of Gaussian measures, Osaka J. Math., 48 (2011), pp. 1005–1026.[28] E. L. Wachspress, Trail to a Lyapunov equation solver, Comput. Math. Appl., 55 (2008), pp. 1653–1659,

https://doi.org/10.1016/j.camwa.2007.04.048.

https://doi.org/10.1007/PL00001679

https://doi.org/10.1016/0024-3795(82)90112-4

https://doi.org/10.1081/PDE-100002243

https://doi.org/10.4171/132

https://doi.org/10.1137/130912839

https://doi.org/10.1016/j.camwa.2007.04.048


9. Appendix.

9.1. Proof of Proposition 3.1. A symmetric matrix Σ ∈ Sym (2n) is non-negative defineif, and only if, it is of the form Σ = SS∗, with S ∈ M (2n). In our case, given the blockstructure of Σ in (42), we can write

(173)

[Σ1 KK∗ Σ2

]=

[AB

] [A∗ B∗

]=

[AA∗ AB∗

BA∗ BB∗

],

where A and B are two matrices in M(n× 2n).Therefore, problem (42) becomes

(174)

α(Σ1,Σ2) = max

A,B∈M(n×2n)2 Tr (AB∗)

subject to

Σ1 = AA∗, Σ2 = BB∗

We have already observed that the optimum exists, so the necessary conditions of Lagrangetheorem allows us to characterize this optimum. However, the two constraints Σ1 = AA∗ andΣ2 = BB∗ are not necessarily regular at every point (i.e., the Jacobian of the transformationmay fail to be of full rank at some point), so we must take into account that the optimum couldbe an irregular point. To this purpose, as a customary, we shall adopt Fritz John first-orderformulation for the Lagrangian (see [21]).

We shall initially assume that both Σ1 and Σ2 are non-singular.Let then (ν0,Λ,Γ) ∈ {0, 1} × Sym (n) × Sym (n), with (ν0,Λ,Γ) 6= (0, 0, 0), where the

symmetric matrices Λ and Γ are the Lagrange multipliers. The Lagrangian function will be

L = 2ν0 Tr (AB∗)− Tr (ΛAA∗)− Tr (ΓBB∗)

= 2ν0 Tr (AB∗)− Tr (A∗ΛA)− Tr (B∗ΓB)

The critical points of L lead to the following first order conditions

(175)

{ν0B = ΛA, ν0A = ΓB

Σ1 = AA∗, Σ2 = BB∗.

In the case ν0 = 1, i.e., the case of stationary regular points, Eq. (175) becomes

(176)

{B = ΛA, A = ΓB

Σ1 = AA∗, Σ2 = BB∗,

which in turn implies

(177)

{ΛΣ1Λ = Σ2

ΓΣ2Γ = Σ1, Λ,Γ ∈ Sym (n)


and further

(178) K = Σ1Λ = ΓΣ2.

Of course, Eq.s (177) could be more general than Eq.s (176) and thus possibly containundesirable solutions. In this light, we establish the following facts, in which both matricesΣ1 and Σ2 must be nonsingular. Notice that in this case Eq.s (177) imply that both Λ and Γare nonsingular as well.

Claim 1: If (Γ,Λ) is a solution to (177) and Λ−1 = Γ, then the couple (Γ,Λ) are Lagrangemultipliers of Problem (42).

Actually, let Σ1 = AA∗, A ∈ M(n × 2n) be any representation of the matrix Σ1. DefineB = ΛA so that A = Λ−1B = ΓB. Moreover

(179) BB∗ = ΛAA∗Λ = ΛΣ1Λ = Σ2 ,

and so (Λ,Γ) are multipliers associated with the feasible point (A,B).

Claim 2: The set of solutions to (177), such that Γ−1 = Λ, is not empty. In particular, there

is a unique pair(

Λ, Γ)

where both Λ and Γ are positive definite.

We have already observed that Eq.s (177) imply that Λ and Γ are nonsingular. Moreover,we have Γ−1Σ1Γ−1 = Σ2. Recalling that Riccati’s equation has one and only one solution inthe class of positive definite matrices, then X = Λ = Γ−1.

Now we proceed to study the solutions to ΛΣ1Λ = Σ2 and we shall show that Eq (177)has infinitely many solutions. In correspondence to each one Λ, the value of the objectivefunction will be given by 2 Tr (K) = 2 Tr (Σ1Λ). Therefore, we must select the matrix Λ suchthat Tr (Σ1Λ) be maximized.

Following [12], we define

(180) R = Σ1/21 ΛΣ

1/21 ∈ Sym (n) ,

so that, in view of (177), we have

(181) R2 = Σ1/21 ΛΣ

1/21 Σ

1/21 ΛΣ

1/21 = Σ

1/21 ΛΣ1ΛΣ

1/21 = Σ

1/21 Σ2Σ

1/21 ∈ Sym+ (n) .

Moreover,

(182) Tr (R) = Tr(

Σ1/21 ΛΣ

1/21

)= Tr

(Σ

1/21 Σ

1/21 Λ

)= Tr (Σ1Λ) = Tr (K) .

Eq. (181) shows that, though the Lagrangian can have many rest points (i.e., many

solutions Λ) the matrix R2 = Σ1/21 Σ2Σ

1/21 ∈ Sym+ (n) remains constant. Not so the value of

the objective function Tr (K) = Tr (R) which depends on R (i.e., on Λ).


Let

(183) R2 =∑k

λkEk

denote the spectral decomposition of R2, then the solutions to R will be

(184) R =∑k

εkλ1/2k Ek

with εk = ±1. Hence Tr (K) = Tr (R) will be maximized whenever εk ≡ 1 and so R ∈Sym+ (n). Clearly the objective function will be minimized if εk ≡ −1. From now on theproof of the min statement follows similarly.

It follows that the maximum of the trace occurs at

(185) R =(

Σ1/21 Σ2Σ

1/21

)1/2,

namely Λ = Σ−1/21

(Σ

1/21 Σ2Σ

1/21

)1/2Σ−1/21 . Thanks to Claims 1-2 this matrix is a multiplier

of the Lagrangian and so we would have

(186) α (Σ1,Σ2) = 2Tr(

Σ1/21 Σ2Σ

1/21

)1/2,

as long as the optimum is attained at a regular point. In fact, to complete the proof, we muststill examine the case ν0 = 0, for which Eq. (175) becomes

(187) ΛA = 0, ΓB = 0 .

It follows

ΛΣ1 = ΛAA∗ = 0

ΓΣ2 = ΓBB∗ = 0 ,

and consequently Λ = Γ = 0. Therefore there is no irregular point, provided Σ1 and Σ2 arenot singular matrices. So we have proved the relation (186) under the above assumptions.

Last step will be that of extending our result when the matrices Σ1 and Σ2 are not bothnonsingular.

Given the two matrices Σ1,Σ2 ∈ Sym+ (n), set

(188) Σ1 (ε) = Σ1 + εIn and Σ2 (ε) = Σ2 + εIn, with ε ∈ [0, 1] .

If ε > 0, then

(189) det (Σi + εI) =n∏j=1

(λi,j + ε) > 0, i = 1, 2 .


where λi,j , j = 1, . . . , n is a set of eigenvalues of Σi, i = 1, 2. Let us consider the parametricprogramming problem

(190)

α(Σ1(ε),Σ2(ε)) = max

K∈M(n)2 Tr (K)

subject to[Σ1(ε) KK∗ Σ2(ε)

]∈ Sym+ (2n)

Observe that the feasible region is contained in a compact set independent of ε ∈ [0, 1] becauseof the bound (36).

Now the continuity of the optimal value ε 7→ α(Σ1(ε),Σ2(ε)) follows easily from Bergemaximum theorem, see for instance [2, Th. 17.31]. Hence

(191) α(Σ1,Σ2) = limε→0

α(Σ1(ε),Σ2(ε)) = 2 Tr(

(Σ1/21 Σ2Σ

1/21 )1/2

)and the assertion is proved for any Σ1,Σ2 ∈ Sym+ (n).

Wasserstein Riemannian Geometry of Positive …...WASSERSTEIN RIEMANNIAN GEOMETRY OF POSITIVE-DEFINITE MATRICES 3 singular. This has been done by R. J. McCann [22, Example 1.7] and

Documents