Top Banner
arXiv:1111.5280v4 [math.OC] 19 Nov 2013 Stochastic gradient descent on Riemannian manifolds S. Bonnabel Abstract Stochastic gradient descent is a simple approach to find the local minima of a cost function whose evaluations are corrupted by noise. In this paper, we develop a procedure extending stochastic gradient descent algorithms to the case where the function is defined on a Riemannian manifold. We prove that, as in the Euclidian case, the gradient descent algorithm converges to a critical point of the cost function. The algorithm has numerous potential applications, and is illustrated here by four examples. In particular a novel gossip algorithm on the set of covariance matrices is derived and tested numerically. 1 Introduction Stochastic approximation provides a simple approach, of great practical importance, to find the local minima of a function whose evaluations are corrupted by noise. It has had a long history in optimization and control with numerous applications (e.g. [19, 6, 36]). To demon- strate the main ideas on a toy example, we briefly mention a traditional procedure to optimize the ballistic trajectory of a projectile in a fluctuating wind. Successive gradient corrections (i.e. corrections proportional to the distance between the projectile impact and the target) are performed on the angle at which the projectile is launched. With a decreasing step size tending to zero, one can reasonably hope the launching angle will converge to a fixed value which is such that the corresponding impacts are centered on the target on average. One of the first formal algorithm of this kind is the Robbins-Monro algorithm [29], which dates back to the 1950s. It proves that for a smooth cost function C (w) having a unique minimum, the algorithm w t+1 = w t γ t h t (w t ), where h t (w t ) is a noisy evaluation of the gradient of C at w t , converges in quadratic mean to the minimum, under specific conditions on the sequence γ t . Although stochastic gradient has found applications in control, system identification, and filtering theories (for instance a Kalman filter for noisy observations of a constant process is a Robbins-Monro algorithm), new challenging applications stem from the active machine learning community. The work of L. Bottou, a decade ago [9], has popularized the stochas- tic gradient approach, both to address the online learning problem (identification of a con- stant parameter in real time from noisy output measurements) and large-scale learning (with * Robotics lab, Math´ ematiques et syst` emes, Mines ParisTech, 75272 Paris CEDEX, France (e-mail: sil- vere.bonnabel@ mines-paristech.fr). 1
29

Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

Sep 07, 2018

Download

Documents

dinhthuan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

arX

iv:1

111.

5280

v4 [

mat

h.O

C]

19 N

ov 2

013

Stochastic gradient descent on Riemannian manifolds

S. Bonnabel∗

Abstract

Stochastic gradient descent is a simple approach to find the local minima of a costfunction whose evaluations are corrupted by noise. In this paper, we develop a procedureextending stochastic gradient descent algorithms to the case where the function is definedon a Riemannian manifold. We prove that, as in the Euclidian case, the gradient descentalgorithm converges to a critical point of the cost function. The algorithm has numerouspotential applications, and is illustrated here by four examples. In particular a novelgossip algorithm on the set of covariance matrices is derived and tested numerically.

1 Introduction

Stochastic approximation provides a simple approach, of great practical importance, to findthe local minima of a function whose evaluations are corrupted by noise. It has had a longhistory in optimization and control with numerous applications (e.g. [19, 6, 36]). To demon-strate the main ideas on a toy example, we briefly mention a traditional procedure to optimizethe ballistic trajectory of a projectile in a fluctuating wind. Successive gradient corrections(i.e. corrections proportional to the distance between theprojectile impact and the target)are performed on the angle at which the projectile is launched. With a decreasing step sizetending to zero, one can reasonably hope the launching anglewill converge to a fixed valuewhich is such that the corresponding impacts are centered onthe target on average. One ofthe first formal algorithm of this kind is the Robbins-Monro algorithm [29], which dates backto the 1950s. It proves that for a smooth cost functionC(w) having a unique minimum, thealgorithmwt+1 = wt − γtht(wt), whereht(wt) is a noisy evaluation of the gradient ofC atwt, converges in quadratic mean to the minimum, under specific conditions on the sequenceγt.

Although stochastic gradient has found applications in control, system identification, andfiltering theories (for instance a Kalman filter for noisy observations of a constant processis a Robbins-Monro algorithm), new challenging applications stem from the active machinelearning community. The work of L. Bottou, a decade ago [9], has popularized the stochas-tic gradient approach, both to address the online learning problem (identification of a con-stant parameter in real time from noisy output measurements) and large-scale learning (with

∗Robotics lab, Mathematiques et systemes, Mines ParisTech, 75272 Paris CEDEX, France (e-mail: sil-vere.bonnabel@ mines-paristech.fr).

1

Page 2: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

ever-increasing data-sets, approximating the cost function with a simpler appropriate stochas-tic function can lead to a reduced numerical complexity). Some recent problems have beenstrong drivers for the development of new estimation methods, such as the one proposed inthe present paper, dealing with stochastic gradient descent on manifolds.

The paper is organized as follows. In Section 2 general stochastic gradient descent algo-rithms on Riemannian manifolds are introduced. The algorithms already used in [23, 22, 3, 4,26] can all be cast in the proposed general framework. The main algorithms are completelyintrinsic, i.e. they do not depend on a specific embedding of the manifold or on a choice oflocal coordinates.

In section 3 the convergence properties of the algorithms are analyzed. In the Euclidiancase, almost sure (a.s.) convergence of the parameter to a critical point of the gradient of thecost function is well-established under reasonable assumptions (see e.g. [9]), but this resulthas never been proven to hold for non-Euclidian spaces. In this paper, almost sure convergenceof the proposed algorithms is obtained under several assumptions, extending the results of theEuclidian case to the Riemannian case.

In Section 4 the algorithms and the convergence results of the preceding sections are ap-plied to four examples. The first example revisits the celebrated Oja algorithm [26] for onlineprincipal component analysis (PCA). This algorithm can be cast in our versatile framework,and its convergence properties immediately follow from thetheorems of Section 3. More-over, the other results of the present paper allow to define alternative algorithms for onlinePCA with guaranteed convergence properties. The second example is concerned with the ran-domized computation of intrinsic means on a hyperbolic space, the Poincare disk. This is arather tutorial example, meant to illustrate the assumptions and results of the third theoremof Section 3. The convergence follows from this theorem. Thelast two examples are moredetailed and include numerical experiments. The third example is concerned with a particularalgorithm of [23]. The goal is to identify a positive semi-definite matrix (a kernel or a Maha-lanobis distance) from noisy measurements. The theoretical convergence results of Section 3allow to complete the work of [23], and simulations illustrate the convergence properties. Thelast example is concerned with a consensus application on the set of covariance matrices (seee.g. [20]). A novel randomized gossip algorithm based on theFisher Information Metric isproposed. The algorithm has a meaningful statistical interpretation, and admits several invari-ance and guaranteed convergence properties that follow from the results of Section 3. As thestate space is convex, the usual gossip algorithm [10] is well defined and can be implementedon this space. Simulations indicate the proposed Riemannian consensus algorithm convergesfaster than the usual gossip algorithm.

Appendix A briefly presents some links with information geometry and Amari’s naturalgradient. Appendix B contains a brief recap of differentialgeometry. Preliminary results canbe found in [8, 7].

2

Page 3: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

2 Stochastic gradient on Riemannian manifolds

2.1 Standard stochastic gradient inRn

Let C(w) = EzQ(z, w) =∫

Q(z, w)dP (z) be a three times continuously differentiable costfunction, wherew ∈ Rn is a minimization parameter, anddP is a probability measure on ameasurable spaceZ. Consider the optimization problem

minw

C(w) (1)

In stochastic approximation, the cost function cannot be computed explicitly as the distribu-tion dP is assumed to be unknown. Instead, one has access to a sequence of independentobservationsz1, z2 · · · of a random variable drawn with probability lawdP . At each time stept, the user can compute the so-called loss functionQ(zt, w) for any parameterw ∈ Rn. Theloss can be viewed as an approximation of the (average) cost functionC(w) evaluated underthe inputzt ∈ Z. Stochastic gradient descent is a standard technique to treat this problem. Ateach step the algorithm receives an inputzt drawn according todP , and performs a gradientdescent on the approximated costwt+1 = wt − γtH(zt, wt) whereH(z, w) can be viewed asthe gradient of the loss, i.e., on averageEzH(z, w) =

H(z, w)dP (z) = ∇C(w). As C isnot convex in many applications, one can not hope for a much better result than almost sure(a.s.) convergence ofC(wt) to some valueC∞, and convergence of∇C(wt) to 0. Such aresult holds under a set of standard assumptions, summarized in e.g. [9]. Note that, a.s. con-vergence is a very desirable property for instance in onlineestimation, as it ensures asymptoticconvergence is always achieved in practice.

2.2 Limits of the approach: a motivating example

A topical problem that has attracted a lot of attention in themachine learning community overthe last years is low-rank matrix estimation (or matrix completion, which can be viewed asthe matrix counterpart of sparse approximation problems) and in particular the collaborativefiltering problem: given a matrixW ∗

ij containing the preference ratings of users about items(movies, books), the goal is to compute personalized recommendations of these items. Onlya small subset of entries(i, j) ∈ Ξ is known, and there are many ways to complete the matrix.A standard approach to overcome this ambiguity, and to filterthe noise, is to constrain thestate space by assuming the tastes of the users are explainedby a reduced number of criteria(say,r). This yields the following non-linear optimization problem

minW∈Rd1×d2

(i,j)∈Ξ

(W ∗ij −Wij)

2 s.t. rank(W ) = r

The matrix being potentially of high dimension (d1 ≃ 105, d2 ≃ 106 in the so-called Netflixprize problem), a standard method to reduce the computational burden is to draw randomelements ofΞ, and perform gradient descent ignoring the remaining entries. Unfortunatelythe updated matrixW − γt∇W (W ∗

ij − Wij)2 does not have rankr. Seeking the matrix of

rankr which best approximates it can be numerically costly, especially for very larged1, d2.A more natural way to enforce the rank constraint is to endow the parameter space with a

3

Page 4: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

Riemannian metric, and to perform a gradient step within themanifold of fixed-rank matrices.In [22] this approach has led to stochastic gradient algorithms that compete with state of theart methods. Yet a convergence proof is still lacking. The convergence results below aregeneral, and in Section 4.3 they will be shown to apply to thisproblem for the particular caseof W ∗ being symmetric positive definite.

2.3 Proposed general stochastic gradient algorithm on Riemannian man-ifolds

In this paper we propose a new procedure to address problem (1) whereC(w) = EzQ(z, w)is a three times continuously differentiable cost functionand wherew is now a minimizationparameter belonging to a smooth connected Riemannian manifoldM. OnM, we propose toreplace the usual update with the following update

wt+1 = expwt(−γtH(zt, wt)) (2)

whereexpw is the exponential map atw, andH(z, w) can be viewed as the Riemanniangradient of the loss, i.e., we have on averageEzH(z, w) =

H(z, w)dP (z) = ∇C(w) where∇C(w) denotes the Riemannian gradient ofC at w ∈ M. The proposed update (2) is astraightforward transposition of the standard gradient update in the Euclidian case. Indeed,H(z, w) is a tangent vector to the manifold that describes the direction of steepest descent forthe loss. In update (2), the parameter moves along the geodesic emanating from the currentparameter positionwt, in the direction defined byH(zt, wt) and with intensity‖H(zt, wt‖. Ifthe manifold at hand isRn equipped with the usual Euclidian scalar product, the geodesics arestraight lines, and the definitions coincide. Note that, theprocedure here is totally intrinsic, i.e.the algorithm is completely independent of the choice of local coordinates on the manifold.

In many cases, the exponential map is not easy to compute (a calculus of variations prob-lem must be solved, or the Christoffel symbols need be known), and it is much easier andmuch faster to use a first-order approximation of the exponential, called a retraction. Indeeda retractionRw(v) : TwM 7→ M maps the tangent space atw to the manifold, and it is suchthatd(Rw(tv), expw(tv)) = O(t2). It yields the alternative update

wt+1 = Rwt(−γtH(zt, wt)) (3)

Let us give a simple example to illustrate the ideas: if the manifold were the sphereSn−1

endowed with the natural metric inherited through immersion inRn, a retraction would consist

of a simple addition in the ambient spaceRn followed by a projection onto the sphere. Thisis a numerically very simple operation that avoids calculating the geodesic distance explicitly.See the Appendix for more details on Riemannian manifolds.

3 Convergence results

In this section, the convergence of the proposed algorithms(2) and (3) are analyzed. Theparameter is proved to converge almost surely to a critical point of the cost function in variouscases and under various conditions. More specifically, three general results are derived.

4

Page 5: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

In Subsection 3.1) a first general result is derived: when theparameterw ∈ M is provedto remain in a compact set, the algorithm (2) converges a.s. under standard conditions on thestep size sequence. This theorem applies in particular to all connected compact manifolds.Important examples of such manifolds in applications are the orthogonal group, the group ofrotations, the sphere, the real projective space, the Grassmann and the Stiefel manifold. InSubsection 3.2), the result is proved to hold when a twice continuously differentiable retrac-tion is used instead of the exponential map.

Finally, in Subsection 3.3), we consider a slightly modifiedversion of algorithm (2) onspecific non positively curved Riemannian manifolds. The step sizeγt is adapted at eachstep in order to take into account the effects of negative curvature that tend to destabilize thealgorithm. Under a set of mild assumptions naturally extending those of the Euclidian case,the parameter is proved to a.s. remain in a compact set, and thus a.s. convergence is proved.Important examples of such manifolds are the Poincare diskor the Poincare half plane, and thespace of real symmetric positive definite matricesP+(n). The sequence of step sizes(γt)t≥0

will satisfy the usual condition in stochastic approximation:

γ2t < ∞ and

γt = +∞ (4)

3.1 Convergence on compact sets

The following theorem proves the a.s. convergence of the algorithm under some assumptionswhen the trajectories have been proved to remain in a predefined compact set at all times. Thisis of course the case ifM is compact.

Theorem 1. Consider the algorithm(2) on a connected Riemannian manifoldM with in-jectivity radius uniformly bounded from below byI > 0. Assume the sequence of step sizes(γt)t≥0 satisfy the standard condition(4). Suppose there exists a compact setK such thatwt ∈ K for all t ≥ 0. We also suppose that the gradient is bounded onK, i.e. there existsA > 0 such that for allw ∈ K andz ∈ Z we have‖H(z, w)‖ ≤ A. ThenC(wt) convergesa.s. and∇C(wt) → 0 a.s.

Proof. The proof builds upon the usual proof in the Euclidian case (see e.g. [9]). As theparameter is proved to remain in a compact set, all continuous functions of the parameter canbe bounded. Moreover, asγt → 0, there existst0 such that fort ≥ t0 we haveγtA < I.Suppose now thatt ≥ t0, then there exists a geodesicexp(−sγtH(zt, wt))0≤s≤1 linking wt

andwt+1 asd(wt, wt+1) < I. C(exp(−γtH(zt, wt))) = C(wt+1) and thus the Taylor formulaimplies that (see Appendix)

C(wt+1)− C(wt) ≤ −γt〈H(zt, wt),∇C(wt)〉+ γ2

t ‖H(zt, wt)‖2k1(5)

wherek1 is an upper bound on the Riemannian Hessian ofC in the compact setK. LetFt bethe increasing sequence ofσ-algebras generated by the variables available just beforetime t:

Ft = {z0, · · · , zt−1}

5

Page 6: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

wt being computed fromz0, · · · , zt−1, is measurableFt. As zt is independent fromFt wehaveE[〈H(zt, wt),∇C(wt)〉|Ft] = Ez[〈H(z, wt),∇C(wt)〉] = ‖∇C(wt)‖2. Thus

E(C(wt+1)− C(wt)|Ft) ≤ −γt‖∇C(wt)‖2 + γt2A2k1 (6)

as‖H(zt, wt)‖ ≤ A. As C(wt) ≥ 0, this provesC(wt) +∑∞

t γ2kA

2k1 is a nonnegative su-permartingale, hence it converges a.s. implying thatC(wt) converges a.s. Moreover summingthe inequalities we have

t≥t0

γt‖∇C(wt)‖2 ≤ −∑

t≥t0

E(C(wt+1)− C(wt)|Ft)

+∑

t≥t0

γt2A2k1

(7)

Here we would like to prove the right term is bounded so that the left term converges. Butthe fact thatC(wt) converges does not imply it has bounded variations. However, as in theEuclidian case, we can use a theorem by D.L. Fisk [15] ensuring thatC(wt) is a quasi mar-tingale, i.e., it can be decomposed into a sum of a martingaleand a process whose trajectoriesare of bounded variation. For a random variableX, letX+ denote the quantitymax(X, 0).

Proposition 1. [Fisk (1965)] Let(Xn)n∈N be a non-negative stochastic process with boundedpositive variations, i.e., such that

∑∞0 E([E(Xn+1 −Xn)|Fn)]

+) < ∞. Then the process is aquasi-martingale, i.e.

∞∑

0

|E[Xn+1 −Xn|Fn]| < ∞ a.s. , andXn converges a.s.

Summing (6) overt, it is clear thatC(wt) satisfies the proposition’s assumptions, andthusC(wt) is a quasi-martingale, implying

t≥t0γt‖∇C(wt)‖2 converges a.s. because of

inequality (7) where the central term can be bounded by its absolute value which is convergentthanks to the proposition. But, asγt → 0, this does not prove‖∇C(wt)‖ converges a.s.However, if‖∇C(wt)‖ is proved to converge a.s., it can only converge to 0 a.s. because ofcondition (4).

Now consider the nonnegative processpt = ‖∇C(wt)‖2. Bounding the second derivativeof ‖∇C‖2 byk2, along the geodesic linkingwt andwt+1, a Taylor expansion yieldspt+1−pt ≤−2γt〈∇C(wt), (∇2

wtC)H(zt, wt)〉+ (γt)

2‖H(zt, wt)‖2k2, and thus bounding from below theHessian ofC on the compact set by−k3 we haveE(pt+1 − pt|Ft) ≤ 2γt‖∇C(wt)‖2k3 +γt

2A2k2. We just proved the sum of the right term is finite. It impliespt is a quasi-martingale,thus it implies a.s. convergence ofpt towards a valuep∞ which can only be 0.

3.2 Convergence with a retraction

In this section, we prove Theorem 1 still holds when a retraction is used instead of the expo-nential map.

6

Page 7: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

Theorem 2. Let M be a connected Riemannian manifold with injectivity radiusuniformlybounded from below byI > 0. Let Rw be a twice continuously differentiable retraction,and consider the update(3). Assume the sequence of step sizes(γt)t≥0 satisfy the standardcondition(4). Suppose there exists a compact setK such thatwt ∈ K for all t ≥ 0. Wesuppose also that the gradient is bounded inK, i.e. forw ∈ K we have∀z ‖H(z, w)‖ ≤ Afor someA > 0. ThenC(wt) converges a.s. and∇C(wt) → 0 a.s.

Proof. Let wexpt+1 = expwt

(−γtH(zt, wt)). The proof essentially relies on the fact that thepointswt+1 andwexp

t+1, are close to each other on the manifold for sufficiently large t. In-deed, as the retraction is twice continuously differentiable there existsr > 0 such thatd(Rw(sv), expw(sv)) ≤ rs2 for s sufficiently small,‖v‖ = 1, andw ∈ K. As for t suf-ficiently largeγtA can be made arbitrarily small (in particular smaller than the injectivityradius), this impliesd(wexp

t+1, wt+1) ≤ γt2rA2.

We can now reiterate the proof of Theorem 1. We haveC(wt+1) − C(wt) ≤ |C(wt+1)−C(wexp

t+1))|+C(wexpt+1)−C(wt). The termC(wexp

t+1)−C(wt) can be bounded as in (5) whereaswe have just proved|C(wt+1)−C(wexp

t+1))| is bounded byk1rγt2A2 wherek1 is a bound on theRiemannian gradient ofC in K. ThusC(wt) is a quasi-martingale and

∑∞1 γt‖∇C(wt)‖2 <

∞. It means that if‖∇C(wt)‖ converges, it can only converge to zero.Let us consider the variations of the functionp(w) = ‖∇C(w)‖2. Writing p(wt+1) −

p(wt) ≤ |p(wt+1) − p(wexpt+1))| + p(wexp

t+1) − p(wt) and bounding the first term of the rightterm byk3rγ2

tA2 wherek3 is a bound on the gradient ofp, we see the inequalities of Theorem

1 are unchanged up to second order terms inγt. Thusp(wt) is a quasi-martingale and thusconverges.

3.3 Convergence on Hadamard manifolds

In the previous section, we proved convergence as long as theparameter is known to remainin a compact set. For some manifolds, the algorithm can be proved to converge without thisassumption. This is the case for instance in the Euclidian space, where the trajectories can beproved to be confined to a compact set under a set of conditions[9]. In this section, we extendthose conditions to the important class of Hadamard manifolds, and we prove convergence.Hadamard manifolds are complete, simply-connected Riemannian manifolds with nonpositivesectional curvature. In order to account for curvature effects, the step size must be slightlyadapted at each iteration. This step adaptation yields a more flexible algorithm, and allows torelax one of the standard conditions even in the Euclidian case.

Hadamard manifolds have strong properties. In particular,the exponential map at anypoint is globally invertible (e.g. [27]). LetD(w1, w2) = d2(w1, w2) be the squared geodesicdistance. Consider the following assumptions, which can beviewed as an extension of theusual ones in the Euclidian case:

1. There is a pointv ∈ M andS > 0 such that the opposite of the gradient points towardsv whend(w, v) becomes larger than

√S i.e.

infD(w,v)>S

〈exp−1w (v),∇C(w)〉 < 0

2. There exists a lower bound on the sectional curvature denoted byκ < 0.

7

Page 8: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

3. There exists a continuous functionf : M 7→ R that satisfies

f(w)2 ≥ max{1,Ez

(

‖H(z, w)‖2(1 +√

|κ|(√

D(w, v) + ‖H(z, w)‖)))

,

Ez

(

(2‖H(z, w)‖√

D(w, v) + ‖H(z, w)‖2)2)

}

Theorem 3. LetM be a Hadamard manifold. Consider the optimization problem(1). Underassumptions 1-3, the modified algorithm

wt+1 = expwt(− γt

f(wt)H(zt, wt)) (8)

is such thatC(wt) converges a.s. and∇C(wt) → 0 a.s.

Assumptions 1-3 are mild assumptions that encompass the Euclidian case. In this lat-ter case Assumption 3 is usually replaced with the stronger conditionEz(‖H(z, w)‖k) ≤A + B‖w‖k for k = 2, 3, 4 (note that, this condition immediately implies the existence ofthe functionf ). Indeed, on the one hand our general procedure based on the adaptive stepγt/f(wt) allows to relax this standard condition, also in the Euclidian case, as will be illus-trated by the example of Section 4.3. On the other hand, contrarily to the Euclidian case, onecould object that the user must provide at each step an upper bound on a function ofD(w, v),wherev is the point appearing in Assumption 1, which requires some knowledge ofv. Thiscan appear to be a limitation, but in fact finding a pointv fulfilling Assumption 1 may be quiteobvious in practice, and may be far from requiring direct knowledge of the point the algorithmis supposed to converge to, as illustrated by the example of Section 4.2.

Proof. The following proof builds upon the Euclidian case [9]. We are first going to provethat the trajectories asymptotically remain in a compact set. Theorem 1 will then easily apply.A second order Taylor expansion yields

D(wt+1, v)−D(wt, v) ≤2γt

f(wt)〈H(zt, wt), exp

−1wt(v)〉

+ (γt

f(wt))2‖H(zt, wt)‖2k1

(9)

wherek1 is an upper bound on the operator norm of half of the Riemannian hessian ofD(·, v)along the geodesic joiningwt towt+1 (see the Appendix). If the sectional curvature is boundedfrom below byκ < 0 we have ([12] Lemma 3.12)

λmax

(

∇2w(D(w, v)/2)

)

≤√

|κ|D(w, v)

tanh(√

|κ|D(w, v))

where∇2w(D(w, v)/2) is the Hessian of the squared half distance andλmax(·) denotes the

largest eigenvalue of an operator. This implies thatλmax

(

∇2w(D(w, v)/2)

)

≤√

|κ|D(w, v) +

1. Moreover, along the geodesic linkingwt andwt+1, triangle inequality implies√

D(w, v) ≤√

D(wt, v) + ‖H(zt, wt)‖ asf(wt) ≥ 1 and there existst0 such thatγt ≤ 1 for t ≥ t0. Thus

8

Page 9: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

k1 ≤ β(zt, wt) for t ≥ t0 whereβ(zt, wt) = 1 +√

|κ|(√

D(wt, v) + ‖H(zt, wt)‖). LetFt bethe increasing sequence ofσ-algebras generated by the several variables available just beforetime t: Ft = {z0, · · · , zt−1}. As zt is independent fromFt, andwt is Ft measurable, wehaveE[( γt

f(wt))2‖H(zt, wt)‖2k1|Ft] ≤ ( γt

f(wt))2Ez

(

‖H(z, wt)‖2β(z, wt))

. Conditioning (9) toFt, and using Assumption 3:

E[D(wt+1, v)−D(wt, v)|Ft]

≤ 2γt

f(wt)〈∇C(wt), exp

−1wt(v)〉+ γ2

t

(10)

Let φ : R+ → R+ be a smooth function such that

• φ(x) = 0 for 0 ≤ x ≤ S

• 0 < φ′′(x) ≤ 2 for S < x ≤ S + 1

• φ′(x) = 1 for x ≥ S + 1

and letht = φ(D(wt, v)). Let us prove it converges a.s. to0. As φ′′(x) ≤ 2 for all x ≥ 0 asecond order Taylor expansion onφ yields

ht+1 − ht ≤ [D(wt+1, v)−D(wt, v)]φ′(D(wt, v))

+ (D(wt+1, v)−D(wt, v))2

Because of the triangle inequality we haved(wt+1, v) ≤ d(wt, v) +γt

f(wt)‖H(zt, wt)‖. Thus

D(wt+1, v)−D(wt, v) ≤ 2d(wt, v)γt

f(wt)‖H(zt, wt)‖+( γt

f(wt))2‖H(zt, wt)‖2 which is less than

γtf(wt)

[2d(wt, v)‖H(zt, wt)‖ + ‖H(zt, wt)‖2] for t ≥ t0. Using Assumption 3 and the fact thatwt is measurableFt we haveE[ht+1−ht|Ft] ≤ φ′(D(wt, v))E[D(wt+1, v)−D(wt, v)|Ft]+γ2

t .Using (10) we have

E[ht+1 − ht|Ft]

≤ 2γt

f(wt)〈∇C(wt), exp

−1wt(v)〉φ′(D(wt, v)) + 2γ2

t

(11)

asφ′ is positive, and less than 1. EitherD(wt, v) ≤ S and then we haveφ′(D(wt, v)) = 0 andthusE[ht+1−ht|Ft] ≤ 2γ2

t . OrD(wt, v) > S, and Assumption 1 ensures〈∇C(wt), exp−1wt(v)〉

is negative. Asφ′ ≥ 0, (11) impliesE[ht+1 − ht|Ft] ≤ 2γ2t . In both casesE[ht+1 − ht|Ft] ≤

2γ2t , provinght+2

∑∞t γ2

k is a positive supermartingale, hence it converges a.s. Let us prove itnecessarily converges to0. We have

∑∞t0E([E(ht+1−ht|Ft)]

+) ≤ 2∑

t γ2t < ∞. Proposition

1 proves thatht is a quasi-martingale. Using (11) we have inequality

−2∞∑

t0

γtf(wt)

〈∇C(wt), exp−1wt(v)〉φ′(D(wt, v)) ≤ 2

∞∑

t0

γ2t −

∞∑

t0

E[ht+1 − ht|Ft]

and asht is a quasi-martingale we have a.s.

−∞∑

t0

γtf(wt)

〈∇C(wt), exp−1wt(v)〉φ′(D(wt, v))

≤ 2|∞∑

t0

γ2t |+

∞∑

t0

|E[ht+1 − ht|Ft]| < ∞(12)

9

Page 10: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

Consider a sample trajectory for whichht converges toα > 0. It means that fort large enoughD(wt, v) > S and thusφ′(D(wt, v)) > ǫ1 > 0. Because of Assumption 1 we have also〈∇C(wt), exp

−1wt(v)〉 < −ǫ2 < 0. This contradicts (12) as

∑∞t0

γtf(wt)

= ∞. The last equalitycomes from (4) and the fact thatf is continuous and thus bounded along the trajectory.

It has been proved that almost every trajectory asymptotically enters the ball of centerv and radiusS and stays inside of it. Let us prove that we can work on a fixed compactset. LetGn =

t>n{D(wt, v) ≤ S}. We have just provedP (∪ Gn) = 1. Thus to provea.s. convergence, it is thus sufficient to prove a.s. convergence on each of those sets. Weassume from now on the trajectories all belong to the ball of centerv and radiusS. Asthis is a compact set, all continuous functions of the parameter can be bounded. In particularγt/k3 ≤ γt/f(wt) ≤ γt for somek3 > 0 and thus the modified step size verifies the conditionsof Theorem 1. Moreover,Ez(‖H(z, w)‖2) ≤ A2 for someA > 0 on the compact as itis dominated byf(w)2. As there is no cut locus, this weaker condition is sufficient, sinceit implies that (6) holds. The proof follows from a mere application of Theorem 1 on thiscompact set.

Note that, it would be possible to derive an analogous resultwhen a retraction is usedinstead of the exponential, using the ideas of the proof of Theorem 2. However, due to a lackof relevant examples, this result is not presented.

4 Examples

Four application examples are presented. The first two examples are rather tutorial. The firstone illustrates Theorems 1 and 2. The second one allows to provide a graphical interpretationof Theorem 3 and its assumptions. The third and fourth examples are more detailed andinclude numerical experiments. Throughout this sectionγt is a sequence of positive step sizessatisfying the usual condition (4).

4.1 Subspace tracking

We propose to first revisit in the light of the preceding results the well-known subspace track-ing algorithm of Oja [26] which is a generalization of the power method for computing thedominant eigenvector. In several applications, one wants to compute ther principal eigen-vectors, i.e. perform principal component analysis (PCA) of a n × n covariance matrixA,wherer ≤ n. Furthermore, for computational reasons or for adaptiveness, the measurementsare supposed to be a stream ofn-dimensional data vectorsz1, · · · zt, · · · whereE(ztz

Tt ) = A

(online estimation). The problem boils down to estimating an element of the Grassmann man-ifold Gr(r, n) of r-dimensional subspaces in an-dimensional ambient space, which can beidentified to the set of rankr projectors:

Gr(r, n) = {P ∈ Rn×n s.t. P T = P, P 2 = P, Tr (P ) = r}.

10

Page 11: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

Those projectors can be represented by matricesWW T whereW belongs to the Stiefel man-ifold St(r, n), i.e., matrices ofRn×r whose columns are orthonormal. Define the cost function

C(W ) = −1

2Ez[z

TW TWz] = −1

2Tr

(

W TAW)

which is minimal whenW is a basis of the dominant subspace of the covariance matrixA.It is invariant to rotationsW 7→ WO,O ∈ O(r). The state-space is therefore the set ofequivalence classes[W ] = {WO s.t. O ∈ O(r)}. This set is denoted bySt(r, n)/O(r). Itis aquotient representationof the Grassmann manifold Gr(r, d). This quotient geometry hasbeen well-studied in e.g. [13]. The Riemannian gradient under the eventz is: H(z,W ) =(I −WW T )zzTW . We have the following result

Proposition 2. Supposez1, z2, · · · are uniformly bounded. Consider the stochastic Rieman-nian gradient algorithm

Wt+1 = WtVt cos (γtΘt)VTt + Ut sin (γtΘt)V

Tt (13)

whereUtΘtVt is the compact SVD of the matrix(I −WtWTt )ztz

Tt Wt. ThenWt converges a.s.

to an invariant subspace of the covariance matrixA.

Proof. The proof is a straightforward application of Theorem 1. Indeed, the update (13)corresponds to (2) as it states thatWt+1 is on the geodesic emanating fromWt with tangentvectorH(zt,Wt) at a distanceγt‖H(zt,Wt)‖ from Wt. As the input sequence is bounded, sois the sequence of gradients. The injectivity radius of the Grassmann manifold isπ/2, andis thus bounded away from zero, and the Grassmann manifold iscompact. Thus Theorem 1proves thatWt a.s. converges to a point such that∇C(W ) = 0, i.e. AW = WW TAW .For such points there existsM such thatAW = WM , provingW is an invariant subspaceof A. A local analysis proves the dominant subspace ofA (i.e. the subspace associated withthe firstr eigenvalues) is the only stable subspace of the averaged algorithm [26] under basicassumptions.

We also have the following result

Proposition 3. Consider a twice differentiable retractionRW . The algorithm

Wt+1 = RWt

(

Wt + γt(I −WtWTt )ztz

Tt Wt

)

(14)

converges a.s. to an invariant subspace of the covariance matrix A.

The result is a mere application of Theorem 2. Consider in particular the following re-traction:RW (γH)=qf(W + γH) where qf() extracts the orthogonal factor in the QR decom-position of its argument. For smallγt, this retraction amounts to follow the gradient in theEuclidian ambient spaceRn×p, and then to orthonormalize the matrix at each step. It is aninfinitely differentiable retraction [1]. The algorithm (14) with this particular retraction isknown as Oja’s vector field for subspace tracking and has already been proved to convergein [25]. Using the general framework proposed in the presentpaper, we see this convergenceresult directly stems from Theorem 2.

11

Page 12: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

Figure 1: The Poincare disk. The boundary is at infinite distance from the center. Thegeodesics (solid lines) are either arcs of circles perpendicular to the boundary of the disk,or diameters. The dashed circle is the boundary of a geodesicball centered at 0. Assumption1 is obviously verified: if a pointwt outside the ball makes a small move towards any pointztinside the ball along the geodesic linking them, its distance to 0 decreases.

This example clearly illustrates the benefits of using a retraction. Indeed, from a numericalviewpoint, the geodesic update (13) requires to perform a SVD at each time step, i.e.O(nr2)+O(r3) operations, whereas update (14) is only an orthonormalization of the vectors having alower computational cost of orderO(nr2), which can be very advantageous, especially whenr is large.

4.2 Randomized computation of a Karcher mean on a hyperbolicspace

We propose to illustrate Theorem 3 and the assumptions it relies on on a well-known andtutorial manifold. Consider the unit diskD = {x ∈ R2 : ‖x‖ < 1} with the Riemannianmetric defined on the tangent plane atx by

〈ξ, η〉x = 4ξ · η

(1− ‖x‖2)2

where“·” represents the conventional scalar product inR2. The metric tensor is thus diagonal,

so the angles between two intersecting curves in the Riemannian metric are the same as in theEuclidian space. However, the distances differ: as a point is moving closer to the boundaryof the disk, the distances are dilated so that the boundary can not be reached in finite time.As illustrated on the figure, the geodesics are either arcs ofcircles that are orthogonal to theboundary circle, or diameters. The Poincare disk equippedwith its metric is a Hadamardmanifold.

The Karcher (or Frechet) mean on a Riemannian manifold is defined as the minimizer ofw 7→ ∑N

1 d2(w, zi). It can be viewed as a natural extension of the usual Euclidian barycenterto the Riemannian case. It is intrinsically defined, and it isunique on Hadamard manifolds.

12

Page 13: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

There has been growing interest in computing Karcher means recently, in particular for filter-ing on manifolds, see e.g. [2, 5, 4]. On the Poincare disk we propose to compute the meanof N points in a randomized way. The method is as follows, and is closely related to theapproach [4]. Letwt be the optimization parameter. The goal is to find the minimumof thecost function

C(w) =1

2N

N∑

1

d2(w, zi)

At each time step, a pointzi is randomly picked with an uniform probability law. The lossfunction isQ(zt, wt) = 1

2d2(wt, zi), andH(zt, wt) is the Riemannian gradient of the half

squared geodesic distance12D(zt, wt) =

12d2(zt, wt). On the Poincare disk, the distance func-

tion is defined byd(z, w) = cosh−1 (1 + δ(z, w)) whereδ(z, w) = 2 ‖z−w‖2

(1−‖z‖2)(1−‖w‖2). As the

metric tensor is diagonal, the Riemannian gradient and the Euclidian gradient have the samedirection. Moreover its norm is simplyd(zt, wt) (see the Appendix). It is thus easy to provethat

H(zt, wt) =(1− ‖wt‖2)(wt − zt) + ‖wt − zt‖2wt

‖(1− ‖wt‖2)(wt − zt) + ‖wt − zt‖2wt‖d(zt, wt) (15)

When there is a lot of redundancy in the data, i.e. when some points are very close to eachother, a randomized algorithm may be much more efficient numerically than a batch algo-rithm. This becomes obvious in the extreme case where thezi’s are all equal. In this case, theapproximated gradientH(zt, wt) coincides with the (Riemannian) gradient of the costC(wt).However, computing this latter quantity requiresN times more operations than computingthe approximated gradient. WhenN is large and when there is a lot of redundancy in thedata, we thus see a randomized algorithm can lead to a drasticreduction in the computationalburden. Besides, note that, the stochastic algorithm can also be used in order to filter a streamof noisy measurements of a single point on the manifold (and thus track this point in case itslowly moves). Indeed, it is easily seen that ifM = Rn andd is the Euclidian distance, theproposed update boils down to a first order discrete low pass filter as it computes a weightedmean between the current updatewt and the new measurementzt.

Proposition 4. Suppose at each time a pointzt is randomly drawn. LetS > 0 be such thatS > (max{d(z1, 0), · · · , d(zN , 0)})2 and letα(wt) = d(wt, 0)+

√S. Consider the algorithm

(8) whereH(zt, wt) is given by(15)and

f(wt)2 = max{1,

α(wt)2(1 + d(wt, 0) + α(wt)),

(2α(wt)d(wt, 0) + α(wt)2)2}

Thenwt converges a.s. to the Karcher mean of the pointsz1, · · · , zN .

Proof. The conditions of Theorem 3 are easily checked. Assumption 1: it is easy to seeon the figure that Assumption 1 is verified withv = 0, andS being the radius of an opengeodesic ball centered at0 and containing all the pointsz1, · · · , zN . More technically, supposed(w, 0) >

√S. The quantity〈exp−1

w (0), H(zi, w)〉w is equal to−d(w, 0)H(zi, w) · w/(1 −

13

Page 14: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

‖w‖2)2 = −λ((1 − ‖w‖2)(‖w‖2 − zi · w) + ‖w − zi‖2‖w‖2) whereλ is a positive quantitybounded away from zero ford(w, 0) >

√S. As there existsβ > 0 such that‖w‖− ‖zi‖ ≥ β,

and‖w − zi‖ ≥ ‖w‖ − ‖zi‖, the term〈exp−1w (0), H(zi, w)〉w is negative and bounded away

from zero, and so is its average over thez′is. Assumption 2: in dimension 2, the sectionalcurvature is known to be identically equal to−1. Assumption 3 is obviously satisfied as‖H(z, w)‖ = d(z, w) ≤ d(z, 0) + d(0, w) ≤

√S + d(0, w) = α(w) by triangle inequality.

Note that, one could object that in general finding a functionf(wt) satisfying Assumption3 of Theorem 3 requires knowingd(wt, v) and thus requires some knowledge of the pointv of Assumption 1. However, we claim (without proof) that in applications on Hadamardmanifolds, there may be obvious choices forv. This is the case in the present example wherefindingv such that assumptions 1-3 are satisfied requires very little(or no) knowledge on thepoint the algorithm is supposed to converge to. Indeed,v = 0 is a straightforward choice thatalways fulfills the assumptions. This choice is convenient for calculations as the geodesicsemanating from0 are radiuses of the disk, but many other choices would have been possible.

4.3 Identification of a fixed rank symmetric positive semi-definite matrix

To illustrate the benefits of the approach on a recent non-linear problem, we focus in thissection on an algorithm of [23], and we prove new rigorous convergence results. Least MeanSquares (LMS) filters have been extensively utilized in adaptive filtering for online regression.Let xt ∈ Rn be the input vector, andyt be the output defined byyt = wTxt + νt where theunknown vectorw ∈ Rn is to be identified (filter weights), andνt is a noise. At each step welet zt = (xt, yt) and the approximated cost function isQ(zt, wt) =

12(wTxt − yt)

2. Applyingthe steepest descent leads to the stochastic gradient algorithm known as LMS:wt+1 = wt −γt(wt

Txt − yt)xt.We now consider a non-linear generalization of this problemcoming from the machine

learning field (see e.g. [37]), wherext ∈ Rn is the input,yt ∈ R is the output, and the matrixcounterpart of the linear model is

yt = Tr(

WxtxTt

)

= xTt Wxt (16)

whereW ∈ Rn×n is an unknown positive semi-definite matrix to be identified.In data mining,positive semi-definite matricesW represent a kernel or a Mahalanobis distance, i.e.Wij is thescalar product, or the distance, between instancesi andj. We assume at each step an expertprovides an estimation ofWij which can be viewed as a random output. The goal is to estimatethe matrixW online. We letzt = (xt, yt) and we will apply our stochastic gradient method tothe cost functionC(W ) = EzQ(z,W ) whereQ(zt,Wt) =

12(xT

t Wtxt − yt)2 = 1

2(yt − yt)

2.Due to the large amount of data available nowadays, matrix classification algorithms tend

to be applied to computational problems of ever-increasingsize. Yet, they need be adaptedto remain tractable, and the matrices’ dimensions need to bereduced so that the matrices arestorable. A wide-spread topical method consists of workingwith low-rank approximations.Any rank r approximation of a positive definite matrix can be factored as A = GGT whereG ∈ Rn×r. It is then greatly reduced in size ifr ≪ n, leading to a reduction of the numericalcost of typical matrix operations fromO(n3) to O(nr2), i.e. linear complexity. This fact has

14

Page 15: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

motivated the development of low-rank kernel and Mahalanobis distance learning [18], andgeometric understanding of the set of semidefinite positivematrices of fixed rank:

S+(r, n) = {W ∈ Rn×n s.t. W = W T � 0, rank(W ) = r}.

4.3.1 Proposed algorithm and convergence results

To endowS+(r, n) with a metric we start from the square-root factorizationW = GGT , whereG ∈ Rn×r

∗ , i.e. has rankr. Because the factorization is invariant by rotation, the search spaceis identified to the quotientS+(r, n) ≃ Rn×r

∗ /O(r), which represents the set of equivalenceclasses

[G] = {GO s.t. O ∈ O(r)}.The Euclidian metricgG(∆1,∆2) = Tr

(

∆T1∆2

)

, for ∆1,∆2 ∈ Rn×r tangent vectors atG,

is invariant along the equivalence classes. It thus inducesa well-defined metricg[G](ξ, ζ) onthe quotient, i.e. forξ, ζ tangent vectors at[G] in S+(r, n). Classically [1], the tangent vec-tors of the quotient spaceS+(r, n) are identified to the projection onto the horizontal space(the orthogonal space to[G]) of tangent vectors of the total spaceRn×r

∗ . So tangent vec-tors at[G] are represented by the set of horizontal tangent vectors{Sym(∆)G,∆ ∈ Rn×n},whereSym(A) = (A + AT )/2. The horizontal gradient ofQ(zt, Gt) is the unique horizon-tal vectorH(zt, Gt) that satisfies the definition of the Riemannian gradient. In the sequelwe will systematically identify an elementG to its equivalence class[G], which is a ma-trix of S+(r, n). For more details on this manifold see [17]. Elementary computations yieldH(zt, Gt) = 2(yt − yt)xtxt

TGt, and (8) writes

Gt+1 = Gt −γt

f(Gt)(‖GT

t xt‖2 − yt)xtxTt Gt (17)

where we choosef(Gt) = max(1, ‖Gt‖6) and where the sequence(γt)t≥0 satisfies condi-tion (4). This non-linear algorithm is well-defined on the set of equivalence classes, andautomatically enforces the rank and positive semi-definiteness constraints of the parameterGtG

Tt = Wt.

Proposition 5. Let (xt)t≥0 be a sequence of zero centered random vectors ofRn with inde-pendent identically and normally distributed components.Supposeyt = xT

t V xt is generatedby some unknown parameterV ∈ S+(r, n). The Riemannian gradient descent(17) is suchthatGtG

Tt = Wt → W∞ and∇C(Wt) → 0 a.s. Moreover, ifW∞ has rankr = n necessarily

W∞ = V a.s. If r < n, necessarilyW a.s. converges to an invariant subspace ofV ofdimensionr. If this is the dominant subspace ofV , thenW∞ = V .

The last proposition can be completed with the following fact: it can be easily proved thatthe dominant subspace ofV is a stable equilibrium of the averaged algorithm. As concernsfor the other invariant subspaces ofV , simulations indicate they are unstable equilibria. Theconvergence toV is thus always expected in simulations.

Proof. As here the Euclidian gradient of the loss with respect to theparameterGt coincideswith its projection onto the horizontal spaceH(zt, Gt), and is thus a tangent vector to themanifold, we propose to apply Theorem 3 to the Euclidian space Rn×r, which is of course

15

Page 16: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

a Hadamard manifold. This is a simple way to avoid to compute the sectional curvatureof S+(r, n). Note that, the adaptive stepf(Gt) introduced in Theorem 3 and the results ofTheorem 3 are nevertheless needed, as the usual assumptionEx‖H(x,G)‖k ≤ A + B‖G‖kof the Euclidian case is violated. In fact the proposition can be proved under slightly moregeneral assumptions: suppose the components of the input vectors have moments up to theorder 8, with second and fourth moments denoted bya = E(xi)2 andb = E(xi)4 for 1 ≤ i ≤ nsuch thatb > a2 > 0. We begin with a preliminary result:

Lemma 1. Consider the linear (matrix) mapU : M 7→ Ex(Tr(

xxTM)

xxT ). U(M) is thematrix whose coordinates area2(Mij +Mji) for i 6= j, and Tr(M) a2+Mii(b−a2) for i = j.

Assumption 1: We letv = 0. Let “·” denote the usual scalar product inRn×r. For‖G‖2sufficiently large,G · (v−G) = −Tr

(

[E(Tr(

xxT (GGT − V ))

xxTG]GT ))

< −ǫ < 0, whichmeans the gradient tends to make the norm of the parameter decrease on average, when it is farfrom the origin. Indeed letP = GGT . We want to prove that Tr(U(P )P ) > Tr (U(V )P ) + ǫfor sufficiently large‖G‖2 = Tr (P ). If we choose a basis in whichP is diagonal we haveTr (U(P )P ) = a2Tr (P )2+ (b− a2)Tr (P 2) = a2(

λi)2+ (b− a2)(

λ2i ) whereλ1, . . . , λn

are the eigenvalues ofP . We have also Tr(U(V )P ) = a2Tr (P )Tr (V ) + (b − a2)∑

(λiVii).For sufficiently large Tr(P ), Tr (P )2 is arbitrarily larger than Tr(P )Tr (V ). We have also(∑

(λ2i )∑

(V 2ii ))

1/2 ≥ ∑

(λiVii) and (∑

λ2i )

1/2 ≥ 1nTr (P ) by Cauchy-Schwartz inequal-

ity. Thus for Tr(P ) ≥ n∑

(V 2ii )

1/2, we have(∑

λ2i )

1/2 ≥ ∑

(V 2ii )

1/2 and thus∑

λ2i ≥

(∑

(λ2i )∑

(V 2ii ))

1/2 ≥ ∑

(λiVii). Assumption 2 is satisfied as the curvature of an Euclidianspace is zero. For Assumption 3, using the fact that forP,Q positive semi-definite Tr(PQ) ≤(Tr (P 2)Tr (Q2))1/2 ≤ Tr (P )Tr (Q), and that Tr

(

GGT)

= ‖G‖2 and Tr(

xxT)

= ‖x‖2, itis easy to prove there existsB > 0 such that‖H(x,G)‖2 = (‖GTx‖2 − y)2‖xxTG‖2 ≤(‖G‖6 + B‖G‖2)‖x‖8. Thus there existsµ > 0 such that[max(1, ‖G‖3)]2 is greater thanµEz‖H(x,G)‖2. On the other hand, there existsλ such thatλEz(2‖H(x,G)‖‖G‖+‖H(x,G)‖2)2 ≤max(1, ‖G‖6). But the alternative stepmax(µ, λ)γt satisfies condition (4).

Let us analyze the set of possible asymptotic values. It is characterized byU(GGT −V )G = 0. LetM be the symmetric matrixGGT − V . If G is invertible, it meansU(M) = 0.Using the lemma above we see that the off-diagonal terms ofM are equal to 0, and summingthe diagonal terms((n − 1)a2 + b)Tr (M) = 0 and thus Tr(M) = 0 which then impliesM = 0 asb > a2. Now supposer < n. If b = 3a2, as for the normal distribution,U(M) =2M + Tr (M) I andU(GGT − V )G = 0 impliesG(kI + 2GTG) = 2V G for somek ∈ R.Thus (as in example 4.1) it impliesG is an invariant subspace ofV . Similarly to the full rankcase, it is easy to proveW∞ = V whenV andG span the same subspace.

4.3.2 Gain tuning

The condition (4) is common in stochastic approximation. Asin the standard filtering prob-lem, or in Kalman filter theory, the more noisy observations of a constant process one gets, theweaker the gain of the filter becomes. It is generally recommended to setγt = a/(1 + b t1/2+ǫ)where in theoryǫ > 0, but in practice we propose to takeǫ = 0, leading to the family of gains

γt =a

1 + b t1/2(18)

16

Page 17: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

If the gain remains too high, the noise will make the estimator oscillate around the solution.But a low gain leads to slow convergence. The coefficienta represents the initial gain. It mustbe high enough to ensure sufficiently fast convergence but not excessive to avoid amplifyingthe noise.b is concerned with the asymptotic behavior of the algorithm and must be set suchthat the algorithm is insensitive to noise in the final iterations (a high noise could destabilizethe final matrix identified over a given training set).a is generally set experimentally usinga reduced number of iterations, andb must be such that the variance of the gradient is verysmall compared to the entries ofGt for larget.

4.3.3 Simulation results

Asymptotic convergence ofGGT to the true valueV is always achieved in simulations. Whent becomes large, the behavior of the stochastic algorithm is very close to the behavior of theaveraged gradient descent algorithmJt+1 = Jt − γt

f(Jt)EzH(z, Jt), as illustrated in Figure 2.

This latter algorithm has a well characterized behavior in simulations: in a first phase theestimation error decreases rapidly, and in a second phase itslowly converges to zero. As thenumber of iterations increases, the estimation error becomes arbitrarily small.

In all the experiments, the estimated matrices have an initial norm equal to‖V ‖. Thisis because a large initial norm discrepancy induces a rapid decrease in the estimation error,which then would seem to tend very quickly to zero compared toits initial value. Thus, afair experiment requires initial comparable norms. In the first set of numerical experimentsGaussian input vectorsxt ∈ R100 with 0 mean and identity covariance matrix are generated.The output is generated via model (16) whereV ∈ R100×100 is a symmetric positive semi-definite matrix with rankr = 3. The results are illustrated on Figure 2 and indicate the matrixV is asymptotically well identified.

In order to compare the proposed method with another algorithm, we propose to focus onthe full-rank case, and compare the algorithm with a naive but efficient technique. Indeed,whenr = n, the cost functionC(W ) becomes convex in the parameterW ∈ P+(n) and theonly difficulty is to numerically maintainW as positive semi-definite. Thus, a simple methodto attack the problem of identifyingV is to derive a stochastic gradient algorithm inRn×n,and to project at each step the iterate on the cone of positivesemi-definite matrices, i.e.,

P0 ∈ S+(n, n), Pt+1 = π(Pt − γt∇Q(zt, Pt)) (19)

whereπ is the projection on the cone. It has been proved in [16] this projection can beperformed by diagonalizing the matrix, and setting all the negative eigenvalues equal to zero.Figure 3 illustrates the results. Both algorithms have comparable performances. However,the proposed algorithm (17) is a little slower than the stochastic algorithm (19) which is asexpected, since this latter algorithm takes advantage of the convexity of the averaged costfunction in the full-rank case.

However, the true advantage of the proposed approach is essentially computational, andbecomes more apparent when the rank is low. Indeed, whenn is very large andr ≪ n (17)has linear complexity inn, whereas a method based on diagonalization requires at least O(n2)operations and may become intractable. Moreover, the problem is not convex anymore dueto the rank constraint and an approximation then projectiontechnique can lead to degraded

17

Page 18: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

0 1 2 3 4 5

x 104

−4

−2

0

2

Out

put e

rror

Output error

xTGGTx−y

0 1 2 3 4 5

x 104

0

0.2

0.4

0.6

0.8

Number of iterations

Est

imat

ion

erro

r

Convergence and comparison with the averaged flow

||GGT−V||

||JJT−V||

Figure 2: Identification of a rank3 matrixV of dimension100×100 with algorithm (17). Topplot: output (or classification) error versus the number of iterations. Bottom plot: estimationerror for the stochastic algorithm‖GtG

Tt − V ‖ (solid line) and estimation error for the de-

terministic averaged algorithmJt+1 = Jt − γtf(Jt)

EzH(z, Jt) (dashed line). The curves nearly

coincide. The chosen gain isγt = .001/(1 + t/5000)1/2.

18

Page 19: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

0.05

0.1

0.15

0.2

0.25

Number of iterations

Est

imat

ion

erro

r

Comparison with a simple algorithm in the full rank case

||GGT−V||||P−V||

Figure 3: Full-rank case withr = n = 20. Plot of the estimation error for algorithm (17)(solid line) and (19) (dashed line). The gain isγt = .01/(1 + t/500)1/2.

performance. Thus, comparing both techniques (17) and a technique based on diagonaliza-tion such as (19) is pointless for low-rank applications, and finding relevant algorithms is aninvolved task that has been recently addressed in several papers, see e.g. [23, 33]. Since in thepresent paper the emphasis is put on mathematical convergence results, the interested readercan refer to [21] where (17) and its variants have been extensively tested on several databases.They are shown to compete with state of the art methods, and toscale very well when thematrix is of very large dimension (a variant was tested on theNetflix prize database). Theyhave also been recently compared to more involved Riemannian methods in [33].

4.4 Non-linear gossip algorithm for decentralized covariance matrix es-timation

The problem of computing distributed averages on a network appears in many applications,such as multi-agent systems, distributed data fusion, and decentralized optimization. The un-derlying idea is to replace expansive wiring by a network where the information exchange isreduced owing to various limitations in data communication. A way to compute distributedaverages that has gained popularity over the last years is the so-called gossip algorithm [10].It is a randomized procedure where at each iteration, a node communicates with one neighborand both nodes set their value equal to the average of their current values. The goal is for allnodes to reach a common intermediate value as quickly as possible, with little computationalpower. Gossip algorithms are a special case of distributed optimization algorithms, wherestochastic gradient descent plays a central role (see e.g. [36]). When applied to multi-agentsystems, this algorithm allows the agents to reach a consensus, i.e. agree on a common quan-tity. This is the well known consensus problem [24]. When theconsensus space is not linear(for instance a group of agents wants to agree on a common direction of motion, or a group ofoscillators on a common phase) the methods need to be adaptedto the non-linearities of theproblem (see e.g. [32]). Consensus on manifolds has recently received increasing attention,

19

Page 20: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

see e.g. the recent work of [35, 31] for deterministic procedures. In [30], a gossip algorithmon the circle has been proposed and analyzed.

In this section, we address the problem of estimating a covariance matrixW on a sensornetwork in a decentralized way (see e.g. [20]): suppose eachnodei provides measurementsyi = [y1i , · · · , ymi ] ∈ Rn×m where the vectorsyji ∈ Rn are zero centered normally distributedrandom vectors with a covariance matrix to be estimated. After a local computation, eachnode is assumed to possess an initial estimated covariance matrix Wi,0. Neighboring nodesare allowed to exchange information at random times. We assume the nodes are labeledaccording to their proximity as follows: fori ≤ m − 1 the nodesi andi + 1 are neighbors.At each time stept, we suppose a nodei < m is picked randomly with probabilitypi > 0(wherepi represents for instance the frequency of availability of the communication channelbetween nodesi andi + 1) and the nodes in question update their covariance estimatesWi,t

andWi+1,t. Our goal is that they reach consensus on a common intermediate value. To doso, the procedure (2) is implemented using an interesting alternative metric on the cone ofpositive definite matricesP+(n).

4.4.1 A statistically meaningful distance onP+(n)

Information geometry allows to define Riemannian distancesbetween probability distribu-tions that are very meaningful from a statistical point of view. On the manifold of symmetricpositive definite matricesP+(n), the so-called Fisher Information Metric for two tangent vec-torsX1, X2 atP ∈ P+(n) is given by

〈X1, X2〉P = Tr(

X1P−1X2P

−1)

(20)

It defines an infinitesimal distance that agrees with the celebrated Kullback-Leibler divergencebetween probability distributions: up to third order termsin a small symmetric matrix, say,Xwe have

KL(N (0, P )||N (0, P +X)) = 〈X,X〉Pfor anyP ∈ P+(n) whereN (0, P ) denotes a Gaussian distribution of zero mean and covari-ance matrixP andKL denotes the Kullback Leibler divergence. The geodesic distance writesd(P,Q) = (

∑nk=1 log

2(λk))1/2 whereλ1, · · · , λn are the eigenvalues the matrixPQ−1 and it

represents the amount of information that separatesP from Q.The introduced notion of statistical information is easilyunderstood from a simple vari-

ance estimation problem withn = 1. Indeed, consider the problem of estimating the varianceof a random vectory ≃ N (0, σ). In statistics, the Cramer-Rao bound provides a lower boundon the accuracy of any unbiased estimatorσ of the varianceσ: here it statesE(σ − σ)2 ≥ σ2.Thus the smallerσ is, the more potential information the distribution contains aboutσ. As a re-sult, two samples drawn respectively from, say, the distributionsN (0, 1000) andN (0, 1001)look much more similar than samples drawn respectively fromthe distributionsN (0, 0.1)andN (0, 1.1). In other words, a unit increment in the variance will have a small impact onthe corresponding distributions if initiallyσ = 1000 whereas it will have a high impact ifσ = 0.1. Identifying zero mean Gaussian distributions with their variances, the Fisher Infor-mation Metric accounts for that statistical discrepancy as(20) writes〈dσ, dσ〉σ = (dσ/σ)2.

20

Page 21: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

But the Euclidian distance does not, as the Euclidian distance between the variances is equalto 1 in both cases.

The metric (20) is also known as the natural metric onP+(n), and admits strong invarianceproperties (see Proposition 6 below). These properties make the Karcher mean associated withthis distance more robust to outliers than the usual arithmetic mean. This is the main reasonwhy this distance has attracted ever increasing attention in medical imaging applications (seee.g. [28]) and radar processing [5] over the last few years.

4.4.2 A novel randomized algorithm

We propose the following randomized procedure to tackle theproblem above. At each stept, a nodei < m is picked randomly with probabilitypi > 0, and both neighboring nodesiandi+ 1 move their values towards each other along the geodesic linking them, to a distanceγtd(Wi,t,Wi+1,t) from their current position. Note that, forγt = 1/2 the updated matrixWi,t+1 is at exactly half Fisher information (geodesic) distance betweenWi,t andWi+1,t. Thisis an application of update (2) wherezt denotes the selected node at timet and has probabilitydistribution(p1, · · · , pm−1), and where the average cost function writes

C(W1, · · · ,Wm) =m−1∑

i=1

pi d2(Wi,Wi+1)

on the manifoldP+(n) × · · · × P+(n). Using the explicit expression of the geodesics [14],update (2) writes

Wi,t+1 = W1/2i,t exp(γt log(W

−1/2i,t Wi+1,tW

−1/2i,t ))W

1/2i,t ,

Wi+1,t+1 = W1/2i+1,t exp(γt log(W

−1/2i+1,tWi,tW

−1/2i+1,t ))W

1/2i+1,t

(21)

This algorithm has several theoretical advantages. First it is based on the Fisher informa-tion metric, and thus is natural from a statistical viewpoint. Then it has several nice propertiesas illustrated by the following two results:

Proposition 6. The algorithm(21) is invariant to the action ofGL(n) onP+(n) by congru-ence.

Proof. This is merely a consequence of the invariance of the metric,and of the geodesicdistanced(GPGT , GQGT ) = d(P,Q) for anyP,Q ∈ P+(n) andG ∈ GL(n).

The proposition has the following meaning: after a linear change of coordinates, if allthe measurementsyji are transformed into new measurementsGyji , whereG is an invertiblematrix onRn, and the corresponding estimated initial covariance matrices are accordinglytransformed intoGWi,0G

T , then the algorithm (21) is unchanged, i.e., for any nodei thealgorithm with initial valuesGWi,0G

T ’s will yield at time t the matrixGWi,tGT , whereWi,t

is the updated value corresponding to the initial valuesWi,0’s. As a result, the algorithm willperform equally well for a given problem independently of the choice of coordinates in whichthe covariance matrices are expressed (this implies in particular invariance to the orientation ofthe axes, and invariance to change of units, such as meters versus feet, which can be desirableand physically meaningful in some applications). The most important result is merely anapplication of the theorems of the present paper:

21

Page 22: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

0 2 4 6 8 10 12 14−0.5

0

0.5

1

Number of iterations

Ent

ries

of th

e m

atric

es a

t eac

h no

de

Convergence of the entries of 2x2 matrices in a 6 nodes graph

Figure 4: Entries of the matrices at each node versus the number of iterations over a single run.The matrices are of dimension2 × 2 and the graph has 6 nodes. Convergence to a common(symmetric) matrix is observed.

Proposition 7. If the sequenceγt satisfies the usual assumption(4), and is upper bounded by1/2, the covariance matrices at each node converge a.s. to a common value.

Proof. This is simply an application of Theorem 1. Indeed,P+(n) endowed with the naturalmetric is a complete manifold, and thus one can find a geodesicball containing the initialvaluesW1, · · · ,Wm. In this manifold, geodesic balls are convex (see e.g. [2]).At each timestep two points move towards each other along the geodesic linking them, but asγt ≤ 1/2their updated value lies between their current values, so they remain in the ball by convex-ity. Thus, the values belong to a compact set at all times. Moreover the injectivity radius isbounded away from zero, and the gradient is bounded in the ball, so Theorem 1 can be ap-plied.∇Wm

C = 0 implies thatd(Wm−1,Wm) = 0 as∇WmC = −2pm−1 exp

−1Wm

(Wm−1) (seeAppendix). ThusWm−1 andWm a.s. converge to the same value. But as∇Wm−1

C = 0 thisimpliesWm−2 converges a.s. to the same value. By the same token we see all nodes convergeto the same value a.s.

4.4.3 Simulation results

As the cone of positive definite matricesP+(n) is convex, the standard gossip algorithm iswell defined onP+(n). If nodei is drawn, it simply consists of the update

Wi,t+1 = Wi+1,t+1 = (Wi,t +Wi+1,t)/2 (22)

In the following numerical experiments we letn = 10, m = 6, and nodes are drawn withuniform probability. The stepγt is fixed equal to1/2 over the experiment so that (21) canbe viewed as a Riemannian gossip algorithm (the condition (4) is only concerned with theasymptotic behavior ofγt and can thus be satisfied even ifγt is fixed over a finite numberof iterations). Simulations show that both algorithms always converge. Convergence of theRiemannian algorithm (21) is illustrated in Figure 4.

22

Page 23: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

In Figures 5 and 6, the two algorithms are compared. Due to thestochastic nature of thealgorithm, the simulation results are averaged over 50 runs. Simulations show that the Rie-mannian algorithm (21) converges faster in average than theusual gossip algorithm (22). Theconvergence is slightly faster when the initial matricesW1,0, · · · ,Wm,0 have approximatelythe same norm. But when the initial matrices are far from eachother, the Riemannian consen-sus algorithm outperforms the usual gossip algorithm. In Figure 5, the evolution of the costC(W1,t, · · · ,Wm,t)

1/2 is plotted versus the number of iterations. In Figure 6, the diameter ofthe convex hull of matricesW1,t, · · · ,Wm,t is considered as an alternative convergence crite-rion. We see the superiority of the Riemannian algorithm is particularly striking with respectto this convergence criterium. It can also be observed in simulations that the Riemannian al-gorithm is more robust to outliers. Together with its statistical motivations, and its invarianceand guaranteed convergence properties, it makes it an interesting procedure for decentralizedcovariance estimation, or more generally randomized consensus onP+(n).

0 2 4 6 8 10 12 140

0.05

0.1

0.15

0.2

Number of iterations

(Σi p

i ||W

i+1−

Wi||2 )1/

2

Convergence with initial matrices all having unit norm

Riemannian gossipEuclidian gossip

0 2 4 6 8 10 12 140

0.5

1

1.5

Number of iterations

(Σi p

i ||W

i+1−

Wi||2 )1/

2

Convergence with initial matrices having norms ranging from .1 to 10

Riemannian gossipEuclidian gossip

Figure 5: Comparison of Riemannian (solid line) and Euclidian (dashed line) gossip for co-variance matrices of dimension10×10 with a 6 nodes graph. The plotted curves represent thesquare root of the averaged costC(W1,t, · · · ,Wm,t)

1/2 versus the number of iterations, aver-aged over 50 runs. The Riemannian algorithm converges faster (top graphics). Its superiorityis particularly striking when the nodes have heterogeneousinitial values (bottom graphics).

5 Conclusion

In this paper we proposed a stochastic gradient algorithm onRiemannian manifolds. Underreasonable assumptions the convergence of the algorithm was proved. Moreover the conver-gence results are proved to hold when a retraction is used, a feature of great practical interest.The approach is versatile, and potentially applicable to numerous non-linear problems in sev-eral fields of research such as control, machine learning, and signal processing, where themanifold approach is often used either to enforce a constraint, or to derive an intrinsic algo-rithm.

23

Page 24: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

0 2 4 6 8 10 12 140

0.2

0.4

0.6

0.8

1

Number of iterations

Max

ij ||W

i−W

j||

Convergence with initial matrices all having unit norm

Riemannian gossipEuclidian gossip

0 2 4 6 8 10 12 140

2

4

6

8

10

Number of iterations

Max

ij ||W

i−W

j||

Convergence with initial matrices having norms ranging from .1 to 10

Riemannian gossipEuclidian gossip

Figure 6: Comparison of Riemannian (solid line) and Euclidian (dashed line) gossip for co-variance matrices of dimension10× 10 with a 6 nodes graph with another convergence crite-rion. The plotted curves represent the diameter of the convex hull maxi,j‖Wi,t −Wj,t‖ versusthe number of iterations, averaged over 50 runs. The Riemannian algorithm converges faster(top plot). It outperforms the Euclidian algorithm when thenodes have heterogeneous initialvalues (bottom plot).

Another important connection with the literature concernsAmari’s natural gradient [3], atechnique that has led to substantial gains in the blind source separation problem, and that canbe cast in our framework. Indeed, the idea is to consider successive realizationsz1, z2, · · ·of a parametric model with parameterw ∈ Rn and joint probabilityp(z, w). The goal is toestimate online the parameterw. Amari proposes to use algorithm (3) where the Riemannianmetric is the Fisher Information Metric associated to the parametric model, the loss is the log-likelihoodQ(z, w) = log p(z, w), and the retraction is the mere addition inRn. The resultingalgorithm, the so-called natural gradient, is proved to be asymptotically efficient, i.e., to reachan asymptotical Cramer-Rao lower bound. Using the true exponential map and thus algorithm(2) would result in a different (intrinsic) update. In [34],S. Smith has proposed an intrinsicCramer-Rao bound based on the Fisher metric. In future work,one could explore whether theintrinsic algorithm (2) asymptotically reaches theintrinsic Cramer-Rao bound. More detailsare given in Appendix A.

In the future we would also like to explore two of the aforementioned applications. First,the matrix completion problem. Proving the convergence of the stochastic gradient algo-rithm [22] requires to study the critical points of the averaged cost function. This leads toprove mathematically involved results on low-rank matrix identifiability, possibly extendingthe non-trivial work of [11]. Then, we would like to prove more general results for non-linearconsensus on complete manifolds with a stochastic communication graph. In particular wehope to extend or improve the convergence bounds of the gossip algorithms in the Euclidiancase [10] to problems such as the ones described in [32, 30], and also to understand to whatextent the gossip algorithms for consensus can be faster in ahyperbolic geometry.

24

Page 25: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

Acknowledgements

The author would like to thank Gilles Meyer and Rodolphe Sepulchre for early collaborationon the subject, Leon Bottou for interesting discussions.

Appendix A: Links with information geometry

An important concept in information geometry is the naturalgradient. Let us show it is relatedto the method proposed in this paper. Suppose now thatzt are realizations of a parametricmodel with parameterw ∈ Rn and joint probability density functionp(z, w). Now let

Q(z, w) = l(z, w) = log(p(z, w))

be the log-likelihood of the parametric lawp. If w is an estimator of the true parameterw∗ based onk realizations of the processz1, · · · , zk the covariance matrix is larger than theCramer-Rao bound:

E[(w − w∗)(w − w∗)T ] ≥ 1

kG(w∗)−1

with G the Fisher information matrixG(w) = −Ez [(∇Ew l(z, w))(∇E

wl(z, w))T ] where∇E

denotes the conventional gradient in Euclidian spaces. AsG(w) is a positive definite matrix itdefines a Riemannian structure on the state spaceM = Rn, known as the Fisher informationmetric. In this chart the Riemannian gradient ofQ(z, w) writesG−1(w)∇E

wl(z, w). AsM =Rn, a simple retraction is the additionRw(u) = w + u. Takingγt = 1/t which is compatiblewith assumption 1, update (3) writeswt+1 = wt− 1

tG−1(wt)∇E

w l(zt, wt). This is the celebratedAmari’s natural gradient [3].

Assumingwt converges to the true parameterw∗ generating the data, Amari proves it isan asymptotically efficient estimator. Indeed, lettingVt = E[(wt − w∗)(wt − w∗)T ] we have

Vt+1 = Vt − 2E[1

tG−1∇E

w l(zt, wt)(wt − w∗)T ] +1

t2G−1GG−1 +O(

1

t3)

But up to second order terms∇Ew l(zt, wt) = ∇E

w l(zt, w∗) + (∇E

w)2l(zt, w

∗)(wt − w∗), withE[l(z, w∗)] = 0 asw∗ achieves a maximum of the expected log-likelihood where theexpec-tation is with respect to the lawp(z, w∗), andG(w) = E[(∇E

w)2l(zt, w)] because of the basic

properties of the Cramer Rao bound. FinallyVt+1 = Vt − 2Vt/t+G−1/t2, up to terms whoseaverage can be neglected. The asymptotic solution of this equation isVt = G−1/t + O(1/t2)proving statistical efficiency.

It completes our convergence result, proving that when the space is endowed with theFisher metric, and the trivial retraction is used, the stochastic gradiend method proposed in thispaper provides an asymptotically efficient estimator. The natural gradient has been applied toblind source separation (BSS) and has proved to lead to substantial performance gains.

[34] has recently derived an intrinsic Cramer-Rao bound. The bound does not depend onany non-trivial choice of coordinates, i.e. the estimationerror‖w − w‖ is replaced with theRiemannian distance associated to the Fisher information metric. In the same way, the usualnatural gradient updatewt+1 = wt− 1

tG−1∇E

w l(zt, wt) could be replaced with its intrinsic ver-sion (2) proposed in this paper. It can be conjectured this estimator achieves Fisher efficiency

25

Page 26: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

i.e. reaches theintrinsic Cramer-Rao bound as defined in [34]. Such a result in the theory ofinformation geometry goes beyond the scope of this paper andis left for future research.

Appendix B: Riemannian geometry background

Let (M, g) be a connected Riemannian manifold (see e.g. [1] for basic Riemannian geometrydefinitions). It carries the structure of a metric space whose distance function is the arc lengthof a minimizing path between two points. The lengthL of a curvec(t) ∈ M is defined by

L =

∫ b

a

g(c(t), c(t))dt =

∫ b

a

‖c(t)‖dt

If y is sufficiently close tox ∈ M, there is a unique path of minimal length linkingx andy. It is called a geodesic. The exponential map is defined as follows: expx(v) is the pointz ∈ M situated on the geodesic with initial position-velocity(x, v) at distance‖v‖ of x. Wealso defineexp−1

x (z) = v. The cut locus ofx is roughly speaking the set where the geodesicsstarting atx stop being paths of minimal length (for exampleπ on the circle forx = 0). Theleast distance to the cut locus is the so-called injectivityradiusI atx. A geodesic ball is a ballwith radius less than the injectivity radius at its center.

For f : M → R twice continuously differentiable, one can define the Riemannian gra-dient as the tangent vector atx satisfying d

dt|t=0f(expx(tv)) = 〈v,∇f(x)〉g and the hessian

as the operator such thatddt|t=0〈∇f(expx(tv)),∇f(expx(tv))〉g = 2〈∇f(x), (∇2

xf)v〉g. Forinstance, iff(x) = 1

2d2(p, x) is half the squared distance to a pointp the Riemannian gradient

is ∇xf = exp−1x (p), i.e. it is a tangent vector atx collinear to the geodesic linkingx andp,

with normd(p, x). Lettingc(t) = expx(tv) we have

f(c(t)) = f(x) + t〈v,∇f(x)〉g

+

∫ t

0

(t− s)〈 dds

c(s), (∇2c(s)f)

d

dsc(s)〉gds.

and thusf(expx(tv)) − f(x) ≤ t〈v,∇f(x)〉g + t2

2‖v‖2gk, wherek is a bound on the hessian

along the geodesic.

References

[1] P.A. Absil, R. Mahony, and R. Sepulchre.Optimization Algorithms on Matrix Manifolds.Princeton University Press, 2007.

[2] B. Afsari. Riemannian LP center of mass: existence, uniqueness, and convexity.Pro-ceedings of the American Mathematical Society, 139:655–673, 2011.

[3] S.I. Amari. Natural gradient works efficiently in learning. Neural Computation, MITPress, 1998.

26

Page 27: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

[4] M. Arnaudon, C. Dombry, A. Phan, and Le Yang. Stochastic algorithms for computingmeans of probability measures.Stochastic Processes and their Applications, 122:1437–1455, 2012.

[5] F. Barbaresco. Innovative tools for radar signal processing based on cartan’s geome-try of symmetric positive-definite matrices and information geometry. InIEEE RadarConference, 2008.

[6] A. Benveniste, M. Goursat, and G. Ruget. Analysis of stochastic approximation schemeswith discontinuous and dependent forcing terms with applications to data communica-tion algorithms.IEEE Trans. on Automatic Control, 25(6):1042–1058, 1980.

[7] S. Bonnabel. Convergence des methodes de gradient stochastique sur les varietes rie-manniennes. InXIII i eme Colloque GRETSI, Bordeaux, 2011.

[8] S. Bonnabel, G. Meyer, and R. Sepulchre. Adaptive filtering for estimation for a low-rank positive semidefinite matrix. InInternational Symposium on Mathematical Theoryof Networks and Systems (MTNS), 2010.

[9] L. Bottou. Online Algorithms and Stochastic Approximations. Online Learning andNeural Networks, Edited by David Saad, Cambridge University Press, 1998.

[10] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomizedgossip algorithms.IEEETrans. on Information Theory, 52(6):2508–2530, 2006.

[11] E.J. Candes and Y. Plan. Tight oracle bounds for low-rank matrix recovery from a mini-mal number of random measurements.IEEE Trans. on Information Theory, 57(4):2342–2359, 2009.

[12] D. Cordero-Erausquin, R. J. McCann, and M. Schmuckenschlager. A Riemannian inter-polation inequality a la Borell, Brascamp and Lieb.Invent. Math., 146:219–257, 2001.

[13] T.A. Arias Edelman, A. and S.T. Smith. The geometry of algorithms with orthogonalityconstraints.SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.

[14] J. Faraut and A. Koranyi.Analysis on Symmetric Cones. Oxford Univ. Press, London,U.K., 1994.

[15] D.L. Fisk. Quasi-martingales.Trans. of the American Mathematical Society, 120(3),1965.

[16] Nicholas J. Higham.Matrix nearness problems and applications. Oxford UniversityPress, 1989.

[17] M. Journee, P.-A. Absil F. Bach, and R. Sepulchre. Low-rank optimization on the cone ofpositive semidefinite matrices.SIAM Journal on Optimization, 20(5):2327–2351, 2010.

[18] B. Kulis, M. Sustik, and Dhillon I.S. Learning low-rankkernel matrices.Proceedingsof the 23 rd International Conference on Machine Learning (ICML), 2006.

27

Page 28: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

[19] L. Ljung. Analysis of recursive stochastic algorithms. IEEE Trans. on Automatic Con-trol, 22(4):551–575, 1977.

[20] L. Ljung, H. Hjalmarsson, and H. Ohlsson. Four encounters with system identification.European Journal of Control, pages 449–471, 2011.

[21] G. Meyer.Geometric optimization algorithms for linear regression on fixed-rank matri-ces. PhD thesis, University of Liege, 2011.

[22] G. Meyer, S. Bonnabel, and R. Sepulchre. Linear regression under fixed-rank con-straints: a Riemannian approach. InProc. of the 28th International Conference onMachine Learning (ICML), 2011.

[23] G. Meyer, S. Bonnabel, and R. Sepulchre. Regression on fixed-rank positive semidefinitematrices: a Riemannian approach.Journal of Machine Learning Reasearch (JMLR),12:593–625, 2011.

[24] L. Moreau. Stability of multiagent systems with time-dependent communication links.IEEE Trans. on Automatic Control, 50(2):169–182, 2005.

[25] E. Oja.Subspace methods of pattern recognition.Research Studies Press, 1983.

[26] E. Oja. Principal components, minor components, and linear neural networks.NeuralNetworks, 5:927 – 935, 1992.

[27] B. O’Neill. Semi-Riemaniann geometry. Pure and applied mathematics. Academic PressInc., New York, 1983.

[28] X. Pennec, P. Fillard, and N. Ayache. A riemannian framework for tensor computing.International Journal of Computer Vision, 66:41–66, 2006.

[29] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathe-matical Statistics, 22, 1951.

[30] Sarlette S., S.E. Tuna, V.D. Blondel, and R. Sepulchre.Global synchronization on thecircle. InProceedings of the 17th IFAC World Congress, 2008.

[31] A. Sarlette, S. Bonnabel, and R. Sepulchre. Coordinated motion design on lie groups.IEEE Trans. on Automatic Control, 55(5):1047–1058, 2010.

[32] S. Sepulchre, P. Derek, and N.E. Leonard. Stabilization of planar collective motion withlimited communication.IEEE Trans. on Automatic Control, 53(3):706–719, 2008.

[33] U. Shalit, D. Weinshall, and G. Chechik. Online learning in the embedded manifold oflow-rank matrices.The Journal of Machine Learning Research (JMLR), 13:429–458,2012.

[34] S.T. Smith. Covariance, subspace, and intrinsic Cramer-Rao bounds.IEEE Trans. onSignal Processing, 53(5):1610–1629, 2005.

28

Page 29: Stochastic gradient descent on Riemannian manifolds · Stochastic gradient descent on Riemannian manifolds ... To demon-strate the main ideas on a toy example, we briefly mention

[35] R. Tron, B. Afsari, and R. Vidal. Average consensus on Riemannian manifolds withbounded curvature. InProceedings of the 50th IEEE Conference on Decision and Con-trol, pages 7855 – 7862, 2011.

[36] J.N. Tsitskilis, D. P. Bertsekas, and M. Athans. Distributed asynchronous determin-istic and stochastic gradient optimization algorithms.IEEE Trans. Autom. Control,31(9):803–812, 1986.

[37] Ratsch-G. Tsuda, K. and M. Warmuth. Matrix exponentiated gradient updates for onlinelearning and bregman projection.Journal of Machine Learning Research, 36(6):995–1018, 2005.

29