Top Banner
Riemannian adaptive stochastic gradient algorithms on matrix manifolds Hiroyuki Kasai (The University of Electro-Communications, Japan), Pratik Jawanpuria (Microsoft, India) and Bamdev Mishra (Microsoft, India) Problem of interest Consider the problem [1] of min xM f (x). M : Riemannian manifold M are represented as matrices of size n × r . Promising applications include, e.g., matrix/tensor completion, subspace tracking. Contributions Propose a modeling adaptive weight matrices for row and column subspaces exploiting the geometry of manifold. Develop efficient Riemannian adaptive stochastic gradient algorithms (RASA). Achieve a rate of convergence order O (log(T )/ T ) for non-convex stochastic optimization under mild conditions. Show efficiency of RASA from numerical experiments on several applications. Preliminaries Riemannian stochastic gradient update: (RSGD) x t+1 = R x t | {z } retraction (-α t gradf t (x t ) | {z } Riemannian stochastic gradient ), R x (ζ ) maps ζ T x M (tangent space) onto M. When M = R d with standard Euclidean inner product, RSGD update results in (SGD) x t+1 = x t - α t f t (x t ). Euclidean adaptive stochastic gradient updates: Rescale the learning rate based on past gradients as x t+1 = x t - α t V -1/2 t f t (x t ). V t = Diag(v t ) is a diagonal matrix such as (AgaGrad) v t = t k =1 f k (x k ) ◦∇f k (x k ), (RMSProp) v t = β v t-1 + (1 - β )f t (x t ) ◦∇f t (x t ). MATLAB source code The code, which is compliant to Manopt (https://www.manopt.org/), is available at https://github.com/hiroyuki-kasai/ RSOpt/. RASA: Riemannian Adaptive Stochastic gradient Algorithms Exploit matrix structure of Riemannian gradient G t (= gradf t (x t ) R n×r ) by separating adaptive weight matrices corresponding to row subspace L t and column subspaces R t . c.f. [2] views G t as a vector in R nr . Exponentially weighted matrices: L t = β L t-1 + (1 - β )G t G t /r, (R n×n ) R t = β R t-1 + (1 - β )G t G t /n. (R r ×r ) (β (0, 1): hyper-parameter) Adaptive Riemannian gradient G t : ˜ G t = L -1/4 t G t R -1/4 t . Full-matrix update: x t+1 = R x t (-α t P x t ( ˜ G t )). P x , a linear operator, projects onto tangent space T x M. Diagonal modeling of {L t , R t } as vectors {l t , r t }: l t = β l t-1 + (1 - β )diag(G t G t ), (R n ) r t = β r t-1 + (1 - β )diag(G t G t ). (R r ) diag(·) returns diagonal vector of a square matrix. Maximum operator for convergence: ˆ l t = max( ˆ l t-1 , l t ), ˆ r t = max(ˆ r t-1 , r t ). Alg.1: RASA Require: Step size {α t } T t=1 , hyper-parameter β . 1: Initialize x 1 ∈M, l 0 = ˆ l 0 = 0 n , r 0 r 0 = 0 r . 2: for t =1, 2,...,T do 3: Compute Riemannian stochastic gradient G t = gradf t (x t ). 4: Update l t = β l t-1 + (1 - β )diag(G t G T t )/r . 5: Calculate ˆ l t = max( ˆ l t-1 , l t ). 6: Update r t = β r t-1 + (1 - β )diag(G T t G t )/n. 7: Calculate ˆ r t = max(ˆ r t-1 , r t ). 8: x t+1 = R x t ( -α t P x t (Diag( ˆ l -1/4 t )G t Diag(ˆ r -1/4 t ))). 9: end for RASA variants: RASA-L adapts only the row subspace. RASA-R adapts only the column subspace. RASA-LR adapts both the row and column subspaces. Convergence rate analysis Extend existing convergence analysis in Euclidean space, e.g., [3], into Riemannian setting. Additionally, need to take care of (i) upper bound of ˆ v t (Lem.4.3) for update, and (ii) projection P x of weighted gradient onto T x M. For analysis, we use additional notations as x t+1 = R x t (-α t P x t (V -1/2 t g t (x t ))) for step 8 in Alg.1, ˆ V t = Diag(ˆ v t ), where ˆ v t r 1/2 t ˆ l 1/2 t , and g t (x) as the vectorized representation of gradf t (x). Definition, assumptions, and lemma: Def.4.1. (Upper-Hessian bounded) There exists a constant L> 0 such that d 2 f (R x ()) dt 2 L, for x ∈U⊂M and η T x M with η x =1, and all t such that R x (τη ) ∈U for τ [0,t]. Asm.1.1. f is continuously differentiable and is lower bounded, i.e., f (x * ) > -∞. Asm.1.2. f has H -bounded Riemannian stochastic gradient, i.e., gradf i (x)F H or g i (x)2 H . Asm.1.3. f is upper-Hessian bounded (Def.4.1). Lem.4.2. Under Asm.1 and L> 0 in Def.4.1, we have f (z ) f (x)+ gradf (x)2 + 1 2 Lξ 2 2 , for x ∈M, where ξ T x M and R x (ξ )= z . Obtained results: Thm.4.4. Let {x t } and { ˆ v t } be the sequences from Alg.1. Then, under Asm.1, we have E T t=2 α t-1 g(x t ), g(x t ) ˆ v t-1 2 C + E L 2 T t=1 α t g t (x t ) ˆ v t 2 2 + H 2 T t=2 α t ˆ v t - α t-1 ˆ v t-1 1 where C is a constant term independent of T . Cor.4.5. Let α t =1/ t and min j [ d] v 1 ) j is lower-bounded by a constant c> 0, where d is the dimension of M. Then, under Asm.1, the output of x t of Alg.1 satisfies min t[2,...,T ] Egradf (x t )2 F 1 T - 1 (Q 1 +Q 2 log(T )), where Q 2 = LH 3 /2c 2 and Q 1 = Q 2 + 2dH 3 c + H E[ f (x 1 ) - f (x * )]. Numerical evaluations PCA problem 0 0.5 1 1.5 2 Number of iterations 10 4 10 -4 10 -3 10 -2 10 -1 Optimapity gap RSGD (5.0e-02) cRMSProp (5.0e-01) cRMSProp-M (5.0e-01) Radagrad (5.0e-01) Radam (5.0e-01) Ramsgrad (5.0e-01) RASA-R (5.0e-02) RASA-L (5.0e-02) RASA-LR (5.0e-02) (a) Case P1: Synthetic dataset. 0 1 2 3 4 5 Number of iterations 10 4 10 -4 10 -3 10 -2 10 -1 Optimapity gap RSGD (5.0e-01) cRMSProp (5.0e-03) cRMSProp-M (5.0e-03) Radagrad (5.0e+00) Radam (5.0e+00) Ramsgrad (5.0e+00) RASA-R (5.0e-02) RASA-L (5.0e-02) RASA-LR (1.0e-02) (b) Case P2: MNIST dataset. 0 2000 4000 6000 8000 10000 Number of iterations 10 -2 10 -1 10 0 10 1 Optimapity gap RSGD (1.0e-05) cRMSProp (1.0e-04) cRMSProp-M (1.0e-04) Radagrad (1.0e+01) Radam (1.0e+01) Ramsgrad (1.0e+01) RASA-R (5.0e-04) RASA-L (1.0e-03) RASA-LR (5.0e-02) (c) Case P3: COIL100 dataset. Matrix completion problem 0 2000 4000 6000 Number of iterations 0.74 0.75 0.76 0.77 0.78 0.79 0.8 Root mean squared error on training set RSGD (0.10) Radagrad (1.00) Radam (1.00) Ramsgrad (1.00) RASA-LR (0.01) (a) Movie-Lens-1M (train). 0 2000 4000 6000 Number of iterations 0.89 0.9 0.91 0.92 0.93 0.94 0.95 Root mean squared error on test set RSGD Radagrad Radam Ramsgrad RASA-LR (b) Movie-Lens-1M (test). 0 2 4 6 Number of iterations 10 4 0.71 0.72 0.73 0.74 0.75 Root mean squared error on training set RSGD (0.10) Radagrad (1.00) Radam (1.00) Ramsgrad (1.00) RASA-LR (0.01) (c) Movie-Lens-10M (train). 0 2 4 6 Number of iterations 10 4 0.81 0.82 0.83 0.84 0.85 Root mean squared error on test set RSGD Radagrad Radam Ramsgrad RASA-LR (d) Movie-Lens-10M (test). ICA problem 0 2000 4000 6000 8000 10000 Number of iterations 10 -2 10 -1 10 0 Relative optimapity gap RSGD (1.0e-03) cRMSProp (1.0e-02) cRMSProp-M (1.0e-02) Radagrad (5.0e+00) Radam (1.0e+01) Ramsgrad (1.0e+01) RASA-R (1.0e-02) RASA-L (1.0e-02) RASA-LR (1.0e-01) (a) Case I1: YaleB dataset. 0 2000 4000 6000 8000 10000 Number of iterations 10 -4 10 -3 10 -2 10 -1 10 0 Relative optimapity gap RSGD (1.0e-04) cRMSProp (5.0e-02) cRMSProp-M (5.0e-02) Radagrad (5.0e-01) Radam (1.0e+00) Ramsgrad (1.0e+00) RASA-R (1.0e-04) RASA-L (1.0e-04) RASA-LR (5.0e-02) (b) Case I2: COIL100 dataset. References [ 1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008. [ 2] S.K. Roy, Z. Mhammedi, and M. Harandi, Geometry aware constrained optimization techniques for deep learning, CVPR, 2018. [ 3] X. Chen, S. Liu, R. Sun, and M. Hong, On the convergence of a class of Adam-type algorithms for non-convex optimization, ICLR, 2019.
1

Riemannian adaptive stochastic gradient algorithms on matrix manifoldskasai.comm.waseda.ac.jp/lab/wp-content/uploads/2019/06/201906_ICML2019... · Riemannian adaptive stochastic gradient

Aug 25, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Riemannian adaptive stochastic gradient algorithms on matrix manifoldskasai.comm.waseda.ac.jp/lab/wp-content/uploads/2019/06/201906_ICML2019... · Riemannian adaptive stochastic gradient

Riemannian adaptive stochastic gradient algorithms on matrix manifoldsHiroyuki Kasai (The University of Electro-Communications, Japan), Pratik Jawanpuria (Microsoft, India) and Bamdev Mishra (Microsoft, India)

Problem of interest

• Consider the problem [1] ofminx∈M

f (x). M : Riemannian manifold• M are represented as matrices of size n × r.• Promising applications include, e.g., matrix/tensor

completion, subspace tracking.

Contributions• Propose a modeling adaptive weight matrices

for row and column subspaces exploiting thegeometry of manifold.

• Develop efficient Riemannian adaptivestochastic gradient algorithms (RASA).

• Achieve a rate of convergence orderO(log(T )/

√T ) for non-convex stochastic

optimization under mild conditions.• Show efficiency of RASA from numerical

experiments on several applications.

Preliminaries

• Riemannian stochastic gradient update:(RSGD) xt+1 = Rxt︸︷︷︸

retraction(−αt gradft(xt)︸ ︷︷ ︸

Riemannianstochastic gradient

),

• Rx(ζ) maps ζ ∈ TxM (tangent space) onto M.• When M = Rd with standard Euclidean inner product,

RSGD update results in(SGD) xt+1 = xt − αt∇ft(xt).

• Euclidean adaptive stochastic gradient updates:• Rescale the learning rate based on past gradients as

xt+1 = xt − αtV−1/2t ∇ft(xt).

• Vt = Diag(vt) is a diagonal matrix such as

(AgaGrad) vt =t∑

k=1∇fk(xk) ◦ ∇fk(xk),

(RMSProp) vt = βvt−1 + (1 − β)∇ft(xt) ◦ ∇ft(xt).

MATLAB source codeThe code, which is compliant to Manopt(https://www.manopt.org/), is available athttps://github.com/hiroyuki-kasai/RSOpt/.

RASA: Riemannian AdaptiveStochastic gradient Algorithms

Exploit matrix structure of Riemannian gradientGt (= gradft(xt) ∈ Rn×r) by separating adaptiveweight matrices corresponding to row subspace Lt

and column subspaces Rt.

c.f. [2] views Gt as a vector in Rnr.

• Exponentially weighted matrices:

Lt = βLt−1 + (1 − β)GtG⊤t /r, (∈ Rn×n)

Rt = βRt−1 + (1 − β)G⊤t Gt/n. (∈ Rr×r)

(β ∈ (0, 1): hyper-parameter)

• Adaptive Riemannian gradient Gt:

G̃t = L−1/4t GtR−1/4

t .

• Full-matrix update:

xt+1 = Rxt(−αtPxt

(G̃t)).

• Px, a linear operator, projects onto tangent space TxM.

• Diagonal modeling of {Lt, Rt} as vectors {lt, rt}:

lt = βlt−1 + (1 − β)diag(GtG⊤t ), (∈ Rn)

rt = βrt−1 + (1 − β)diag(G⊤t Gt). (∈ Rr)

• diag(·) returns diagonal vector of a square matrix.

• Maximum operator for convergence:

l̂t = max(̂lt−1, lt), r̂t = max(r̂t−1, rt).

Alg.1: RASA

Require: Step size {αt}Tt=1, hyper-parameter β.

1: Initialize x1 ∈ M, l0 = l̂0 = 0n, r0 = r̂0 = 0r.2: for t = 1, 2, . . . , T do3: Compute Riemannian stochastic gradient

Gt = gradft(xt).4: Update lt = βlt−1 + (1 − β)diag(GtGT

t )/r.5: Calculate l̂t = max(̂lt−1, lt).6: Update rt = βrt−1 + (1 − β)diag(GT

t Gt)/n.7: Calculate r̂t = max(r̂t−1, rt).8: xt+1 =Rxt

(−αtPxt(Diag(̂l−1/4

t )GtDiag(r̂−1/4t ))).

9: end for

• RASA variants:• RASA-L adapts only the row subspace.• RASA-R adapts only the column subspace.• RASA-LR adapts both the row and column subspaces.

Convergence rate analysis

Extend existing convergence analysis in Euclideanspace, e.g., [3], into Riemannian setting.Additionally, need to take care of(i) upper bound of v̂t (Lem.4.3) for update, and(ii) projection Px of weighted gradient onto TxM.

• For analysis, we use additional notations as• xt+1 = Rxt

(−αtPxt(V−1/2

t gt(xt))) for step 8 in Alg.1,• V̂t = Diag(v̂t), where v̂t = r̂1/2

t ⊗ l̂1/2t , and

• gt(x) as the vectorized representation of gradft(x).

• Definition, assumptions, and lemma:Def.4.1. (Upper-Hessian bounded) There exists a

constant L > 0 such that d2f (Rx(tη))dt2 ≤ L, for

x ∈ U ⊂ M and η ∈ TxM with ∥η∥x = 1,and all t such that Rx(τη) ∈ U for τ ∈ [0, t].

Asm.1.1. f is continuously differentiable and islower bounded, i.e., f (x∗) > −∞.

Asm.1.2. f has H-bounded Riemannian stochasticgradient, i.e., ∥gradfi(x)∥F ≤ H or∥gi(x)∥2 ≤ H .

Asm.1.3. f is upper-Hessian bounded (Def.4.1).Lem.4.2. Under Asm.1 and L > 0 in Def.4.1, we

have f (z) ≤ f (x) + ⟨gradf (x), ξ⟩2 + 12L∥ξ∥2

2,for x ∈ M, where ξ ∈ TxM and Rx(ξ) = z.

• Obtained results:Thm.4.4. Let {xt} and {v̂t} be the sequencesfrom Alg.1. Then, under Asm.1, we have

E

T∑t=2

αt−1

⟨g(xt),

g(xt)√v̂t−1

⟩2

≤ C +

≤ E

L

2T∑

t=1

∥∥∥∥∥∥αtgt(xt)√v̂t

∥∥∥∥∥∥2

2+ H2

T∑t=2

∥∥∥∥∥∥ αt√v̂t

− αt−1√v̂t−1

∥∥∥∥∥∥1

where C is a constant term independent of T .

Cor.4.5. Let αt = 1/√

t and minj∈[d]√

(v̂1)j islower-bounded by a constant c > 0, where d is thedimension of M. Then, under Asm.1, the outputof xt of Alg.1 satisfies

mint∈[2,...,T ]

E∥gradf (xt)∥2F ≤ 1√

T − 1(Q1+Q2 log(T )),

where Q2 = LH3/2c2 and

Q1 = Q2 + 2dH3

c+ HE[f (x1) − f (x∗)].

Numerical evaluations

• PCA problem

0 0.5 1 1.5 2

Number of iterations 104

10-4

10-3

10-2

10-1

Op

tim

ap

ity g

ap

RSGD (5.0e-02)

cRMSProp (5.0e-01)

cRMSProp-M (5.0e-01)

Radagrad (5.0e-01)

Radam (5.0e-01)

Ramsgrad (5.0e-01)

RASA-R (5.0e-02)

RASA-L (5.0e-02)

RASA-LR (5.0e-02)

(a) Case P1: Synthetic dataset.

0 1 2 3 4 5

Number of iterations 104

10-4

10-3

10-2

10-1

Op

tim

ap

ity g

ap

RSGD (5.0e-01)

cRMSProp (5.0e-03)

cRMSProp-M (5.0e-03)

Radagrad (5.0e+00)

Radam (5.0e+00)

Ramsgrad (5.0e+00)

RASA-R (5.0e-02)

RASA-L (5.0e-02)

RASA-LR (1.0e-02)

(b) Case P2: MNIST dataset.

0 2000 4000 6000 8000 10000

Number of iterations

10-2

10-1

100

101

Op

tim

ap

ity g

ap

RSGD (1.0e-05)

cRMSProp (1.0e-04)

cRMSProp-M (1.0e-04)

Radagrad (1.0e+01)

Radam (1.0e+01)

Ramsgrad (1.0e+01)

RASA-R (5.0e-04)

RASA-L (1.0e-03)

RASA-LR (5.0e-02)

(c) Case P3: COIL100 dataset.

• Matrix completion problem

0 2000 4000 6000

Number of iterations

0.74

0.75

0.76

0.77

0.78

0.79

0.8

Ro

ot

me

an

sq

ua

red

err

or

on

tra

inin

g s

et

RSGD (0.10)

Radagrad (1.00)

Radam (1.00)

Ramsgrad (1.00)

RASA-LR (0.01)

(a) Movie-Lens-1M (train).

0 2000 4000 6000

Number of iterations

0.89

0.9

0.91

0.92

0.93

0.94

0.95

Ro

ot

me

an

sq

ua

red

err

or

on

te

st

se

t RSGD

Radagrad

Radam

Ramsgrad

RASA-LR

(b) Movie-Lens-1M (test).

0 2 4 6

Number of iterations 104

0.71

0.72

0.73

0.74

0.75

Ro

ot

me

an

sq

ua

red

err

or

on

tra

inin

g s

et

RSGD (0.10)

Radagrad (1.00)

Radam (1.00)

Ramsgrad (1.00)

RASA-LR (0.01)

(c) Movie-Lens-10M (train).

0 2 4 6

Number of iterations 104

0.81

0.82

0.83

0.84

0.85

Ro

ot

me

an

sq

ua

red

err

or

on

te

st

se

t RSGD

Radagrad

Radam

Ramsgrad

RASA-LR

(d) Movie-Lens-10M (test).

• ICA problem

0 2000 4000 6000 8000 10000

Number of iterations

10-2

10-1

100

Re

lative

op

tim

ap

ity g

ap

RSGD (1.0e-03)

cRMSProp (1.0e-02)

cRMSProp-M (1.0e-02)

Radagrad (5.0e+00)

Radam (1.0e+01)

Ramsgrad (1.0e+01)

RASA-R (1.0e-02)

RASA-L (1.0e-02)

RASA-LR (1.0e-01)

(a) Case I1: YaleB dataset.

0 2000 4000 6000 8000 10000

Number of iterations

10-4

10-3

10-2

10-1

100

Re

lative

op

tim

ap

ity g

ap

RSGD (1.0e-04)

cRMSProp (5.0e-02)

cRMSProp-M (5.0e-02)

Radagrad (5.0e-01)

Radam (1.0e+00)

Ramsgrad (1.0e+00)

RASA-R (1.0e-04)

RASA-L (1.0e-04)

RASA-LR (5.0e-02)

(b) Case I2: COIL100 dataset.

References

[1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on MatrixManifolds. Princeton University Press, 2008.

[2] S.K. Roy, Z. Mhammedi, and M. Harandi, Geometry aware constrained optimizationtechniques for deep learning, CVPR, 2018.

[3] X. Chen, S. Liu, R. Sun, and M. Hong, On the convergence of a class of Adam-typealgorithms for non-convex optimization, ICLR, 2019.

1 / 1