Riemannian adaptive stochastic gradient algorithms on matrix manifoldskasai.comm.waseda.ac.jp/lab/wp-content/uploads/2019/06/201906_ICML2019... · Riemannian adaptive stochastic gradient

Post on 25-Aug-2021

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Riemannian adaptive stochastic gradient algorithms on matrix manifoldsHiroyuki Kasai (The University of Electro-Communications, Japan), Pratik Jawanpuria (Microsoft, India) and Bamdev Mishra (Microsoft, India)

Problem of interest

• Consider the problem [1] ofminx∈M

f (x). M : Riemannian manifold• M are represented as matrices of size n × r.• Promising applications include, e.g., matrix/tensor

completion, subspace tracking.

Contributions• Propose a modeling adaptive weight matrices

for row and column subspaces exploiting thegeometry of manifold.

• Develop efficient Riemannian adaptivestochastic gradient algorithms (RASA).

• Achieve a rate of convergence orderO(log(T )/

√T ) for non-convex stochastic

optimization under mild conditions.• Show efficiency of RASA from numerical

experiments on several applications.

Preliminaries

• Riemannian stochastic gradient update:(RSGD) xt+1 = Rxt︸︷︷︸

retraction(−αt gradft(xt)︸︷︷︸

Riemannianstochastic gradient

),

• Rx(ζ) maps ζ ∈ TxM (tangent space) onto M.• When M = Rd with standard Euclidean inner product,

RSGD update results in(SGD) xt+1 = xt − αt∇ft(xt).

• Euclidean adaptive stochastic gradient updates:• Rescale the learning rate based on past gradients as

xt+1 = xt − αtV−1/2t ∇ft(xt).

• Vt = Diag(vt) is a diagonal matrix such as

(AgaGrad) vt =t∑

k=1∇fk(xk) ◦ ∇fk(xk),

(RMSProp) vt = βvt−1 + (1 − β)∇ft(xt) ◦ ∇ft(xt).

MATLAB source codeThe code, which is compliant to Manopt(https://www.manopt.org/), is available athttps://github.com/hiroyuki-kasai/RSOpt/.

RASA: Riemannian AdaptiveStochastic gradient Algorithms

Exploit matrix structure of Riemannian gradientGt (= gradft(xt) ∈ Rn×r) by separating adaptiveweight matrices corresponding to row subspace Lt

and column subspaces Rt.

c.f. [2] views Gt as a vector in Rnr.

• Exponentially weighted matrices:

Lt = βLt−1 + (1 − β)GtG⊤t /r, (∈ Rn×n)

Rt = βRt−1 + (1 − β)G⊤t Gt/n. (∈ Rr×r)

(β ∈ (0, 1): hyper-parameter)

• Adaptive Riemannian gradient Gt:

G̃t = L−1/4t GtR−1/4

t .

• Full-matrix update:

xt+1 = Rxt(−αtPxt

(G̃t)).

• Px, a linear operator, projects onto tangent space TxM.

• Diagonal modeling of {Lt, Rt} as vectors {lt, rt}:

lt = βlt−1 + (1 − β)diag(GtG⊤t ), (∈ Rn)

rt = βrt−1 + (1 − β)diag(G⊤t Gt). (∈ Rr)

• diag(·) returns diagonal vector of a square matrix.

• Maximum operator for convergence:

l̂t = max(̂lt−1, lt), r̂t = max(r̂t−1, rt).

Alg.1: RASA

Require: Step size {αt}Tt=1, hyper-parameter β.

1: Initialize x1 ∈ M, l0 = l̂0 = 0n, r0 = r̂0 = 0r.2: for t = 1, 2, . . . , T do3: Compute Riemannian stochastic gradient

Gt = gradft(xt).4: Update lt = βlt−1 + (1 − β)diag(GtGT

t )/r.5: Calculate l̂t = max(̂lt−1, lt).6: Update rt = βrt−1 + (1 − β)diag(GT

t Gt)/n.7: Calculate r̂t = max(r̂t−1, rt).8: xt+1 =Rxt

(−αtPxt(Diag(̂l−1/4

t )GtDiag(r̂−1/4t ))).

9: end for

• RASA variants:• RASA-L adapts only the row subspace.• RASA-R adapts only the column subspace.• RASA-LR adapts both the row and column subspaces.

Convergence rate analysis

Extend existing convergence analysis in Euclideanspace, e.g., [3], into Riemannian setting.Additionally, need to take care of(i) upper bound of v̂t (Lem.4.3) for update, and(ii) projection Px of weighted gradient onto TxM.

• For analysis, we use additional notations as• xt+1 = Rxt

(−αtPxt(V−1/2

t gt(xt))) for step 8 in Alg.1,• V̂t = Diag(v̂t), where v̂t = r̂1/2

t ⊗ l̂1/2t , and

• gt(x) as the vectorized representation of gradft(x).

• Definition, assumptions, and lemma:Def.4.1. (Upper-Hessian bounded) There exists a

constant L > 0 such that d2f (Rx(tη))dt2 ≤ L, for

x ∈ U ⊂ M and η ∈ TxM with ∥η∥x = 1,and all t such that Rx(τη) ∈ U for τ ∈ [0, t].

Asm.1.1. f is continuously differentiable and islower bounded, i.e., f (x∗) > −∞.

Asm.1.2. f has H-bounded Riemannian stochasticgradient, i.e., ∥gradfi(x)∥F ≤ H or∥gi(x)∥2 ≤ H .

Asm.1.3. f is upper-Hessian bounded (Def.4.1).Lem.4.2. Under Asm.1 and L > 0 in Def.4.1, we

have f (z) ≤ f (x) + ⟨gradf (x), ξ⟩2 + 12L∥ξ∥2

2,for x ∈ M, where ξ ∈ TxM and Rx(ξ) = z.

• Obtained results:Thm.4.4. Let {xt} and {v̂t} be the sequencesfrom Alg.1. Then, under Asm.1, we have

E

T∑t=2

αt−1

⟨g(xt),

g(xt)√v̂t−1

⟩2

≤ C +

≤ E

L

2T∑

t=1

∥∥∥∥∥∥αtgt(xt)√v̂t

∥∥∥∥∥∥2

2+ H2

T∑t=2

∥∥∥∥∥∥ αt√v̂t

− αt−1√v̂t−1

∥∥∥∥∥∥1

where C is a constant term independent of T .

Cor.4.5. Let αt = 1/√

t and minj∈[d]√

(v̂1)j islower-bounded by a constant c > 0, where d is thedimension of M. Then, under Asm.1, the outputof xt of Alg.1 satisfies

mint∈[2,...,T ]

E∥gradf (xt)∥2F ≤ 1√

T − 1(Q1+Q2 log(T )),

where Q2 = LH3/2c2 and

Q1 = Q2 + 2dH3

c+ HE[f (x1) − f (x∗)].

Numerical evaluations

• PCA problem

0 0.5 1 1.5 2

Number of iterations 104

10-4

10-3

10-2

10-1

Op

tim

ap

ity g

ap

RSGD (5.0e-02)

cRMSProp (5.0e-01)

cRMSProp-M (5.0e-01)

Radagrad (5.0e-01)

Radam (5.0e-01)

Ramsgrad (5.0e-01)

RASA-R (5.0e-02)

RASA-L (5.0e-02)

RASA-LR (5.0e-02)

(a) Case P1: Synthetic dataset.

0 1 2 3 4 5

Number of iterations 104

10-4

10-3

10-2

10-1

Op

tim

ap

ity g

ap

RSGD (5.0e-01)

cRMSProp (5.0e-03)

cRMSProp-M (5.0e-03)

Radagrad (5.0e+00)

Radam (5.0e+00)

Ramsgrad (5.0e+00)

RASA-R (5.0e-02)

RASA-L (5.0e-02)

RASA-LR (1.0e-02)

(b) Case P2: MNIST dataset.

0 2000 4000 6000 8000 10000

Number of iterations

10-2

10-1

100

101

Op

tim

ap

ity g

ap

RSGD (1.0e-05)

cRMSProp (1.0e-04)

cRMSProp-M (1.0e-04)

Radagrad (1.0e+01)

Radam (1.0e+01)

Ramsgrad (1.0e+01)

RASA-R (5.0e-04)

RASA-L (1.0e-03)

RASA-LR (5.0e-02)

(c) Case P3: COIL100 dataset.

• Matrix completion problem

0 2000 4000 6000

Number of iterations

0.74

0.75

0.76

0.77

0.78

0.79

0.8

Ro

ot

me

an

sq

ua

red

err

or

on

tra

inin

g s

et

RSGD (0.10)

Radagrad (1.00)

Radam (1.00)

Ramsgrad (1.00)

RASA-LR (0.01)

(a) Movie-Lens-1M (train).

0 2000 4000 6000

Number of iterations

0.89

0.9

0.91

0.92

0.93

0.94

0.95

Ro

ot

me

an

sq

ua

red

err

or

on

te

st

se

t RSGD

Radagrad

Radam

Ramsgrad

RASA-LR

(b) Movie-Lens-1M (test).

0 2 4 6

Number of iterations 104

0.71

0.72

0.73

0.74

0.75

Ro

ot

me

an

sq

ua

red

err

or

on

tra

inin

g s

et

RSGD (0.10)

Radagrad (1.00)

Radam (1.00)

Ramsgrad (1.00)

RASA-LR (0.01)

(c) Movie-Lens-10M (train).

0 2 4 6

Number of iterations 104

0.81

0.82

0.83

0.84

0.85

Ro

ot

me

an

sq

ua

red

err

or

on

te

st

se

t RSGD

Radagrad

Radam

Ramsgrad

RASA-LR

(d) Movie-Lens-10M (test).

• ICA problem

0 2000 4000 6000 8000 10000

Number of iterations

10-2

10-1

100

Re

lative

op

tim

ap

ity g

ap

RSGD (1.0e-03)

cRMSProp (1.0e-02)

cRMSProp-M (1.0e-02)

Radagrad (5.0e+00)

Radam (1.0e+01)

Ramsgrad (1.0e+01)

RASA-R (1.0e-02)

RASA-L (1.0e-02)

RASA-LR (1.0e-01)

(a) Case I1: YaleB dataset.

0 2000 4000 6000 8000 10000

Number of iterations

10-4

10-3

10-2

10-1

100

Re

lative

op

tim

ap

ity g

ap

RSGD (1.0e-04)

cRMSProp (5.0e-02)

cRMSProp-M (5.0e-02)

Radagrad (5.0e-01)

Radam (1.0e+00)

Ramsgrad (1.0e+00)

RASA-R (1.0e-04)

RASA-L (1.0e-04)

RASA-LR (5.0e-02)

(b) Case I2: COIL100 dataset.

References

[1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on MatrixManifolds. Princeton University Press, 2008.

[2] S.K. Roy, Z. Mhammedi, and M. Harandi, Geometry aware constrained optimizationtechniques for deep learning, CVPR, 2018.

[3] X. Chen, S. Liu, R. Sun, and M. Hong, On the convergence of a class of Adam-typealgorithms for non-convex optimization, ICLR, 2019.

1 / 1

https://www.manopt.org/

https://github.com/hiroyuki-kasai/RSOpt/

https://github.com/hiroyuki-kasai/RSOpt/

top related

Stochastic gradient methods for unconstrained...

Variational convergence on Riemannian manifolds Stochastic.....

Stochastic Gradient Descent - CMU Statistics

Riemannian Geometry, continued - Imageimage.diku.dk ›...

CERTAIN SYSTEMS ARISING IN STOCHASTIC GRADIENT …

Semi-Stochastic Gradient Descent Methods

Stochastic Gradient Descent (SGD)

Stochastic Proximal Gradient...

Stochastic gradient variational Bayes for gamma ...

“Towards Faster Stochastic Gradient...

Leader Stochastic Gradient Descent for Distributed ...

Bayesian Posterior Sampling via Stochastic Gradient ...

Stochastic gradient descent on Riemannian manifolds - GdR...

Averaging Stochastic Gradient Descent on Riemannian...

Stochastic Partial Di erential Equations on Evolving …We.....

Riemannian adaptive stochastic gradient algorithms on...