Riemannian adaptive stochastic gradient algorithms on matrix manifolds Hiroyuki Kasai (The University of Electro-Communications, Japan), Pratik Jawanpuria (Microsoft, India) and Bamdev Mishra (Microsoft, India) Problem of interest • Consider the problem [1] of min x∈M f (x). M : Riemannian manifold • M are represented as matrices of size n × r . • Promising applications include, e.g., matrix/tensor completion, subspace tracking. Contributions • Propose a modeling adaptive weight matrices for row and column subspaces exploiting the geometry of manifold. • Develop efficient Riemannian adaptive stochastic gradient algorithms (RASA). • Achieve a rate of convergence order O (log(T )/ √ T ) for non-convex stochastic optimization under mild conditions. • Show efficiency of RASA from numerical experiments on several applications. Preliminaries • Riemannian stochastic gradient update: (RSGD) x t+1 = R x t | {z } retraction (-α t gradf t (x t ) | {z } Riemannian stochastic gradient ), • R x (ζ ) maps ζ ∈ T x M (tangent space) onto M. • When M = R d with standard Euclidean inner product, RSGD update results in (SGD) x t+1 = x t - α t ∇f t (x t ). • Euclidean adaptive stochastic gradient updates: • Rescale the learning rate based on past gradients as x t+1 = x t - α t V -1/2 t ∇f t (x t ). • V t = Diag(v t ) is a diagonal matrix such as (AgaGrad) v t = t ∑ k =1 ∇f k (x k ) ◦∇f k (x k ), (RMSProp) v t = β v t-1 + (1 - β )∇f t (x t ) ◦∇f t (x t ). MATLAB source code The code, which is compliant to Manopt (https://www.manopt.org/), is available at https://github.com/hiroyuki-kasai/ RSOpt/. RASA: Riemannian Adaptive Stochastic gradient Algorithms Exploit matrix structure of Riemannian gradient G t (= gradf t (x t ) ∈ R n×r ) by separating adaptive weight matrices corresponding to row subspace L t and column subspaces R t . c.f. [2] views G t as a vector in R nr . • Exponentially weighted matrices: L t = β L t-1 + (1 - β )G t G ⊤ t /r, (∈ R n×n ) R t = β R t-1 + (1 - β )G ⊤ t G t /n. (∈ R r ×r ) (β ∈ (0, 1): hyper-parameter) • Adaptive Riemannian gradient G t : ˜ G t = L -1/4 t G t R -1/4 t . • Full-matrix update: x t+1 = R x t (-α t P x t ( ˜ G t )). • P x , a linear operator, projects onto tangent space T x M. • Diagonal modeling of {L t , R t } as vectors {l t , r t }: l t = β l t-1 + (1 - β )diag(G t G ⊤ t ), (∈ R n ) r t = β r t-1 + (1 - β )diag(G ⊤ t G t ). (∈ R r ) • diag(·) returns diagonal vector of a square matrix. • Maximum operator for convergence: ˆ l t = max( ˆ l t-1 , l t ), ˆ r t = max(ˆ r t-1 , r t ). Alg.1: RASA Require: Step size {α t } T t=1 , hyper-parameter β . 1: Initialize x 1 ∈M, l 0 = ˆ l 0 = 0 n , r 0 =ˆ r 0 = 0 r . 2: for t =1, 2,...,T do 3: Compute Riemannian stochastic gradient G t = gradf t (x t ). 4: Update l t = β l t-1 + (1 - β )diag(G t G T t )/r . 5: Calculate ˆ l t = max( ˆ l t-1 , l t ). 6: Update r t = β r t-1 + (1 - β )diag(G T t G t )/n. 7: Calculate ˆ r t = max(ˆ r t-1 , r t ). 8: x t+1 = R x t ( -α t P x t (Diag( ˆ l -1/4 t )G t Diag(ˆ r -1/4 t ))). 9: end for • RASA variants: • RASA-L adapts only the row subspace. • RASA-R adapts only the column subspace. • RASA-LR adapts both the row and column subspaces. Convergence rate analysis Extend existing convergence analysis in Euclidean space, e.g., [3], into Riemannian setting. Additionally, need to take care of (i) upper bound of ˆ v t (Lem.4.3) for update, and (ii) projection P x of weighted gradient onto T x M. • For analysis, we use additional notations as • x t+1 = R x t (-α t P x t (V -1/2 t g t (x t ))) for step 8 in Alg.1, • ˆ V t = Diag(ˆ v t ), where ˆ v t =ˆ r 1/2 t ⊗ ˆ l 1/2 t , and • g t (x) as the vectorized representation of gradf t (x). • Definition, assumptions, and lemma: Def.4.1. (Upper-Hessian bounded) There exists a constant L> 0 such that d 2 f (R x (tη )) dt 2 ≤ L, for x ∈U⊂M and η ∈ T x M with ∥η ∥ x =1, and all t such that R x (τη ) ∈U for τ ∈ [0,t]. Asm.1.1. f is continuously differentiable and is lower bounded, i.e., f (x * ) > -∞. Asm.1.2. f has H -bounded Riemannian stochastic gradient, i.e., ∥gradf i (x)∥ F ≤ H or ∥g i (x)∥ 2 ≤ H . Asm.1.3. f is upper-Hessian bounded (Def.4.1). Lem.4.2. Under Asm.1 and L> 0 in Def.4.1, we have f (z ) ≤ f (x)+ ⟨gradf (x),ξ ⟩ 2 + 1 2 L∥ξ ∥ 2 2 , for x ∈M, where ξ ∈ T x M and R x (ξ )= z . • Obtained results: Thm.4.4. Let {x t } and { ˆ v t } be the sequences from Alg.1. Then, under Asm.1, we have E T ∑ t=2 α t-1 ⟨ g(x t ), g(x t ) √ ˆ v t-1 ⟩ 2 ≤ C + ≤ E L 2 T ∑ t=1 α t g t (x t ) √ ˆ v t 2 2 + H 2 T ∑ t=2 α t √ ˆ v t - α t-1 √ ˆ v t-1 1 where C is a constant term independent of T . Cor.4.5. Let α t =1/ √ t and min j ∈[ d] √ (ˆ v 1 ) j is lower-bounded by a constant c> 0, where d is the dimension of M. Then, under Asm.1, the output of x t of Alg.1 satisfies min t∈[2,...,T ] E∥gradf (x t )∥ 2 F ≤ 1 √ T - 1 (Q 1 +Q 2 log(T )), where Q 2 = LH 3 /2c 2 and Q 1 = Q 2 + 2dH 3 c + H E[ f (x 1 ) - f (x * )]. Numerical evaluations • PCA problem 0 0.5 1 1.5 2 Number of iterations 10 4 10 -4 10 -3 10 -2 10 -1 Optimapity gap RSGD (5.0e-02) cRMSProp (5.0e-01) cRMSProp-M (5.0e-01) Radagrad (5.0e-01) Radam (5.0e-01) Ramsgrad (5.0e-01) RASA-R (5.0e-02) RASA-L (5.0e-02) RASA-LR (5.0e-02) (a) Case P1: Synthetic dataset. 0 1 2 3 4 5 Number of iterations 10 4 10 -4 10 -3 10 -2 10 -1 Optimapity gap RSGD (5.0e-01) cRMSProp (5.0e-03) cRMSProp-M (5.0e-03) Radagrad (5.0e+00) Radam (5.0e+00) Ramsgrad (5.0e+00) RASA-R (5.0e-02) RASA-L (5.0e-02) RASA-LR (1.0e-02) (b) Case P2: MNIST dataset. 0 2000 4000 6000 8000 10000 Number of iterations 10 -2 10 -1 10 0 10 1 Optimapity gap RSGD (1.0e-05) cRMSProp (1.0e-04) cRMSProp-M (1.0e-04) Radagrad (1.0e+01) Radam (1.0e+01) Ramsgrad (1.0e+01) RASA-R (5.0e-04) RASA-L (1.0e-03) RASA-LR (5.0e-02) (c) Case P3: COIL100 dataset. • Matrix completion problem 0 2000 4000 6000 Number of iterations 0.74 0.75 0.76 0.77 0.78 0.79 0.8 Root mean squared error on training set RSGD (0.10) Radagrad (1.00) Radam (1.00) Ramsgrad (1.00) RASA-LR (0.01) (a) Movie-Lens-1M (train). 0 2000 4000 6000 Number of iterations 0.89 0.9 0.91 0.92 0.93 0.94 0.95 Root mean squared error on test set RSGD Radagrad Radam Ramsgrad RASA-LR (b) Movie-Lens-1M (test). 0 2 4 6 Number of iterations 10 4 0.71 0.72 0.73 0.74 0.75 Root mean squared error on training set RSGD (0.10) Radagrad (1.00) Radam (1.00) Ramsgrad (1.00) RASA-LR (0.01) (c) Movie-Lens-10M (train). 0 2 4 6 Number of iterations 10 4 0.81 0.82 0.83 0.84 0.85 Root mean squared error on test set RSGD Radagrad Radam Ramsgrad RASA-LR (d) Movie-Lens-10M (test). • ICA problem 0 2000 4000 6000 8000 10000 Number of iterations 10 -2 10 -1 10 0 Relative optimapity gap RSGD (1.0e-03) cRMSProp (1.0e-02) cRMSProp-M (1.0e-02) Radagrad (5.0e+00) Radam (1.0e+01) Ramsgrad (1.0e+01) RASA-R (1.0e-02) RASA-L (1.0e-02) RASA-LR (1.0e-01) (a) Case I1: YaleB dataset. 0 2000 4000 6000 8000 10000 Number of iterations 10 -4 10 -3 10 -2 10 -1 10 0 Relative optimapity gap RSGD (1.0e-04) cRMSProp (5.0e-02) cRMSProp-M (5.0e-02) Radagrad (5.0e-01) Radam (1.0e+00) Ramsgrad (1.0e+00) RASA-R (1.0e-04) RASA-L (1.0e-04) RASA-LR (5.0e-02) (b) Case I2: COIL100 dataset. References [ 1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008. [ 2] S.K. Roy, Z. Mhammedi, and M. Harandi, Geometry aware constrained optimization techniques for deep learning, CVPR, 2018. [ 3] X. Chen, S. Liu, R. Sun, and M. Hong, On the convergence of a class of Adam-type algorithms for non-convex optimization, ICLR, 2019.