On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization Dongruo Zhou *† Jinghui Chen *‡ Yuan Cao *§ Yiqi Tang ¶ Ziyan Yang Quanquan Gu ** Abstract Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a fine-grained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a first-order stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension, and is strictly faster than stochastic gradient decent (SGD) when the stochastic gradients are sparse. To the best of our knowledge, this is the first result showing the advantage of adaptive gradient methods over SGD in nonconvex setting. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives. 1 Introduction Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants have been widely used in training deep neural networks. Among those variants, adaptive gradient methods (AdaGrad) (Duchi et al., 2011; McMahan and Streeter, 2010), which scale each coordinate of the gradient by a function of past gradients, can achieve better performance than vanilla SGD in practice when the gradients are sparse. An intuitive explanation for the success of AdaGrad is that it automatically adjusts the learning rate for each feature based on the partial gradient, which accelerates the convergence. However, AdaGrad was later found to demonstrate degraded performance especially in * Equal Contribution † Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]‡ Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]§ Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]¶ Department of Computer Science, Ohio State University, Columbus, OH 43210, USA; e-mail: [email protected]Department of Computer Science, University of Virginia, Charlottesville, VA 22904, USA; e-mail: [email protected]** Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]1 arXiv:1808.05671v3 [cs.LG] 19 Oct 2020
30
Embed
On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
On the Convergence of Adaptive Gradient Methods for
Adaptive gradient methods are workhorses in deep learning. However, the convergence
guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly
studied. In this paper, we provide a fine-grained convergence analysis for a general class of
adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex
functions, we prove that adaptive gradient methods in expectation converge to a first-order
stationary point. Our convergence rate is better than existing results for adaptive gradient
methods in terms of dimension, and is strictly faster than stochastic gradient decent (SGD)
when the stochastic gradients are sparse. To the best of our knowledge, this is the first result
showing the advantage of adaptive gradient methods over SGD in nonconvex setting. In addition,
we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as
well as AdaGrad, which have not been established before. Our analyses shed light on better
understanding the mechanism behind adaptive gradient methods in optimizing nonconvex
objectives.
1 Introduction
Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants have been widely
used in training deep neural networks. Among those variants, adaptive gradient methods (AdaGrad)
(Duchi et al., 2011; McMahan and Streeter, 2010), which scale each coordinate of the gradient by a
function of past gradients, can achieve better performance than vanilla SGD in practice when the
gradients are sparse. An intuitive explanation for the success of AdaGrad is that it automatically
adjusts the learning rate for each feature based on the partial gradient, which accelerates the
convergence. However, AdaGrad was later found to demonstrate degraded performance especially in
∗Equal Contribution†Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
[email protected]‡Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
[email protected]§Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
[email protected]¶Department of Computer Science, Ohio State University, Columbus, OH 43210, USA; e-mail: [email protected]‖Department of Computer Science, University of Virginia, Charlottesville, VA 22904, USA; e-mail:
1, . . . , T, 0 ≤ s ≤ 1/2. Then for any δ > 0, under Assumptions 5.1, 5.2 and 6.1, with probability at
least 1− δ, the iterates xt of AMSGrad satisfy that
1
T − 1
T∑t=2
‖∇f(xt)‖22 ≤M1
Tα+M2d
T+αM3d
T 1/2−s , (6.1)
where Mi3i=1 are defined as follows:
M1 = 4G∞∆f + C ′G∞ε−1σ2G∞ log(2/δ),M2 =
4G3∞ε−1/2
1− β1+ 4G2
∞,
M3 =4LG2
∞
ε1/2(1− β2)1/2(1− β1/β1/22 )
(1 +
2β211− β1
),
and ∆f = f(x1)− infx f(x).
Remark 6.4. Similar to the discussion in Remark 5.5, we can choose α = Θ(d1/2T 1/4+s/2
)−1,
to achieve an O(d1/2/T 3/4−s/2 + d/T ) convergence rate. When s < 1/2, this rate of AMSGrad is
strictly better than that of nonconvex SGD (Ghadimi and Lan, 2016).
We also have the following corollaries characterizing the high probability bounds for RMSProp
and AdaGrad.
9
Corollary 6.5 (corrected version of RMSProp). Under the same conditions of Theorem 6.3, if
αt = α ≤ 1/2 and ‖g1:T,i‖2 ≤ G∞Ts for t = 1, . . . , T, 0 ≤ s ≤ 1/2, then for any δ > 0, with
probability at least 1− δ, the iterates xt of RMSProf satisfy that
1
T − 1
T∑t=2
‖∇f(xt)‖22 ≤M1
Tα+M2d
T+αM3d
T 1/2−s , (6.2)
where Mi3i=1 are defined as follows:
M1 = 4G∞∆ + C ′G∞ε−1σ2G∞ log(2/δ),M2 = 4G3
∞ε−1/2 + 4G2
∞, M3 =4LG2
∞ε1/2(1− β)1/2
,
and ∆ = f(x1)− infx f(x).
Corollary 6.6 (AdaGrad). Under the same conditions of Theorem 6.3, if αt = α ≤ 1/2 and
‖g1:T,i‖2 ≤ G∞T s for t = 1, . . . , T, 0 ≤ s ≤ 1/2, then for any δ > 0, with probability at least 1− δ,the iterates xt of AdaGrad satisfy that
1
T − 1
T∑t=2
‖∇f(xt)‖22 ≤M1
Tα+M2d
T+αM3d
T 1/2−s , (6.3)
where Mi3i=1 are defined as follows:
M1 = 4G∞∆ + C ′G∞ε−1σ2G∞ log(2/δ),M2 = 4G3
∞ε−1/2 + 4G2
∞, M3 =4LG2
∞ε1/2
,
and ∆ = f(x1)− infx f(x).
7 Proof Sketch of the Main Theory
In this section, we provide a proof sketch of Theorems 5.3 and 6.3, and the complete proofs as well
as proofs for other corollaries and technical lemmas can be found in the appendix. Compared with
the analysis of standard stochastic gradient descent, the main difficulty of analyzing the convergence
rate of adaptive gradient methods is caused by the existence of stochastic momentum mt and
adaptive stochastic gradient V−1/2t gt. To address this challenge, following Yang et al. (2016), we
define an auxiliary sequence zt: let x0 = x1, and for each t ≥ 1,
zt = xt +β1
1− β1(xt − xt−1) =
1
1− β1xt −
β11− β1
xt−1. (7.1)
The following lemma shows that zt+1 − zt can be represented by mt,gt and V−1/2t . This indicates
that by considering the sequence zt, it is possible to analyze algorithms which include stochastic
momentum, such as AMSGrad.
Lemma 7.1. Let zt be defined in (7.1). Then for t ≥ 2, we have the following expression for
10
zt+1 − zt.
zt+1 − zt =β1
1− β1
[I−
(αtV
−1/2t
)(αt−1V
−1/2t−1
)−1](xt−1 − xt)− αtV−1/2t gt.
We can also represent zt+1 − zt as the following:
zt+1 − zt =β1
1− β1(αt−1V
−1/2t−1 − αtV
−1/2t
)mt−1 − αtV−1/2t gt.
For t = 1, we have z2 − z1 = −α1V−1/21 g1.
With Lemma 7.1, we have the following two lemmas giving upper bounds for ‖zt+1 − zt‖2 and
‖∇f(zt)−∇f(xt)‖2 , which are useful to the proof of the main theorem.
Lemma 7.2. Let zt be defined in (7.1). For t ≥ 2, we have
‖zt+1 − zt‖2 ≤∥∥αV−1/2t gt
∥∥2
+β1
1− β1‖xt−1 − xt‖2.
Lemma 7.3. Let zt be defined in (7.1). For t ≥ 2, we have
‖∇f(zt)−∇f(xt)‖2 ≤ L( β1
1− β1
)· ‖xt − xt−1‖2.
We also need the following lemma to bound ‖∇f(x)‖∞, ‖vt‖∞ and ‖mt‖∞. Basically, it shows
that these quantities can be bounded by G∞.
Lemma 7.4. Let vt and mt be as defined in Algorithm 1. Then under Assumption 5.1, we have
‖∇f(x)‖∞ ≤ G∞, ‖vt‖∞ ≤ G2∞ and ‖mt‖∞ ≤ G∞.
Last, we need the following lemma that provides upper bounds for ‖V−1/2t mt‖2 and ‖V−1/2t gt‖2.
More specifically, it shows that we can bound ‖V−1/2t mt‖2 and ‖V−1/2t gt‖2 with∑d
i=1 ‖g1:T,i‖2.
Lemma 7.5. Let β1, β2 be the weight parameters, αt, t = 1, . . . , T be the step sizes in Algorithm
1. We denote γ = β1/β1/22 . Suppose that αt = α and γ ≤ 1, then under Assumption 5.1, we have
the following two results:
T∑t=1
α2t
∥∥V−1/2t mt
∥∥22≤ T 1/2α2
t (1− β1)2ε1/2(1− β2)1/2(1− γ)
d∑i=1
‖g1:T,i‖2,
and
T∑t=1
α2t
∥∥V−1/2t gt∥∥22≤ T 1/2α2
t
2ε1/2(1− β2)1/2(1− γ)
d∑i=1
‖g1:T,i‖2.
With all lemmas provided, now we are ready to show the proof sketch of Theorem 5.3.
11
Proof Sketch of Theorem 5.3. Since f is L-smooth, we have:
f(zt+1) ≤ f(zt) +∇f(zt)>(zt+1 − zt) +
L
2‖zt+1 − zt‖22
= f(zt) +∇f(xt)>(zt+1 − zt)︸ ︷︷ ︸I1
+ (∇f(zt)−∇f(xt))>(zt+1 − zt)︸ ︷︷ ︸
I2
+L
2‖zt+1 − zt‖22︸ ︷︷ ︸
I3
. (7.2)
In the following, we bound I1, I2 and I3 separately.
Bounding term I1: We can prove that when t = 1,
∇f(x1)>(z2 − z1) = −∇f(x1)
>α1V−1/2t g1. (7.3)
For t ≥ 2, by Lemma 7.1, we can prove the following result:
∇f(xt)>(zt+1 − zt) ≤
1
1− β1G2∞
(∥∥αt−1v−1/2t−1∥∥1−∥∥αtv−1/2t
∥∥1
)−∇f(xt)
>αt−1V−1/2t−1 gt. (7.4)
Bounding term I2: For t ≥ 1, by Lemma 7.1 and Lemma 7.2, we can prove that
(∇f(zt)−∇f(xt)
)>(zt+1 − zt) ≤ L
∥∥αtV−1/2t gt∥∥22
+ 2L
(β1
1− β1
)2
‖xt − xt−1‖22, (7.5)
Bounding term I3: For t ≥ 1, by Lemma 7.1, we have
L
2‖zt+1 − zt‖22 ≤ L
∥∥αtV−1/2t gt∥∥22
+ 2L
(β1
1− β1
)2
‖xt−1 − xt‖22. (7.6)
Now we get back to (7.2). We provide upper bounds of (7.2) for t = 1 and t > 1 separately. For
t = 1, substituting (7.3), (7.5) and (7.6) into (7.2), taking expectation and rearranging terms, we
have
E[f(z2)− f(z1)] ≤ E[dα1G∞ + 2L
∥∥α1V−1/21 g1
∥∥22
], (7.7)
For t ≥ 2, substituting (7.4), (7.5) and (7.6) into (7.2), taking expectation and rearranging terms,
we have
E[f(zt+1) +
G2∞∥∥αtv−1/2t
∥∥1
1− β1
]− E
[f(zt) +
G2∞∥∥αt−1v−1/2t−1
∥∥1
1− β1
]≤ E
[− αt−1
∥∥∇f(xt)∥∥22(G2p∞)−1 + 2L
∥∥αtV−1/2t gt∥∥22
+ 4L
(β1
1− β1
)2∥∥αt−1V−1/2t−1 mt−1∥∥22
], (7.8)
We now telescope (7.8) for t = 2 to T , and add it with (7.7). Rearranging it, we have
G−1∞
T∑t=2
αt−1E∥∥∇f(xt)
∥∥22≤ E
[∆f +
G2∞α1ε
−1/2d
1− β1+ dα1G∞
]+ 2L
T∑t=1
E∥∥αtV−1/2t gt
∥∥22
12
+ 4L
(β1
1− β1
)2 T∑t=1
E[∥∥αtV−1/2t mt
∥∥22
]. (7.9)
By using Lemma 7.5, we can further bound∑T
t=1 E‖αtV−1/2t gt‖22 and
∑Tt=1 E‖αtV
−1/2t mt‖22 in
(7.9) with∑d
i=1 ‖g1:T,i‖2, which turns out to be
E‖∇f(xout)‖22 ≤1
Tα2G∞∆f +
2
T
(G3∞ε−1/2d
1− β1+ dG2
∞
)+
2G∞Lα
T 1/2ε1/2(1− γ)(1− β2)1/2E( d∑i=1
‖g1:T,i‖2)·(
1 + 2(1− β1)(
β11− β1
)2),
(7.10)
Finally, rearranging (7.10), and adopting the theorem condition that ‖g1:T,i‖2 ≤ G∞T s, we obtain
E‖∇f(xout)‖22 ≤M1
Tα+M2d
T+αM3d
T 1/2−s
where Mi3i=1 are defined in Theorem 5.3. This completes the proof.
We then show the proof sketch of for high probability result, i.e, Theorem 6.3.
Proof Sketch of Theorem 6.3. Following the same procedure as in the proof for Theorem 5.3 until
(7.6). For t = 1, substituting (7.3), (7.5) and (7.6) into (7.2), rearranging terms, we have
f(z2)− f(z1) ≤ dα1G∞ + 2L∥∥α1V
−1/21 g1
∥∥22, (7.11)
For t ≥ 2, substituting (7.4), (7.5) and (7.6) into (7.2), rearranging terms, we have
f(zt+1) +G2∞∥∥αtV−1/2t
∥∥1,1
1− β1−(f(zt) +
G2∞∥∥αt−1V−1/2t−1
∥∥1,1
1− β1
)≤ −∇f(xt)
>αt−1V−1/2t−1 gt + 2L
∥∥αtV−1/2t gt∥∥22
+ 4L
(β1
1− β1
)2∥∥αt−1V−1/2t−1 mt−1∥∥22. (7.12)
We now telescope (7.12) for t = 2 to T and add it with (7.11). Rearranging it, we have
T∑t=2
αt−1∇f(xt)>V−1/2t−1 gt ≤ ∆f +
G2∞α1ε
−1/2d
1− β1+ dα1G∞ + 2L
T∑t=1
∥∥αtV−1/2t gt∥∥22
+ 4L
(β1
1− β1
)2 T∑t=1
∥∥αtV−1/2t mt
∥∥22. (7.13)
Now consider the filtration Ft = σ(ξ1, . . . , ξt). Since xt and V−1/2t−1 only depend on ξ1, . . . , ξt−1, by
13
Assumption 6.1 and an martingale concentration argument, we obtain we obtain∣∣∣∣∣T∑t=2
αt−1∇f(xt)>V−1/2t−1 gt −
T∑t=2
αt−1∇f(xt)>V−1/2t−1 ∇f(xt)
∣∣∣∣∣≤ G−1∞
T∑t=2
α2t−1‖∇f(xt)‖22 + Cε−1σ2G∞ log(2/δ), (7.14)
By using Lemma 7.5 and substituting (7.14) into (7.13), we have
T∑t=2
αt−1∇f(xt)>V−1/2t−1 ∇f(xt) ≤ ∆f +
G2∞α1ε
−1/2d
1− β1+ dα1G∞ +G−1∞
T∑t=2
α2t−1‖∇f(xt)‖22
+LT 1/2α2
t
ε1/2(1− β2)1/2(1− γ)
d∑i=1
‖g1:T,i‖2 + Cε−1σ2G∞ log(2/δ)
+
(β1
1− β1
)2 2LT 1/2α2t (1− β1)
ε1/2(1− β2)1/2(1− γ)
d∑i=1
‖g1:T,i‖2.
Moreover, by Lemma 7.4, we have ∇f(xt)>V−1/2t−1 ∇f(xt) ≥ G−1∞ ‖∇f(xt)‖22, and therefore by
choosing αt = α ≤ 1/2 and rearranging terms, we have
1
T − 1
T∑t=2
‖∇f(xt)‖22 ≤4G∞Tα
·∆f +4G3∞ε−1/2
1− β1· dT
+ 4G2∞ ·
d
T
+4G∞Lα
ε1/2(1− β2)1/2(1− γ)T 1/2
d∑i=1
‖g1:T,i‖2
+
(β1
1− β1
)2 8G∞Lα(1− β1)ε1/2(1− β2)1/2(1− γ)T 1/2
d∑i=1
‖g1:T,i‖2
+C ′G∞ε
−1σ2G∞ log(2/δ)
Tα, (7.15)
where C ′ is an absolute constant. Finally, rearranging (7.15) and adopting the theorem condition
‖g1:T,i‖2 ≤ G∞T s, we have
1
T − 1
T∑t=2
‖∇f(xt)‖22 ≤M1
Tα+M2d
T+αM3d
T 1/2−s ,
where Mi3i=1 are defined in Theorem 6.3. This completes the proof.
8 Conclusions
In this paper, we provided a fine-grained analysis of a general class of adaptive gradient methods,
and proved their convergence rates for smooth nonconvex optimization. Our results provide faster
14
convergence rates of AMSGrad and the corrected version of RMSProp as well as AdaGrad for
smooth nonconvex optimization compared with previous works. In addition, we also prove high
probability bounds on the convergence rates of AMSGrad and RMSProp as well as AdaGrad, which
have not been established before.
A Detailed Proof of the Main Theory
Here we provide the detailed proof of the main theorem.
A.1 Proof of Theorem 5.3
Let x0 = x1. To prove Theorem 5.3, we need the following lemmas:
Lemma A.1 (Restatement of Lemma 7.4). Let vt and mt be as defined in Algorithm 1. Then
under Assumption 5.1, we have ‖∇f(x)‖∞ ≤ G∞, ‖vt‖∞ ≤ G2∞ and ‖mt‖∞ ≤ G∞.
Lemma A.2 (Generalized version of Lemma 7.5). Let β1, β2, β′1, β′2 be the weight parameters such
that
mt = β1mt−1 + (1− β′1)gt,vt = β2vt−1 + (1− β′2)g2
t ,
αt, t = 1, . . . , T be the step sizes. We denote γ = β1/β1/22 . Suppose that αt = α and γ ≤ 1, then
under Assumption 5.1, we have the following two results:
T∑t=1
α2t
∥∥V−1/2t mt
∥∥22≤ T 1/2α2
t (1− β′1)2ε1/2(1− β′2)1/2(1− γ)
d∑i=1
‖g1:T,i‖2,
and
T∑t=1
α2t
∥∥V−1/2t gt∥∥22≤ T 1/2α2
t
2ε1/2(1− β′2)1/2(1− γ)
d∑i=1
‖g1:T,i‖2.
Note that Lemma A.2 is general and applicable to various algorithms. Specifically, set β′1 = β1and β′2 = β2, we recover the case in Algorithm 1. Further set β1 = 0 we recover the case in Algorithm
2. Set β′1 = β1 = 0 and β2 = 1, β′2 = 0 we recover the case in Algorithm 3.
To deal with stochastic momentum mt and stochastic weight V−1/2t , following Yang et al. (2016),
we define an auxiliary sequence zt as follows: let x0 = x1, and for each t ≥ 1,
zt = xt +β1
1− β1(xt − xt−1) =
1
1− β1xt −
β11− β1
xt−1. (A.1)
Lemma A.3 shows that zt+1 − zt can be represented in two different ways.
15
Lemma A.3 (Restatement of Lemma 7.1). Let zt be defined in (A.1). For t ≥ 2, we have
zt+1 − zt =β1
1− β1
[I−
(αtV
−1/2t
)(αt−1V
−1/2t−1
)−1](xt−1 − xt)− αtV−1/2t gt. (A.2)
and
zt+1 − zt =β1
1− β1(αt−1V
−1/2t−1 − αtV
−1/2t
)mt−1 − αtV−1/2t gt. (A.3)
For t = 1, we have
z2 − z1 = −α1V−1/21 g1. (A.4)
By Lemma A.3, we connect zt+1 − zt with xt+1 − xt and αtV−1/2t gt
The following two lemmas give bounds on ‖zt+1 − zt‖2 and ‖∇f(zt) − ∇f(xt)‖2, which play
important roles in our proof.
Lemma A.4 (Restatement of Lemma 7.2). Let zt be defined in (A.1). For t ≥ 2, we have
‖zt+1 − zt‖2 ≤∥∥αV−1/2t gt
∥∥2
+β1
1− β1‖xt−1 − xt‖2.
Lemma A.5 (Restatement of Lemma 7.3). Let zt be defined in (A.1). For t ≥ 2, we have
‖∇f(zt)−∇f(xt)‖2 ≤ L( β1
1− β1
)· ‖xt − xt−1‖2.
Now we are ready to prove Theorem 5.3.
Proof of Theorem 5.3. Since f is L-smooth, we have:
f(zt+1) ≤ f(zt) +∇f(zt)>(zt+1 − zt) +
L
2‖zt+1 − zt‖22
= f(zt) +∇f(xt)>(zt+1 − zt)︸ ︷︷ ︸I1
+ (∇f(zt)−∇f(xt))>(zt+1 − zt)︸ ︷︷ ︸
I2
+L
2‖zt+1 − zt‖22︸ ︷︷ ︸
I3
(A.5)
In the following, we bound I1, I2 and I3 separately.
Bounding term I1: When t = 1, we have
∇f(x1)>(z2 − z1) = −∇f(x1)
>α1V−1/2t g1. (A.6)
For t ≥ 2, we have
∇f(xt)>(zt+1 − zt)
= ∇f(xt)>[
β11− β1
(αt−1V
−1/2t−1 − αtV
−1/2t
)mt−1 − αtV−1/2t gt
]
16
=β1
1− β1∇f(xt)
>(αt−1V−1/2t−1 − αtV−1/2t
)mt−1 −∇f(xt)
>αtV−1/2t gt, (A.7)
where the first equality holds due to (A.3) in Lemma A.3. For ∇f(xt)>(αt−1V
−1/2t−1 −αtV
−1/2t )mt−1
in (A.7), we have
∇f(xt)>(αt−1V
−1/2t−1 − αtV
−1/2t )mt−1 ≤ ‖∇f(xt)‖∞ ·
∥∥αt−1V−1/2t−1 − αtV−1/2t
∥∥1,1· ‖mt−1‖∞
≤ G2∞
[∥∥αt−1V−1/2t−1∥∥1,1−∥∥αtV−1/2t
∥∥1,1
]. (A.8)
The first inequality holds because for a positive diagonal matrix A, we have x>Ay ≤ ‖x‖∞ ·‖A‖1,1 · ‖y‖∞. The second inequality holds due to αt−1V
−1/2t−1 αtV
−1/2t 0. Next we bound
−∇f(xt)>αtV
−1/2t gt. We have
−∇f(xt)>αtV
−1/2t gt
= −∇f(xt)>αt−1V
−1/2t−1 gt −∇f(xt)
>(αtV−1/2t − αt−1V−1/2t−1)gt
≤ −∇f(xt)>αt−1V
−1/2t−1 gt + ‖∇f(xt)‖∞ ·
∥∥αtV−1/2t − αt−1V−1/2t−1∥∥1,1· ‖gt‖∞
≤ −∇f(xt)>αt−1V
−1/2t−1 gt +G2
∞
(∥∥αt−1V−1/2t−1∥∥1,1−∥∥αtV−1/2t
∥∥1,1
). (A.9)
The first inequality holds because for a positive diagonal matrix A, we have x>Ay ≤ ‖x‖∞ · ‖A‖1,1 ·‖y‖∞. The second inequality holds due to αt−1V
−1/2t−1 αtV
−1/2t 0. Substituting (A.8) and (A.9)
into (A.7), we have
∇f(xt)>(zt+1 − zt) ≤ −∇f(xt)
>αt−1V−1/2t−1 gt +
1
1− β1G2∞
(∥∥αt−1V−1/2t−1∥∥1,1−∥∥αtV−1/2t
∥∥1,1
).
(A.10)
Bounding term I2: For t ≥ 1, we have(∇f(zt)−∇f(xt)
)>(zt+1 − zt)
≤∥∥∇f(zt)−∇f(xt)
∥∥2· ‖zt+1 − zt‖2
≤(∥∥αtV−1/2t gt
∥∥2
+β1
1− β1‖xt−1 − xt‖2
)· β1
1− β1· L‖xt − xt−1‖2
= Lβ1
1− β1∥∥αtV−1/2t gt
∥∥2· ‖xt − xt−1‖2 + L
(β1
1− β1
)2
‖xt − xt−1‖22
≤ L∥∥αtV−1/2t gt
∥∥22
+ 2L
(β1
1− β1
)2
‖xt − xt−1‖22, (A.11)
where the second inequality holds because of Lemma A.3 and Lemma A.4, the last inequality holds
due to Young’s inequality.
17
Bounding term I3: For t ≥ 1, we have
L
2‖zt+1 − zt‖22 ≤
L
2
[∥∥αtV−1/2t gt∥∥2
+β1
1− β1‖xt−1 − xt‖2
]2≤ L
∥∥αtV−1/2t gt∥∥22
+ 2L
(β1
1− β1
)2
‖xt−1 − xt‖22. (A.12)
The first inequality is obtained by introducing Lemma A.3.
For t = 1, substituting (A.6), (A.11) and (A.12) into (A.5), taking expectation and rearranging