Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a Large Condition Number Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang [email protected] 2019-06-10 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1 / 20
Katalyst: Boosting Convex Katayusha for Non-ConvexProblems with a Large Condition Number
Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang
2019-06-10
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1 / 20
Overview
1 Introduction
2 Katalyst Algorithm and Theoretical Guarantee
3 Experiments
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 2 / 20
Problem Definition
Problem Definition
minx∈Rd
φ(x) :=1
n
n∑i=1
fi (x) + ψ(x) (1)
we can obtain a better gradient complexity w.r.t. sample size n andaccuracy ε via variance reduced method (Johnson & Zhang, 2013)(SVRG-type).
We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu,2017) and Catalyst (Lin et al., 2015).
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 3 / 20
Problem Definition
Problem Definition
minx∈Rd
φ(x) :=1
n
n∑i=1
fi (x) + ψ(x) (1)
we can obtain a better gradient complexity w.r.t. sample size n andaccuracy ε via variance reduced method (Johnson & Zhang, 2013)(SVRG-type).
We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu,2017) and Catalyst (Lin et al., 2015).
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 3 / 20
Assumptions
fi are L-smooth.
ψ can be non-smooth but convex.
φ is µ-weakly convex.
Definition 1
(L-smoothness) A function f is Lipschitz smooth with constant L if itsderivatives are Lipschitz continuous with constant L, that is
‖∇f (x)−∇(y)‖ ≤ L‖x− y‖, ∀x, y ∈ Rd
Definition 2
(Weak convexity) A function φ is µ-weakly convex, if φ(x) + µ2‖x‖
2 isconvex.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 4 / 20
Comparisons with Related Work
Table 1: Comparison of gradient complexities of variance reduction basedalgorithms for finding ε-stationary point of (1). ∗ marks the result is only validwhen L/µ ≤
√n.
Algorithms L/µ ≥ Ω(n) L/µ ≤ O(n) Non-smooth ψ
SAGA (Reddi et al., 2016) O(n2/3L/ε2) O(n2/3L/ε2) Yes
RapGrad (Lan & Yang, 2018) O(√nLµ/ε2) O((µn +
√nLµ)/ε2) indicator function
SVRG (Reddi et al., 2016) O(n2/3L/ε2) O(n2/3L/ε2) Yes
Natasha1 (Allen-Zhu, 2017) NA O(n2/3L2/3µ1/3/ε2)∗
Yes
RepeatSVRG (Allen-Zhu, 2017) O(n3/4√Lµ/ε2) O((µn + n3/4
√Lµ)/ε2) Yes
4WD-Catalyst (Paquette et al., 2018) O(nL/ε2) O(nL/ε2) YesSPIDER (Fang et al., 2018) O(
√nL/ε2) O(
√nL/ε2) No
SNVRG (Zhou et al., 2018) O(√nL/ε2) O(
√nL/ε2) No
Katalyst (this work) O(√nLµ/ε2) O((µn + L)/ε2) Yes
Our bound is proved optimal up to a logarithmic factor by a recent work(Zhou & Gu, 2019).
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 5 / 20
Overview
1 Introduction
2 Katalyst Algorithm and Theoretical Guarantee
3 Experiments
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 6 / 20
Interpretation - Our Basic Idea
-1 -0.5 0 0.5 1
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
x0
x1
Step 1
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 7 / 20
Interpretation - Our Basic Idea
-1 -0.5 0 0.5 1
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
x1
x2
Step > 1
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 8 / 20
A Unified Framework
Meta Algorithm
Algorithm 1: Stagewise-SA(w0, ηs, µ, ws)
Input : a non-increasing sequence ws, x0 ∈ dom(ψ), γ = (2µ)−1;1 for s = 1, . . . ,S do2 fs(·) = φ(·) + 1
2γ ‖ · −xs−1‖2;
3 xs = Katyusha(fs , xs−1,Ks , µ, L + µ) // xs is usually an averagedsolution;
4 endOutput: xτ , τ is randomly chosen from 0, . . . ,S according to the
probabilities pτ = wτ+1∑Sk=0 wk+1
, τ = 0, . . . ,S ;
fs(x) =1
n
n∑i=1
(fi (x) +µ
2‖x− xs−1‖2︸ ︷︷ ︸fi (x)
) +γ−1 − µ
2‖x− xs−1‖2 + ψ(x)︸ ︷︷ ︸
ψ(x)
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 9 / 20
Algorithm
Algorithm 2: Katyusha(f , x0,K , σ, L)
Initialize: τ2 = 12 , τ1 = min
√nσ3L ,
12, η = 1
3τ1L, θ = 1 + ησ, m =
d log(2τ1+2/θ−1)log θ e+ 1, y0 = ζ0 = x0 ← x0;
1 for k = 0, . . . ,K − 1 do
2 uk = ∇f (xk);3 for t = 0, . . . ,m − 1 do4 j = km + t;
5 xj = τ1ζj + τ2xk + (1− τ1 − τ2)yj ;
6 ∇j+1 = uk +∇fi (xj+1)−∇fi (xk);
7 ζj+1 = arg minζ1
2η‖ζ − ζj‖2 + 〈∇j+1, ζ〉+ ψ(ζ);
8 yj+1 = arg miny3L2 ‖y − xj+1‖2 + 〈∇j+1, y〉+ ψ(ζ);
9 end
10 xk+1 =∑m−1
t=0 θtysm+t+1∑m−1j=0 θt
;
11 end
Output : xK ;
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 10 / 20
Theory
Theorem 3
Let ws = sα, α > 0, γ = 12µ
, L = L + µ, σ = µ, and in each call of Katyusha let
τ1 = min√
Nσ
3L, 1
2, step size η = 1
3τ1L, τ2 = 1/2, θ = 1 + ησ, and Ks =
⌈log(Ds )m log(θ)
⌉,
m =⌊
log(2τ1+2/θ−1)log θ
⌋+ 1, where Ds = max4L/µ, L3/µ3, L2s/µ2. Then we have that
maxE[‖∇φγ(xτ+1)‖2],E[L2‖xτ+1 − zτ+1‖2] ≤34µ∆φ(α+ 1)
S + 1+
98µ∆φ(α+ 1)
(S + 1)αIα<1,
where z = proxγφ(x), τ is randomly chosen from 0, . . . ,S according to probabilities
pτ =wτ+1∑Sk=0
wk+1, τ = 0, . . . ,S. Furthermore, the total gradient complexity for finding xτ+1 such
thatmax(E[‖∇φγ(xτ+1)‖2], L2E[‖xτ+1 − zτ+1‖2]) ≤ ε2
is
N(ε) =
O
((µn +
√nµL) log
(L
µε
)1
ε2
), n ≥
3L
4µ,
O
(√nLµ log
(L
µε
)1
ε2
), n ≤
3L
4µ.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 11 / 20
Theory
Theorem 4
Suppose ψ = 0. With the same parameter values as in Theorem 3 except that K =⌈
log(D)m log(θ)
⌉,
where D = max(48L/µ, 2L3/µ3). The total gradient complexity for finding xτ+1 such thatE[‖∇φ(xτ+1)‖2] ≤ ε2 is
N(ε) =
O
((µn +
√nµL) log
(L
µ
)1
ε2
), n ≥
3L
4µ,
O
(√nLµ log
(L
µ
)1
ε2
), n ≤
3L
4µ.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 12 / 20
Overview
1 Introduction
2 Katalyst Algorithm and Theoretical Guarantee
3 Experiments
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 13 / 20
Experiments I
Squared hinge loss + (log-sum penalty (LSP) / transformed `1 penalty(TL1)).
number of gradients/n0 50 100 150 200
log 1
0(objective)
-1
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
TL1, rcv1, λ = 1/n
KatalystproxSVRGproxSVRG-mb4WD-Catalyst
number of gradients/n0 50 100 150 200
log 1
0(objective)
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
LSP, rcv1, λ = 1/n
KatalystproxSVRGproxSVRG-mb4WD-Catalyst
number of gradients/n0 50 100 150 200
log 1
0(objective)
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
TL1, realsim, λ = 1/n
KatalystproxSVRGproxSVRG-mb4WD-Catalyst
number of gradients/n0 50 100 150 200
log 1
0(objective)
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
LSP, realsim, λ = 1/n
KatalystproxSVRGproxSVRG-mb4WD-Catalyst
number of gradients/n0 50 100 150 200
log 1
0(objective)
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
TL1, rcv1, λ = 0.1/n
KatalystproxSVRGproxSVRG-mb4WD-Catalyst
number of gradients/n0 50 100 150 200
log 1
0(objective)
-2
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
LSP, rcv1, λ = 0.1/n
KatalystproxSVRGproxSVRG-mb4WD-Catalyst
number of gradients/n0 50 100 150 200
log 1
0(objective)
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
TL1, realsim, λ = 0.1/n
KatalystproxSVRGproxSVRG-mb4WD-Catalyst
number of gradients/n0 50 100 150 200
log 1
0(objective)
-2
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
LSP, realsim, λ = 0.1/n
KatalystproxSVRGproxSVRG-mb4WD-Catalyst
Figure 1: Comparison of different algorithms for two tasks on different datasets
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 14 / 20
Experiments II
We use Smoothed SCAD given in (Lan & Yang, 2018),
Rλ,γ,ε(x) =
λ(x2 + ε)12 , if(x2 + ε)
12 ≤ λ,
2γλ(x2 + ε)12 − (x2 + ε)− λ2
2(γ − 1),
if λ < (x2 + ε)12 < γλ,
λ2(γ + 1)
2, otherwise,
where γ > 2, λ > 0, and ε > 0. Then the problem is
minx∈Rd
φ(x) :=1
2n
n∑i=1
(a>i x− bi )2 +
ρ
2
d∑i=1
Rλ,γ,ε(xi )
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 15 / 20
Experiments II.1
0 20 40 60 80 100
0.5
1
1.5
2
0 50 100 150 200
-0.5
0
0.5
1
1.5
2
0 50 100 150 200 250 300 350 400
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 500 1000 1500 2000 2500 3000
-15
-10
-5
0
0 1000 2000 3000 4000 5000 6000
-15
-10
-5
0
0 2000 4000 6000 8000 10000
-15
-10
-5
0
Figure 2: Theoretical performances of RapGrad and Katalyst.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 16 / 20
Experiments II.2
0 20 40 60 80 100
0.5
1
1.5
2
0 50 100 150 200
-0.5
0
0.5
1
1.5
2
0 50 100 150 200 250 300 350 400
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 50 100 150 200 250 300
-15
-10
-5
0
0 100 200 300 400 500 600 700 800
-15
-10
-5
0
0 500 1000 1500 2000 2500
-15
-10
-5
0
Figure 3: Empirical performances of RapGrad and Katalyst with early termination.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 17 / 20
The End
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 18 / 20
References I
Allen-Zhu, Z. Natasha: Faster non-convex stochastic optimization viastrongly non-convex parameter. In Proceedings of the 34th InternationalConference on Machine Learning (ICML), pp. 89–97, 2017.
Allen-Zhu, Z. Katyusha: the first direct acceleration of stochastic gradientmethods. In Proceedings of the 49th Annual ACM SIGACT Symposiumon Theory of Computing (STOC), pp. 1200–1205, 2017.
Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER: near-optimalnon-convex optimization via stochastic path-integrated differentialestimator. In NeurIPS, pp. 687–697, 2018.
Johnson, R. and Zhang, T. Accelerating stochastic gradient descent usingpredictive variance reduction. In NIPS, pp. 315–323, 2013.
Lan, G. and Yang, Y. Accelerated stochastic algorithms for nonconvexfinite-sum and multi-block optimization. CoRR, abs/1805.05411, 2018.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 19 / 20
References II
Lin, H., Mairal, J., and Harchaoui, Z. A universal catalyst for first-orderoptimization. In Advances in Neural Information Processing Systems,pp. 3384–3392, 2015.
Paquette, C., Lin, H., Drusvyatskiy, D., Mairal, J., and Harchaoui, Z.Catalyst for gradient-based nonconvex optimization. In Proceedings ofthe Twenty-First International Conference on Artificial Intelligence andStatistics, volume 84, pp. 613–622, 2018.
Reddi, S. J., Sra, S., Poczos, B., and Smola, A. J. Proximal stochasticmethods for nonsmooth nonconvex finite-sum optimization. In Advancesin Neural Information Processing Systems, pp. 1145–1153, 2016.
Zhou, D. and Gu, Q. Lower bounds for smooth nonconvex finite-sumoptimization. arXiv preprint arXiv:1901.11224, 2019.
Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduced gradientdescent for nonconvex optimization. In NeurIPS, pp. 3925–3936, 2018.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 20 / 20