Top Banner
Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a Large Condition Number Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang [email protected] 2019-06-10 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1 / 20
21

Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Oct 01, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Katalyst: Boosting Convex Katayusha for Non-ConvexProblems with a Large Condition Number

Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang

[email protected]

2019-06-10

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1 / 20

Page 2: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Overview

1 Introduction

2 Katalyst Algorithm and Theoretical Guarantee

3 Experiments

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 2 / 20

Page 3: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Problem Definition

Problem Definition

minx∈Rd

φ(x) :=1

n

n∑i=1

fi (x) + ψ(x) (1)

we can obtain a better gradient complexity w.r.t. sample size n andaccuracy ε via variance reduced method (Johnson & Zhang, 2013)(SVRG-type).

We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu,2017) and Catalyst (Lin et al., 2015).

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 3 / 20

Page 4: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Problem Definition

Problem Definition

minx∈Rd

φ(x) :=1

n

n∑i=1

fi (x) + ψ(x) (1)

we can obtain a better gradient complexity w.r.t. sample size n andaccuracy ε via variance reduced method (Johnson & Zhang, 2013)(SVRG-type).

We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu,2017) and Catalyst (Lin et al., 2015).

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 3 / 20

Page 5: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Assumptions

fi are L-smooth.

ψ can be non-smooth but convex.

φ is µ-weakly convex.

Definition 1

(L-smoothness) A function f is Lipschitz smooth with constant L if itsderivatives are Lipschitz continuous with constant L, that is

‖∇f (x)−∇(y)‖ ≤ L‖x− y‖, ∀x, y ∈ Rd

Definition 2

(Weak convexity) A function φ is µ-weakly convex, if φ(x) + µ2‖x‖

2 isconvex.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 4 / 20

Page 6: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Comparisons with Related Work

Table 1: Comparison of gradient complexities of variance reduction basedalgorithms for finding ε-stationary point of (1). ∗ marks the result is only validwhen L/µ ≤

√n.

Algorithms L/µ ≥ Ω(n) L/µ ≤ O(n) Non-smooth ψ

SAGA (Reddi et al., 2016) O(n2/3L/ε2) O(n2/3L/ε2) Yes

RapGrad (Lan & Yang, 2018) O(√nLµ/ε2) O((µn +

√nLµ)/ε2) indicator function

SVRG (Reddi et al., 2016) O(n2/3L/ε2) O(n2/3L/ε2) Yes

Natasha1 (Allen-Zhu, 2017) NA O(n2/3L2/3µ1/3/ε2)∗

Yes

RepeatSVRG (Allen-Zhu, 2017) O(n3/4√Lµ/ε2) O((µn + n3/4

√Lµ)/ε2) Yes

4WD-Catalyst (Paquette et al., 2018) O(nL/ε2) O(nL/ε2) YesSPIDER (Fang et al., 2018) O(

√nL/ε2) O(

√nL/ε2) No

SNVRG (Zhou et al., 2018) O(√nL/ε2) O(

√nL/ε2) No

Katalyst (this work) O(√nLµ/ε2) O((µn + L)/ε2) Yes

Our bound is proved optimal up to a logarithmic factor by a recent work(Zhou & Gu, 2019).

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 5 / 20

Page 7: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Overview

1 Introduction

2 Katalyst Algorithm and Theoretical Guarantee

3 Experiments

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 6 / 20

Page 8: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Interpretation - Our Basic Idea

-1 -0.5 0 0.5 1

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

x0

x1

Step 1

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 7 / 20

Page 9: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Interpretation - Our Basic Idea

-1 -0.5 0 0.5 1

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

x1

x2

Step > 1

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 8 / 20

Page 10: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

A Unified Framework

Meta Algorithm

Algorithm 1: Stagewise-SA(w0, ηs, µ, ws)

Input : a non-increasing sequence ws, x0 ∈ dom(ψ), γ = (2µ)−1;1 for s = 1, . . . ,S do2 fs(·) = φ(·) + 1

2γ ‖ · −xs−1‖2;

3 xs = Katyusha(fs , xs−1,Ks , µ, L + µ) // xs is usually an averagedsolution;

4 endOutput: xτ , τ is randomly chosen from 0, . . . ,S according to the

probabilities pτ = wτ+1∑Sk=0 wk+1

, τ = 0, . . . ,S ;

fs(x) =1

n

n∑i=1

(fi (x) +µ

2‖x− xs−1‖2︸ ︷︷ ︸fi (x)

) +γ−1 − µ

2‖x− xs−1‖2 + ψ(x)︸ ︷︷ ︸

ψ(x)

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 9 / 20

Page 11: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Algorithm

Algorithm 2: Katyusha(f , x0,K , σ, L)

Initialize: τ2 = 12 , τ1 = min

√nσ3L ,

12, η = 1

3τ1L, θ = 1 + ησ, m =

d log(2τ1+2/θ−1)log θ e+ 1, y0 = ζ0 = x0 ← x0;

1 for k = 0, . . . ,K − 1 do

2 uk = ∇f (xk);3 for t = 0, . . . ,m − 1 do4 j = km + t;

5 xj = τ1ζj + τ2xk + (1− τ1 − τ2)yj ;

6 ∇j+1 = uk +∇fi (xj+1)−∇fi (xk);

7 ζj+1 = arg minζ1

2η‖ζ − ζj‖2 + 〈∇j+1, ζ〉+ ψ(ζ);

8 yj+1 = arg miny3L2 ‖y − xj+1‖2 + 〈∇j+1, y〉+ ψ(ζ);

9 end

10 xk+1 =∑m−1

t=0 θtysm+t+1∑m−1j=0 θt

;

11 end

Output : xK ;

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 10 / 20

Page 12: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Theory

Theorem 3

Let ws = sα, α > 0, γ = 12µ

, L = L + µ, σ = µ, and in each call of Katyusha let

τ1 = min√

3L, 1

2, step size η = 1

3τ1L, τ2 = 1/2, θ = 1 + ησ, and Ks =

⌈log(Ds )m log(θ)

⌉,

m =⌊

log(2τ1+2/θ−1)log θ

⌋+ 1, where Ds = max4L/µ, L3/µ3, L2s/µ2. Then we have that

maxE[‖∇φγ(xτ+1)‖2],E[L2‖xτ+1 − zτ+1‖2] ≤34µ∆φ(α+ 1)

S + 1+

98µ∆φ(α+ 1)

(S + 1)αIα<1,

where z = proxγφ(x), τ is randomly chosen from 0, . . . ,S according to probabilities

pτ =wτ+1∑Sk=0

wk+1, τ = 0, . . . ,S. Furthermore, the total gradient complexity for finding xτ+1 such

thatmax(E[‖∇φγ(xτ+1)‖2], L2E[‖xτ+1 − zτ+1‖2]) ≤ ε2

is

N(ε) =

O

((µn +

√nµL) log

(L

µε

)1

ε2

), n ≥

3L

4µ,

O

(√nLµ log

(L

µε

)1

ε2

), n ≤

3L

4µ.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 11 / 20

Page 13: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Theory

Theorem 4

Suppose ψ = 0. With the same parameter values as in Theorem 3 except that K =⌈

log(D)m log(θ)

⌉,

where D = max(48L/µ, 2L3/µ3). The total gradient complexity for finding xτ+1 such thatE[‖∇φ(xτ+1)‖2] ≤ ε2 is

N(ε) =

O

((µn +

√nµL) log

(L

µ

)1

ε2

), n ≥

3L

4µ,

O

(√nLµ log

(L

µ

)1

ε2

), n ≤

3L

4µ.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 12 / 20

Page 14: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Overview

1 Introduction

2 Katalyst Algorithm and Theoretical Guarantee

3 Experiments

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 13 / 20

Page 15: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Experiments I

Squared hinge loss + (log-sum penalty (LSP) / transformed `1 penalty(TL1)).

number of gradients/n0 50 100 150 200

log 1

0(objective)

-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

TL1, rcv1, λ = 1/n

KatalystproxSVRGproxSVRG-mb4WD-Catalyst

number of gradients/n0 50 100 150 200

log 1

0(objective)

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

LSP, rcv1, λ = 1/n

KatalystproxSVRGproxSVRG-mb4WD-Catalyst

number of gradients/n0 50 100 150 200

log 1

0(objective)

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

TL1, realsim, λ = 1/n

KatalystproxSVRGproxSVRG-mb4WD-Catalyst

number of gradients/n0 50 100 150 200

log 1

0(objective)

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

LSP, realsim, λ = 1/n

KatalystproxSVRGproxSVRG-mb4WD-Catalyst

number of gradients/n0 50 100 150 200

log 1

0(objective)

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

TL1, rcv1, λ = 0.1/n

KatalystproxSVRGproxSVRG-mb4WD-Catalyst

number of gradients/n0 50 100 150 200

log 1

0(objective)

-2

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

LSP, rcv1, λ = 0.1/n

KatalystproxSVRGproxSVRG-mb4WD-Catalyst

number of gradients/n0 50 100 150 200

log 1

0(objective)

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

TL1, realsim, λ = 0.1/n

KatalystproxSVRGproxSVRG-mb4WD-Catalyst

number of gradients/n0 50 100 150 200

log 1

0(objective)

-2

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

LSP, realsim, λ = 0.1/n

KatalystproxSVRGproxSVRG-mb4WD-Catalyst

Figure 1: Comparison of different algorithms for two tasks on different datasets

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 14 / 20

Page 16: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Experiments II

We use Smoothed SCAD given in (Lan & Yang, 2018),

Rλ,γ,ε(x) =

λ(x2 + ε)12 , if(x2 + ε)

12 ≤ λ,

2γλ(x2 + ε)12 − (x2 + ε)− λ2

2(γ − 1),

if λ < (x2 + ε)12 < γλ,

λ2(γ + 1)

2, otherwise,

where γ > 2, λ > 0, and ε > 0. Then the problem is

minx∈Rd

φ(x) :=1

2n

n∑i=1

(a>i x− bi )2 +

ρ

2

d∑i=1

Rλ,γ,ε(xi )

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 15 / 20

Page 17: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Experiments II.1

0 20 40 60 80 100

0.5

1

1.5

2

0 50 100 150 200

-0.5

0

0.5

1

1.5

2

0 50 100 150 200 250 300 350 400

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 500 1000 1500 2000 2500 3000

-15

-10

-5

0

0 1000 2000 3000 4000 5000 6000

-15

-10

-5

0

0 2000 4000 6000 8000 10000

-15

-10

-5

0

Figure 2: Theoretical performances of RapGrad and Katalyst.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 16 / 20

Page 18: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Experiments II.2

0 20 40 60 80 100

0.5

1

1.5

2

0 50 100 150 200

-0.5

0

0.5

1

1.5

2

0 50 100 150 200 250 300 350 400

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 50 100 150 200 250 300

-15

-10

-5

0

0 100 200 300 400 500 600 700 800

-15

-10

-5

0

0 500 1000 1500 2000 2500

-15

-10

-5

0

Figure 3: Empirical performances of RapGrad and Katalyst with early termination.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 17 / 20

Page 19: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

The End

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 18 / 20

Page 20: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

References I

Allen-Zhu, Z. Natasha: Faster non-convex stochastic optimization viastrongly non-convex parameter. In Proceedings of the 34th InternationalConference on Machine Learning (ICML), pp. 89–97, 2017.

Allen-Zhu, Z. Katyusha: the first direct acceleration of stochastic gradientmethods. In Proceedings of the 49th Annual ACM SIGACT Symposiumon Theory of Computing (STOC), pp. 1200–1205, 2017.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER: near-optimalnon-convex optimization via stochastic path-integrated differentialestimator. In NeurIPS, pp. 687–697, 2018.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent usingpredictive variance reduction. In NIPS, pp. 315–323, 2013.

Lan, G. and Yang, Y. Accelerated stochastic algorithms for nonconvexfinite-sum and multi-block optimization. CoRR, abs/1805.05411, 2018.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 19 / 20

Page 21: Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

References II

Lin, H., Mairal, J., and Harchaoui, Z. A universal catalyst for first-orderoptimization. In Advances in Neural Information Processing Systems,pp. 3384–3392, 2015.

Paquette, C., Lin, H., Drusvyatskiy, D., Mairal, J., and Harchaoui, Z.Catalyst for gradient-based nonconvex optimization. In Proceedings ofthe Twenty-First International Conference on Artificial Intelligence andStatistics, volume 84, pp. 613–622, 2018.

Reddi, S. J., Sra, S., Poczos, B., and Smola, A. J. Proximal stochasticmethods for nonsmooth nonconvex finite-sum optimization. In Advancesin Neural Information Processing Systems, pp. 1145–1153, 2016.

Zhou, D. and Gu, Q. Lower bounds for smooth nonconvex finite-sumoptimization. arXiv preprint arXiv:1901.11224, 2019.

Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduced gradientdescent for nonconvex optimization. In NeurIPS, pp. 3925–3936, 2018.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 20 / 20