Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Katalyst: Boosting Convex Katayusha for Non-ConvexProblems with a Large Condition Number

Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang

[email protected]

2019-06-10

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1 / 20

Overview

1 Introduction

2 Katalyst Algorithm and Theoretical Guarantee

3 Experiments


Problem Definition

Problem Definition

minx∈Rd

φ(x) :=1

n

n∑i=1

fi (x) + ψ(x) (1)

we can obtain a better gradient complexity w.r.t. sample size n andaccuracy ε via variance reduced method (Johnson & Zhang, 2013)(SVRG-type).

We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu,2017) and Catalyst (Lin et al., 2015).


Problem Definition

Problem Definition

minx∈Rd

φ(x) :=1

n

n∑i=1

fi (x) + ψ(x) (1)

we can obtain a better gradient complexity w.r.t. sample size n andaccuracy ε via variance reduced method (Johnson & Zhang, 2013)(SVRG-type).

We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu,2017) and Catalyst (Lin et al., 2015).


Assumptions

fi are L-smooth.

ψ can be non-smooth but convex.

φ is µ-weakly convex.

Definition 1

(L-smoothness) A function f is Lipschitz smooth with constant L if itsderivatives are Lipschitz continuous with constant L, that is

‖∇f (x)−∇(y)‖ ≤ L‖x− y‖, ∀x, y ∈ Rd

Definition 2

(Weak convexity) A function φ is µ-weakly convex, if φ(x) + µ2‖x‖

2 isconvex.


Comparisons with Related Work

Table 1: Comparison of gradient complexities of variance reduction basedalgorithms for finding ε-stationary point of (1). ∗ marks the result is only validwhen L/µ ≤

√n.

Algorithms L/µ ≥ Ω(n) L/µ ≤ O(n) Non-smooth ψ

SAGA (Reddi et al., 2016) O(n2/3L/ε2) O(n2/3L/ε2) Yes

RapGrad (Lan & Yang, 2018) O(√nLµ/ε2) O((µn +

√nLµ)/ε2) indicator function

SVRG (Reddi et al., 2016) O(n2/3L/ε2) O(n2/3L/ε2) Yes

Natasha1 (Allen-Zhu, 2017) NA O(n2/3L2/3µ1/3/ε2)∗

Yes

RepeatSVRG (Allen-Zhu, 2017) O(n3/4√Lµ/ε2) O((µn + n3/4

√Lµ)/ε2) Yes

4WD-Catalyst (Paquette et al., 2018) O(nL/ε2) O(nL/ε2) YesSPIDER (Fang et al., 2018) O(

√nL/ε2) O(

√nL/ε2) No

SNVRG (Zhou et al., 2018) O(√nL/ε2) O(

√nL/ε2) No

Katalyst (this work) O(√nLµ/ε2) O((µn + L)/ε2) Yes

Our bound is proved optimal up to a logarithmic factor by a recent work(Zhou & Gu, 2019).


Overview

1 Introduction


3 Experiments


Interpretation - Our Basic Idea

-1 -0.5 0 0.5 1

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

x0

x1

Step 1


Interpretation - Our Basic Idea

-1 -0.5 0 0.5 1

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

x1

x2

Step > 1


A Unified Framework

Meta Algorithm

Algorithm 1: Stagewise-SA(w0, ηs, µ, ws)

Input : a non-increasing sequence ws, x0 ∈ dom(ψ), γ = (2µ)−1;1 for s = 1, . . . ,S do2 fs(·) = φ(·) + 1

2γ ‖ · −xs−1‖2;

3 xs = Katyusha(fs , xs−1,Ks , µ, L + µ) // xs is usually an averagedsolution;

4 endOutput: xτ , τ is randomly chosen from 0, . . . ,S according to the

probabilities pτ = wτ+1∑Sk=0 wk+1

, τ = 0, . . . ,S ;

fs(x) =1

n

n∑i=1

(fi (x) +µ

2‖x− xs−1‖2︸︷︷︸fi (x)

) +γ−1 − µ

2‖x− xs−1‖2 + ψ(x)︸︷︷︸

ψ(x)


Algorithm

Algorithm 2: Katyusha(f , x0,K , σ, L)

Initialize: τ2 = 12 , τ1 = min

√nσ3L ,

12, η = 1

3τ1L, θ = 1 + ησ, m =

d log(2τ1+2/θ−1)log θ e+ 1, y0 = ζ0 = x0 ← x0;

1 for k = 0, . . . ,K − 1 do

2 uk = ∇f (xk);3 for t = 0, . . . ,m − 1 do4 j = km + t;

5 xj = τ1ζj + τ2xk + (1− τ1 − τ2)yj ;

6 ∇j+1 = uk +∇fi (xj+1)−∇fi (xk);

7 ζj+1 = arg minζ1

2η‖ζ − ζj‖2 + 〈∇j+1, ζ〉+ ψ(ζ);

8 yj+1 = arg miny3L2 ‖y − xj+1‖2 + 〈∇j+1, y〉+ ψ(ζ);

9 end

10 xk+1 =∑m−1

t=0 θtysm+t+1∑m−1j=0 θt

;

11 end

Output : xK ;


Theory

Theorem 3

Let ws = sα, α > 0, γ = 12µ

, L = L + µ, σ = µ, and in each call of Katyusha let

τ1 = min√

Nσ

3L, 1

2, step size η = 1

3τ1L, τ2 = 1/2, θ = 1 + ησ, and Ks =

⌈log(Ds )m log(θ)

⌉,

m =⌊

log(2τ1+2/θ−1)log θ

⌋+ 1, where Ds = max4L/µ, L3/µ3, L2s/µ2. Then we have that

maxE[‖∇φγ(xτ+1)‖2],E[L2‖xτ+1 − zτ+1‖2] ≤34µ∆φ(α+ 1)

S + 1+

98µ∆φ(α+ 1)

(S + 1)αIα<1,

where z = proxγφ(x), τ is randomly chosen from 0, . . . ,S according to probabilities

pτ =wτ+1∑Sk=0

wk+1, τ = 0, . . . ,S. Furthermore, the total gradient complexity for finding xτ+1 such

thatmax(E[‖∇φγ(xτ+1)‖2], L2E[‖xτ+1 − zτ+1‖2]) ≤ ε2

is

N(ε) =

O

((µn +

√nµL) log

(L

µε

)1

ε2

), n ≥

3L

4µ,

O

(√nLµ log

(L

µε

)1

ε2

), n ≤

3L

4µ.


Theory

Theorem 4

Suppose ψ = 0. With the same parameter values as in Theorem 3 except that K =⌈

log(D)m log(θ)

⌉,

where D = max(48L/µ, 2L3/µ3). The total gradient complexity for finding xτ+1 such thatE[‖∇φ(xτ+1)‖2] ≤ ε2 is

N(ε) =

O

((µn +

√nµL) log

(L

µ

)1

ε2

), n ≥

3L

4µ,

O

(√nLµ log

(L

µ

)1

ε2

), n ≤

3L

4µ.


Overview

1 Introduction


3 Experiments


Experiments I

Squared hinge loss + (log-sum penalty (LSP) / transformed `1 penalty(TL1)).

number of gradients/n0 50 100 150 200

log 1

0(objective)

-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

TL1, rcv1, λ = 1/n

KatalystproxSVRGproxSVRG-mb4WD-Catalyst


log 1

0(objective)

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

LSP, rcv1, λ = 1/n



log 1

0(objective)

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

TL1, realsim, λ = 1/n



log 1

0(objective)

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

LSP, realsim, λ = 1/n



log 1

0(objective)

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

TL1, rcv1, λ = 0.1/n



log 1

0(objective)

-2

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

LSP, rcv1, λ = 0.1/n



log 1

0(objective)

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

TL1, realsim, λ = 0.1/n



log 1

0(objective)

-2

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

LSP, realsim, λ = 0.1/n


Figure 1: Comparison of different algorithms for two tasks on different datasets


Experiments II

We use Smoothed SCAD given in (Lan & Yang, 2018),

Rλ,γ,ε(x) =

λ(x2 + ε)12 , if(x2 + ε)

12 ≤ λ,

2γλ(x2 + ε)12 − (x2 + ε)− λ2

2(γ − 1),

if λ < (x2 + ε)12 < γλ,

λ2(γ + 1)

2, otherwise,

where γ > 2, λ > 0, and ε > 0. Then the problem is

minx∈Rd

φ(x) :=1

2n

n∑i=1

(a>i x− bi )2 +

ρ

2

d∑i=1

Rλ,γ,ε(xi )


Experiments II.1

0 20 40 60 80 100

0.5

1

1.5

2

0 50 100 150 200

-0.5

0

0.5

1

1.5

2

0 50 100 150 200 250 300 350 400

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 500 1000 1500 2000 2500 3000

-15

-10

-5

0

0 1000 2000 3000 4000 5000 6000

-15

-10

-5

0

0 2000 4000 6000 8000 10000

-15

-10

-5

0

Figure 2: Theoretical performances of RapGrad and Katalyst.


Experiments II.2

0 20 40 60 80 100

0.5

1

1.5

2

0 50 100 150 200

-0.5

0

0.5

1

1.5

2

0 50 100 150 200 250 300 350 400

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 50 100 150 200 250 300

-15

-10

-5

0

0 100 200 300 400 500 600 700 800

-15

-10

-5

0

0 500 1000 1500 2000 2500

-15

-10

-5

0

Figure 3: Empirical performances of RapGrad and Katalyst with early termination.


The End


References I

Allen-Zhu, Z. Natasha: Faster non-convex stochastic optimization viastrongly non-convex parameter. In Proceedings of the 34th InternationalConference on Machine Learning (ICML), pp. 89–97, 2017.

Allen-Zhu, Z. Katyusha: the first direct acceleration of stochastic gradientmethods. In Proceedings of the 49th Annual ACM SIGACT Symposiumon Theory of Computing (STOC), pp. 1200–1205, 2017.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER: near-optimalnon-convex optimization via stochastic path-integrated differentialestimator. In NeurIPS, pp. 687–697, 2018.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent usingpredictive variance reduction. In NIPS, pp. 315–323, 2013.

Lan, G. and Yang, Y. Accelerated stochastic algorithms for nonconvexfinite-sum and multi-block optimization. CoRR, abs/1805.05411, 2018.


References II

Lin, H., Mairal, J., and Harchaoui, Z. A universal catalyst for first-orderoptimization. In Advances in Neural Information Processing Systems,pp. 3384–3392, 2015.

Paquette, C., Lin, H., Drusvyatskiy, D., Mairal, J., and Harchaoui, Z.Catalyst for gradient-based nonconvex optimization. In Proceedings ofthe Twenty-First International Conference on Artificial Intelligence andStatistics, volume 84, pp. 613–622, 2018.

Reddi, S. J., Sra, S., Poczos, B., and Smola, A. J. Proximal stochasticmethods for nonsmooth nonconvex finite-sum optimization. In Advancesin Neural Information Processing Systems, pp. 1145–1153, 2016.

Zhou, D. and Gu, Q. Lower bounds for smooth nonconvex finite-sumoptimization. arXiv preprint arXiv:1901.11224, 2019.

Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduced gradientdescent for nonconvex optimization. In NeurIPS, pp. 3925–3936, 2018.


Katalyst: Boosting Convex Katayusha for Non-Convex ...11-11...We name the proposed algorithmKatalystafter Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al.

Documents