Top Banner
Stochastic Newton and quasi-Newton Methods for Large-Scale Convex Optimization Donald Goldfarb Department of Industrial Engineering and Operations Research Columbia University Joint with Robert Gower and Peter Richt´ arik Optimization Without Borders 2016 Les Houches, February 7-12, 2016 1/1
23

Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Jul 13, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Stochastic Newton and quasi-Newton Methods forLarge-Scale Convex Optimization

Donald Goldfarb

Department of Industrial Engineering and Operations ResearchColumbia University

Joint with Robert Gower and Peter Richtarik

Optimization Without Borders 2016Les Houches, February 7-12, 2016

1 / 1

Page 2: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Outline

Newton-like and quasi-Newton methods for convex stochasticoptimization problems using limited memory block BFGSupdates.

In the class of problems of interest, the objective functionscan be expressed as the sum of a huge number of functions ofan extremely large number of variables.

We present preliminary numerical results on problems frommachine learning.

2 / 1

Page 3: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Related work on L-BFGS for Stochastic Optimization

P1 N.N. Schraudolph, J. Yu and S.Gunter. A stochastic quasi-Newtonmethod for online convex optim. Int’l. Conf. AI & Stat., 2007

P2 A. Bordes, L. Bottou and P. Gallinari. SGD-QN: Carefulquasi-Newton stochastic gradient descent. JMLR vol. 10, 2009

P3 R.H. Byrd, S.L. Hansen, J. Nocedal, and Y. Singer. A stochasticquasi-Newton method for large-scale optim. arXiv1401.7020v2,2014

P4 A. Mokhtari and A. Ribeiro. RES: Regularized stochastic BFGSalgorithm. IEEE Trans. Signal Process., no. 10, 2014.

P5 A. Mokhtari and A. Ribeiro. Global convergence of online limitedmemory BFGS. to appear in J. Mach. Learn. Res., 2015.

P6 P. Moritz, R. Nishihara, M.I. Jordan. A linearly-convergentstochastic L-BFGS Algorithm, 2015 arXiv:1508.02087v1

P7 X. Wang, S. Ma, D. Goldfarb and W. Liu. Stochastic quasi-Newtonmethods for nonconvex stochastic optim. 2015, submitted.

(the first 6 papers are for strongly convex problems, the last one isfor nonconvex problems) 3 / 1

Page 4: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Stochastic optimization

Stochastic optimization

min f (x) = E[f (x , ξ)], ξ is random variable

Or finite sum (with fi (x) ≡ f (x , ξi ) for i = 1, . . . , n and verylarge n)

min f (x) =1

n

n∑i=1

fi (x)

f and ∇f are very expensive to evaluate; e.g., SGD methodsrandomly choose a random subset S ⊂ [n] and evaluate

fS(x) =1

|S|∑i∈S

fi (x) and ∇fS(x) =1

|S|∑i∈S∇fi (x)

Essentially, only noisy info about f , ∇f and ∇2f is available

Challenge: how to design a method that takes advantage ofnoisy 2nd-order information?

4 / 1

Page 5: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Using 2nd-order information

Assumption: f (x) = 1n

∑ni=1 fi (x) is strongly convex and twice

continuously differentiable.

Choose (compute) a sketching matrix Sk (the columns of Skare a set of directions).

Following Byrd, Hansen, Nocedal and Singer, we do not usedifferences in noisy gradients to estimate curvature, but rathercompute the action of the sub-sampled Hessian on Sk . i.e.,

compute Yk = 1|T |

∑i∈T ∇2fi (x)Sk , where T ⊂ [n].

We choose T = S

5 / 1

Page 6: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

block BFGS

Given Hk = B−1k , the block BFGS method computes a ”least

change” update to the current approximation Hk to the inverseHessian matrix ∇2f (x) at the current point x , by solving

min ‖H − Hk‖s.t., H = H>, HYk = Sk .

This gives the updating formula (analgous to the updates derivedby Broyden, Fletcher, Goldfarb and Shanno).

Hk+1 = (I−Sk [S>k Yk ]−1Y>k )Hk(I−Yk [S>k Yk ]−1S>k )+Sk [S>k Yk ]−1S>k

or, by the Sherman-Morrison-Woodbury formula:

Bk+1 = Bk − BkSk [S>k BkSk ]−1S>k Bk + Yk [S>k Yk ]−1Y>k

6 / 1

Page 7: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Limited Memory Block BFGS

After M block BFGS steps starting from Hk+1−M , one can expressHk+1 as

Hk+1 = VkHkVTk + SkΛkS

Tk

= VkVk−1Hk−1VTk−1Vk + VkSk−1Λk−1S

Tk−1V

Tk + SkΛkS

Tk

...

= Vk:k+1−MHk+1−MV Tk:k+1−M +

k+1−M∑i=k

Vk:i+1SiΛiSTi V T

k:i+1,

whereVk = (I − SkΛkY

Tk ) (1)

and Λk = (STk Yk)−1 and Vk:i = Vk · · ·Vi .

7 / 1

Page 8: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Limited Memory Block BFGS

Hence, when the number of variables d is large, instead ofstoring the d × d matrix Hk , we store the previous M blockcurvature pairs

(Sk+1−M ,Yk+1−M) , . . . , (Sk ,Yk) ,

and the Cholesky factors of the matrices (STi Yi ) = Λ−1

i fori = k + 1−M, . . . , k .

Then, analogously to the standard L-BFGS method, for anyvector v ∈ Rd , Hkv can be computed efficiently using atwo-loop block recursion (in Mp(4d + 2p) + O(p))operations), if all Si ∈ Rd×p.

Intuition

Limited memory - least change aspect of BFGS is important

Each block update acts like a sketching procedure.

8 / 1

Page 9: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Choices for the Sketching Matrix Sk

We employ one of the following strategies

Gaussian: Sk ∼ N (0, I ) has Gaussian entries sampled i.i.d ateach iteration.

Previous search directions si delayed: Store the previous Lsearch directions Sk = [sk+1−L, . . . , sk ] then update Hk onlyonce every L iterations.

Self-conditioning: Sample the columns of the Cholesky factorsLk of Hk (i.e., LkL

Tk = Hk) uniformly at random. Fortunately

we can maintain and update Lk efficiently with limitedmemory.

The matrix S is a sketching matrix, in the sense that we aresketching the, possibly very large equation ∇2f (x)H = I to whichthe solution is the inverse Hessian. Left multiplying by ST

compresses/sketches the equation yielding ST∇2f (x)H = ST .

9 / 1

Page 10: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Stochastic Variance Reduced Gradients

Stochastic methods converge slowly near the optimum due tothe variance of the gradient estimates ∇fS(x); hence requiringa decreasing step size.

We use the control variates approach of Johnson and Zhang(2013) for a SGD method SVRG.

It uses ∇fS(xt)−∇fS(wk) +∇f (wk , where wk is a referencepoint, in place of ∇fS(xt) .

wk , and the full gradient, are computed after each full pass ofthe data, hence doubling the work of computing stochasticgradients.

Other recently proposed SGD variance reduction techniquessuch as SAG, SAGA, SDCA, and S2GD, can be used in placeof SVRG.

10 / 1

Page 11: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

The Basic Algorithm

11 / 1

Page 12: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Algorithm 0.1: Stochastic Variable Metric Learning with SVRG

Input: H−1 ∈ Rd×d , w0 ∈ Rd , η ∈ R+, s = subsample size, q = sampleaction size and m

1 for k = 0, . . . , max iter do2 µ = ∇f (wk)3 x0 = wk

4 for t = 0, . . . ,m − 1 do5 Sample St , Tt ⊆ [n] i.i.d from a distribution S6 Compute the sketching matrix St ∈ Rd×q

7 Compute ∇2fS(xt)St8 Ht =update metric(Ht−1,St ,∇2fT (xt)St)9 dt = −Ht (∇fS(xt)−∇fS(wk) + µ)

10 xt+1 = xt + ηdt11 end12 Option I: wk+1 = xm13 Option II: wk+1 = xi , i selected uniformly at random from [m];

14 end

11 / 1

Page 13: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Convergence - Assumptions

There exist constants λ,Λ ∈ R+ such that

f is λ–strongly convex

f (w) ≥ f (x) +∇f (x)T (w − x) +λ

2‖w − x‖2

2 , (2)

f is Λ–smooth

f (w) ≤ f (x) +∇f (x)T (w − x) +Λ

2‖w − x‖2

2 , (3)

These assumptions imply that

λI � ∇2fS(w) � ΛI , for all x ∈ Rd ,S ⊆ [n], (4)

from which we can prove that there exist constants γ, Γ ∈ R+

such that for all k we have

γI � Hk � ΓI . (5)

12 / 1

Page 14: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Linear Convergence

Theorem

Suppose that the Assumptions hold. Let w∗ be the uniqueminimizer of f (w). Then in our Algorithm, we have for all k ≥ 0that

Ef (wk)− f (w∗) ≤ ρkEf (w0)− f (w∗),

where the convergence rate is given by

ρ =1/2mη + ηΓ2Λ(Λ− λ)

γλ− ηΓ2Λ2< 1,

assuming we have chosen η < γλ/(2Γ2Λ2) and that we choose mlarge enough to satisfy

m ≥ 1

2η (γλ− ηΓ2Λ(2Λ− λ)),

which is a positive lower bound given our restriction on η.13 / 1

Page 15: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Upper and lower bounds on eigenvalues of Hk

Under the assumption that

λI � ∇2fT (x) � ΛI , ∀x ∈ Rd (6)

there exist constants γ, Γ ∈ R+ such that for all k we have

γI � Hk � ΓI . (7)

where

γ ≥ 1

1 + MΛ, Γ ≤ (1 +

√κ)2M(1+

1

λ(2√κ+ κ)

), κ ≡ Λ/λ.

(8)

Previously derived bounds depend on the problem dimensiond ; e.g. Γ ∼ ((d + M)κ)d+M

14 / 1

Page 16: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

gisette-scale d = 5, 000, n = 6, 000

0 10 20 30

time (s)

10-1

100

error

10 20 30 40

datapasses

gauss_18_M_5

prev_18_M_5

fact_18_M_3MNJ_bH_330

SVRG

Figure: gisette

15 / 1

Page 17: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

covtype-libsvm-binary d = 54, n = 581, 012

0 10 20 30 40

time (s)

10-3

10-2

10-1

100

error

5 10 15 20

datapasses

gauss_8_M_5

prev_8_M_5

fact_8_M_3MNJ_bH_3815

SVRG

Figure: covtype.libsvm.binary

16 / 1

Page 18: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Higgs d = 28, n = 11, 000, 000

0 50 100

time (s)

10-4

10-3

10-2

10-1

error

1 1.2 1.4

datapasses

gauss_4_M_5

prev_4_M_5

fact_4_M_3MNJ_bH_16585

SVRG

Figure: HIGGS

17 / 1

Page 19: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

SUSY d = 18, n = 3, 548, 466

0 50 100

time (s)

10-2

10-1

100

error

1 2 3 4 5 6

datapasses

gauss_3_M_5

prev_3_M_5

fact_3_M_3MNJ_bH_9420

SVRG

Figure: SUSY

18 / 1

Page 20: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

epsilon-normalized d = 2, 000, n = 400, 000

0 50 100

time (s)

10-2

10-1

100

error

1 2 3 4 5 6

datapasses

gauss_45_M_5

prev_45_M_5

fact_45_M_3MNJ_bH_3165

SVRG

Figure: epsilon-normaliized

19 / 1

Page 21: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

rcv1-training d = 47, 236, n = 20, 242

0 10 20 30

time (s)

0.3

0.4

0.5

0.6

error

20 40 60 80 100

datapasses

gauss_10_M_5

prev_10_M_5

fact_10_M_3MNJ_bH_715

SVRG

Figure: rcv1-train

20 / 1

Page 22: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

url-combined d = 3, 231, 961, n = 2, 396, 130

0 50 100

time (s)

10-2

10-1

100

error

1 1.5 2

datapasses

gauss_2_M_3

prev_2_M_3

fact_2_M_3MNJ_bH_7740

SVRG

Figure: url-combined

21 / 1

Page 23: Stochastic Newton and quasi-Newton Methods for Large-Scale ...aspremon/Houches/talks/Goldfarb.pdf · Newton-like and quasi-Newton methods for convex stochastic optimization problems

Contributions

New metric learning framework. A block BFGS framework forgradually learning the metric of the underlying function usinga sketched form of the subsampled Hessian matrix

New limited-memory block BFGS method. May also be ofinterest for non-stochastic optimization

New limited-memory factored form block BFGS method.

Several sketching matrix possibilities.

Linear convergence rate proof for our methods.

Tighter upper and lower bounds on the eigenvalues of thevariable metric

22 / 1