Top Banner
1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning
39

1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

Dec 29, 2015

Download

Documents

Charles Webb
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

1

Jorge Nocedal Northwestern University

With S. Hansen, R. Byrd and Y. Singer

IPAM, UCLA, Feb 2014

A Stochastic Quasi-Newton Method for Large-Scale Learning

Page 2: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

Propose a robust quasi-Newton method that operates in the stochastic approximation regime

• purely stochastic method (not batch) – to compete with stochatic gradient (SG) method

• Full non-diagonal Hessian approximation• Scalable to millions of parameters

2

Goal

wk+1 =wk −αkHk∇F(wk)

Page 3: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

3

Outline

Are iterations of this following form viable? - theoretical considerations; iteration costs - differencing noisy gradients?

Key ideas: compute curvature information pointwise at regular intervals, build on strength of BFGS updating recalling that it is an overwriting (and not averaging process)

- results on text and speech problems - examine both training and testing errors

wk+1 =wk −αkHk∇f(wk)

Page 4: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

4

Problem

minw∈° n F(w)=E[ f(w;ξ)]

Stochastic function of variable w with random variable ξApplications• Simulation optimization• Machine learning ξ = selection of input-output pair (x, z)

  f (w,ξ) = f(w,xi ,zi ) =l (h(w;xi ),zi )

F(w)=1N

f (w;xi ,zi )i=1

N

Algorithm not (yet) applicable to simulation based optimization

Page 5: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

For loss function

Robbins-Monro or stochastic gradient method

using stochastic gradient (estimator)

min-batch

5

Stochastic gradient method

wk+1 =wk −αk∇F(wk) α k = O(1 / k)

 ∇F(wk) :        E [∇F(w)]=∇F(w) 

∇F(wk) =1b

∇i∈S∑ f(w;xi ,zi ),      |S |=b

F(w)=1N

f (w;xi ,zi )i=1

N

b=|S|<< N

Page 6: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

1. Is there any reason to think that including a Hessian approximation will improve upon stochastic gradient method?

2. Iteration costs are so high that even if method is faster than SG in terms of training costs it will be a weaker learner

6

Why it won’t work ….

wk+1 =wk −αkHk∇F(wk)

Page 7: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

wk+1 =wk −αkBk−1∇F(wk), αk =O(1 / k)

Number of iterations needed to compute an epsilon-accurate solution:

Bk =∇2F(w* )Bk =I

Depends on the Hessian at true solution and the gradient covariance matrix

Depends on the condition number of the Hessian at the true solution

Completely removes the

dependency on the condition number

(Murata 98); cf Bottou-Bousquet

7

Theoretical Considerations

Motivation for choosing Bk → ∇2F        Quasi-Newton

Page 8: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

Assuming we obtain efficiencies of classical quasi-Newtonmethods in limited memory form

• Each iteration requires 4Md operations• M = memory in limited memory implementation; M=5• d = dimension of the optimization problem

8

Computational cost

wk+1 =wk −αkHk∇f(wk)

4Md +d= 21d    vs     d      Stochastic gradient

method

cost of computing

       ∇f(wk)

Page 9: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

• assuming a min-batch b=50• cost of stochastic gradient = 50d

Use of small mini-batches will be a game-changer b =10, 50, 100

9

Mini-batching

4Md +bd=20d+ 50d    vs     50d     

∇F(wk) =1b

∇i∈S∑ f(w;xi ,zi ),      |S | =b

70d vs 50d

Page 10: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

Mini-batching makes operation counts favorable but does not resolve challenges related to noise

1. Avoid differencing noise• Curvature estimates cannot suffer from sporadic spikes in noise

(Schraudolph et al. (99), Ribeiro et at (2013)• Quasi-Newton updating is an overwriting process not an averaging

process• Control quality of curvature information

2. Cost of curvature computation

Use of small mini-batches will be a game-changer b =10,50,100

10

Game changer? Not quite…

Page 11: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

11

Desing of Stochastic Quasi-Newton Method

wk+1 =wk −αkHk∇f(wk)

Propose a method based on the famous BFGS formula• all components seem to fit together well• numerical performance appears to be strong

Propose a new quasi-Newton updating formula• Specifically designed to deal with noisy gradients• Work in progress

Page 12: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

12

Review of the deterministic BFGS method

wk+1 =wk −αkBk−1∇F(wk)

At every iteration compute and save

     yk =∇F(wk+1 )−∇F(wk), sk =wk+1 −wk,

wk+1 =wk −αkHk∇F(wk)

H k+1 =(I −ρkskykT )Hk(I −ρkyksk

T )+ ρkskskT ρk =

1

skT yk

Correction pairs yk , sk uniquely determine BFGS updating

Page 13: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

13

The remarkable properties of BFGS method (convex case)

tr(Bk+1) =tr(Bk)−° Bksk°´skTBksk

+° yk°´2

ykTsk

detBk+1 = detBk

ykTskskTsk

skTsk

skTBksk

° (Bk −∇F(x* ))sk°´°´sk°´

→ 0

If the algorithm takes a bad step, matrix is corrected...

Superlinear convergence; global convergence

for strongly convex problems, self-correction properties

Only need to approximate Hessian in a subspace

Powell 76

Byrd-N 89

Page 14: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

14

Adaptation to stochastic setting

Cannot mimic classical approach and update after each iteration

Since batch size b is small this will yield highly noisy curvature estimates

yk =∇F(wk+1 )−∇F(wk), sk =wk+1 −wk

Instead: Use a collection of iterates to define the correction

pairs

Page 15: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

15

Stochastic BFGS: Approach 1

Define two collections of size L:

wi ,∇F(wi ) |i ∈I{ } , wj ,∇F(wj ) | j ∈J{ }    |I |=|J |=L

wI =1|I |

wii∈I∑ , gI =

1|I |

∇i∈I∑ F(wi )

s =wI −wj , y= gI −gJ

Define average iterate/gradient:

New curvature pair:

Page 16: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

wk+1 =wk −αk∇F(wk)wk+1 =wk −αkHt∇F(wk)

wi ,∇F(wi ) |i ∈I{ }

w j ,∇F(wj ) | j ∈J{ }

wi ,∇F(wi ) |i ∈I{ }

w j ,∇F(wj ) | j ∈J{ }

wi ,∇F(wi ) |i ∈I{ }

s =wI −wJ

y=gI −gJ

       →  H2

s =wI −wJ

y=gI −gJ

→  H1

s =wI −wJ

y=gI −gJ

        →  H3

16

Stochastic L-BFGS: First Approach

Page 17: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

17

Stochastic BFGS: Approach 1

We could not make this work in a robust manner!

1. Two sources of error• Sample variance• Lack of sample uniformity

2. Initial reaction• Control quality of average gradients• Use of sample variance … dynamic sampling

s =wI −wJ ,y= gI −gJ

Proposed Solution Control quality of curvature y estimate directly

Page 18: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

Standard definition

arises from

Hessian-vector products are often available

Define curvature vector for L-BFGS via a Hessian-vector product perform only every L iterations

y  ←  ∇2F(wI )s

y =∇F(wI )−∇F(wJ )

18

∇F(wI )−∇F(wJ ) ≈∇2F(wI )(wI −wJ )

Key idea: avoid differencing

Page 19: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

y =∇2F(wI )s

∇2F(wk)s=1bH

∇i∈SH

∑ 2f (w;xi ,zi )s,         |SH | =bH

19

Structure of Hessian-vector product

Mini-batch stochastic gradient

1. Code Hessian-vector product directly2. Achieve sample uniformity automatically (c.f. Schraudolph)

3. Avoid numerical problems when ||s|| is small

4. Control cost of y computation

Page 20: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

wk+1 =wk −αk∇F(wk)wk+1 =wk −αkHt∇F(wk)

wi | i ∈I{ }

w j | j ∈J{ }

wi | i ∈I{ }

w j | j ∈J{ }

wi | i ∈I{ }

s =wI −wJ

y=∇2F(wI )s→  H2

s =wI −wJ

y=∇2F(wI )s      →  H1

s =wI −wJ

y=∇2F(wI )s          →  H3

20

The Proposed Algorithm

Page 21: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

• b: stochastic gradient batch size

• bH: Hessian-vector batch size

• L: controls frequency of quasi-Newton updating

• M: memory parameter in L-BFGS updating M=5

- use limited memory form

21

Algorithmic Parameters

Page 22: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

22

Need Hessian to implement a quasi-Newton method?

Are you out of your

mind?

We don’t need Hessian-vector product, but it has many

Advantages: complete freedom in sampling and accuracy

ŦŒ ä śϖ⌥ ħ??

Page 23: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

23

Numerical Tests

wk+1 =wk −αkHk∇f(wk)      αk =β / k

wk+1 =wk −αk∇f(wk)      αk =β / k     Stochastic gradient

method (SGD)

Stochastic quasi-Newton

method (SQN)

parameter β to be fixed at start for each method;found by bisection procedure

It is well know that SGD is highly sensitive to choice of steplength,

and so will be the SQN method (though perhaps less)

Page 24: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

• b = 50, 300, 1000

• M = 5, L = 20 bH = 100024

RCV1 Problem n = 112919, N = 688329

Accessed

data points;

includes Hessian-

vector products

sgd

sqn

Page 25: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

• b = 100, 500 , M = 5, L = 20, bH = 1000

25

Speech Problem n= 30315, N = 191607

sgd

sqn

Page 26: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

26

Varying Hessian batch bH: RCV1

b=300

Page 27: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

27

Varying memory size M in limited memory BFGS: RCV1

Page 28: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

28

Varying L-BFGS Memory Size: Synthetic problem

Page 29: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

29

Generalization Error: RCV1 Problem

SGD

SQN

Page 30: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

• Synthetically Generated Logistic Regression: Singer et al

– n = 50, N = 7000

– Training data:.

• RCV1 dataset

– n = 112919, N = 688329

– Training data: xi ∈[0,1]n, zi ∈[0,1]

xi ∈° n, zi ∈[0,1]

• SPEECH dataset

– NF = 235, |C| = 129

– n = NF x |C| --> n= 30315, N = 191607

– Training data: xi ∈° NF , zi ∈[1,129]30

Test Problems

Page 31: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

Iteration Costs

• mini-batch stochastic gradient

SQNSGD

• mini-batch stochastic gradient

• Hessian-vector product every L iterations

• matrix-vector product

31

Page 32: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

• b = 50-1000

bn bn +bHn/ L+4Mn

SGD SQN

• bH = 100-1000

• L = 10-20

• M = 3-20

Typical Parameter Values

b = 300

bH = 1000

L = 20

M = 5

300n 370n

32

Iteration Costs

Page 33: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

33

Hasn’t this been done before?

Hessian-free Newton method: Martens (2010), Byrd et al (2011)

- claim: stochastic Newton not competitive with stochastic BFGS Prior work: Schraudolph et al. - similar, cannot ensure quality of y - change BFGS formula in one-sided form

Page 34: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

34

Supporting theory?

Work in progress: Figen Oztoprak, Byrd, Soltnsev

- combine classical analysis Murata, Nemirovsky et a - and asumptotic quasi-Newton theory - effect on constants (condition number) - invoke self-correction properties of BFGS

Practical Implementation: limited memory BFGS - loses superlinear convergence property - enjoys self-correction mechanisms

Page 35: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

SGD:

SQN:

b adp/iter

b + bH/L adp/iter

bn + bHn/L +4Mn work/iter

bn work/iter

bH=1000, M=5, L=200

Parameters L, M and bH provide freedom in adapting the SQN method to a specific application 35

Small batches: RCV1 Problem

Page 36: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

36

Alternative quasi-Newton framework

BFGS method was not derived with noisy gradients in mind

- how do we know it is an appropriate framework -

Start from scratch - derive quasi-Newton updating formulas tolerant to noise

Page 37: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

Define quadratic model around a reference point z

Using a collection indexed by I , natural to require

i.e. residuals are zero in expectation

Not enough information to determine the whole model

37

Foundations

qz (w)= f(w) + gT (w−z) +12(w−z)T B(w−z)

[∇qj∈I∑ (wj )−∇f(wj )] =0

Page 38: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

Given a collection I, choose model q to minimize

Differentiating w.r.t. g:

Encouraging: obtained residual condition

38

Mean square error

F(g, B)=  °´j∈I∑  ∇q(wj )−∇f(wj )°´2=0

ming,B °´j∈I∑ Bsk + g−∇f (wj )°´

2

Define sk =wk −z,  restate problem as

j∈I∑ {Bsk + g−∇f (wj )} =0

qz (w)=gT (w−z) +12(w−z)T B(w−z)

Page 39: 1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

39

The End