Top Banner
Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 25
27

Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Sep 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Optimization in the “Big Data” Regime 2:SVRG & Tradeoffs in Large Scale Learning.

Sham M. Kakade

Machine Learning for Big DataCSE547/STAT548

University of Washington

S. M. Kakade (UW) Optimization for Big data 1 / 25

Page 2: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Announcements...

Work on your project milestonesread/related work summarysome empirical work

Today:Review: optimization of finite sums, (dual) coordinate ascentNew: SVRG (for sums of loss functions);Tradeoffs in large scale learningHow do we optimize in the “big data” regime?

S. M. Kakade (UW) Optimization for Big data 2 / 25

Page 3: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Machine Learning and the Big Data Regime...

goal: find a d-dim parameter vector which minimizes the loss on ntraining examples.

have n training examples (x1, y1), . . . (xn, yn)

have parametric a classifier h(x ,w), where w is a d dimensionalvector.

minw

L(w) where L(w) =∑

i

loss(h(xi ,w), yi)

“Big Data Regime”: How do you optimize this when n and d arelarge? memory? parallelization?

Can we obtain linear time algorithms to find an ε-accurate solution?i.e. find w so that

L(w)−minw

L(w) ≤ ε

S. M. Kakade (UW) Optimization for Big data 3 / 25

Page 4: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Review: Stochastic Gradient Descent (SGD)

SGD update rule: at each time t ,

sample a point (xi , yi)

w ← w − η(w · xi − yi)xi

Problem: even if w = w∗, the update changes w .Rate: convergence rate is O(1/ε), with decaying η

simple algorithm, light on memory, but poor convergence rate

S. M. Kakade (UW) Optimization for Big data 4 / 25

Page 5: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Review: Stochastic Gradient Descent (SGD)

SGD update rule: at each time t ,

sample a point (xi , yi)

w ← w − η(w · xi − yi)xi

Problem: even if w = w∗, the update changes w .Rate: convergence rate is O(1/ε), with decaying η

simple algorithm, light on memory, but poor convergence rate

S. M. Kakade (UW) Optimization for Big data 4 / 25

Page 6: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Review: Stochastic Gradient Descent (SGD)

SGD update rule: at each time t ,

sample a point (xi , yi)

w ← w − η(w · xi − yi)xi

Problem: even if w = w∗, the update changes w .Rate: convergence rate is O(1/ε), with decaying η

simple algorithm, light on memory, but poor convergence rateS. M. Kakade (UW) Optimization for Big data 4 / 25

Page 7: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

SDCA advantages/disadvantages

What about more general convex problems? e.g.

minw

L(w) where L(w) =∑

i

loss(h(xi ,w), yi)

the basic idea (formalized with duality) is pretty general for convex loss(·).works very well in practice.

memory: SDCA needs O(n + d) memory, while SGD is only O(d).What about an algorithm for non-convex problems?

SDCA seems heavily tied to the convex case.Is there an algo that is highly accurate in the convex case and sensiblein the non-convex case?

S. M. Kakade (UW) Optimization for Big data 5 / 25

Page 8: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

L smooth and µ-strongly convex case

S. M. Kakade (UW) Optimization for Big data 6 / 25

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Page 9: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Review: Stochastic Gradient Descent

Suppose L(w) is µ strongly convex.Suppose each loss loss(·) is L-smoothTo get ε accuracy:

# iterations to get ε-accuracy:Lµε

(see related work for precise problem dependent parameters)Computation time to get ε-accuracy:

Lµε

d

(assuming O(d) cost pre gradient evaluation.)

S. M. Kakade (UW) Optimization for Big data 7 / 25

Page 10: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

(another idea) Stochastic Variance Reduced Gradient(SVRG)

1 exact gradient computation: at stage s, using ws, compute:

∇L(ws) =1n

n∑i=1

∇loss(h(xi , ws), yi)

2 variance reduction + SGD: initialize w ← ws. for m steps,

sample a point (x , y)w ← w − η

(∇loss(h(x ,w), y)−∇loss(h(x , ws), y) +∇L(ws)

)3 update and repeat: ws+1 ← w .

S. M. Kakade (UW) Optimization for Big data 8 / 25

Kira Goldner
Kira Goldner
Kira Goldner
Page 11: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Properties of SVRG

unbiased updates: What is the mean of the blue term?

E[∇loss(h(x , ws), y)−∇L(ws)] =?

where the expectation is for a random sample (x , y).If w = w∗, then no update.Memory is O(d).No “dual” variables.Applicable to non-convex optimization.

S. M. Kakade (UW) Optimization for Big data 9 / 25

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Page 12: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Guarantees of SVRG

set m = L/µ.# of gradient computations to get ε accuracy:(

n +Lµ

)log 1/ε

S. M. Kakade (UW) Optimization for Big data 10 / 25

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Page 13: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Comparisons

a gradient evaluation is at point (x , y).

SVRG: # of gradient computations to get ε accuracy:(n +

)log 1/ε

# of gradient evaluations for batch gradient descent:

nLµ

log 1/ε

where L is the smoothness of L(w).# of gradient computations for SGD:

Lµε

S. M. Kakade (UW) Optimization for Big data 11 / 25

Kira Goldner
Page 14: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Non-convex comparisons

How many gradient evaluations does it take to find w so that:

‖∇L(w)‖2 ≤ ε2

(i.e. ”close” to a stationary point)Rates: the number of gradient evaluations, at a point (x , y), is:

GD: O(n/ε)SGD: O(1/ε2)SVRG: O(n + n2/3/ε)

Does SVRG work well in practice?

S. M. Kakade (UW) Optimization for Big data 12 / 25

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Page 15: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Tradeoffs in Large Scale Learning.

Many issues sources of “error”approximation error: our choice of a hypothesis classestimation error: we only have n samplesoptimization error: computing exact (or near-exact) minimizers canbe costly.How do we think about these issues?

S. M. Kakade (UW) Optimization for Big data 13 / 25

Kira Goldner
Page 16: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

The true objective

hypothesis map x ∈ X to y ∈ Y.have n training examples (x1, y1), . . . (xn, yn) sampled i.i.d. from D.Training objective: have a set of parametric predictors{h(x ,w) : w ∈ W},

minw∈W

Ln(w) where Ln(w) =1n

n∑i=1

loss(h(xi ,w), yi)

True objective: to generalize to D,

minw∈W

L(w) where L(w) = E(X ,Y )∼Dloss(h(X ,w),Y )

Optimization: Can we obtain linear time algorithms to find anε-accurate solution? i.e. find h so that

L(w)− minw∈W

L(w) ≤ ε

S. M. Kakade (UW) Optimization for Big data 14 / 25

Page 17: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Definitions

Let h∗ is the Bayes optimal hypothesis, over all functions fromX → Y.

h∗ ∈ argminhL(h)

Let w∗ is the best in class hypothesis

w∗ ∈ argminw∈WL(w)

Let wn be the empirical risk minimizer:

wn ∈ argminw∈W Ln(w)

Let wn be what our algorithm returns.

S. M. Kakade (UW) Optimization for Big data 15 / 25

Page 18: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Loss decomposition

Observe:

L(wn)− L(h∗) = L(w∗)− L(h∗) Approximation error

+L(wn)− L(w∗) Estimation error

+L(wn)− L(wn) Optimization error

Three parts which determine our performance.Optimization algorithms with “best” accuracy dependencies on Lnmay not be best.Forcing one error to decrease much faster may be wasteful.

S. M. Kakade (UW) Optimization for Big data 16 / 25

Page 19: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Time to a fixed accuracy

test error versus training time

S. M. Kakade (UW) Optimization for Big data 17 / 25

Page 20: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Comparing sample sizes

test error versus training time

• Vary the number of examples

S. M. Kakade (UW) Optimization for Big data 18 / 25

Page 21: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Comparing sample sizes and models

test error versus training time

• Vary the number of examples

S. M. Kakade (UW) Optimization for Big data 19 / 25

Page 22: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Optimal choices

test error versus training time

• Optimal combination depends on training time budget.

Good combinations

S. M. Kakade (UW) Optimization for Big data 20 / 25

Page 23: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Estimation error: simplest case

Measuring a mean:L(µ) = E(µ− y)2

The minima is at µ = E[y ].With n samples, the Bayes optimal estimator is the sample mean:µn = 1

n∑

i yi .The error is:

E[L(µn)]− L(E[y ]) =σ2

n

σ2 is the variance and the expectation is with respect to the nsamples.How many samples do we need for ε error?

S. M. Kakade (UW) Optimization for Big data 21 / 25

Page 24: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Let’s compare:

SGD: Is O(1/ε) reasonable?GD: Is log 1/eps needed?SDCA/SVRG: These are also log 1/eps but much faster.

S. M. Kakade (UW) Optimization for Big data 22 / 25

Page 25: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Statistical Optimality

Can generalize as well as the sample minimizer, wn?(without computing it exactly)For a wide class of models (linear regression, logistic regression,etc), we have that the estimation error is:

E[L(wn)]− L(w∗) =σ2

opt

n

where σ2opt is a problem dependent constant.

What is the computational cost of achieving exactly this rate? say forlarge n?

S. M. Kakade (UW) Optimization for Big data 23 / 25

Page 26: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Averaged SGD

SGD:wt+1 ← wt − ηt∇loss(h(x ,wt), y)

An (asymptotically) optimal algo:Have ηt go to 0 (sufficiently slowly)(iterate averaging) Maintain the a running average:

wn =1n

∑t≤n

wt

(Polyak & Juditsky, 1992) for large enough n and with one pass of SGDover the dataset:

E[L(wn)]− L(w∗) n→∞=

σ2opt

n

S. M. Kakade (UW) Optimization for Big data 24 / 25

Page 27: Optimization in the ``Big Data'' Regime 2: SVRG & Tradeoffs in Large Scale Learning. · 2017. 5. 2. · Optimization in the “Big Data” Regime 2: SVRG & Tradeoffs in Large Scale

Acknowledgements

Some slides from “Large-scale machine learning revisited”, LeonBottou 2013.

S. M. Kakade (UW) Optimization for Big data 25 / 25