Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Post on 03-Aug-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Machine Learning (CSE 446):Pratical issues: optimization and learning

Sham M Kakadec© 2018

University of Washingtoncse446-staff@cs.washington.edu

1 / 11

Announcements

I Midterm summary:I stats: 71.5 std: 18I Office hours today: 1:15-2:30 (No office hours on Monday)

I Monday: John Thickstun guest lecture

I Grading:I HW3 posted

I will be periodically updated for typos/clarifictionsI extra credit posted soon

I Today:I Midterm reviewI GD/SGD: practical issues

1 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner

Midterm

1 / 11

What is a good model of this distribution?

“A mixture of Gaussians”

2 / 11

What is a good model of this distribution?“A mixture of Gaussians” 2 / 11

Midterm Q4: scratch space

3 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner

Midterm: scratch space

3 / 11

Midterm Q5: scratch space

4 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner

Midterm: scratch space

4 / 11

Today

4 / 11

The “general” Loss Minimization Problem

w∗ = argminw

1

N

N∑n=1

`(xn, yn,w)︸ ︷︷ ︸`n(w)

+R(w)

How do we run GD? SGD? Which one to use?

How do run them?

5 / 11

Our running example

argminw

1

N

N∑n=1

1

2(yn −w · xn)

2 +1

2λ‖w‖2

I GD? SGD?

I Note we are computing an average. What is a crude way to estimate an average?

Will it converge?

6 / 11

Kira Goldner
Kira Goldner
Kira Goldner

How does GD behave? A 1-dim example

6 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner

GD: How do we set the step sizes?

I Theory:

I square loss:I more generally:

I Practice:

I square loss:I more generally:

I Do we decay the stepsize?

7 / 11

Kira Goldner
Kira Goldner
Kira Goldner
Kira Goldner

SGD for the square loss

Data: step sizes 〈η(1), . . . , η(K)〉Result: parameter winitialize: w(0) = 0;for k ∈ {1, . . . ,K} do

n ∼ Uniform({1, . . . , N});w(k) = w(k−1) + η(k)

(yn −w(k−1) · xn

)xn;

end

return w(K);Algorithm 1: SGD

I where did the N go?

I regularization?

I minibatching?

8 / 11

SGD for the square loss

Data: step sizes 〈η(1), . . . , η(K)〉Result: parameter winitialize: w(0) = 0;for k ∈ {1, . . . ,K} do

n ∼ Uniform({1, . . . , N});w(k) = w(k−1) + η(k)

(yn −w(k−1) · xn

)xn;

end

return w(K);Algorithm 2: SGD

I where did the N go?

I regularization?

I minibatching?

8 / 11

SGD: How do we set the step sizes?

I Theory:

I Practice:I How do start it?I When do we decay it?

9 / 11

Stochastic Gradient Descent: Convergence

w∗ = argminw

1

N

N∑n=1

`n(w)

I w(k): our parameter after k updates.

I Thm: Suppose `(·) is convex (and satisfies mild regularity conditions). There isdecreasing sequence of step sizes η(k) so that our function value, F (w(k)),converges to the minimal function value, F (w∗).

I GD vs SGD: we need to turn down our step sizes over time!

10 / 11

Making features: scratch space

11 / 11

top related