Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Machine Learning (CSE 446):Pratical issues: optimization and learning

University of Washingtoncse446-staff@cs.washington.edu

1 / 11

Announcements

I Midterm summary:I stats: 71.5 std: 18I Office hours today: 1:15-2:30 (No office hours on Monday)

I Monday: John Thickstun guest lecture

I Grading:I HW3 posted

I will be periodically updated for typos/clarifictionsI extra credit posted soon

I Today:I Midterm reviewI GD/SGD: practical issues

1 / 11

Midterm

1 / 11

What is a good model of this distribution?

“A mixture of Gaussians”

2 / 11

What is a good model of this distribution?“A mixture of Gaussians” 2 / 11

Midterm Q4: scratch space

3 / 11

Midterm: scratch space

3 / 11

Midterm Q5: scratch space

4 / 11

Midterm: scratch space

4 / 11

The “general” Loss Minimization Problem

w∗ = argminw

N∑n=1

`(xn, yn,w)︸︷︷︸`n(w)

How do we run GD? SGD? Which one to use?

How do run them?

5 / 11

Our running example

argminw

N∑n=1

2(yn −w · xn)

2λ‖w‖2

I GD? SGD?

I Note we are computing an average. What is a crude way to estimate an average?

Will it converge?

6 / 11

How does GD behave? A 1-dim example

6 / 11

GD: How do we set the step sizes?

I Theory:

I square loss:I more generally:

I Practice:

I square loss:I more generally:

I Do we decay the stepsize?

7 / 11

SGD for the square loss

Data: step sizes 〈η(1), . . . , η(K)〉Result: parameter winitialize: w(0) = 0;for k ∈ {1, . . . ,K} do

n ∼ Uniform({1, . . . , N});w(k) = w(k−1) + η(k)

(yn −w(k−1) · xn

return w(K);Algorithm 1: SGD

I where did the N go?

I regularization?

I minibatching?

8 / 11

SGD for the square loss

Data: step sizes 〈η(1), . . . , η(K)〉Result: parameter winitialize: w(0) = 0;for k ∈ {1, . . . ,K} do

n ∼ Uniform({1, . . . , N});w(k) = w(k−1) + η(k)

(yn −w(k−1) · xn

return w(K);Algorithm 2: SGD

I where did the N go?

I regularization?

I minibatching?

8 / 11

SGD: How do we set the step sizes?

I Theory:

I Practice:I How do start it?I When do we decay it?

9 / 11

Stochastic Gradient Descent: Convergence

w∗ = argminw

N∑n=1

I w(k): our parameter after k updates.

I Thm: Suppose `(·) is convex (and satisfies mild regularity conditions). There isdecreasing sequence of step sizes η(k) so that our function value, F (w(k)),converges to the minimal function value, F (w∗).

I GD vs SGD: we need to turn down our step sizes over time!

10 / 11

Making features: scratch space

11 / 11

Machine Learning (CSE 446): Pratical issues: optimization ...courses.cs.washington.edu/courses/cse446/18wi/slides/practical_annotated.pdfMachine Learning (CSE 446): Pratical issues:

Documents

CSE 446 Dimensionality Reduction and PCA Winter 2012

Machine Learning (CSE 446): The Perceptron Algorithm

EEP (Pratical)

Pratical Fixed Income_EISTI 2015

Pratical ICP MS - Unknown

Reinforcement Learning CSE 446 – Winter 2012

Machine Learning (CSE 446): Introduction Learning (CSE 446):...

Machine Learning (CSE 446): Learning as Minimizing Loss ...

Point Estimation - University of Washington · Point...

Pratical mpi programming

Machine Learning (CSE 446): Generative Adversarial Networks....

Manual of Microbiologiy Pratical

Linear Regression: Model and Algorithms · 1/8/17 2 3 CSE.....

Chem_First Sem-I - Pratical

1 CSE 446 Machine Learning Daniel Weld Xiao Ling Congle...

Food Chemistry Pratical Manual