Top Banner
Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives Learning objectives Know standard, coordinate, stochastic gradient, and minibatch gradient descent Adagrad: core idea
17

6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Jul 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Gradient DescentLyle Ungar

University of PennsylvaniaIn part from slides written jointly with Zack Ives

Learning objectivesKnow standard, coordinate, stochastic gradient, and minibatch gradient descent

Adagrad: core idea

Page 2: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Gradient Descentu We almost always want to minimize some loss functionu Example: Sum of Squared Errors (SSE):

𝑆𝑆𝐸 q =1𝑛&!"#

$

𝑟! q %

𝑟!(q) = 𝑦(!) − ŷ 𝒙 ! ; q

Page 3: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Mean Squared Error

http://www.math.uah.edu/stat/expect/Variance.html

In one dimension, lookslike a parabola centeredaround the optimalvalue μ

(Generalizes to d dimensions)

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $

q

Page 4: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

Current value q

What if we use theslope of the tangentto decide where to“go next”?

the gradient𝛻𝑆𝑆𝐸 q

= lim!→#

ŷ(q + 𝑑 − ŷ q𝑑

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

theory.stanford.edu/~tim/s15/l/l15.pdf

q: = q− h 𝛻𝑆𝑆𝐸 q

Page 5: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q =1𝑛3$%&

'𝑑𝑑q𝑟$ q (

Page 6: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

$

2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q

Page 7: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

$

2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q

𝜕𝑟$𝜕q)

= 𝑥)$

Page 8: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝜕𝑟$𝜕q)

= 𝑥)$

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

$

2 𝑟! q ⋅ 𝒙 !

Page 9: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝜕𝑟$𝜕q)

= 𝑥)$

𝛻𝑆𝑆𝐸 q

=2𝑛'!"#

$

𝑟! q ⋅ 𝒙 !

Step h

q ∶= q− h 𝛻𝑀𝑆𝐸(q)

Page 10: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Key questionsu How big a step h to take?

l Too small and it takes a long timel Too big and it will be unstable

u“Optimal:” scale h ~ 1/sqrt(iteration)u Adaptive (a simple version)

l E.g. each time, increase step size by 10%n If error ever increases, cut set size in half

q: = q− h 𝛻𝑆𝑆𝐸(q)

Page 11: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

For ||w||1 or ||y-y||1 use coordinate descent

https://en.wikipedia.org/wiki/Coordinate_descent

Repeat:For j=1:pqj:= qj - hdErr/dqj

Page 12: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Elastic net parameter search

Size of coefficients

Regularization penalty (inverse) Zou and Hastie

Page 13: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Stochastic Gradient Descentu If we have a very large data set, update the model

after observing each single observationl “online” or “streaming” learning

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $ 𝛻𝑆𝑆𝐸𝑖 q =𝑑𝑑q𝑟! q $

q ∶= q− h 𝛻𝑆𝑆𝐸𝑖 q

Page 14: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Mini-batchu Update the model every k observations

l Batch size k (e.g. 50)u More efficient than pure stochastic gradient

or full gradient descent𝑆𝑆𝐸 q =

1𝑛)!"#

%

𝑟! q $ 𝛻𝑆𝑆𝐸𝑘 q =1𝑘'!"#

#$%𝑑𝑑q 𝑟!

q &

q ∶= q− h 𝛻𝑆𝑆𝐸𝑘 q

Page 15: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Adagrad• Define a per-feature learning rate for feature j as:

• Gt,j is the sum of squares of gradients of feature j over time t• Frequently occurring features in the gradients get small learning rates; rare features get higher ones • Key idea: “learn slowly” from frequent features but “pay attention” to rare but informative feature

@

@✓jcost✓(xk, yk)

⌘t,j =⌘pGt,j

Gt,j =tX

k=1

g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>

⌘t,j =⌘pGt,j

Gt,j =tX

k=1

g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>

Page 16: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Adagrad

In practice, add a small constant 𝜁 > 0 to prevent dividing by zero

⌘t,j =⌘pGt,j

Gt,j =tX

k=1

g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>

⌘t,j =⌘pGt,j

Gt,j =tX

k=1

g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>

✓j ✓j �⌘p

Gt,j + ⇣gt,j

<latexit sha1_base64="OBxDJ3usqHT9hv3J+8q8hjcf0Bs=">AAACOXicbZDLThsxFIY93AoDhRSWbCwiqkptoxmo1C5RWcAySASQMtHI45xJHDwX7DOtUmteqxvegh1SN12AEFteAE8yC26/ZOn3d86Rff4ol0Kj5107M7Nz8wvvFpfc5ZX3q2uND+snOisUhw7PZKbOIqZBihQ6KFDCWa6AJZGE0+h8v6qf/gKlRZYe4ziHXsIGqYgFZ2hR2Gi7AQ4BWWhGJf0YSIiRKZX9pjUe0a80iBXjJrDX0gT6QqE5CA1+GZUl/UyDPxWngylx3bDR9FreRPS18WvTJLXaYeMq6Ge8SCBFLpnWXd/LsWeYQsEllG5QaMgZP2cD6FqbsgR0z0w2L+m2JX0aZ8qeFOmEPp0wLNF6nES2M2E41C9rFXyr1i0w/tEzIs0LhJRPH4oLSTGjVYy0LxRwlGNrGFfC/pXyIbM5oQ27CsF/ufJrc7LT8ndbO0ffmns/6zgWySbZIp+IT76TPXJI2qRDOPlL/pEbcutcOv+dO+d+2jrj1DMb5Jmch0cMvq0H</latexit>

Page 17: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Recap: Gradient Descentu “Follow the slope” towards a minimum

l Analytical or numerical derivativel Need to pick step size

n larger = faster convergence but instabilityu Lots of variations

l Coordinate descentl Stochastic gradient descent or mini-batch

u Can get caught in local minimal Alternative, simulated annealing, uses randomness