6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Post on 21-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Gradient DescentLyle Ungar

University of PennsylvaniaIn part from slides written jointly with Zack Ives

Learning objectivesKnow standard, coordinate, stochastic gradient, and minibatch gradient descent

Adagrad: core idea

Gradient Descentu We almost always want to minimize some loss functionu Example: Sum of Squared Errors (SSE):

𝑆𝑆𝐸 q =1𝑛&!"#

$

𝑟! q %

𝑟!(q) = 𝑦(!) − ŷ 𝒙 ! ; q

Mean Squared Error

http://www.math.uah.edu/stat/expect/Variance.html

In one dimension, lookslike a parabola centeredaround the optimalvalue μ

(Generalizes to d dimensions)

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $

q

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

Current value q

What if we use theslope of the tangentto decide where to“go next”?

the gradient𝛻𝑆𝑆𝐸 q

= lim!→#

ŷ(q + 𝑑 − ŷ q𝑑

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

theory.stanford.edu/~tim/s15/l/l15.pdf

q: = q− h 𝛻𝑆𝑆𝐸 q

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q =1𝑛3$%&

'𝑑𝑑q𝑟$ q (

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

$

2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

$

2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q

𝜕𝑟$𝜕q)

= 𝑥)$

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝜕𝑟$𝜕q)

= 𝑥)$

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

$

2 𝑟! q ⋅ 𝒙 !

Getting Closer

http://www.math.uah.edu/stat/expect/Variance.html

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $

Current value q

𝜕𝑟$𝜕q)

= 𝑥)$

𝛻𝑆𝑆𝐸 q

=2𝑛'!"#

$

𝑟! q ⋅ 𝒙 !

Step h

q ∶= q− h 𝛻𝑀𝑆𝐸(q)

Key questionsu How big a step h to take?

l Too small and it takes a long timel Too big and it will be unstable

u“Optimal:” scale h ~ 1/sqrt(iteration)u Adaptive (a simple version)

l E.g. each time, increase step size by 10%n If error ever increases, cut set size in half

q: = q− h 𝛻𝑆𝑆𝐸(q)

For ||w||1 or ||y-y||1 use coordinate descent

https://en.wikipedia.org/wiki/Coordinate_descent

Repeat:For j=1:pqj:= qj - hdErr/dqj

Elastic net parameter search

Size of coefficients

Regularization penalty (inverse) Zou and Hastie

Stochastic Gradient Descentu If we have a very large data set, update the model

after observing each single observationl “online” or “streaming” learning

𝑆𝑆𝐸 q =1𝑛)!"#

%

𝑟! q $ 𝛻𝑆𝑆𝐸𝑖 q =𝑑𝑑q𝑟! q $

q ∶= q− h 𝛻𝑆𝑆𝐸𝑖 q

Mini-batchu Update the model every k observations

l Batch size k (e.g. 50)u More efficient than pure stochastic gradient

or full gradient descent𝑆𝑆𝐸 q =

1𝑛)!"#

%

𝑟! q $ 𝛻𝑆𝑆𝐸𝑘 q =1𝑘'!"#

#$%𝑑𝑑q 𝑟!

q &

q ∶= q− h 𝛻𝑆𝑆𝐸𝑘 q

Adagrad• Define a per-feature learning rate for feature j as:

• Gt,j is the sum of squares of gradients of feature j over time t• Frequently occurring features in the gradients get small learning rates; rare features get higher ones • Key idea: “learn slowly” from frequent features but “pay attention” to rare but informative feature

@

@✓jcost✓(xk, yk)

⌘t,j =⌘pGt,j

Gt,j =tX

k=1

g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>

⌘t,j =⌘pGt,j

Gt,j =tX

k=1

g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>

Adagrad

In practice, add a small constant 𝜁 > 0 to prevent dividing by zero

⌘t,j =⌘pGt,j

Gt,j =tX

k=1

g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>

⌘t,j =⌘pGt,j

Gt,j =tX

k=1

g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>

✓j ✓j �⌘p

Gt,j + ⇣gt,j

<latexit sha1_base64="OBxDJ3usqHT9hv3J+8q8hjcf0Bs=">AAACOXicbZDLThsxFIY93AoDhRSWbCwiqkptoxmo1C5RWcAySASQMtHI45xJHDwX7DOtUmteqxvegh1SN12AEFteAE8yC26/ZOn3d86Rff4ol0Kj5107M7Nz8wvvFpfc5ZX3q2uND+snOisUhw7PZKbOIqZBihQ6KFDCWa6AJZGE0+h8v6qf/gKlRZYe4ziHXsIGqYgFZ2hR2Gi7AQ4BWWhGJf0YSIiRKZX9pjUe0a80iBXjJrDX0gT6QqE5CA1+GZUl/UyDPxWngylx3bDR9FreRPS18WvTJLXaYeMq6Ge8SCBFLpnWXd/LsWeYQsEllG5QaMgZP2cD6FqbsgR0z0w2L+m2JX0aZ8qeFOmEPp0wLNF6nES2M2E41C9rFXyr1i0w/tEzIs0LhJRPH4oLSTGjVYy0LxRwlGNrGFfC/pXyIbM5oQ27CsF/ufJrc7LT8ndbO0ffmns/6zgWySbZIp+IT76TPXJI2qRDOPlL/pEbcutcOv+dO+d+2jrj1DMb5Jmch0cMvq0H</latexit>

Recap: Gradient Descentu “Follow the slope” towards a minimum

l Analytical or numerical derivativel Need to pick step size

n larger = faster convergence but instabilityu Lots of variations

l Coordinate descentl Stochastic gradient descent or mini-batch

u Can get caught in local minimal Alternative, simulated annealing, uses randomness

top related