6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Gradient DescentLyle Ungar

University of PennsylvaniaIn part from slides written jointly with Zack Ives

Learning objectivesKnow standard, coordinate, stochastic gradient, and minibatch gradient descent

Adagrad: core idea

Gradient Descentu We almost always want to minimize some loss functionu Example: Sum of Squared Errors (SSE):

𝑆𝑆𝐸 q =1𝑛&!"#

𝑟! q %

𝑟!(q) = 𝑦(!) − ŷ 𝒙 ! ; q

Mean Squared Error

http://www.math.uah.edu/stat/expect/Variance.html

In one dimension, lookslike a parabola centeredaround the optimalvalue μ

(Generalizes to d dimensions)

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $

Getting Closer

Current value q

What if we use theslope of the tangentto decide where to“go next”?

the gradient𝛻𝑆𝑆𝐸 q

= lim!→#

ŷ(q + 𝑑 − ŷ q𝑑

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $

theory.stanford.edu/~tim/s15/l/l15.pdf

q: = q− h 𝛻𝑆𝑆𝐸 q

Getting Closer

We can compute the gradient numerically… But sometimes better to use analytics (calculus)!

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q =1𝑛3$%&

'𝑑𝑑q𝑟$ q (

Getting Closer

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q

Getting Closer

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $

Current value q

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q

𝜕𝑟$𝜕q)

= 𝑥)$

Getting Closer

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $

Current value q

𝜕𝑟$𝜕q)

= 𝑥)$

𝛻𝑆𝑆𝐸 q

=1𝑛'!"#

2 𝑟! q ⋅ 𝒙 !

Getting Closer

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $

Current value q

𝜕𝑟$𝜕q)

= 𝑥)$

𝛻𝑆𝑆𝐸 q

=2𝑛'!"#

𝑟! q ⋅ 𝒙 !

Step h

q ∶= q− h 𝛻𝑀𝑆𝐸(q)

Key questionsu How big a step h to take?

l Too small and it takes a long timel Too big and it will be unstable

u“Optimal:” scale h ~ 1/sqrt(iteration)u Adaptive (a simple version)

l E.g. each time, increase step size by 10%n If error ever increases, cut set size in half

q: = q− h 𝛻𝑆𝑆𝐸(q)

For ||w||1 or ||y-y||1 use coordinate descent

https://en.wikipedia.org/wiki/Coordinate_descent

Repeat:For j=1:pqj:= qj - hdErr/dqj

Elastic net parameter search

Size of coefficients

Regularization penalty (inverse) Zou and Hastie

Stochastic Gradient Descentu If we have a very large data set, update the model

after observing each single observationl “online” or “streaming” learning

𝑆𝑆𝐸 q =1𝑛)!"#

𝑟! q $ 𝛻𝑆𝑆𝐸𝑖 q =𝑑𝑑q𝑟! q $

q ∶= q− h 𝛻𝑆𝑆𝐸𝑖 q

Mini-batchu Update the model every k observations

l Batch size k (e.g. 50)u More efficient than pure stochastic gradient

or full gradient descent𝑆𝑆𝐸 q =

1𝑛)!"#

𝑟! q $ 𝛻𝑆𝑆𝐸𝑘 q =1𝑘'!"#

#$%𝑑𝑑q 𝑟!

q ∶= q− h 𝛻𝑆𝑆𝐸𝑘 q

Adagrad• Define a per-feature learning rate for feature j as:

• Gt,j is the sum of squares of gradients of feature j over time t• Frequently occurring features in the gradients get small learning rates; rare features get higher ones • Key idea: “learn slowly” from frequent features but “pay attention” to rare but informative feature

@✓jcost✓(xk, yk)

⌘t,j =⌘pGt,j

Gt,j =tX

g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>

⌘t,j =⌘pGt,j

Gt,j =tX

Adagrad

In practice, add a small constant 𝜁 > 0 to prevent dividing by zero

⌘t,j =⌘pGt,j

Gt,j =tX

⌘t,j =⌘pGt,j

Gt,j =tX

✓j ✓j �⌘p

Gt,j + ⇣gt,j

<latexit sha1_base64="OBxDJ3usqHT9hv3J+8q8hjcf0Bs=">AAACOXicbZDLThsxFIY93AoDhRSWbCwiqkptoxmo1C5RWcAySASQMtHI45xJHDwX7DOtUmteqxvegh1SN12AEFteAE8yC26/ZOn3d86Rff4ol0Kj5107M7Nz8wvvFpfc5ZX3q2uND+snOisUhw7PZKbOIqZBihQ6KFDCWa6AJZGE0+h8v6qf/gKlRZYe4ziHXsIGqYgFZ2hR2Gi7AQ4BWWhGJf0YSIiRKZX9pjUe0a80iBXjJrDX0gT6QqE5CA1+GZUl/UyDPxWngylx3bDR9FreRPS18WvTJLXaYeMq6Ge8SCBFLpnWXd/LsWeYQsEllG5QaMgZP2cD6FqbsgR0z0w2L+m2JX0aZ8qeFOmEPp0wLNF6nES2M2E41C9rFXyr1i0w/tEzIs0LhJRPH4oLSTGjVYy0LxRwlGNrGFfC/pXyIbM5oQ27CsF/ufJrc7LT8ndbO0ffmns/6zgWySbZIp+IT76TPXJI2qRDOPlL/pEbcutcOv+dO+d+2jrj1DMb5Jmch0cMvq0H</latexit>

Recap: Gradient Descentu “Follow the slope” towards a minimum

l Analytical or numerical derivativel Need to pick step size

n larger = faster convergence but instabilityu Lots of variations

l Coordinate descentl Stochastic gradient descent or mini-batch

u Can get caught in local minimal Alternative, simulated annealing, uses randomness

6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives

Documents

Gradient Descent Optimization

Gradient descent

Gradient Descent - cs.cmu.edu · Gradient Descent • Now.....

The Gradient Descent Algorithm

Learning Unitaries with gradient descent...

Mini-batch deeplearning.ai gradient descent · Batch vs....

Gradient Descent: Second Order Momentum and Saturating...

Exponentiated Gradient versus Gradient Descent for Linear...

Stochastic Gradient Descent - CMU Statistics

Linear Regression and Gradient Descent

Gradient Methods April 2004. Preview Background Steepest...

Gradient Descent Rule Tuning

Intro Logistic+Regression Gradient+Descent+++SGD...9...

Learning to learn by gradient descent by gradient...

Gradient Descent - University of Washington

Holonomic Gradient Descent