Top Banner
Gradient Ascent Chris Piech CS109, Stanford University
60

Gradient Ascent - Stanford University

Dec 29, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gradient Ascent - Stanford University

Gradient AscentChris Piech

CS109, Stanford University

Page 2: Gradient Ascent - Stanford University

Our Path

Parameter Estimation

Linear

Regressio

n Naïve Baye

s Logistic

Regressio

n

Deep Learning

Page 3: Gradient Ascent - Stanford University

Our Path

Linear

Regressio

n Naïve Baye

s Logistic

Regressio

n

Deep Learning

Unbiased

estimators Maxim

izing

likelihood Baye

sian

estimation

Page 4: Gradient Ascent - Stanford University

Review

Page 5: Gradient Ascent - Stanford University
Page 6: Gradient Ascent - Stanford University

• Consider n I.I.D. random variables X1, X2, ..., Xn

§ Xi is a sample from density function f(Xi | q)

§ What are the best choice of parameters q ?

Parameter Learning

Page 7: Gradient Ascent - Stanford University

Piech, CS106A, Stanford University

Likelihood (of data given parameters):

Õ=

=n

iiXfL

1

)|()( qq

Page 8: Gradient Ascent - Stanford University

Piech, CS106A, Stanford University

Maximum Likelihood Estimation

Õ=

=n

iiXfL

1

)|()( qq

LL(✓) =nX

i=1

log f(Xi|✓)

✓ = argmax✓

LL(✓)

Page 9: Gradient Ascent - Stanford University

Argmax?

Page 10: Gradient Ascent - Stanford University

Option #1: Straight optimization

Page 11: Gradient Ascent - Stanford University

• General approach for finding MLE of q§ Determine formula for LL(q)

§ Differentiate LL(q) w.r.t. (each) q :

§ To maximize, set

§ Solve resulting (simultaneous) equations to get qMLEo Make sure that derived is actually a maximum (and not a

minimum or saddle point). E.g., check LL(qMLE ± e) < LL(qMLE) • This step often ignored in expository derivations• So, we’ll ignore it here too (and won’t require it in this class)

qq

¶¶ )(LL

0)(=

¶¶

qqLL

MLEq

Computing the MLE

Page 12: Gradient Ascent - Stanford University

Piech, CS106A, Stanford University

End Review

Page 13: Gradient Ascent - Stanford University

Maximizing Likelihood with Bernoulli

• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Ber(p)§ Probability mass function, f(Xi | p):

Page 14: Gradient Ascent - Stanford University

• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Ber(p)§ Probability mass function, f(Xi | p):

Maximizing Likelihood with Bernoulli

0 1

p

1 - p

PMF of Bernoulli

1 0 )1()|( 1 or where =-= -i

xxi xpppXf ii

PMF of Bernoulli (p = 0.2)

f(x) = 0.2x(1� 0.2)1�x

Page 15: Gradient Ascent - Stanford University

Bernoulli PMF

f(X = x|p) = px(1� p)1�x

X ⇠ Ber(p)

Page 16: Gradient Ascent - Stanford University

• Consider I.I.D. random variables X1, X2, ..., Xn

§ Xi ~ Ber(p)

§ Probability mass function, f(Xi | p), can be written as:

§ Likelihood:

§ Log-likelihood:

§ Differentiate w.r.t. p, and set to 0:

1 0 )1()|( 1 or where =-= -i

xxi xpppXf ii

Õ=

--=n

i

XX ii ppL1

1)1()(q

[ ]åå==

- --+=-=n

iii

n

i

XX pXpXppLL ii

11

1 )1log()1()(log))1(log()(q

å ==--+=

n

i iXYpYnpY1

where)1log()()(log

å=

==Þ=--

-+=¶

¶ n

iiMLE X

nnYp

pYn

pY

ppLL

1

1 01

1)(1)(

Maximizing Likelihood with Bernoulli

Page 17: Gradient Ascent - Stanford University

Isn’t that the same as the sample mean?

Page 18: Gradient Ascent - Stanford University

Yes. For Bernoulli.

Page 19: Gradient Ascent - Stanford University
Page 20: Gradient Ascent - Stanford University

Maximum Likelihood Algorithm

4. Use an optimization algorithm to calculate argmax

1. Decide on a model for the distribution of your samples. Define the PMF / PDF for your sample.

2. Write out the log likelihood function.

3. State that the optimal parameters are the argmax of the log likelihood function.

Page 21: Gradient Ascent - Stanford University
Page 22: Gradient Ascent - Stanford University

• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Poi(l)

§ PMF: Likelihood:

§ Log-likelihood:

§ Differentiate w.r.t. l, and set to 0:

!)|(

i

x

i xe

Xfill

l-

= Õ=

-

=n

i i

X

XeL

i

1 !)( lq

l

[ ]åå==

-

-+-==n

iii

n

i i

X

XXeXeLL

i

11

)!log()log()log()!

log()( lllql

åå==

=Þ=+-=¶

¶ n

iiMLE

n

ii X

nXnLL

11

1 01)( lll

l

åå==

-+-=n

ii

n

ii XXn

11

)!log()log(ll

Maximizing Likelihood with Poisson

Page 23: Gradient Ascent - Stanford University

It is so general!

Page 24: Gradient Ascent - Stanford University

• Consider I.I.D. random variables X1, X2, ..., Xn

§ Xi ~ Uni(a, b)

§ PDF:

§ Likelihood:

o Constraint a ≤ x1, x2, …, xn ≤ b makes differentiation tricky

o Intuition: want interval size (b – a) to be as small as possible to maximize likelihood function for each data point

o But need to make sure all observed data contained in interval• If all observed data not in interval, then L(q) = 0

§ Solution: aMLE = min(x1, …, xn) bMLE = max(x1, …, xn)

ïî

ïíì

££= ÷÷ø

öççè

æ

-

otherwise 0

,...,, )( 211 baq ab n

n

xxxL

ïî

ïíì ££= -

otherwise0

),|(1

baba ab ii

xXf

Maximizing Likelihood with Uniform

Page 25: Gradient Ascent - Stanford University

• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Uni(0, 1)

§ Observe data:o 0.15, 0.20, 0.30, 0.40, 0.65, 0.70, 0.75

Likelihood: L(a,1)

a

L(a,1)

Likelihood: L(0, b)

b

L(0, b)

Understanding MLE with Uniform

Page 26: Gradient Ascent - Stanford University

• How do small samples affect MLE?

§ In many cases, = sample mean

o Unbiased. Not too shabby…

§ Estimating Normal,

o Biased. Underestimates for small n (e.g., 0 for n = 1)

§ As seen with Uniform, aMLE ≥ a and bMLE ≤ bo Biased. Problematic for small n (e.g., a = b when n = 1)

§ Small sample phenomena intuitively make sense:o Maximum likelihood Þ best explain data we’ve seen

o Does not attempt to generalize to unseen data

å=

=n

iiMLE X

n 1

å=

-=n

iMLEiXnMLE

1

22 )(1 µs

Small Samples = Problems

Page 27: Gradient Ascent - Stanford University

• Maximum Likelihood Estimators are generally:

§ Consistent: for e > 0

§ Potentially biased (though asymptotically less so)

§ Asymptotically optimalo Has smallest variance of “good” estimators for large samples

§ Often used in practice where sample size is large relative to parameter space

o But be careful, there are some very large parameter spaces

1) |ˆ(|lim =<-¥®

eqqPn

Properties of MLE

Page 28: Gradient Ascent - Stanford University

Piech, CS106A, Stanford University

Maximum Likelihood Estimation

Õ=

=n

iiXfL

1

)|()( qq

LL(✓) =nX

i=1

log f(Xi|✓)

✓ = argmax✓

LL(✓)

Page 29: Gradient Ascent - Stanford University
Page 30: Gradient Ascent - Stanford University

Argmax 2: Gradient Ascent

Page 31: Gradient Ascent - Stanford University

ArgmaxOption #1: Straight optimization

Page 32: Gradient Ascent - Stanford University

ArgmaxOption #2: Gradient Ascent

Page 33: Gradient Ascent - Stanford University

Gradient Ascent

Walk uphill and you will find a local maxima (if your step size is small enough)

p(samples|✓

)

argmax

Page 34: Gradient Ascent - Stanford University

Gradient Ascent

Walk uphill and you will find a local maxima (if your step size is small enough)

p(samples|✓

)

argmax

Page 35: Gradient Ascent - Stanford University

Gradient Ascent

Walk uphill and you will find a local maxima (if your step size is small enough)

Especially good if function is convex

p(samples|✓

)

✓1 ✓2

Page 36: Gradient Ascent - Stanford University

✓ newj = ✓ old

j + ⌘ · @LL(✓old)

@✓ oldj

Gradient Ascent

Repeat many times

Walk uphill and you will find a local maxima (if your step size is small enough)

This is some profound life philosophy

Page 37: Gradient Ascent - Stanford University

Piech, CS106A, Stanford University

Gradient ascent is your bread and butter algorithm for optimization (eg argmax)

Page 38: Gradient Ascent - Stanford University

Initialize: θj = 0 for all 0 ≤ j ≤ m

Gradient Ascent

Calculate all θj

Page 39: Gradient Ascent - Stanford University

Initialize: θj = 0 for all 0 ≤ j ≤ m

Gradient Ascent

Repeat many times:

Calculate all gradient[j]’s based on data

!j += h * gradient[j] for all 0 ≤ j ≤ m

gradient[j] = 0 for all 0 ≤ j ≤ m

Page 40: Gradient Ascent - Stanford University

Review: Maximum Likelihood Algorithm

4. Use an optimization algorithm to calculate argmax

1. Decide on a model for the likelihood of your samples. This is often using a PMF or PDF.

2. Write out the log likelihood function.

3. State that the optimal parameters are the argmax of the log likelihood function.

Page 41: Gradient Ascent - Stanford University

Review: Maximum Likelihood Algorithm

4. Calculate the derivative of LL with respect to theta

1. Decide on a model for the likelihood of your samples. This is often using a PMF or PDF.

2. Write out the log likelihood function.

3. State that the optimal parameters are the argmax of the log likelihood function.

5. Use an optimization algorithm to calculate argmax

Page 42: Gradient Ascent - Stanford University

Linear Regression Lite

Page 43: Gradient Ascent - Stanford University

Predicting Warriors

X1 = Opposing team ELO

X2 = Points in last game

X3 = Curry playing?

X4 = Playing at home?

Y = Warriors points

Page 44: Gradient Ascent - Stanford University

Predicting CO2 (simple)

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))

N training datapoints

Linear Regression Lite Model

Y = ✓ ·X + Z Y |X ⇠ N(✓X,�2)Z ⇠ N(0,�2)

X = CO2 level

Y = Average Global Temperature

Page 45: Gradient Ascent - Stanford University

1) Write Likelihood Fn

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))

N training datapoints ModelY |X ⇠ N(✓X,�2)

First, calculate Likelihood of the data

Shorthand for:

Page 46: Gradient Ascent - Stanford University

1) Write Likelihood Fn

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))

N training datapoints ModelY |X ⇠ N(✓X,�2)

First, calculate Likelihood of the data

Page 47: Gradient Ascent - Stanford University

2) Write Log Likelihood Fn

Second, calculate Log Likelihood of the data

Likelihood function:

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:

Page 48: Gradient Ascent - Stanford University

3) State MLE as Optimization

Third, celebrate!

Log Likelihood:

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:

Page 49: Gradient Ascent - Stanford University

4) Find derivative

Fourth, optimize!

Goal:(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:

Page 50: Gradient Ascent - Stanford University

5) Run optimization code

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:

Page 51: Gradient Ascent - Stanford University

Initialize: θj = 0 for all 0 ≤ j ≤ m

Gradient Ascent

Repeat many times:

Calculate all gradient[j]’s based on data and current setting of theta

!j += h * gradient[j] for all 0 ≤ j ≤ m

gradient[j] = 0 for all 0 ≤ j ≤ m

Page 52: Gradient Ascent - Stanford University

Initialize: θ = 0

Gradient Ascent

Repeat many times:

Calculate gradient based on data

! += h * gradient

gradient = 0

Linear Regression (simple)

Page 53: Gradient Ascent - Stanford University

Initialize: θ = 0

Repeat many times:

For each training example (x, y):

! += h * gradient

gradient = 0

Update gradient for current training example

Linear Regression (simple)

Page 54: Gradient Ascent - Stanford University

Initialize: θ = 0

Repeat many times:

For each training example (x, y):

! += h * gradient

gradient = 0

gradient += 2(y - θ x)(x)

Linear Regression (simple)

Page 55: Gradient Ascent - Stanford University

Linear Regression

Page 56: Gradient Ascent - Stanford University

Predicting CO2

X1 = Temperature

X2 = Elevation

X3 = CO2 level yesterday

X4 = GDP of region

X5 = Acres of forest growth

Y = CO2 levels

Page 57: Gradient Ascent - Stanford University

Training: Gradient ascent to chose the best thetas to describe your data

✓MLE = argmax✓

�nX

i=1

(Y (i) � ✓Tx(i))2

Problem: Predict real value Y based on observing variable X

Linear Regression

Model: Linear weight every feature

Y = ✓1X1 + · · ·+ ✓mXm + ✓m+1

= ✓TX

Page 58: Gradient Ascent - Stanford University

Initialize: θj = 0 for all 0 ≤ j ≤ m

Repeat many times:

For each training example (x, y):

For each parameter j:

!j += h * gradient[j] for all 0 ≤ j ≤ m

gradient[j] = 0 for all 0 ≤ j ≤ m

gradient[j] += (y – θTx)(-x[j])

Linear Regression

Page 59: Gradient Ascent - Stanford University

Predicting WarriorsY = Warriors points

X1 = Opposing team ELO

X2 = Points in last game

X3 = Curry playing?

X4 = Playing at home?

!1 = -2.3

!2 = +1.2

!3 = +10.2

!4 = +3.3

!5 = +95.4

Y = ✓1X1 + · · ·+ ✓mXm + ✓m+1

= ✓TX

Page 60: Gradient Ascent - Stanford University

§ Training data: set of N pre-classified data instanceso N training pairs: (x(1),y(1)), (x(2),y(2)), …, (x(n), y(n))

• Use superscripts to denote i-th training instance§ Learning algorithm: method for determining g(X)

o Given a new input observation of x = x1, x2, …, xmo Use g(x) to compute a corresponding output (prediction)

Output(Class)

Trainingdata

Learningalgorithm

g(X)(Classifier)

X

The Machine Learning Process