Top Banner
Midterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan Xiong Reviews 15th May, 2020 1 / 26
27

Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Sep 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Midterm Reviews (CS 229/ STATS 229)

Zhihan Xiong

Stanford University

15th May, 2020

Zhihan Xiong Reviews 15th May, 2020 1 / 26

Page 2: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning

Outline

1 Supervised Learning

Discriminative Algorithms

Generative Algorithms

Kernel and SVM

2 Neural Networks

3 Unsupervised Learning

4 Practice Exam Problem (If time permits)

Zhihan Xiong Reviews 15th May, 2020 2 / 26

Page 3: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Discriminative Algorithms

Outline

1 Supervised Learning

Discriminative Algorithms

Generative Algorithms

Kernel and SVM

2 Neural Networks

3 Unsupervised Learning

4 Practice Exam Problem (If time permits)

Zhihan Xiong Reviews 15th May, 2020 3 / 26

Page 4: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Discriminative Algorithms

Optimization Methods

Gradient and Hessian (di↵erentiable function f : Rd 7! R)

rx f =

h@f@x1

. . . @f@xd

iT2 Rd

(Gradient)

r2x f =

2

664

@2f@x21

. . . @2f@x1@xd

.

.

.. . .

.

.

.

@2f@xd@x1

. . . @2f@x2d

3

775 2 Rd⇥d(Hessian)

Gradient Descent and Newton’s Method (objective function J (✓))

✓(t+1)= ✓(t) � ↵r✓J(✓

(t)) (Gradient descent)

✓(t+1)= ✓(t) �

hr2

✓J(✓(t))

i�1r✓J(✓

(t)) (Newton’s method)

Zhihan Xiong Reviews 15th May, 2020 4 / 26

Page 5: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Discriminative Algorithms

Least Square—Gradient Descent

Model: h✓ (x) = ✓T x

Training data:��

x (i), y (i)� n

i=1, x (i) 2 Rd

Loss: J (✓) = 12

Pni=1

�h✓(x (i))� y (i)

�2

Update rule:

✓(t+1)= ✓(t) � ↵

nX

i=1

⇣h✓(x

(i))� y (i)

⌘x (i)

Stochastic Gradient Descent (SGD)

Pick one data point x (i) and then update:

✓(t+1)= ✓(t) � ↵

⇣h✓(x

(i))� y (i)

⌘x (i)

Zhihan Xiong Reviews 15th May, 2020 5 / 26

Page 6: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Discriminative Algorithms

Least Square—Closed Form

Loss in matrix form: J (✓) = 12 kX✓ � yk22, where X 2 Rn⇥d

, y 2 Rn

Normal Equation (set gradient to 0):

XT(X✓? � y) = 0

Closed form solution:

✓? =⇣XTX

⌘�1XT y

Connection to Newton’s Method

✓? =⇥r2

✓J⇤�1r✓J

Newton’s method is exact for 2nd-order objective.

Zhihan Xiong Reviews 15th May, 2020 6 / 26

Page 7: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Discriminative Algorithms

Logistic Regression

A binary classification model and y (i) 2 {0, 1}Assumed model:

p (y | x ; ✓) =(g✓ (x) if y = 1

1� g✓ (x) if y = 0, where g✓ (x) =

1

1 + e�✓T x

Log-likelihood function:

` (✓) =nX

i=1

log p(y (i) | x (i); ✓)

=

nX

i=1

hy (i) log g✓(x

(i)) + (1� y (i)) log(1� g✓(x

(i)))

i

Find parameters through maximizing log-likelihood, max✓ ` (✓) (in Pset1).

Zhihan Xiong Reviews 15th May, 2020 7 / 26

Page 8: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Discriminative Algorithms

The Exponential Family

Definition

Probability distribution (with natural parameter ⌘) whose density (or mass function) can be

written into the following form

p (y ; ⌘) = b (y) exp⇣⌘TT (y)� a (⌘)

Example

Bernoulli distribution:

p (y ;�) = �y(1� �)1�y

= exp

✓✓log

✓�

1� �

◆◆y + log (1� �)

=) b (y) = 1, T (y) = y , ⌘ = log

✓�

1� �

◆, a (⌘) = log (1 + e⌘)

Zhihan Xiong Reviews 15th May, 2020 8 / 26

Page 9: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Discriminative Algorithms

The Exponential Family

More Examples

Categorical distribution, Poisson distribution, (Multivariate) normal distribution, etc.

Properties (In Pset1)

E [T (Y ) ; ⌘] = r⌘a (⌘)

Var (T (Y ) ; ⌘) = r2⌘a (⌘)

Non-exponential Family Distribution

Uniform distribution over interval [a, b]:

p (y ; a, b) =1

b � a· 1{ayb}

Reason: b (y) cannot depend on parameter ⌘.

Zhihan Xiong Reviews 15th May, 2020 9 / 26

Page 10: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Discriminative Algorithms

The Exponential Family

More Examples

Categorical distribution, Poisson distribution, (Multivariate) normal distribution, etc.

Properties (In Pset1)

E [T (Y ) ; ⌘] = r⌘a (⌘)

Var (T (Y ) ; ⌘) = r2⌘a (⌘)

Non-exponential Family Distribution

Uniform distribution over interval [a, b]:

p (y ; a, b) =1

b � a· 1{ayb}

Reason: b (y) cannot depend on parameter ⌘.

Zhihan Xiong Reviews 15th May, 2020 9 / 26

Page 11: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Discriminative Algorithms

The Generalized Linear Model (GLM)

Components

Assumed model: p (y | x ; ✓) ⇠ ExponentialFamily (⌘) with ⌘ = ✓T x

Predictor: h (x) = E [T (Y ) ; ⌘] = r⌘a (⌘).

Fitting through maximum likelihood:

max✓

` (✓) = max✓

nX

i=1

p(y (i) | x (i); ⌘)

Examples

GLM under Bernoulli distribution: Logistic regression

GLM under Poisson distribution: Poisson regression (in Pset1)

GLM under Normal distribution: Linear regression

GLM under Categorical distribution: Softmax regression

Zhihan Xiong Reviews 15th May, 2020 10 / 26

Page 12: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Generative Algorithms

Outline

1 Supervised Learning

Discriminative Algorithms

Generative Algorithms

Kernel and SVM

2 Neural Networks

3 Unsupervised Learning

4 Practice Exam Problem (If time permits)

Zhihan Xiong Reviews 15th May, 2020 11 / 26

Page 13: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Generative Algorithms

Gaussian Discriminant Analysis (GDA)

Generative Algorithm for Classification

Learn p (x | y) and p (y)

Classify through Bayes rule: argmaxy p (y | x) = argmaxy p (x | y) p (y)

GDA Formulation

Assume p (x | y) ⇠ N (µy ,⌃) for some µy 2 Rdand ⌃ 2 Rd⇥d

Estimate µy , ⌃ and p (y) through maximum likelihood, which is

max

nX

i=1

hlog p(x (i) | y (i)) + log p(y (i))

i

p (y) =

Pni=1 1{y (i)=y}

n, µy =

Pni=1 1{y (i)=y}x

(i)

Pni=1 1{y (i)=y}

,⌃ =1

n

nX

i=1

(x (i) � µy (i))(x (i) � µy (i))T

Zhihan Xiong Reviews 15th May, 2020 12 / 26

Page 14: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Generative Algorithms

Naive Bayes

Formulation

Assume p (x | y) =Qd

j=1 p (xj | y)Estimate p (xj | y) and p (y) through maximum likelihood, which gives

p (xj | y) =

Pni=1 1

nx(i)j =xj ,y (i)=y

o

Pni=1 1{y (i)=y}

, p (y) =

Pni=1 1{y (i)=y}

n

Laplace Smoothing

Assume xj takes value in {1, 2, . . . , k}, the corresponding modified estimator is

p (xj | y) =1 +

Pni=1 1

nx(i)j =xj ,y (i)=y

o

k +Pn

i=1 1{y (i)=y}

Zhihan Xiong Reviews 15th May, 2020 13 / 26

Page 15: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Kernel and SVM

Outline

1 Supervised Learning

Discriminative Algorithms

Generative Algorithms

Kernel and SVM

2 Neural Networks

3 Unsupervised Learning

4 Practice Exam Problem (If time permits)

Zhihan Xiong Reviews 15th May, 2020 14 / 26

Page 16: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Kernel and SVM

Kernel

Motivation

Feature map: � : Rd 7! Rp

Fitting linear model with gradient descent gives us ✓ =Pn

u=1 �i�(x(i))

Predict a new example z : h✓ (z) =Pn

i=1 �i�(x(i))T� (z) =

Pni=1 �iK (x (i), z)

It brings nonlinearity without much sacrifice in e�ciency as long as K (·, ·) can be computed

e�ciently.

Definition

K (x , z) : Rd ⇥ Rd 7! R is a valid kernel if there exists � : Rd 7! Rpfor some p � 1 such that

K (x , z) = � (x)T � (z)

Zhihan Xiong Reviews 15th May, 2020 15 / 26

Page 17: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Kernel and SVM

Kernel (Continued)

Examples

Polynomial kernels: K (x , z) =�xT z + c

�d, 8 c � 0 and d 2 N

Gaussian kernels: K (x , z) = exp

⇣�kx�zk22

2�2

⌘, 8 �2 > 0

More in Pset2...

Theorem

K (x , z) is a valid kernel if and only if for any set of {x (1), . . . , x (n)}, its Gram matrix, definedas

G =

2

64K (x (1), x (1)) . . . K (x (1), x (n))

.... . .

...K (x (n), x (1)) . . . K (x (n), x (n))

3

75 2 Rn⇥n

is positive semi-definite.

Zhihan Xiong Reviews 15th May, 2020 16 / 26

Page 18: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Supervised Learning Kernel and SVM

SVM

Formulation (y 2 {�1, 1})

min{w ,b}12 kwk22

subject to y (i)(wT x (i) + b) � 1, 8 i 2 {1, . . . , n} (Hard-SVM)

min{w ,b,⇠}12 kwk22 + C

Pni=1 ⇠i

subject to y (i)(wT x (i) + b) � 1� ⇠i , 8 i 2 {1, . . . , n}⇠i � 0, 8 i 2 {1, . . . , n}

(Soft-SVM)

Properties

The optimal solution has the form w?=Pn

i=1 ↵iy (i)x (i) and thus can be kernelized.

The soft-SVM can be treated as a minimization over hinge loss plus `2 regularization:

min{w ,b}

nX

i=1

max

n0, 1� y (i)(wT x (i) + b)

o+ � kwk22

Zhihan Xiong Reviews 15th May, 2020 17 / 26

Page 19: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Neural Networks

Outline

1 Supervised Learning

Discriminative Algorithms

Generative Algorithms

Kernel and SVM

2 Neural Networks

3 Unsupervised Learning

4 Practice Exam Problem (If time permits)

Zhihan Xiong Reviews 15th May, 2020 18 / 26

Page 20: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Neural Networks

Model Formulation

Multi-layer Fully-connected Neural Networks (with Activation Function f )

a[1] = f⇣W [1]x + b[1]

a[2] = f⇣W [2]a[1] + b[2]

. . .

a[r�1]= f

⇣W [r�1]a[r�2]

+ b[r�1]⌘

h✓ (x) = a[r ] = W [r ]a[r�1]+ b[r ]

Possible Activation Functions

ReLU: f (z) = ReLU (z) := max {z , 0}Sigmoid: f (z) = 1

1+e�z

Hyperbolic Tangent: f (z) = tanh (z) := ez�e�z

ez+e�z

Zhihan Xiong Reviews 15th May, 2020 19 / 26

Page 21: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Neural Networks

Backpropogation

Let J be the loss function and z [k] = W [k]a[k�1]+ b[k]. By chain rule, we have

@J

@W [r ]ij

=@J

@z [r ]i

@z [r ]i

@W [r ]ij

=@J

@z [r ]i

a[r�1]j =) @J

@W [r ]=

@J

@z [r ]a[r�1]T ,

@J

@b[r ]=

@J

@z [r ]

@J

@a[r�1]i

=

drX

j=1

@J

@z [r ]j

@z [r ]j

@a[r�1]i

=

drX

j=1

@J

@z [r ]j

W [r ]ji =) @J

@a[r�1]= W [r ]T @J

@z [r ]

@J

@z [r ]:= �[r ] =) @J

@z [r�1]=

⇣W [r ]T �[r ]

⌘� f 0

⇣z [r�1]

⌘:= �[r�1]

=) @J

@W [r�1]= �[r�1]a[r�2]T ,

@J

@b[r�1]= �[r�1]

Continue for layers r � 2, . . . , 1.

Zhihan Xiong Reviews 15th May, 2020 20 / 26

Page 22: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Unsupervised Learning

Outline

1 Supervised Learning

Discriminative Algorithms

Generative Algorithms

Kernel and SVM

2 Neural Networks

3 Unsupervised Learning

4 Practice Exam Problem (If time permits)

Zhihan Xiong Reviews 15th May, 2020 21 / 26

Page 23: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Unsupervised Learning

k-means

Algorithm 1: k-means

Input: Training data {x (1), . . . , x (n)}; number of clusters k1 Initialize c(1), . . . , c(k) 2 Rd

as clustering centers

2 while not converge do3 Assign each x (i) to its cloest clustering centers c(j)

4 Take the mean of each cluster as new clustering center

5 end

Property

k-means tries to minimize the following loss function approximately:

min

{c(1),...,c(k)}

nX

i=1

���x (i) � c(j(i))���2

2, where j (i) = argmin

j 02{1,...,k}

���x (i) � c(j0)���2

2

However, it does not guarantee to find the global minimum.

Zhihan Xiong Reviews 15th May, 2020 22 / 26

Page 24: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Unsupervised Learning

Gaussian Mixture Model (GMM)

Formulation

Assume each data point x (i) is generated independently through the following procedure:

1. Sample z(i) ⇠ Multinomial (�), wherePk

j=1 �j = 1

2. Sample x (i) ⇠ N (µz(i) ,⌃z(i))

How to estimate parameters �, {µ1, . . . , µk} and {⌃1, . . . ,⌃k} if z(i) cannot be observed?

Maximum Likelihood

` (✓) =nX

i=1

log

0

@mX

j=1

�jp(x(i);µj ,⌃j)

1

A ,

where p(x (i);µj ,⌃j) =1q

(2⇡)d |⌃j |exp

✓�1

2(x (i) � µj)

T⌃�1j (x (i) � µj)

This is too complicated to optimize directly!

Zhihan Xiong Reviews 15th May, 2020 23 / 26

Page 25: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Unsupervised Learning

Expectation-Maximization (EM)

Jensen’s Inequality

By Jensen’s inequality, for any distribution Qi over {1, . . . , k}, we have

nX

i=1

log

0

@mX

j=1

Qi (j)p(x (i), z(i) = j ; ✓)

Qi (j)

1

A �nX

i=1

mX

j=1

Qi (j) logp(x (i), z(i) = j ; ✓)

Qi (j):= ELBO (✓)

Theorem

If we takeQi (j) = p(z(i) = j | x (i); ✓(t)) and let✓(t+1)

:= argmax✓ ELBO (✓), we thenhave `(✓(t+1)

) � `(✓(t)) (proved inlecture).

Algorithm 2: EM Algorithm

Input: Training data {x (1), . . . , x (n)}1 Initialize ✓(0) by some random guess

2 for t = 0, 1, 2, . . . do3 Set Qi (j) = p(z(i) = j | x (i); ✓(t)) for each

i , j ; // E-step

4 Set ✓(t+1)= argmax✓ ELBO (✓); // M-step

5 end

Zhihan Xiong Reviews 15th May, 2020 24 / 26

Page 26: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Unsupervised Learning

EM in GMM

Posterior of z(i)

p(z(i) = j | x (i); ✓(t)) =�(t)j p(x (i);µ(t)

j ,⌃(t)j )

Pkj 0=1 �

(t)j 0 p(x (i);µ(t)

j 0 ,⌃(t)j 0 )

GMM Update Rules

By defining w (i)j = p(z(i) = j | x (i); ✓(t)), we have

�(t+1)j =

Pni=1 w

(i)j

n, µ(t+1)

j =

Pni=1 w

(i)j x (i)

Pni=1 w

(i)j

, 8 j 2 {1, . . . , k}

⌃(t+1)j =

Pni=1 w

(i)j (x (i) � µ(t+1)

j )(x (i) � µ(t+1)j )

T

Pni=1 w

(i)j

, 8 j 2 {1, . . . , k}

Zhihan Xiong Reviews 15th May, 2020 25 / 26

Page 27: Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm Reviews (CS 229/ STATS 229) Zhihan Xiong Stanford University 15th May, 2020 Zhihan

Practice Exam Problem (If time permits)

Outline

1 Supervised Learning

Discriminative Algorithms

Generative Algorithms

Kernel and SVM

2 Neural Networks

3 Unsupervised Learning

4 Practice Exam Problem (If time permits)

Zhihan Xiong Reviews 15th May, 2020 26 / 26