Top Banner
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression
62

Lecture 3: Linear Regression

Feb 24, 2016

Download

Documents

zagiri

Lecture 3: Linear Regression. Machine Learning CUNY Graduate Center. Today. Calculus Lagrange Multipliers Linear Regression. Optimization with constraints. What if I want to constrain the parameters of the model. The mean is less than 10 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 3: Linear Regression

Machine LearningCUNY Graduate Center

Lecture 3: Linear Regression

Page 2: Lecture 3: Linear Regression

2

Today

• Calculus– Lagrange Multipliers

• Linear Regression

Page 3: Lecture 3: Linear Regression

3

Optimization with constraints

• What if I want to constrain the parameters of the model.– The mean is less than 10

• Find the best likelihood, subject to a constraint.

• Two functions:– An objective function to maximize– An inequality that must be satisfied

Page 4: Lecture 3: Linear Regression

4

Lagrange Multipliers

• Find maxima of f(x,y) subject to a constraint.

Page 5: Lecture 3: Linear Regression

5

General form

• Maximizing:

• Subject to:

• Introduce a new variable, and find a maxima.

Page 6: Lecture 3: Linear Regression

6

Example

• Maximizing:

• Subject to:

• Introduce a new variable, and find a maxima.

Page 7: Lecture 3: Linear Regression

7

Example

Now have 3 equations with 3 unknowns.

Page 8: Lecture 3: Linear Regression

8

ExampleEliminate Lambda Substitute and Solve

Page 9: Lecture 3: Linear Regression

9

Basics of Linear Regression

• Regression algorithm• Supervised technique.• In one dimension:

– Identify • In D-dimensions:

– Identify• Given: training data:

– And targets:

Page 10: Lecture 3: Linear Regression

10

Graphical Example of Regression

?

Page 11: Lecture 3: Linear Regression

11

Graphical Example of Regression

Page 12: Lecture 3: Linear Regression

12

Graphical Example of Regression

Page 13: Lecture 3: Linear Regression

13

Definition

• In linear regression, we assume that the model that generates the data involved only a linear combination of input variables.

Where w is a vector of weights which define the D parameters of the model

Page 14: Lecture 3: Linear Regression

14

Evaluation

• How can we evaluate the performance of a regression solution?

• Error Functions (or Loss functions)– Squared Error– Linear Error

Page 15: Lecture 3: Linear Regression

15

Regression Error

Page 16: Lecture 3: Linear Regression

16

Empirical Risk• Empirical risk is the measure of the loss from

data.

• By minimizing risk on the training data, we optimize the fit with respect to the loss function

Page 17: Lecture 3: Linear Regression

17

Model Likelihood and Empirical Risk

• Two related but distinct ways to look at a model.1. Model Likelihood.

1. “What is the likelihood that a model generated the observed data?”

2. Empirical Risk1. “How much error does the model have on the

training data?”

Page 18: Lecture 3: Linear Regression

18

Model Likelihood

Assuming Independently Identically Distributed (iid) data.

Page 19: Lecture 3: Linear Regression

19

Understanding Model Likelihood

Substitution for the eqn of a gaussian

Apply a log function

Let the log dissolve products into sums

Page 20: Lecture 3: Linear Regression

20

Understanding Model Likelihood

Optimize the weights. (Maximum LikelihoodEstimation)

Log Likelihood

Empirical Risk w/ Squared Loss Function

Page 21: Lecture 3: Linear Regression

21

Maximizing Log Likelihood (1-D)

• Find the optimal settings of w.

Page 22: Lecture 3: Linear Regression

22

Maximizing Log Likelihood

Partial derivative

Set to zero

Separate the sum to isolate w0

Page 23: Lecture 3: Linear Regression

23

Maximizing Log Likelihood

Partial derivative

Set to zero

Separate the sum to isolate w0

Page 24: Lecture 3: Linear Regression

24

Maximizing Log LikelihoodFrom previous partial

From prev. slide

Substitute

Isolate w1

Page 25: Lecture 3: Linear Regression

25

Maximizing Log Likelihood

• Clean and easy.

• Or not…

• Apply some linear algebra.

Page 26: Lecture 3: Linear Regression

26

Likelihood using linear algebra

• Representing the linear regression function in terms of vectors.

Page 27: Lecture 3: Linear Regression

27

Likelihood using linear algebra

• Stack xT into a matrix of data points, X.

Representationas vectors

Stack the datainto a matrixand use the Norm operationto handle the sum

Page 28: Lecture 3: Linear Regression

28

Likelihood in multiple dimensions

• This representation of risk has no inherent dimensionality.

Page 29: Lecture 3: Linear Regression

29

Maximum Likelihood Estimation redux

Decompose the normFOIL – linear algebra style

Differentiate

Combine terms

Isolate w

Page 30: Lecture 3: Linear Regression

30

Extension to polynomial regression

Page 31: Lecture 3: Linear Regression

31

Extension to polynomial regression

• Polynomial regression is the same as linear regression in D dimensions

Page 32: Lecture 3: Linear Regression

32

Generate new featuresStandard Polynomial with coefficients, w

Risk

Page 33: Lecture 3: Linear Regression

33

Generate new featuresFeature Trick: To fit a D dimensional polynomial,Create a D-element vector from xi

Then standard linear regression in D dimensions

Page 34: Lecture 3: Linear Regression

34

How is this still linear regression?

• The regression is linear in the parameters, despite projecting xi from one dimension to D dimensions.

• Now we fit a plane (or hyperplane) to a representation of xi in a higher dimensional feature space.

• This generalizes to any set of functions

Page 35: Lecture 3: Linear Regression

35

Basis functions as feature extraction

• These functions are called basis functions.– They define the bases of the feature space

• Allows linear decomposition of any type of function to data points

• Common Choices:– Polynomial– Gaussian– Sigmoids– Wave functions (sine, etc.)

Page 36: Lecture 3: Linear Regression

36

Training data vs. Testing Data

• Evaluating the performance of a classifier on training data is meaningless.

• With enough parameters, a model can simply memorize (encode) every training point

• To evaluate performance, data is divided into training and testing (or evaluation) data.– Training data is used to learn model parameters– Testing data is used to evaluate performance

Page 37: Lecture 3: Linear Regression

37

Overfitting

Page 38: Lecture 3: Linear Regression

38

Overfitting

Page 39: Lecture 3: Linear Regression

39

Overfitting performance

Page 40: Lecture 3: Linear Regression

40

Definition of overfitting

• When the model describes the noise, rather than the signal.

• How can you tell the difference between overfitting, and a bad model?

Page 41: Lecture 3: Linear Regression

41

Possible detection of overfitting

• Stability – An appropriately fit model is stable under

different samples of the training data– An overfit model generates inconsistent

performance• Performance

– A good model has low test error– A bad model has high test error

Page 42: Lecture 3: Linear Regression

42

What is the optimal model size?

• The best model size generalizes to unseen data the best.

• Approximate this by testing error.• One way to optimize parameters is to minimize

testing error.– This operation uses testing data as tuning or

development data• Sacrifices training data in favor of parameter

optimization• Can we do this without explicit evaluation data?

Page 43: Lecture 3: Linear Regression

43

Context for linear regression

• Simple approach• Efficient learning• Extensible• Regularization provides robust models

Page 44: Lecture 3: Linear Regression

44

Break

Coffee. Stretch.

Page 45: Lecture 3: Linear Regression

45

Linear Regression

• Identify the best parameters, w, for a regression function

Page 46: Lecture 3: Linear Regression

46

Overfitting

• Recall: overfitting happens when a model is capturing idiosyncrasies of the data rather than generalities.– Often caused by too many parameters

relative to the amount of training data.– E.g. an order-N polynomial can intersect any

N+1 data points

Page 47: Lecture 3: Linear Regression

47

Dealing with Overfitting

• Use more data• Use a tuning set• Regularization• Be a Bayesian

Page 48: Lecture 3: Linear Regression

48

Regularization

• In a linear regression model overfitting is characterized by large weights.

Page 49: Lecture 3: Linear Regression

49

Penalize large weights

• Introduce a penalty term in the loss function.

Regularized Regression(L2-Regularization or Ridge Regression)

Page 50: Lecture 3: Linear Regression

50

Regularization Derivation

Page 51: Lecture 3: Linear Regression

51

Page 52: Lecture 3: Linear Regression

52

Regularization in Practice

Page 53: Lecture 3: Linear Regression

53

Regularization Results

Page 54: Lecture 3: Linear Regression

54

More regularization

• The penalty term defines the styles of regularization

• L2-Regularization• L1-Regularization• L0-Regularization

– L0-norm is the optimal subset of features

Page 55: Lecture 3: Linear Regression

55

Curse of dimensionality• Increasing dimensionality of features increases the data

requirements exponentially.• For example, if a single feature can be accurately

approximated with 100 data points, to optimize the joint over two features requires 100*100 data points.

• Models should be small relative to the amount of available data

• Dimensionality reduction techniques – feature selection – can help.– L0-regularization is explicit feature selection– L1- and L2-regularizations approximate feature selection.

Page 56: Lecture 3: Linear Regression

56

Bayesians v. Frequentists• What is a probability?• Frequentists

– A probability is the likelihood that an event will happen– It is approximated by the ratio of the number of observed events to the number

of total events– Assessment is vital to selecting a model– Point estimates are absolutely fine

• Bayesians– A probability is a degree of believability of a proposition.– Bayesians require that probabilities be prior beliefs conditioned on data.– The Bayesian approach “is optimal”, given a good model, a good prior and a

good loss function. Don’t worry so much about assessment.– If you are ever making a point estimate, you’ve made a mistake. The only valid

probabilities are posteriors based on evidence given some prior

Page 57: Lecture 3: Linear Regression

57

Bayesian Linear Regression• The previous MLE derivation of linear regression uses point

estimates for the weight vector, w.• Bayesians say, “hold it right there”.

– Use a prior distribution over w to estimate parameters

• Alpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution.

• Now optimize:

Page 58: Lecture 3: Linear Regression

58

Optimize the Bayesian posterior

As usual it’s easier to optimize after a log transform.

Page 59: Lecture 3: Linear Regression

59

Optimize the Bayesian posterior

As usual it’s easier to optimize after a log transform.

Page 60: Lecture 3: Linear Regression

60

Optimize the Bayesian posterior

Ignoring terms that do not depend on w

IDENTICAL formulation as L2-regularization

Page 61: Lecture 3: Linear Regression

61

Context

• Overfitting is bad.• Bayesians vs. Frequentists

– Is one better?– Machine Learning uses techniques from both

camps.

Page 62: Lecture 3: Linear Regression

62

Next Time

• Logistic Regression

• Read Chapter 4.1, 4.3