Top Banner
Linear Regression Anna Leontjeva [email protected]
44

Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Sep 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Linear Regression Anna Leontjeva

[email protected]

Page 2: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Which of the following is most

related to linear regression?

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

East

West

North

1) Information Gain

2) Linear Atavism

3) Regression to

Mean

4) Method of Least

Squares

Page 3: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Introduction to Linear Regression

Linear regression is an approach to modeling the

relationship between a response variable Y and one or

more explanatory variables denoted X (predictors), e.g

regression is the study of dependence.

A response variable Y must be continuous.

The case of one explanatory variable is called simple

regression.

More than one explanatory variable is multiple

regression.

Page 4: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Scatterplot

Page 5: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Scatterplot

Page 6: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Short Quiz

Sketch on each plot what you think is the best-fitting line

for predicting y from x.

1) 2)

3) 4)

Page 7: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Short quiz

pic y prediction Residual sum of

squares

1

2

3

4

Page 8: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

• Cross at the average y-value for each x and draw the

best-fitting line to the crosses

• Re-compute the y prediction and sum of squared

errors.

1) 2)

3) 4)

Page 9: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Linear Regression Function

Page 10: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Linear Regression Function

Mean function

Intercept

Slope

Page 11: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Linear Regression Function

Mean function

Intercept

Slope

Intercept and slope are unknown, want to estimate

Page 12: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Linear regression function

Page 13: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Residuals (Errors)

Page 14: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Objective function

residual sum of squares (RSS, SSE):

Ordinary Least Squares (OLS)

Page 15: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Minimization

Page 16: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Example

b1 = sum((x-mean(x))*(y - mean(y)))

/ sum((x-mean(x))^2)

[1] 0.541747

b0 = mean(y) - b1*mean(x)

[1] 75.99029

Page 17: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Example

b1 = sum((x-mean(x))*(y - mean(y)))

/ sum((x-mean(x))^2)

[1] 0.541747

b0 = mean(y) - b1*mean(x)

[1] 75.99029

b1 = cov(x,y)/var(x)

Page 18: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Example

lm(y ~ x)

Page 19: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Example

Page 20: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Example

y = 75.99 + 0.54 M_height_cm

Page 21: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Multiple regression

Usually we have more than one variable:

or in matrix notation:

Page 22: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Matrix notation

n observations, p explanatory variables,

dim(Y) = n × 1, dim(X) = n ×(p+1), dim(ß) = (p+1) ×1,

dim(e) = n × 1

Page 23: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

OLS for multiple regression

Page 24: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Example

b = solve(t(X) %*% X) %*% t(X) %*% y

Page 25: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Example

b = solve(t(X) %*% X) %*% t(X) %*% y

b = ginv(X) %*% y

lm(y ~ X)

lm(y ~ x1 + x2)

Page 26: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Types of predictors • The intercept (model can be with or without);

lm(y ~ x1 + x2 – 1)

• Transformations of predictors lm(y ~ x1 + log(x2))

• Polynomials

lm(y ~ x1 + I(x2^2))

• Interactions and other combinations of predictors

lm(y ~ x1/x2)

• Dummy variables and factors lm(y ~ is_male)

Page 27: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Polynomials

Page 28: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Polynomials

m2 <- lm(Salary ~ Experience + I(Experience^2), data = prof)

Page 29: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Quiz: What does it mean: linear?

In which case we cannot use linear

regression?

Page 30: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Quiz: What does it mean:

linear?

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

East

West

North

Page 31: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Dummy variables

• Are binary variables (i.e 0 or 1) created

from a variable with the higher level of

measurement (categorical variable):

Eye color Code

Brown 1

Blue 2

Grey 3

Eye color Is _Brown Is_Blue Is_Grey

Brown 1 0 0

Blue 0 1 0

Grey 0 0 1

Page 32: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Dummy variables

• Are binary variables (i.e 0 or 1) created

from a variable with the higher level of

measurement (categorical variable):

Eye color Code

Brown 1

Blue 2

Grey 3

Eye color Is _Brown Is_Blue Is_Grey

Brown 1 0 0

Blue 0 1 0

Grey 0 0 1

Page 33: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Example

Salary for males: 85181.8 +958.1 yrs.since.phd + 7923.6 * 1 =

93105.4 + 958.1 yrs.since.phd

Salary for females: 85181.8 +958.1 yrs.since.phd + 7923.6 * 0 =

85181.8 +958.1 yrs.since.phd

Page 34: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Diagnostics

Page 35: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Leverage points Demo: http://www.stat.sc.edu/~west/javahtml/Regression.html

Page 36: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

• a leverage point is an observation that has an

extreme value on one or more explanatory variables.

• a point is a bad leverage point if its Y -value does

not follow the pattern set by the other data points.

• a bad leverage point is a leverage point which is

also an outlier.

Page 37: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Standardized residuals

Page 38: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Goodness-of-fit-measures

• R-squared

(square of the sample correlation coefficient between

the outcomes and their predicted values)

• Coefficient Significance: (used to test the hypothesis that the true value of the coefficient is

non-zero, in order to confirm that the independent variable really

belongs in the model)

-------------------------------------------------------------------------------------

• Measures on the test set (RSS, R-squared)

Page 39: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Over- and underfitting

Page 40: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Regularization

• Simple objective function:

min(Error)

• … with regularization:

min(Error + ʎ Complexity)

Page 41: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Regularization

• Simple objective function:

min(Error)

• … with regularization:

min(Error + ʎ Complexity)

Penalty for more complex models:

with larger values of lambda,

greater penalty – more compact

model

Page 42: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Regularization

• OLS objective function:

min( 𝑒2)

• OLS with regularization (Ridge regression):

min( 𝑒2 + 𝜆 𝛽𝑖2)

Page 43: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Regularization

• OLS objective function:

min( 𝑒2)

• OLS with regularization (Ridge regression):

min( 𝑒2 + 𝜆 𝛽𝑖2)

Page 44: Linear Regression - ut · 2012. 2. 13. · Quiz: What does it mean: linear? 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr East West North. Dummy variables • Are binary

Literature

• A modern approach to Regression with R, Simon

Sheather;

• Applied linear regression, Weisberg