Top Banner
Class 5: Thurs., Sep. 23 • Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and experience • Normal distribution calculations • R squared • Checking the assumptions of the simple linear regression model: residual plots.
20

Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Class 5: Thurs., Sep. 23

• Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and experience

• Normal distribution calculations• R squared• Checking the assumptions of the simple

linear regression model: residual plots.

Page 2: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Teachers’ Salaries and Dating• In U.S. culture, it is usually considered impolite to ask how

much money a person makes.• However, suppose that you are single and are interested in

dating a particular person.• Of course, salary isn’t the most important factor when

considering whom to date but it certainly is nice to know (especially if it is high!)

• In this case, the person you are interested in happens to be a high school teacher, so you know a high salary isn’t an issue.

• Still you would like to know how much she or he makes, so you take an informal survey of 11 high school teachers that you know.

Page 3: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Distributions Salary

35000 5000060000

Moments

Mean 50881.818 Std Dev 6491.1968 Std Err Mean 1957.1695 upper 95% Mean 55242.664 lower 95% Mean 46520.973 N 11

Page 4: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

B a s e d o n t h i s d a t a , w h a t c a n y o u c o n c l u d e ? A b s e n t a n y o t h e r i n f o r m a t i o n , b e s t g u e s s f o r t e a c h e r ’ s s a l a r y i s t h e m e a n s a l a r y , $ 5 0 , 8 8 2 . B u t i t i s l i k e l y t h a t t h i s e s t i m a t e w i l l n o t b e c o r r e c t . T o g e t a n i d e a o f h o w f a r o f f , y o u m i g h t b e , y o u c a n c a l c u l a t e t h e s t a n d a r d d e v i a t i o n :

82.649110

421437378

1

)(11

1

2

n

yys i

i

T h e s t a n d a r d d e v i a t i o n i s t h e “ t y p i c a l ” a m o u n t b y w h i c h a n o b s e r v a t i o n d e v i a t e s f r o m m e a n . T h u s , y o u r b e s t e s t i m a t e f o r y o u r p o t e n t i a l d a t e ’ s s a l a r y i s $ 5 0 , 8 8 2 b u t a t y p i c a l e s t i m a t e w i l l b e o f f b y a b o u t $ 6 , 5 0 0 .

Page 5: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

• You happen to know that the person you are interested in has been teaching for 8 years.

• How can you use this information to better predict your potential date’s salary?

• Regression Analysis to the Rescue! • You go back to each of the original 11 teachers you

surveyed and ask them for their years of experience. • Simple Linear Regression Model: E(Y|

X)= , the distribution of Y given X is normal with mean and standard deviation .

Bivariate Fit of Salary By Years of Experience

35000

40000

45000

50000

55000

60000

65000

Sa

lary

0 2.5 5 7.5 1012.5Years of Experience

X10

X10

Page 6: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

B i v a r i a t e F i t o f S a l a r y B y Y e a r s o f E x p e r i e n c e

3 5 0 0 0

4 0 0 0 0

4 5 0 0 0

5 0 0 0 0

5 5 0 0 0

6 0 0 0 0

6 5 0 0 0

Salary

0 2 .5 5 7 .5 1 0 1 2 .5Y e a rs o f E x p e rie n c e

L in e a r F it L i n e a r F i t

S a l a r y = 4 0 6 1 2 . 1 3 5 + 1 6 8 6 . 0 6 7 4 Y e a r s o f E x p e r i e n c e S u m m a r y o f F i t

R S q u a r e 0 . 5 4 5 8 8 1 R S q u a r e A d j 0 . 4 9 5 4 2 3 R o o t M e a n S q u a r e E r r o r 4 6 1 0 . 9 3 M e a n o f R e s p o n s e 5 0 8 8 1 . 8 2 O b s e r v a t i o n s ( o r S u m W g t s ) 1 1

Page 7: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Linear Fit L in e a r F it

S a la ry = 4 0 6 1 2 .1 3 5 + 1 6 8 6 .0 6 7 4 Y e a rs o f E xp e rie n c e S u m m a ry o f F it

R S q u a re 0 .5 4 5 8 8 1 R S q u a re A d j 0 .4 9 5 4 2 3 R o o t M e a n S q u a re E rro r 4 6 1 0 .9 3 • Predicted salary of your potential date who has been a

teacher for 8 years = Estimated Mean salary for teachers of 8 years = 40612.135+1686.0674*8 = $54,100

• How far off will your estimate typically be? Root mean square error = Estimated standard deviation of Y|X = $4,610.93.

• Notice that the typical error of your estimate of teacher salary using experience, $4,610.93, is less than that of using only information on mean teacher salary, $6,491.20.

• Regression analysis enables you to better predict your potential date’s salary.

Page 8: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

More Information About Your Potential Date’s Salary

• From the regression model, you predict that your potential date’s salary is $54,100 and the typical error you expect to make in your prediction is $4,611.

• Suppose you want to know an interval that will most of the time (say 95% of the time) contain your date’s salary? What’s the chance that your date will make more than $60,000? What’s the chance that your date will make less than $50,000?

• We can answer these questions by using the fact that under the simple linear regression model, the distribution of Y|X is normal, here the subpopulation of teachers with 8 years of experience has a normal distribution with mean $54,100 and standard deviation $4,611.

Page 9: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

• 95% interval: For the subpopulation of teachers with 8 years of experience, 95% of the salaries will be within two SDs of the mean. An interval that will contain a randomly chosen teacher’s salary with 8 years of experience 95% of the time is: $54,100 2*$4,611 = ($44,878,$63,322).

• What’s the probability that your date will make more than $60,000? If you don’t have any additional information about your date other than his or her number of years of teaching, we can assume that your date is a random draw from the subpopulation of teachers with 8 years of teaching.

• According to the simple linear regression model, the subpopulation of teachers with 8 years of experience is estimated to have a normal distribution with mean $54,100 and standard deviation $4,611.

Page 10: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Properties of the Normal Distribution (Section 1.3)

• Suppose a variable Y has a normal distribution with mean and standard deviation . Then

follows a standard normal distribution.• Then the probability that Y is greater than a number c

equals

where Z equals standard normal distribution with mean 0 and SD 1.

The probabilities for a standard normal distribution can be found in Table A.

Review Section 1.3 on using the normal tables.

)()()(

cZP

cYPcYP

Y

Z

Page 11: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

• Probability that a teacher with 8 years of experience has salary > $60,000:

• Probability that a teacher with 8 years of experience has salary < $50,000:

• Probability that a teacher with 8 years of experience has salary between $52,000 and $56,000:

1003.08997.01)28.1(1

)28.1()611,4

100,54000,60

611,4

100,54()000,60(

ZP

ZPY

PYP

1867.0)89.0()611,4

100,54000,50

4611

100,54()000,50(

ZP

YPYP

3363.03228.06591.0)46.0()41.0()41.046.0(

)611,4

100,54000,56

611,4

100,54

611,4

100,54000,52()000,56000,52(

ZPZPZP

YPYP

Page 12: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

R Squared

• How much better predictions of your potential date’s salary does the simple linear regression model provide than just using the mean teacher’s salary?

• This is the question that R squared addresses. • R squared: Number between 0 and 1 that measures how

much of the variability in the response the regression model explains.

• R squared close to 0 means that using regression for predicting Y|X isn’t much better than mean of Y, R squared close to 1 means that regression is much better than the mean of Y for predicting Y|X.

Summary of Fit

RSquare 0.545881 RSquare Adj 0.495423 Root Mean Square Error 4610.93

Page 13: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

R Squared Formula

• Total sum of squares = = the sum of squared prediction errors for using sample mean of Y to predict Y

• Residual sum of squares = , where is the prediction of Yi from the least squares line.

squares of sum Total

squares of sum Residual - squares of sum Total2 R

2

1)( YY

n

i i

n

i ii YY1

2)ˆ(

ii XY 10ˆˆˆ

Page 14: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

What’s a good R squared?

• As with correlation, it depends on the context.• A good R2 depends on the context. In precise laboratory

work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.

• The best measure of whether the regression model is providing predictions of Y|X that are accurate enough to be useful is the root mean square error, which tells us the typical error in using the regression to predict Y from X.

Page 15: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Checking the model• The simple linear regression model is a great tool

but its answers will only be useful if it is the right model for the data. We need to check the assumptions before using the model.

• Assumptions of the simple linear regression model:1. Linearity: The mean of Y|X is a straight line.2. Constant variance: The standard deviation of Y|X is

constant.3. Normality: The distribution of Y|X is normal.4. Independence: The observations are independent.

Page 16: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Checking that the mean of Y|X is a straight line

1. Scatterplot: Look at whether the mean of Y given X appears to increase or decrease in a straight line.

Bivariate Fit of Salary By Years of Experience

35000

40000

45000

50000

55000

60000

65000

Sa

lary

0 2.5 5 7.5 1012.5Years of Experience

Bivariate Fit of Heart Disease Mortality By Wine Consumption

2

4

6

8

10

12

He

art

Dis

ea

se

Mo

rta

lity

0 10 20 30 40 50 60 70 80

Wine Consumption

Page 17: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Residual Plot• Residuals: Prediction error of using

regression to predict Yi for observation i:

, where

• Residual plot: Plot with residuals on the y axis and the explanatory variable (or some other variable on the x axis.

ii XY 10ˆˆˆ iii YYres ˆ

-3-2-10123

Res

idua

l

0 10 20 30 40 50 60 70 80

Wine Consumption

-10000

-5000

0

5000

Resid

ual

0 2.5 5 7.5 10 12.5

Years of Experience

Page 18: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

• Residual Plot in JMP: After doing Fit Line, click red triangle next to Linear Fit and then click Plot Residuals.

• What should the residual plot look like if the simple linear regression model holds? Under simple linear regression model, the residuals

should have approximately a normal distribution with mean zero and a standard deviation which is the same for all X.

• Simple linear regression model: Residuals should appear as a “swarm” of randomly scattered points about their (which is always zero).

• A pattern in the residual plot that for a certain range of X the residuals tend to be greater than zero or tend to be less than zero indicates that the mean of Y|X is not a straight line.

)ˆˆ(ˆ10 iiiii XYYYres

Page 19: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

B i v a r i a t e F i t o f M i l e a g e B y S p e e d

5

10

15

20

25

30

35

40Mil

eage

0 10 20 30 40 50 60 70 80 90 100 110

Speed

Linear Fit

L i n e a r F i t M i l e a g e = 2 3 . 2 6 6 7 7 6 - 0 . 0 0 1 2 7 0 1 S p e e d

-20

-10

0

10

Resid

ual

0 10 20 30 40 50 60 70 80 90 100 110

Speed

D a t a S i m u l a t e d F r o m A S i m p l e L i n e a r R e g r e s s i o n M o d e l I d e a l r e g . J M P B i v a r i a t e F i t o f Y B y X

0

10

20

30

40

50

60

70

80

90

100

110

Y

0 10 20 30 40 50 60 70 80 90 100 110

X

-2

-1

0

1

2

Resid

ual

0 10 20 30 40 50 60 70 80 90 100 110

X

Page 20: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Summary

• Normal distribution can be used to calculate probability that Y takes on certain values given X

• R squared: measure of how much regression improves on ignoring X when predicting Y.

• Assumptions of simple linear regression model must be checked in order for model to be used. Residual plots can be used to check the linearity assumption.

• Tuesday’s class: Section 2.4 (more on checking assumptions, outliers and influential points, lurking variables).