Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Post on 20-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Class 5: Thurs., Sep. 23

• Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and experience

• Normal distribution calculations• R squared• Checking the assumptions of the simple

linear regression model: residual plots.

Teachers’ Salaries and Dating• In U.S. culture, it is usually considered impolite to ask how

much money a person makes.• However, suppose that you are single and are interested in

dating a particular person.• Of course, salary isn’t the most important factor when

considering whom to date but it certainly is nice to know (especially if it is high!)

• In this case, the person you are interested in happens to be a high school teacher, so you know a high salary isn’t an issue.

• Still you would like to know how much she or he makes, so you take an informal survey of 11 high school teachers that you know.

Distributions Salary

35000 5000060000

Moments

Mean 50881.818 Std Dev 6491.1968 Std Err Mean 1957.1695 upper 95% Mean 55242.664 lower 95% Mean 46520.973 N 11

B a s e d o n t h i s d a t a , w h a t c a n y o u c o n c l u d e ? A b s e n t a n y o t h e r i n f o r m a t i o n , b e s t g u e s s f o r t e a c h e r ’ s s a l a r y i s t h e m e a n s a l a r y , $ 5 0 , 8 8 2 . B u t i t i s l i k e l y t h a t t h i s e s t i m a t e w i l l n o t b e c o r r e c t . T o g e t a n i d e a o f h o w f a r o f f , y o u m i g h t b e , y o u c a n c a l c u l a t e t h e s t a n d a r d d e v i a t i o n :

82.649110

421437378

1

)(11

1

2

n

yys i

i

T h e s t a n d a r d d e v i a t i o n i s t h e “ t y p i c a l ” a m o u n t b y w h i c h a n o b s e r v a t i o n d e v i a t e s f r o m m e a n . T h u s , y o u r b e s t e s t i m a t e f o r y o u r p o t e n t i a l d a t e ’ s s a l a r y i s $ 5 0 , 8 8 2 b u t a t y p i c a l e s t i m a t e w i l l b e o f f b y a b o u t $ 6 , 5 0 0 .

• You happen to know that the person you are interested in has been teaching for 8 years.

• How can you use this information to better predict your potential date’s salary?

• Regression Analysis to the Rescue! • You go back to each of the original 11 teachers you

surveyed and ask them for their years of experience. • Simple Linear Regression Model: E(Y|

X)= , the distribution of Y given X is normal with mean and standard deviation .

Bivariate Fit of Salary By Years of Experience

35000

40000

45000

50000

55000

60000

65000

Sa

lary

0 2.5 5 7.5 1012.5Years of Experience

X10

X10

B i v a r i a t e F i t o f S a l a r y B y Y e a r s o f E x p e r i e n c e

3 5 0 0 0

4 0 0 0 0

4 5 0 0 0

5 0 0 0 0

5 5 0 0 0

6 0 0 0 0

6 5 0 0 0

Salary

0 2 .5 5 7 .5 1 0 1 2 .5Y e a rs o f E x p e rie n c e

L in e a r F it L i n e a r F i t

S a l a r y = 4 0 6 1 2 . 1 3 5 + 1 6 8 6 . 0 6 7 4 Y e a r s o f E x p e r i e n c e S u m m a r y o f F i t

R S q u a r e 0 . 5 4 5 8 8 1 R S q u a r e A d j 0 . 4 9 5 4 2 3 R o o t M e a n S q u a r e E r r o r 4 6 1 0 . 9 3 M e a n o f R e s p o n s e 5 0 8 8 1 . 8 2 O b s e r v a t i o n s ( o r S u m W g t s ) 1 1

Linear Fit L in e a r F it

S a la ry = 4 0 6 1 2 .1 3 5 + 1 6 8 6 .0 6 7 4 Y e a rs o f E xp e rie n c e S u m m a ry o f F it

R S q u a re 0 .5 4 5 8 8 1 R S q u a re A d j 0 .4 9 5 4 2 3 R o o t M e a n S q u a re E rro r 4 6 1 0 .9 3 • Predicted salary of your potential date who has been a

teacher for 8 years = Estimated Mean salary for teachers of 8 years = 40612.135+1686.0674*8 = $54,100

• How far off will your estimate typically be? Root mean square error = Estimated standard deviation of Y|X = $4,610.93.

• Notice that the typical error of your estimate of teacher salary using experience, $4,610.93, is less than that of using only information on mean teacher salary, $6,491.20.

• Regression analysis enables you to better predict your potential date’s salary.

More Information About Your Potential Date’s Salary

• From the regression model, you predict that your potential date’s salary is $54,100 and the typical error you expect to make in your prediction is $4,611.

• Suppose you want to know an interval that will most of the time (say 95% of the time) contain your date’s salary? What’s the chance that your date will make more than $60,000? What’s the chance that your date will make less than $50,000?

• We can answer these questions by using the fact that under the simple linear regression model, the distribution of Y|X is normal, here the subpopulation of teachers with 8 years of experience has a normal distribution with mean $54,100 and standard deviation $4,611.

• 95% interval: For the subpopulation of teachers with 8 years of experience, 95% of the salaries will be within two SDs of the mean. An interval that will contain a randomly chosen teacher’s salary with 8 years of experience 95% of the time is: $54,100 2*$4,611 = ($44,878,$63,322).

• What’s the probability that your date will make more than $60,000? If you don’t have any additional information about your date other than his or her number of years of teaching, we can assume that your date is a random draw from the subpopulation of teachers with 8 years of teaching.

• According to the simple linear regression model, the subpopulation of teachers with 8 years of experience is estimated to have a normal distribution with mean $54,100 and standard deviation $4,611.

Properties of the Normal Distribution (Section 1.3)

• Suppose a variable Y has a normal distribution with mean and standard deviation . Then

follows a standard normal distribution.• Then the probability that Y is greater than a number c

equals

where Z equals standard normal distribution with mean 0 and SD 1.

The probabilities for a standard normal distribution can be found in Table A.

Review Section 1.3 on using the normal tables.

)()()(

cZP

cYPcYP

Y

Z

• Probability that a teacher with 8 years of experience has salary > $60,000:

• Probability that a teacher with 8 years of experience has salary < $50,000:

• Probability that a teacher with 8 years of experience has salary between $52,000 and $56,000:

1003.08997.01)28.1(1

)28.1()611,4

100,54000,60

611,4

100,54()000,60(

ZP

ZPY

PYP

1867.0)89.0()611,4

100,54000,50

4611

100,54()000,50(

ZP

YPYP

3363.03228.06591.0)46.0()41.0()41.046.0(

)611,4

100,54000,56

611,4

100,54

611,4

100,54000,52()000,56000,52(

ZPZPZP

YPYP

R Squared

• How much better predictions of your potential date’s salary does the simple linear regression model provide than just using the mean teacher’s salary?

• This is the question that R squared addresses. • R squared: Number between 0 and 1 that measures how

much of the variability in the response the regression model explains.

• R squared close to 0 means that using regression for predicting Y|X isn’t much better than mean of Y, R squared close to 1 means that regression is much better than the mean of Y for predicting Y|X.

Summary of Fit

RSquare 0.545881 RSquare Adj 0.495423 Root Mean Square Error 4610.93

R Squared Formula

• Total sum of squares = = the sum of squared prediction errors for using sample mean of Y to predict Y

• Residual sum of squares = , where is the prediction of Yi from the least squares line.

squares of sum Total

squares of sum Residual - squares of sum Total2 R

2

1)( YY

n

i i

n

i ii YY1

2)ˆ(

ii XY 10ˆˆˆ

What’s a good R squared?

• As with correlation, it depends on the context.• A good R2 depends on the context. In precise laboratory

work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.

• The best measure of whether the regression model is providing predictions of Y|X that are accurate enough to be useful is the root mean square error, which tells us the typical error in using the regression to predict Y from X.

Checking the model• The simple linear regression model is a great tool

but its answers will only be useful if it is the right model for the data. We need to check the assumptions before using the model.

• Assumptions of the simple linear regression model:1. Linearity: The mean of Y|X is a straight line.2. Constant variance: The standard deviation of Y|X is

constant.3. Normality: The distribution of Y|X is normal.4. Independence: The observations are independent.

Checking that the mean of Y|X is a straight line

1. Scatterplot: Look at whether the mean of Y given X appears to increase or decrease in a straight line.

Bivariate Fit of Salary By Years of Experience

35000

40000

45000

50000

55000

60000

65000

Sa

lary

0 2.5 5 7.5 1012.5Years of Experience

Bivariate Fit of Heart Disease Mortality By Wine Consumption

2

4

6

8

10

12

He

art

Dis

ea

se

Mo

rta

lity

0 10 20 30 40 50 60 70 80

Wine Consumption

Residual Plot• Residuals: Prediction error of using

regression to predict Yi for observation i:

, where

• Residual plot: Plot with residuals on the y axis and the explanatory variable (or some other variable on the x axis.

ii XY 10ˆˆˆ iii YYres ˆ

-3-2-10123

Res

idua

l

0 10 20 30 40 50 60 70 80

Wine Consumption

-10000

-5000

0

5000

Resid

ual

0 2.5 5 7.5 10 12.5

Years of Experience

• Residual Plot in JMP: After doing Fit Line, click red triangle next to Linear Fit and then click Plot Residuals.

• What should the residual plot look like if the simple linear regression model holds? Under simple linear regression model, the residuals

should have approximately a normal distribution with mean zero and a standard deviation which is the same for all X.

• Simple linear regression model: Residuals should appear as a “swarm” of randomly scattered points about their (which is always zero).

• A pattern in the residual plot that for a certain range of X the residuals tend to be greater than zero or tend to be less than zero indicates that the mean of Y|X is not a straight line.

)ˆˆ(ˆ10 iiiii XYYYres

B i v a r i a t e F i t o f M i l e a g e B y S p e e d

5

10

15

20

25

30

35

40Mil

eage

0 10 20 30 40 50 60 70 80 90 100 110

Speed

Linear Fit

L i n e a r F i t M i l e a g e = 2 3 . 2 6 6 7 7 6 - 0 . 0 0 1 2 7 0 1 S p e e d

-20

-10

0

10

Resid

ual

0 10 20 30 40 50 60 70 80 90 100 110

Speed

D a t a S i m u l a t e d F r o m A S i m p l e L i n e a r R e g r e s s i o n M o d e l I d e a l r e g . J M P B i v a r i a t e F i t o f Y B y X

0

10

20

30

40

50

60

70

80

90

100

110

Y

0 10 20 30 40 50 60 70 80 90 100 110

X

-2

-1

0

1

2

Resid

ual

0 10 20 30 40 50 60 70 80 90 100 110

X

Summary

• Normal distribution can be used to calculate probability that Y takes on certain values given X

• R squared: measure of how much regression improves on ignoring X when predicting Y.

• Assumptions of simple linear regression model must be checked in order for model to be used. Residual plots can be used to check the linearity assumption.

• Tuesday’s class: Section 2.4 (more on checking assumptions, outliers and influential points, lurking variables).

top related