Chapter 8: Simple Linear Regression [email protected] http://www.mysmu.edu/faculty/zlyang/ Yang Yang Zhenlin Zhenlin
Jan 18, 2016
Chapter 8:Simple Linear Regression
[email protected] http://www.mysmu.edu/faculty/zlyang/
Yang ZhenlinYang Zhenlin
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Learning Objectives
Describing the Relationship between Two Variables
-- Scatter plot
-- Numerical measures
Simple Linear Regression Model
Least Squares Method for Model Estimation
A Measure of Goodness of Fit: R-Square
Inference about the Regression Coefficients
Predictions
-- Predicting the value of a future observation
-- Predicting the mean of future observations
2
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Introduction
We are interested in the relationship between two numerical variables X and Y.
One of these variables, say X, is known in advance, called the explanatory variable, or independent variable.The other variable, Y, is a random variable and its values or its general random behavior is of interest. For this, Y is called the response variable, or dependent variable.If there is a strong relationship between X and Y, one can predict a future random variable Y , based on the known future value of X, through such a “relationship”.To study the relation, n pairs of observations on (X, Y) are collected, denoted as (X1, Y1) , (X2, Y2) , . . . , (Xn, Yn).
The Least Squares Method helps finding such a relation.
3
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Describing the Relationship
Example 8.1. Prices of used cars and the odometer readings. A car dealer wants to find the
relationship between the odometer reading and the selling price of used cars.
A random sample of 100 cars is selected, and the data recorded.
Construct a scatter plot of the data.
Car Odometer (X ) Price (Y )1 37388 146362 44758 141223 45833 140164 30862 155905 31705 155686 34010 14718. . .. . .. . .
The full data
Scatter diagram: plot of the pairs of observed values (x1, y1) , (x2, y2) , . . . , (xn, yn) of variables X and Y. It is a very effective graphical tool for “revealing” the relationship between variables.
Scatter diagram: plot of the pairs of observed values (x1, y1) , (x2, y2) , . . . , (xn, yn) of variables X and Y. It is a very effective graphical tool for “revealing” the relationship between variables.
4
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Describing the Relationship
The plot indeed shows a negative linear relation between the price and the odometer reading.
5
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Besides the graphical display of the data, some numerical measures, such as the sample covariance and the sample coefficient of correlation can be used to measure the direction and strength of the linear relationship between two variables
Describing the Relationship
n
iii YYXX
nYXCov
1
))((1
1),(
n
ii
n
ii Y
nYX
nX
11
1and
1
n
iiY
n
iiX YY
nsXX
ns
1
22
1
22 )(1
1and)(
1
1
Sample Means:
Sample Variances:
Sample covariance:
Sample correlation coefficient: YX ss
YXCovr
),(
This is called the ‘five statistics summary’ of the dataThis is called the ‘five statistics summary’ of the data6
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Example 8.2. Continuing on the Example 8.1, find the five statistics summary and comment on the linear relationship between price and odometer reading.
n
YY
ns
n
XX
ns
n
YXYX
nYX
iiY
iiX
iiii
222
222
1
1;
1
1
1
1),(Cov
:FormulasShortcut
;823.822,14
;45.009,36
Y
X
8063.0 or,511,712,2),(Cov
996,259,690,528,43 22
rYX
ss YX
Solution:
Describing the Relationship
As r = 0.8063, there exists a strong negative linear relation …7
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Describing the Relationship
Cov(X, Y) = 0
Strong positive linear relationship.The scatter diagram shows a clear upward trend.
No linear relationship.Scatter diagram shows either no pattern, or a non-linear pattern.
Strong negative linear relationship.The scatter diagram shows a cleardownward trend.
or
Cov(X, Y) > 0
Cov(X, Y) < 0
Sample Coefficient of Correlation
r =
+1
0
1
8
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Simple Linear Regression Model The simple linear regression model takes the form:
Y = dependent variableX = independent variable
0 = y-intercept
1 = slope of the line = error variable
XY 10 XY 10
x
y
0Run
Rise = Rise/Run
0 and 1 are unknown populationparameters, therefore need to be estimated from the data.
As the scatter diagram given in Example 8.1 shows that although there is a general trend that as the odometer reading increases, the price of the used car decreases, the relation is not deterministic as cars of the same odometer reading can have different prices. Thus, price can also be altered by some unknown random errors!
As the scatter diagram given in Example 8.1 shows that although there is a general trend that as the odometer reading increases, the price of the used car decreases, the relation is not deterministic as cars of the same odometer reading can have different prices. Thus, price can also be altered by some unknown random errors!
9
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Simple Linear Regression Model
These n pairs of observations satisfy:
.,...,2,1,10 niXY iii
As Y is a random variable, so must be . Due to the random sampling mechanism, {Yi} must be independent, and so are the {i}. Further, it is reasonable to assume that
To learn this theoretical relationship, in particular, to estimate the parameters 0 and 1, a random sample of n experimental units are selected, and the values of (Y, X) for each unit are to be observed to give (X1, Y1), (X2, Y2), . . . , (Xn, Yn) .
E(i) = 0, i = 1, 2, . . . , n.
For if they are not zero, the non zero constant can be absorbed into 0. Thus, .,...,2,1,)( 10 niXYE ii
10
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Least Squares Estimation
Based on the observed data, we are seeking a line that best fits the data when two variables are related to one another.
We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized.
Errors
Different lines generate different errors, thus different sum of squares of errors.
Different lines generate different errors, thus different sum of squares of errors.
X
YErrors
There is a line that minimizes the sum of squared errors, and in this sense it is the best line.
11
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Let be a fitted line. To find the best line that minimizes the sum of squared errors, it is equivalent to find the intercept b0 and the slope b1 that
2
1
)ˆ(minimize ii
n
i
YY
The actual Y value of point iThe actual Y value of point i
The value of point icalculated from the equation
The value of point icalculated from the equation ii XbbY 10
ˆ
ii XbbY 10ˆ
That is, to minimize
210
110 )(),( ii
n
i
XbbYbbSS
Least Squares Estimation
12
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Taking partial derivatives and set to zero:
0)(2),(
0)(2),(
101
1
11
101
0
10
iii
n
i
ii
n
i
XXbbYb
bbSS
XbbYb
bbSS
0
0
1
21
10
1
110
1
n
ii
n
iiii
n
i
n
iii
n
i
XbXbXY
XbnbY
0)(1
21
11
1
n
ii
n
ii
n
iii XbXXbYXY
XbYb 10 Leads to
Substituting
Least Squares Estimation
13
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
,)(
0
0)(
1
2
1
21
1
21
21
1
1
211
1
XYnYXXnXb
XbXnbXYnYX
XbXXbYnYX
n
iii
n
ii
n
ii
n
iii
n
ii
n
iii
XY
s
YX
XXn
YYXXn
XnX
XYnYX
Xn
ii
n
iii
n
ii
n
iii
10
2
1
2
1
2
1
2
11
ˆˆ
),(Cov
)(1
1
))((1
1
ˆ
And the solutions:
Least Squares Estimation
gives the least squares equation: XY 10ˆˆˆ
14
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Example 8.3. Continuing on the Example 8.2, find the least squares line relating odometer reading to the price of the used car.
067,17)45.009,36)(06232.(82.822,14ˆ
06232.690,528,43
511,712,2),(Covˆ
10
21
XbY
s
YX
X
Solution: The estimated coefficients are
The least squares equation is
XXY 0623.0067,17ˆˆˆ10
Interpretation of = 0.0623: for one additional mile on the odometer, it is estimated that the average cost of the cars decrease by $0.0623.
Interpretation of = 0.0623: for one additional mile on the odometer, it is estimated that the average cost of the cars decrease by $0.0623.
Least Squares Estimation
1̂
15
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Least Squares Estimation
This is the estimated slope of the line.For each additional mile on the odometer, the price decreases by an average of $0.0623
Odometer Line Fit Plot
13000
14000
15000
16000
Odometer
Pri
ce
XY 0623.067,17ˆ
Interpreting the Linear Regression Equation
The intercept is estimated as $17067.
0 No data
Do not interpret the intercept as the “Price of cars that have not been driven”
17067
16
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Least Squares Estimation
Properties of the Least Squares Estimators.
For the simple linear regression model:
Where {i} are independent with E(i) = 0, the least squares estimators and are unbiased estimators of 0 and 1,
niXY iii ,...,2,1,10
1̂0̂
To see this, note that E(Yi) = 0 + 1 Xi, we have More on white board in class.
XYE 10)(
17
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Measure of Goodness of Fit
.)ˆ(1
2
n
iii YYSSE .)ˆ(
1
2
n
iii YYSSE
Sum of Squares due to Errors (SSE) This is the sum of differences between the points and the
regression line.
It can serve as a measure of how well the line fits the data.
SSE is defined by
2
22 ),(Cov
)1(X
Y s
YXsnSSE
2
22 ),(Cov
)1(X
Y s
YXsnSSE
– A shortcut formula
18
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Measure of Goodness of Fit
22
22
22 ),(Cov
or)(
1YXi ss
YXR
YY
SSER
22
22
22 ),(Cov
or)(
1YXi ss
YXR
YY
SSER
Coefficient of Determination R2 it is a measure of the strength of the linear relationship between the response Y and the explanatory variable(s) X, and is defined as
The first definition is a general one and applies to linear regression models with multi predictors.
It simplifies to the second definition when there is only one predictor X.
In the case of simple linear regression, R2 is also the square of the sample correlation coefficient r.
19
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
• To understand the significance of coefficient of determination, note:
SST: total variations (sum of squares) in Y,SSR: sum of squares due to regression,SSE: sum of squares due to error.
• It follows that R2 = 1 SSE/SST = SSR/SST
• R2 measures the proportion of the variation in Y that is explained by the variation in X, or by the model.
• R2 takes on any value between zero and one.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between X and Y.
Measure of Goodness of Fit
)()()(
)ˆ()ˆ()(1
2
1
2
1
2
SSESSRSST
YYYYYYn
iii
n
ii
n
ii
20
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Inferences for the Model Error Variable: Required Conditions
The error is a critical part of the regression model.
For formal statistical inferences for the model, four requirements involving the distribution of must be satisfied. The probability distribution of is normal. The mean of is zero: E() = 0. The standard deviation of is for all values of X.
The set of errors associated with different observations on Y are all independent.
It follows that the response Y is normally distributed with mean E(Y) = 0 + 1 X, and standard deviation , and that the random sample of n observations {Y1, Y2, . . . , Yn} made on Y are independent.
It follows that the response Y is normally distributed with mean E(Y) = 0 + 1 X, and standard deviation , and that the random sample of n observations {Y1, Y2, . . . , Yn} made on Y are independent.
21
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Inferences for the Model
0 + 1x1
0 + 1x2
0 + 1x3
E(y|x2)
E(y|x3)
x1 x2 x3
E(y|x1)
The standard deviation remains constant,
but the mean value changes with x
Normality of
Changing the X value increases (or decreases if 1 < 0) the mean of Y, but does not change the distributional shape of it.
22
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Inferences for the Model
Estimate of Error Standard Deviation The mean error is equal to zero. If is small the errors tend to be close to zero (close to
the mean error). Then, the model fits the data well. Therefore, we can also use as a measure of the
suitability of using a linear model. However, is unknown and has to be estimated. As SSE
is the sum of squared errors, it leads naturally to an
2
Deviation StandardError of Estimate
n
SSEs
2
Deviation StandardError of Estimate
n
SSEs
It can be shown that It can be shown that . ofestimator unbiasedan is )2/( 22 nSSEs
23
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Inferences for the Model Example 8.4. Calculate the estimated of error standard deviation
and the coefficient of determination for Example 8.1, and describe what does it tell you about the model fit?
Solution
450,005,9
690,528,43
)511,712,2()996,259(99
),(Cov)1(
996,2591
)(
2
2
22
22
XY
iY
s
YXsnSSE
n
YYs
It is hard to assess the model based on s even when compared with the mean value of Y,
It is hard to assess the model based on s even when compared with the mean value of Y,
823,14,1.303 Ys
13.30398
450,005,9
2
n
SSEs
Calculated earlierCalculated earlier
24
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Inferences for the Model
25999699)1( 2
YsnSST
6501.0
)25999699/(90054501
/12
SSTSSER
65% of the variation in the auction selling price is explained by the variation in odometer reading. The rest (35%) remains unexplained by this model.
65% of the variation in the auction selling price is explained by the variation in odometer reading. The rest (35%) remains unexplained by this model.
Some Theoretical Results. If the errors {1, 2, …, n} are independent and identically distributed as N(0, ) , then we have
(a)
(b)
(c)
Some Theoretical Results. If the errors {1, 2, …, n} are independent and identically distributed as N(0, ) , then we have
(a)
(b)
(c)
2
]))1((,[~ˆ 221 XsnN
222
2
~)2(
n
sn
t.independen are andˆ 21 s
25
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Inferences for the Model
We can draw inference about 1 from by testing
H0: 1 = 0 versus
H1: 1 0 (or < 0,or > 0)
Testing the Slope
1̂
The implication of this test is clear: if H0 is rejected, one can conclude that there is sufficient evidence to show that Y and X are linearly related; otherwise, they are not. The same question can be answered by constructing a confidence interval for 1.
The implication of this test is clear: if H0 is rejected, one can conclude that there is sufficient evidence to show that Y and X are linearly related; otherwise, they are not. The same question can be answered by constructing a confidence interval for 1.
From the theoretical result given earlier and the results presented in Chapter 5b regarding the t-distribution, it is immediate to see that
22
11 ~)1(
ˆ
n
X
tsns
A statistic for testing the slope parameter or constructing a confidence interval for it.
26
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
A 100(1)% confidence interval for 1 is given as
A 100(1)% confidence interval for 1 is given as
Apparently, the quantity
is an estimate of
the standard deviation of , and thus referred to as the estimated standard error of .
Apparently, the quantity
is an estimate of
the standard deviation of , and thus referred to as the estimated standard error of .
Inferences for the Model
2)1( Xsn
1̂1̂
2)1( Xsn
s
2211221
)1()2(ˆ
)1()2(ˆ
X
n
X
nsn
st
sn
st
Inference concerning the intercept parameter 0 can be carried out in a similar manner, but it is not as interesting and important as for the slope parameter 1.
27
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Inferences for the Model
Example 8.5. Test to determine whether there is enough evidence to infer that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in Example 8.4. Use a = 5%.
Solution: H0: 1 = 0 vs H1: 1 0
49.1300462
00623)1(
0ˆ
00462.)690,528,43)(99(
1.303
)1(
0623.ˆ
2
1
2
1
..
snst
sn
s
X
X
With n = n 2 = 98, the rejection region is
t > t98(.025) or
t < t98(.025) ,
where t.025 1.984. As t = 13.49 < 1.984, reject H0 at 5% level of significance. Yes, there is enough evidence to … A 95% CI for 1:
}0531.0,0715.0{00462.0984.10623.0
28
STAT306, Term II, 09/10
Chapter 8
STAT151, Term I 2015-16 © Zhenlin Yang, SMU
Predictions
• Before using the regression model, we need to assess how well it fits the data.
• If we are satisfied with how well the model fits the data, we can use it to predict the a future value of Y0 or the mean of Y0
based on the future value of X0. This is in fact an important application of a regression model.
• The simple linear regression model can be easily extended to include more predictor variables, e.g., in the examples presented, the price of a used car is not only affected by its odometer reading, but also affected by its ‘age’, color, etc.
• Those constitute important topics in an advanced course: Applied Regression Methods (STAT312)
The end. Thank you.29