Chapter 13: Simple Regression and Correlation Analysis 1 Chapter 13 Simple Regression and Correlation Analysis LEARNING OBJECTIVES The overall objective of this chapter is to give you an understanding of bivariate regression and correlation analysis, thereby enabling you to: 1. Compute the equation of a simple regression line from a sample of data and interpret the slope and intercept of the equation. 2. Understand the usefulness of residual analysis in testing the assumptions underlying regression analysis and in examining the fit of the regression line to the data. 3. Compute a standard error of the estimate and interpret its meaning. 4. Compute a coefficient of determination and interpret it. 5. Test hypotheses about the slope of the regression model and interpret the results. 6. Estimate values of y using the regression model. CHAPTER TEACHING STRATEGY This chapter is about all aspects of simple (bivariate, linear) regression. Early in the chapter through scatter plots, the student begins to understand that the object of simple regression is to fit a line through the points. Fairly soon in the process, the student learns how to solve for slope and y intercept and develop the equation of the regression line. Most of the remaining material on simple regression is to determine how good the fit of the line is and if assumptions underlying the process are met. The student begins to understand that by entering values of the independent variable into the regression model, predicted values can be determined. The question
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Chapter 13: Simple Regression and Correlation Analysis 1
Chapter 13 Simple Regression and Correlation Analysis
LEARNING OBJECTIVES
The overall objective of this chapter is to give you an understanding of bivariate regression and correlation analysis, thereby enabling you to:
1. Compute the equation of a simple regression line from a sample of data and
interpret the slope and intercept of the equation. 2. Understand the usefulness of residual analysis in testing the assumptions
underlying regression analysis and in examining the fit of the regression line to the data.
3. Compute a standard error of the estimate and interpret its meaning. 4. Compute a coefficient of determination and interpret it. 5. Test hypotheses about the slope of the regression model and interpret the results. 6. Estimate values of y using the regression model.
CHAPTER TEACHING STRATEGY
This chapter is about all aspects of simple (bivariate, linear) regression. Early in
the chapter through scatter plots, the student begins to understand that the object of simple regression is to fit a line through the points. Fairly soon in the process, the student learns how to solve for slope and y intercept and develop the equation of the regression line. Most of the remaining material on simple regression is to determine how good the fit of the line is and if assumptions underlying the process are met.
The student begins to understand that by entering values of the independent
variable into the regression model, predicted values can be determined. The question
Chapter 13: Simple Regression and Correlation Analysis 2
then becomes: Are the predicted values good estimates of the actual dependent values? One rule to emphasize is that the regression model should not be used to predict for independent variable values that are outside the range of values used to construct the model. MINITAB issues a warning for such activity when attempted. There are many instances where the relationship between x and y are linear over a given interval but outside the interval the relationship becomes curvilinear or unpredictable. Of course, with this caution having been given, many forecasters use such regression models to extrapolate to values of x outside the domain of those used to construct the model. Whether the forecasts obtained under such conditions are any better than "seat of the pants" or "crystal ball" estimates remains to be seen.
The concept of residual analysis is a good one to show graphically and
numerically how the model relates to the data and the fact that it more closely fits some points than others, etc. A graphical or numerical analysis of residuals demonstrates that the regression line fits the data in a manner analogous to the way a mean fits a set of numbers. The regression model passes through the points such that the geometric distances will sum to zero. The fact that the residuals sum to zero points out the need to square the errors (residuals) in order to get a handle on total error. This leads to the sum of squares error and then on to the standard error of the estimate. In addition, students can learn why the process is called least squares analysis (the slope and intercept formulas are derived by calculus such that the sum of squares of error is minimized - hence "least squares"). Students can learn that by examining the values of se, the residuals, r2, and the t ratio to test the slope they can begin to make a judgment about the fit of the model to the data. Many of the chapter problems ask the student to comment on these items (se, r
2, etc.). It is my view that for many of these students, an important facet of this chapter
lies in understanding the "buzz" words of regression such as standard error of the estimate, coefficient of determination, etc. They may well only interface regression again as some type of computer printout to be deciphered. The concepts then become as important or perhaps more important than the calculations.
CHAPTER OUTLINE
13.1 Introduction to Simple Regression Analysis 13.2 Determining the Equation of the Regression Line 13.3 Residual Analysis Using Residuals to Test the Assumptions of the Regression Model Using the Computer for Residual Analysis 13.4 Standard Error of the Estimate
Chapter 13: Simple Regression and Correlation Analysis 3
13.5 Coefficient of Determination Relationship Between r and r2 13.6 Hypothesis Tests for the Slope of the Regression Model and Testing the Overall Model Testing the Slope Testing the Overall Model 13.7 Estimation Confidence Intervals to Estimate the Conditional Mean of y: µy/x Prediction Intervals to Estimate a Single Value of y 13.8 Interpreting Computer Output
KEY TERMS Coefficient of Determination (r2) Prediction Interval Confidence Interval Probabilistic Model Dependent Variable Regression Analysis Deterministic Model Residual Heteroscedasticity Residual Plot Homoscedasticity Scatter Plot Independent Variable Simple Regression Least Squares Analysis Standard Error of the Estimate (se) Outliers Sum of Squares of Error (SSE)
Chapter 13: Simple Regression and Correlation Analysis 4
Chapter 13: Simple Regression and Correlation Analysis 13
13.16
There appears to be a non constant error variance. 13.17
There appears to be nonlinear regression 13.18 The MINITAB Residuals vs. Fits graphic is strongly indicative of a violation of
the homoscedasticity assumption of regression. Because the residuals are very close together for small values of x, there is little variability in the residuals at the left end of the graph. On the other hand, for larger values of x, the graph flares out indicating a much greater variability at the upper end. Thus, there is a lack of homogeneity of error across the values of the independent variable.
Chapter 13: Simple Regression and Correlation Analysis 14
Approximately 68% of the residuals should fall within ±1se. 3 out of 5 or 60% of the actually residuals in 11.13 fell within ± 1se. 13.20 SSE = Σy2 – b0Σy - b1ΣXY = 45,154 - 144.414(498) - (-.89824)(30,099) = SSE = 272.0
5
0.272
2=
−=
n
SSEse = 7.376
6 out of 7 = 85.7% fall within + 1se 7 out of 7 = 100% fall within + 2se 13.21 SSE = Σy2 – b0Σy - b1ΣXY = 1,587,328 - (-46.29)(2,670) - 15.24(107,610.4) = SSE = 70,940
6
940,70
2=
−=
n
SSEse = 108.7
Six out of eight (75%) of the sales estimates are within $108.7 million. 13.22 SSE = Σy2 – b0Σy - b1ΣXY = 524 - 15.46(48) - (-0.71462)(333) = 19.8885
3
8885.19
2=
−=
n
SSEse = 2.575
Four out of five (80%) of the estimates are within 2.5759 of the actual rate for bonds. This amount of error is probably not acceptable to financial analysts.
Chapter 13: Simple Regression and Correlation Analysis 15
The model produces estimates that are ±.1017 or within about 10 cents 68% of the time. However, the range of milk costs is only 45 cents for this data.
Chapter 13: Simple Regression and Correlation Analysis 16
This is a relatively large standard error of the estimate given the sales values (ranging from 10.5 to 123.8).
13.26 r2 =
5
)97(935,1
6399.461
)(1
222 −
−=−
−
∑∑
n
yy
SSE = .123
This is a low value of r2
13.27 r2 =
7
)498(154,45
12.2721
)(1
222 −
−=−
−
∑∑
n
yy
SSE = .972
This is a high value of r2
Chapter 13: Simple Regression and Correlation Analysis 17
13.28 r2 =
8
)670,2(328,587,1
940,701
)(1
222 −
−=−
−
∑∑
n
yy
SSE = .898
This value of r2 is relatively high
13.29 r2 =
5
)48(524
8885.191
)(1
222 −
−=−
−
∑∑
n
yy
SSE = .685
This value of r2 is a modest value.
68.5% of the variation of y is accounted for by x but 31.5% is unaccounted for.
13.30 r2 =
6
)173(837,5
1384.771
)(1
222 −
−=−
−
∑∑
n
yy
SSE = .909
This value is a relatively high value of r2. Almost 91% of the variability of y is accounted for by the x values. 13.31 CCI Median Income 116.8 37.415 91.5 36.770 68.5 35.501 61.6 35.047 65.9 34.700 90.6 34.942 100.0 35.887 104.6 36.306 125.4 37.005 Σx = 323.573 Σy = 824.9 Σx2 = 11,640.93413 Σy2 = 79,718.79 Σxy = 29,804.4505 n = 9
Chapter 13: Simple Regression and Correlation Analysis 18
Since the observed t = -2.56 > t.025,3 = -3.182, the decision is to fail to reject the null hypothesis. 13.36 Analysis of Variance SOURCE df SS MS F Regression 1 5,165 5,165.00 1.95 Error 7 18,554 2,650.57 Total 8 23,718 Let α = .05 F.05,1,7 = 5.59 Since observed F = 1.95 < F.05,1,7 = 5.59, the decision is to fail to reject the null hypothesis. There is no overall predictability in this model.
t = 95.1=F = 1.40 t.025,7 = 2.365
Since t = 1.40 < t.025,7 = 2.365, the decision is to fail to reject the null hypothesis.
The slope is not significantly different from zero.
Chapter 13: Simple Regression and Correlation Analysis 21
13.37 F = 8.26 with a p-value of .021. The overall model is significant at α = .05 but not at α = .01. For simple regression,
t = F = 2.8674 t.05,5 = 2.015 but t.01,5 = 3.365. The slope is significant at α = .05 but not at α = .01.
Σx= 571 Σx2 = 58,293 Se = 7.377 y = 144.414 - .0898(100) = 54.614
y ± t /2,n-2 se
∑∑−
−++
n
xx
xx
n 22
20
)(
)(11 =
54.614 ± 2.015(7.377)
7
)571(293,58
)57143.81100(
7
11 2
2
−
−++ =
54.614 ± 2.015(7.377)(1.08252) = 54.614 ± 16.091 38.523 < y < 70.705 For x0 = 130, y = 144.414 - .0898(130) = 27.674
y ± t /2,n-2 se
∑∑−
−++
n
xx
xx
n 22
20
)(
)(11 =
27.674 ± 2.015(7.377)
7
)571(293,58
)57143.81130(
7
11 2
2
−
−++ =
27.674 ± 2.015(7.377)(1.1589) = 27.674 ± 17.227 10.447 < y < 44.901 The width of this confidence interval of y for x0 = 130 is wider that the confidence interval of y for x0 = 100 because x0 = 100 is nearer to the value of x = 81.57 than is x0 = 130.
Chapter 13: Simple Regression and Correlation Analysis 23
Σx = 199.5 Σx2 = 7,667.15 Se = 108.8 y = -46.29 + 15.24(20) = 258.51
y ± t /2,n-2 se
∑∑−
−+
n
xx
xx
n 22
20
)(
)(1
258.51 ± (3.143)(108.8)
8
)5.199(15.667,7
)9375.2420(
8
12
2
−
−+
258.51 ± (3.143)(108.8)(0.36614) = 258.51 ± 125.20 133.31 < E(y20) < 383.71 For single y value:
y ± t /2,n-2 se
∑∑−
−++
n
xx
xx
n 22
20
)(
)(11
258.51 ± (3.143)(108.8)
8
)5.199(15.667,7
)9375.2420(
8
11 2
2
−
−++
258.51 ± (3.143)(108.8)(1.06492) = 258.51 ± 364.16 -105.65 < y < 622.67 The confidence interval for the single value of y is wider than the confidence interval for the average value of y because the average is more towards the middle and individual values of y can vary more than values of the average.
Chapter 13: Simple Regression and Correlation Analysis 24
Σx = 41 Σx2 = 421 Se = 2.575 y = 15.46 - 0.715(10) = 8.31
y ± t /2,n-2 se
∑∑−
−+
n
xx
xx
n 22
20
)(
)(1
8.31 ± 5.841(2.575)
5
)41(421
)2.810(
5
12
2
−
−+ =
8.31 ± 5.841(2.575)(.488065) = 8.31 ± 7.34 0.97 < E(y10) < 15.65 If the prime interest rate is 10%, we are 99% confident that the average bond rate is between 0.97% and 15.65%.
Chapter 13: Simple Regression and Correlation Analysis 25
Since the observed t = 3.74 < t.005,4 = 4.604, the decision is to fail to reject the null hypothesis. f) The r2 = 77.74% is modest. There appears to be some prediction with this model.
The slope of the regression line is not significantly different from zero using α = .01. However, for α = .05, the null hypothesis of a zero slope is rejected. The standard error of the estimate, se = 3.661 is not particularly small given the range of values for y (11 - 3 = 8).
13.43 x y 53 5 47 5 41 7 50 4 58 10 62 12 45 3 60 11 Σx = 416 Σx2 = 22,032 Σy = 57 Σy2 = 489 b1 = 0.355 Σxy = 3,106 n = 8 b0 = -11.335 a) y = -11.335 + 0.355 x
Chapter 13: Simple Regression and Correlation Analysis 27
b) y (Predicted Values) (y- y ) residuals 7.48 -2.48 5.35 -0.35 3.22 3.78 6.415 -2.415 9.255 0.745 10.675 1.325 4.64 -1.64 9.965 1.035 c) (y- y )2 6.1504 .1225 14.2884 5.8322 .5550 1.7556 2.6896 1.0712 SSE = 32.4649
Chapter 13: Simple Regression and Correlation Analysis 28
t = 116305.
03555.011 −=−
bs
b β = 3.05
Since the observed t = 3.05 > t.025,6 = 2.447, the decision is to reject the null hypothesis. The population slope is different from zero. g) This model produces only a modest r2 = .608. Almost 40% of the variance of Y is unaccounted for by X. The range of Y values is 12 - 3 = 9 and the standard error of the estimate is 2.33. Given this small range, the se is not small. 13.44 Σx = 1,263 Σx2 = 268,295 Σy = 417 Σy2 = 29,135 Σxy = 88,288 n = 6 b0 = 25.42778 b1 = 0.209369 SSE = Σy2 - b0Σy - b1Σxy = 29,135 - (25.42778)(417) - (0.209369)(88,288) = 46.845468
29.341 + 2.447(3.201)(1.06567) = 29.341 + 8.347 20.994 < y < 37.688 c) The confidence interval for (b) is much wider because part (b) is for a single value of y which produces a much greater possible variation. In actuality, x0 = 70 in part (b) is slightly closer to the mean (x) than x0 = 60. However, the width of the single interval is much greater than that of the average or expected y value in part (a).
Chapter 13: Simple Regression and Correlation Analysis 30
If a regression model would have been developed to predict number of cars sold by the number of sales people, the model would have had an r2 of 82.6%. The same would hold true for a model to predict number of sales people by the number of cars sold. 13.47 n = 12 Σx = 548 Σx2 = 26,592 Σy = 5940 Σy2 = 3,211,546 Σxy = 287,908 b1 = 10.626383 b0 = 9.728511 y = 9.728511 + 10.626383 x SSE = Σy2 - b0Σy - b1Σxy = 3,211,546 - (9.728511)(5940) - (10.626383)(287,908) = 94337.9762
10
9762.337,94
2=
−=
n
SSEse = 97.1277
r2 = 246,271
9762.337,941
)(1
22
−=−
−
∑∑
n
yy
SSE = .652
Chapter 13: Simple Regression and Correlation Analysis 31
t =
12
)548(592,26
1277.970626383.10
2
−
− = 4.33
If α = .01, then t.005,10 = 3.169. Since the observed t = 4.33 > t.005,10 = 3.169, the decision is to reject the null hypothesis. 13.48 Sales(y) Number of Units(x) 17.1 12.4 7.9 7.5 4.8 6.8 4.7 8.7 4.6 4.6 4.0 5.1 2.9 11.2 2.7 5.1 2.7 2.9 Σy = 51.4 Σy2 = 460.1 Σx = 64.3 Σx2 = 538.97 Σxy = 440.46 n = 9 b1 = 0.92025 b0 = -0.863565 SSE = Σy2 - b0Σy - b1Σxy = 460.1 - (-0.863565)(51.4) - (0.92025)(440.46) =
r2 = 55.166
153926.991
)(1
22
−=−
−
∑∑
n
yy
SSE = .405
Chapter 13: Simple Regression and Correlation Analysis 32
r2 for this model is .002858. There is no predictability in this model. Test for slope: t = 0.18 with a p-value of 0.8623. Not significant 13.52 Σx = 323.3 Σy = 6765.8 Σx2 = 29,629.13 Σy2 = 7,583,144.64 Σxy = 339,342.76 n = 7
b1 =
∑∑
∑∑ ∑
−
−=
n
xx
n
yxxy
SS
SS
x
xy
22
)( =
7
)3.323(13.629,29
7
)8.6765)(3.323(76.342,339
2
−
− = 1.82751
b0 = 7
3.323)82751.1(
7
8.67651 −=− ∑∑
n
xb
n
y = 882.138
y = 882.138 + 1.82751 x SSE = Σy2 –b0Σy –b1Σxy
Chapter 13: Simple Regression and Correlation Analysis 35
Since the observed t = 0.50 < t.025,5 = 2.571, the decision is to fail to reject the null hypothesis. 13.53 Let Water use = y and Temperature = x Σx = 608 Σx2 = 49,584 Σy = 1,025 Σy2 = 152,711 b1 = 2.40107 Σxy = 86,006 n = 8 b0 = -54.35604 y = -54.35604 + 2.40107 x
Chapter 13: Simple Regression and Correlation Analysis 36
r2 =
8
)1025(711,152
5145.919,11
)(1
222 −
−=−
−
∑∑
n
yy
SSE = 1 - .09 = .91
Testing the slope: Ho: β = 0
Ha: β ≠ 0 α = .01
Since this is a two-tailed test, α/2 = .005 df = n - 2 = 8 - 2 = 6
t.005,6 = ±3.707
sb =
8
)608(584,49
886.17
)( 222 −
=
−∑∑
n
xx
se = .30783
t = 30783.
040107.211 −=−
bs
b β = 7.80
Since the observed t = 7.80 < t.005,6 = 3.707, the decision is to reject the null
hypothesis.
13.54 a) The regression equation is: y = 67.2 – 0.0565 x
b) For every unit of increase in the value of x, the predicted value of y will decrease by -.0565. c) The t ratio for the slope is –5.50 with an associated p-value of .000. This is significant at α = .10. The t ratio negative because the slope is negative and the numerator of the t ratio formula equals the slope minus zero. d) r2 is .627 or 62.7% of the variability of y is accounted for by x. This is only a modest proportion of predictability. The standard error of the estimate is 10.32. This is best interpreted in light of the data and the magnitude of the data. e) The F value which tests the overall predictability of the model is 30.25. For simple regression analysis, this equals the value of t2 which is (-5.50)2. f) The negative is not a surprise because the slope of the regression line is also
Chapter 13: Simple Regression and Correlation Analysis 37
negative indicating an inverse relationship between x and y. In addition, taking the square root of r2 which is .627 yields .7906 which is the magnitude of the value of r considering rounding error.
13.55 The F value for overall predictability is 7.12 with an associated p-value of .0205 which is significant at α = .05. It is not significant at alpha of .01. The coefficient of determination is .372 with an adjusted r2 of .32. This represents very modest predictability. The standard error of the estimate is 982.219 which in units of 1,000 laborers means that about 68% of the predictions are within 982,219 of the actual figures. The regression model is: Number of Union Members = 22,348.97 - 0.0524 Labor Force. For a labor force of 100,000 (thousand, actually 100 million), substitute x = 100,000 and get a predicted value of 17,108.97 (thousand) which is actually 17,108,970 union members.
13.56 The Residual Model Diagnostics from MINITAB indicate a relatively healthy set
of residuals. The Histogram indicates that the error terms are generally normally distributed. This is confirmed by the nearly straight line Normal Plot of Residuals. The I Chart indicates a relatively homogeneous set of error terms throughout the domain of x values. This is confirmed by the Residuals vs. Fits graph. This residual diagnosis indicates no assumption violations.