PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II
PubH 7405: REGRESSION ANALYSIS
SLR: INFERENCES, Part II
We cover the topic of inference in two sessions; the first session focused on inferences concerning the slope and the intercept; this is a continuation on estimating the mean response – and more. Applications concerning the slope and the intercept are based on the following four (4) theorems
SAMPLING DISTRIBUTION OF SLOPE
∑ −=
=
∈
++=
2_
2
12
11
210
)()b(
)E(b:
),0(
:Model" RegressionError Normal" Under the
xx
NxY
σσ
β
σε
εββ
Variance and Mean with Normal is bslope estimated the of ondistributi sampling The
1
Theorem 1A:
IMPLICATION
)()(
)()( 1
1
1
11
1
11
bbs
bb
bsb
σσββ
÷−
=−
222
1−=− ndfn
χdistributed as N(0,1)
freedom of degrees 2)(n with t"" as ddistribute is )( 1
11 −−bs
b β:1B Theorem
CONFIDENCE INTERVALS
freedom of degrees 2)(n with t"" as ddistribute is )( 1
11 −−bs
b β:1B Theorem
freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1
)b()2;2/1(b:is
11
−−−−−±
−snt α
1βfor Interval Confidence α)100%(1
SAMPLING DISTRIBUTION OF INTERCEPT
−+=
=
∈
++=
∑ 2_
2_
20
2
00
210
)(
1)b(
)E(b:
),0(
:Model" RegressionError Normal" Under the
xx
xn
NxY
σσ
β
σε
εββ
Variance and Mean with Normal is bintercept estimated the of ondistributi sampling The
0
Theorem 2A:
IMPLICATION
)()(
)()( 0
0
0
00
0
00
bbs
bb
bsb
σσββ
÷−
=−
222
1−=− ndfn
χdistributed as N(0,1)
freedom of degrees 2)(n with t"" as ddistribute is )( 0
00 −−bs
b β:2B Theorem
CONFIDENCE INTERVALS
freedom of degrees 2)(n with t"" as ddistribute is )( 0
00 −−bs
b β:2B Theorem
freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1
)b()2;2/1(b:is
00
−−−−−±
−snt α
0βfor Interval Confidence α)100%(1
xxXYE 10)|(:ResponseMean Theββ +==
A common objective in regression analysis is to estimate the mean response. For example: (1) we are interested to know the average blood pressure for women at certain age and how estimate it using the relationship between SBP and Age, and (2) in a study of the relationship between level of pay (salary, X) and worker productivity (Y), the mean productivity at high, medium, and low levels of pay may be of particular interest for any company.
POINT ESTIMATE
xxXYE 10)|(:ResponseMean Theββ +==
Let X = xh denote the level of X for which we wish to estimate the mean response, i.e. E(Y|X=xh); this xh may be a value which occurred in the sample, or it may be some other value of the predictor variable within the scope of the model. The point estimate of the response is:
h
hh
xbbYxXYE
10
^)|(
:
+===
EstimatePoint
SAMPLING DISTRIBUTION
−
−+=
+===
∈
++=
∑ 2_
2_
2h
^2
10h
^
210
)(
)(1)Y(
)|()YE(
:
),0(
:Model" RegressionError Normal" Under the
xx
xxn
xxXYE
NxY
h
hh
σσ
ββ
σε
εββ
Variance and Mean with Normal is Y Response
Mean estimated the of ondistributi sampling The
:#3A Theorem
h^
iih
hh
ii
ii
ykxxn
xY
yk
ykxn
∑
∑
∑
−+=
+=
=
−=
−
)(1bb
b
1b
_
10
^
1
0
The sampling distribution of Ŷh is “normal” because this estimated mean response, like the intercept and the slope, Ŷh is a linear combination of the observations yi and the distribution of each observation is normal under the “normal error regression model”:
The estimated mean response is unbiased because the estimated intercept and estimated slope are both unbiased:
)|(
)()()(
10
10
^10
^
h
h
hh
hh
xXYEx
bExbEYE
xbbY
==+=
+=
+=
ββ
−
−+=
−+−+=
−+−+=
−+=
−+=
∑
∑ ∑
∑
∑
∑
2_
2_
2
22__
2
22__
22
22_^
_^
)(
)(1
)()(21
)()(121
)(1)(
)(1
xx
xxn
kxxkxxnn
kxxkxxnn
kxxn
YVar
ykxxn
Y
i
h
ihih
ihih
ihh
iihh
σ
σ
σ
σ
−
−+=
−
−+=
∑
∑
2_
2_
^2
2_
2_
2^
)(
)(1)(
)(
)(1)(
xx
xxn
MSEYs
xx
xxn
YVar
i
hh
i
hh σ
Taking square root to get Standard Error
−
−+=
−
−+=
∑
∑
2_
i
2_
hh
^
)x(x
)x(xn1MSE)YSE(
2_
2_
^2
)(
)(1)(xx
xxn
MSEYsi
hh
Implication:
Our estimates are less precise toward the ends
MORE ON SAMPLING DISTRIBUTION
)(
)(
)(
)(
)(
)(^
^
^
^
^
^
h
h
h
hh
h
hh
Y
Ys
Y
YEY
Ys
YEY
σσ÷
−=
−
222
1−=− ndfn
χdistributed as N(0,1)
freedom of degrees 2)(n with t"" as ddistribute is )(
)(^
^
−−
h
hh
Ys
YEY
:#3B Theorem
CONFIDENCE INTERVALS
freedom of degrees 2)(n with t"" as ddistribute is)(
)(^
^
−−
h
hh
Ys
YEY
:#3B Theorem
freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1
)()2;2/1(
:is ^^
−−−−−±
−
hh YsntY α
h^Yfor Interval Confidence α)100%(1
x (oz) y (%)112 63111 66107 72119 5292 7580 11881 12084 114
118 42106 72103 9094 91
EXAMPLE #1: Birth weight data: Intercept = 256.972 Slope = -1.737 MSE = 75.982 Mean of X = 100.58 SS of X = 2,156.913 For children with birth weight of xh = 95 ounces, the point estimate and 95% Confidence Interval for the Mean growth between 70-100 days as % of BW is:
%)83.97%,69.85(43.7)(76.91
429.7913.156,2
)58.10095(121)982.75()(
%757.91)95)(737.1(972.2562^
2
^
=±
=
−+=
=−+=
2.228
h
h
Ys
Y
EXAMPLE #2: Age and SBP Age (x) SBP (y)
42 13046 11542 14871 10080 15674 16270 15180 15685 16272 15864 15581 16041 12561 15075 165
Intercept = 99.958 Slope = .705 MSE = 278.554 Mean of X = 65.6 SS of X = 3403.6 For xh = 60 years old women, the point estimate and 95% Confidence Interval for the Mean SBP is:
)2.152,4.132(137.21)(3.142
137.216.3403
)6.6560(151)554.278()(
26.142)60)(705(.958.992^
2
^
=±
=
−+=
=+=
2.160
h
h
Ys
Y
LotSize WorkHours80 39930 12150 22190 37670 36160 224
120 54680 352
100 35350 15740 16070 25290 38920 113
110 435100 42030 21250 26890 377
110 42130 27390 46840 24480 34270 323
EXAMPLE #3: Toluca Company Data
Intercept = 62.366 Slope = 3.570 MSE = 2,384 Mean of X = 70.0 SS of X = 19,800 For the lots’ size of xh = 65 units, the point estimate and 90% Confidence Interval for the Mean Work Hours is:
)4.311,4.277(47.98)(4.294
47.98800,19
)0.7065(251)384,2()(
4.294)65)(57.3(37.622^
2
^
=±
=
−+=
=+=
1.714
h
h
Ys
Y
In regression analysis, besides estimating the mean response, sometimes one may want to estimate a new individual response. For example: (1) In addition to estimating the average blood pressure for women at certain age using the relationship between SBP and Age, we may be interested in estimating the SBP of a particular woman/patient at that age; and (2) In a study of the relationship between pay (salary, X) and worker productivity (Y), the interest may focus on the productivity of certain particular worker.
POINT ESTIMATE Let X = xh denote the level of X under investigation, at which the mean response is E(Y|X=xh). Let Yh(new) be the value of the new individual response of interest. This new observation of Y to be predicted is often viewed as the result of a new trial independent of the trials on which the regression line is formed. The point estimate is still the same as that of the mean response:
)(
^10
^10)|(
newh
hh
hh
Y
xbbY
xxXYE
=
+=
+== ββ
Same as the mean
VARIANCE The point estimates of the mean response and of an individual response are the same but the variances are different. In estimating an individual response, there are two layers of variation: (a) variation in the “position of the distribution” (that is of the mean response), and (b) the variation within that distribution (that is from the individual response to the mean response)
normal. is Y ofon distributi sampling the
Model, RegressionError Normal" Under the:
)(
)(11
)(
)(1
)()()(
h(new)
^
2_
2_
2
2_
2_
22
^
)()(
^
#4A Theorem
−
−++=
−
−++=
+=
∑
∑
xx
xxn
xx
xxn
YVarYVarYVar
i
h
i
h
hnewhnewh
σ
σσ
−
−++=
−
−++=
∑
∑
2_
2_
)(
^2
2_
2_
2)(
^
)(
)(11)(
)(
)(11)(
xx
xxn
MSEYs
xx
xxn
YVar
i
hnewh
i
hnewh σ
Taking square root to get Standard Error
MORE ON SAMPLING DISTRIBUTION
freedom of degrees 2)(n with t"" as ddistribute is )(
:
)(
^
^
)(
^
−−
newh
hnewh
Ys
YY
#4B Theorem
Inferences on a new individual response is based on the following results:
−
−++=
−
−++=
∑
∑
2_
i
2_
hh(new)
^
)x(x
)x(xn11MSE)YSE(
2_
2_
)(
^2
)(
)(11)(xx
xxn
MSEYsi
hnewh
Again:
Our estimates are less precise toward the ends
{ }
MSE
NxY
^2
i
210
:zeromean with sample a is e),0(
:Model RegressionError Normal
=
∈
++=
σ
σε
εββ
2
22-ndf2
E(MSE)
χ as ddistribute is SSE
σσ
=
=
:#5 Theorem
THE TEST FOR INDEPENDENCE
2
1
1
10
r12nrt
)s(bbt
:freedom of degrees 2)(nat test t""0β:H
−−
=
=
−=
+==
:r"" using test the toidentical iswhich
)|(:ResponseMean The
10 xxXYE ββ
The method we use most often is this “Test for Independence” which we are now approaching by a different way: ANOVA
COMPONENTS OF VARIATION • The variation in Y is conventionally measured in
terms of the deviations (Yi - Y)'s; the total variation, denoted by SST, is the sum of squared deviations: SST = Σ(Yi - Y)2. For example, SST=0 when all observations are the same; SST is the numerator of the sample variance of Y, the greater SST the greater the variation among Y-values.
• In the regression analysis, the variation in Y is decomposed into two components: (Yi - Y) = (Yi - Ŷi) + (Ŷi - Y)
DECOMPOSITION OF SST • In the decomposition: (Yi - Y) = (Yi - Ŷi) + (Ŷi - Y) • The first term (RHS) reflects the variation around
the regression line; the part than cannot be explained by the regression itself with the sum of squared errors SSE = Σ(Yi - Ŷi)2.
• The difference between the above two sums of squares, SSR = SST - SSE = Σ(Ŷi - Y)2, is called the regression sum of squares; SSR may be considered as a measure of the variation in Y associated with or explained by the regression line.
Regression helps to “improve” the estimate of Y from Y (without any information) to Ŷ (with information provided by knowing X)
SSRSSESST +=
+++=
−−+−+−=
−+−=−=
−=
∑∑∑∑∑
∑∑∑
iii eYYeSSRSSE
YYYYYYYY
YYYYYY
YYSST
_^
_^^2
_^2
^
2_^^
2_
2_
22
))((2)()(
)]()[()(
)(
0
ANALYSIS OF VARIANCE • SST measures the “total variation” in the sample (of
values of the dependent variable) with (n-1) degrees of freedom, n is the sample size. It is decomposed into: SST=SSE+SSR
• (1) SSE measures the variation cannot be explained by the regression with (n-2) degrees of freedom, and
• (2) SSR measures the variation in Y associated with or explained by the regression line with 1 degree of freedom (representing the slope).
∑∑
∑∑∑
−+=
−+=
=
−=
−+−=
−+=
2_
21
2
2_
211
2
2_
21
2_
1
_
1
_
2_
10
)(
)(])([
)()()(
])[(
)(
xx
xxb
SSREMSRExxb
yxbxby
yxbbSSR
βσ
βσ
Var(X) = E(X2) – {E(X)}2
E(X2) =Var(X) + {E(X)}2
“ANOVA” TABLE • The breakdowns of the total sum of squares and its
associated degree of freedom are displayed in the form of an “analysis of variance table” (ANOVA table) for regression analysis as follows:
Source of Variation SS df MS F Statistic p-value Regression SSR 1 MSR MSR/MSE Error SSE n-2 MSE Total SST n-1
• Recall: MSE, the “error mean square”, serves as an estimate of the constant variance σ2 as stipulated by the regression model.
∑ −+=
=
2_
21
2
2
)()(
)(
xxMSRE
MSEE
βσ
σ
Under the Null Hypothesis H0: β1 = 0, E(MSE) = E(MSR) so that F=MSR/MSE is expected to be near 1.0
Theorem 6: F is distributed, under H0, as F(1,n-2) following a theorem by Cochran.
THE F-TEST
The test statistic F for the above analysis of variance approach compares MSR and MSE, a value near 1 supports the null hypothesis of independence. In fact, we have: F = t2, where t is the test statistic for testing whether or not β1=0; the F-test is equivalent to the two-sided t-test when refereed to the F-table in Appendix B (Table B.4) with (1,n-2) degrees of freedom.
THE TEST FOR INDEPENDENCE
MSEMSRF
:freedom of degrees 2)n(1,at test F""r12nrt
)s(bbt
:freedom of degrees 2)(nat test t""
2
1
1
=
−−−
=
=
−
=
)2(
:r"" using test the toidentical iswhich
(1):choices identical Two
0::Hypothesis Null The
10 βH
COEFFICIENT OF DETERMINATION • We can express the coefficient of determination
(the square of the coefficient of correlation r) as:
• That is the portion of total variation attributable to regression; Regression helps to “improve” the estimate of Y from Y (without any information) to Ŷ (with information provided by knowing X) – reducing the total variation by (100)(r2)%
SSTSSRr =2
EXAMPLE #1: Birth Weight Data
SUMMARY OUTPUT
Regression StatisticsR Square 0.89546Observations 12
ANOVAdf SS MS F Significance F
Regression 1 6508 6508 85.66 3.21622E-06Residual 10 759.8 75.98Total 11 7268
x (oz) y (%)112 63111 66107 72119 5292 7580 11881 12084 114
118 42106 72103 9094 91
EXAMPLE #2: AGE & SBP Age (x) SBP (y)
42 13046 11542 14871 10080 15674 16270 15180 15685 16272 15864 15581 16041 12561 15075 165
SUMMARY OUTPUT
Regression StatisticsR Square 0.3183Observations 15
ANOVAdf SS MS F Significance F
Regression 1 1691 1691 6.071 0.028453563Residual 13 3621 278.6Total 14 5312
EXAMPLE #3: Toluca Company Data LotSize WorkHours
80 39930 12150 22190 37670 36160 224
120 54680 352
100 35350 15740 16070 25290 38920 113
110 435100 42030 21250 26890 377
110 42130 27390 46840 24480 34270 323
SUMMARY OUTPUT
Regression StatisticsR Square 0.3183Observations 15
ANOVAdf SS MS F Significance F
Regression 1 1691 1691 6.071 0.028453563Residual 13 3621 278.6Total 14 5312
),0(
:ModelRegession Error Normal
210
σε
εββ
NxY
∈
++=
The normal regression model assumes that the X values are known constants. We do not impose any kind of distribution for the x-values
In many cases, this is not true; for example, if we study the relationship between “height of a person” and weight of a person”, a sample of persons are taken but both measurements are random. Rather than a regression model, one should consider a “correlation model”; the most widely used is the “Bivariate Normal Distribution” with density:
)])([(
),(
2)1(2
1exp12
1),(22
22
yx
xy
yx
xy
y
y
y
y
x
x
x
x
yx
YXEYXCov
YYXXYXf
µµ
σ
σσσ
ρ
σµ
σµ
σµρ
σµ
ρρσπσ
−−=
=
=
−+
−
−−
−−
−−
=
σxy is the Covariance and ρ is the Coefficient of Correlation between the two random variables X and Y; ρ is estimated by the (sample) Coefficient of Correlation r.
CORRELATION MODEL “Correlation Data” are often cross-sectional or observational. Instead of a regression model, one should consider a “correlation model”; the most widely used is the “Bivariate Normal Distribution” with density:
The Coefficient of Correlation ρ between the two random variables X and Y is estimated by the (sample) Coefficient of Correlation r but the sampling distribution of r is far from being normal. Confidence intervals of is by first making the “Fisher’s z transformation”; the distribution of z is normal if the sample size is not too small
CONDITIONAL DISTRIBUTION
2)21(2|
1
:x|yσdeviation standard andx 1mean with normal is Xgiven any for Y ofon distributi lconditiona The
0
0
2)1(2
1exp12
1),(22
22
yxy
x
yx
yxy
x
y
y
y
y
x
x
x
x
yx
YYXXYXf
σρσ
σ
σρβ
σ
σρµµβ
ββ
σµ
σµ
σµρ
σµ
ρρσπσ
−=
=
−=
+=
−+
−
−−
−−
−−
=
: Theorem
Again, since Var(Y|X)=(1- ρ2)Var(Y), ρ is both a measure of linear association and a measure of “variance reduction” (in Y associated with knowledge of X) – that’s why we called r2, an estimate of ρ2, the “coefficient of determination”.
Readings & Exercises • Readings: A thorough reading of the text’s
sections 2.4-2.5 (pp. 52-61), 2.7 (pp. 63-71), and 2.11 (pp. 78-82) is highly recommended.
• Exercises: The following exercises are good for practice, all from chapter 2 of text: 2.13, 2.23, 2.24, 2.28, and 2.29.
Due As Homework #9.1 Refer to dataset “Cigarettes”, Y= Cotinine & X=CPD: a) Obtain the 95% confidence interval for the mean
Cotinine level for subjects who consumed X = 30 cigarettes per day and give your interpretation.
b) Obtain the 95% confidence interval for Cotinine level of a subject who consumed 30 cigarettes per day; why is the result is different from (a)?
c) Plot the residual against X; What would be your conclusion about their possible linear relationship? What would be the average residual?
d) Set up the ANOVA table and test whether or not a linear association exist between Cotinine and CPD.
#9.2 Answer the 4 questions of Exercise 9.1 using dataset “Vital Capacity” with X = Age and Y = (100)(Vital Capacity); use X= 35 years for questions (a) and (b).