PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

Post on 14-Aug-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

PubH 7405: REGRESSION ANALYSIS

SLR: INFERENCES, Part II

We cover the topic of inference in two sessions; the first session focused on inferences concerning the slope and the intercept; this is a continuation on estimating the mean response – and more. Applications concerning the slope and the intercept are based on the following four (4) theorems

SAMPLING DISTRIBUTION OF SLOPE

∑ −=

=

++=

2_

2

12

11

210

)()b(

)E(b:

),0(

:Model" RegressionError Normal" Under the

xx

NxY

σσ

β

σε

εββ

Variance and Mean with Normal is bslope estimated the of ondistributi sampling The

1

Theorem 1A:

IMPLICATION

)()(

)()( 1

1

1

11

1

11

bbs

bb

bsb

σσββ

÷−

=−

222

1−=− ndfn

χdistributed as N(0,1)

freedom of degrees 2)(n with t"" as ddistribute is )( 1

11 −−bs

b β:1B Theorem

CONFIDENCE INTERVALS

freedom of degrees 2)(n with t"" as ddistribute is )( 1

11 −−bs

b β:1B Theorem

freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1

)b()2;2/1(b:is

11

−−−−−±

−snt α

1βfor Interval Confidence α)100%(1

SAMPLING DISTRIBUTION OF INTERCEPT

−+=

=

++=

∑ 2_

2_

20

2

00

210

)(

1)b(

)E(b:

),0(

:Model" RegressionError Normal" Under the

xx

xn

NxY

σσ

β

σε

εββ

Variance and Mean with Normal is bintercept estimated the of ondistributi sampling The

0

Theorem 2A:

IMPLICATION

)()(

)()( 0

0

0

00

0

00

bbs

bb

bsb

σσββ

÷−

=−

222

1−=− ndfn

χdistributed as N(0,1)

freedom of degrees 2)(n with t"" as ddistribute is )( 0

00 −−bs

b β:2B Theorem

CONFIDENCE INTERVALS

freedom of degrees 2)(n with t"" as ddistribute is )( 0

00 −−bs

b β:2B Theorem

freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1

)b()2;2/1(b:is

00

−−−−−±

−snt α

0βfor Interval Confidence α)100%(1

xxXYE 10)|(:ResponseMean Theββ +==

A common objective in regression analysis is to estimate the mean response. For example: (1) we are interested to know the average blood pressure for women at certain age and how estimate it using the relationship between SBP and Age, and (2) in a study of the relationship between level of pay (salary, X) and worker productivity (Y), the mean productivity at high, medium, and low levels of pay may be of particular interest for any company.

POINT ESTIMATE

xxXYE 10)|(:ResponseMean Theββ +==

Let X = xh denote the level of X for which we wish to estimate the mean response, i.e. E(Y|X=xh); this xh may be a value which occurred in the sample, or it may be some other value of the predictor variable within the scope of the model. The point estimate of the response is:

h

hh

xbbYxXYE

10

^)|(

:

+===

EstimatePoint

SAMPLING DISTRIBUTION

−+=

+===

++=

∑ 2_

2_

2h

^2

10h

^

210

)(

)(1)Y(

)|()YE(

:

),0(

:Model" RegressionError Normal" Under the

xx

xxn

xxXYE

NxY

h

hh

σσ

ββ

σε

εββ

Variance and Mean with Normal is Y Response

Mean estimated the of ondistributi sampling The

:#3A Theorem

h^

iih

hh

ii

ii

ykxxn

xY

yk

ykxn

−+=

+=

=

−=

)(1bb

b

1b

_

10

^

1

0

The sampling distribution of Ŷh is “normal” because this estimated mean response, like the intercept and the slope, Ŷh is a linear combination of the observations yi and the distribution of each observation is normal under the “normal error regression model”:

The estimated mean response is unbiased because the estimated intercept and estimated slope are both unbiased:

)|(

)()()(

10

10

^10

^

h

h

hh

hh

xXYEx

bExbEYE

xbbY

==+=

+=

+=

ββ

−+=

−+−+=

−+−+=

−+=

−+=

∑ ∑

2_

2_

2

22__

2

22__

22

22_^

_^

)(

)(1

)()(21

)()(121

)(1)(

)(1

xx

xxn

kxxkxxnn

kxxkxxnn

kxxn

YVar

ykxxn

Y

i

h

ihih

ihih

ihh

iihh

σ

σ

σ

σ

−+=

−+=

2_

2_

^2

2_

2_

2^

)(

)(1)(

)(

)(1)(

xx

xxn

MSEYs

xx

xxn

YVar

i

hh

i

hh σ

Taking square root to get Standard Error

−+=

−+=

2_

i

2_

hh

^

)x(x

)x(xn1MSE)YSE(

2_

2_

^2

)(

)(1)(xx

xxn

MSEYsi

hh

Implication:

Our estimates are less precise toward the ends

MORE ON SAMPLING DISTRIBUTION

)(

)(

)(

)(

)(

)(^

^

^

^

^

^

h

h

h

hh

h

hh

Y

Ys

Y

YEY

Ys

YEY

σσ÷

−=

222

1−=− ndfn

χdistributed as N(0,1)

freedom of degrees 2)(n with t"" as ddistribute is )(

)(^

^

−−

h

hh

Ys

YEY

:#3B Theorem

CONFIDENCE INTERVALS

freedom of degrees 2)(n with t"" as ddistribute is)(

)(^

^

−−

h

hh

Ys

YEY

:#3B Theorem

freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1

)()2;2/1(

:is ^^

−−−−−±

hh YsntY α

h^Yfor Interval Confidence α)100%(1

x (oz) y (%)112 63111 66107 72119 5292 7580 11881 12084 114

118 42106 72103 9094 91

EXAMPLE #1: Birth weight data: Intercept = 256.972 Slope = -1.737 MSE = 75.982 Mean of X = 100.58 SS of X = 2,156.913 For children with birth weight of xh = 95 ounces, the point estimate and 95% Confidence Interval for the Mean growth between 70-100 days as % of BW is:

%)83.97%,69.85(43.7)(76.91

429.7913.156,2

)58.10095(121)982.75()(

%757.91)95)(737.1(972.2562^

2

^

=

−+=

=−+=

2.228

h

h

Ys

Y

EXAMPLE #2: Age and SBP Age (x) SBP (y)

42 13046 11542 14871 10080 15674 16270 15180 15685 16272 15864 15581 16041 12561 15075 165

Intercept = 99.958 Slope = .705 MSE = 278.554 Mean of X = 65.6 SS of X = 3403.6 For xh = 60 years old women, the point estimate and 95% Confidence Interval for the Mean SBP is:

)2.152,4.132(137.21)(3.142

137.216.3403

)6.6560(151)554.278()(

26.142)60)(705(.958.992^

2

^

=

−+=

=+=

2.160

h

h

Ys

Y

LotSize WorkHours80 39930 12150 22190 37670 36160 224

120 54680 352

100 35350 15740 16070 25290 38920 113

110 435100 42030 21250 26890 377

110 42130 27390 46840 24480 34270 323

EXAMPLE #3: Toluca Company Data

Intercept = 62.366 Slope = 3.570 MSE = 2,384 Mean of X = 70.0 SS of X = 19,800 For the lots’ size of xh = 65 units, the point estimate and 90% Confidence Interval for the Mean Work Hours is:

)4.311,4.277(47.98)(4.294

47.98800,19

)0.7065(251)384,2()(

4.294)65)(57.3(37.622^

2

^

=

−+=

=+=

1.714

h

h

Ys

Y

In regression analysis, besides estimating the mean response, sometimes one may want to estimate a new individual response. For example: (1) In addition to estimating the average blood pressure for women at certain age using the relationship between SBP and Age, we may be interested in estimating the SBP of a particular woman/patient at that age; and (2) In a study of the relationship between pay (salary, X) and worker productivity (Y), the interest may focus on the productivity of certain particular worker.

POINT ESTIMATE Let X = xh denote the level of X under investigation, at which the mean response is E(Y|X=xh). Let Yh(new) be the value of the new individual response of interest. This new observation of Y to be predicted is often viewed as the result of a new trial independent of the trials on which the regression line is formed. The point estimate is still the same as that of the mean response:

)(

^10

^10)|(

newh

hh

hh

Y

xbbY

xxXYE

=

+=

+== ββ

Same as the mean

VARIANCE The point estimates of the mean response and of an individual response are the same but the variances are different. In estimating an individual response, there are two layers of variation: (a) variation in the “position of the distribution” (that is of the mean response), and (b) the variation within that distribution (that is from the individual response to the mean response)

normal. is Y ofon distributi sampling the

Model, RegressionError Normal" Under the:

)(

)(11

)(

)(1

)()()(

h(new)

^

2_

2_

2

2_

2_

22

^

)()(

^

#4A Theorem

−++=

−++=

+=

xx

xxn

xx

xxn

YVarYVarYVar

i

h

i

h

hnewhnewh

σ

σσ

−++=

−++=

2_

2_

)(

^2

2_

2_

2)(

^

)(

)(11)(

)(

)(11)(

xx

xxn

MSEYs

xx

xxn

YVar

i

hnewh

i

hnewh σ

Taking square root to get Standard Error

MORE ON SAMPLING DISTRIBUTION

freedom of degrees 2)(n with t"" as ddistribute is )(

:

)(

^

^

)(

^

−−

newh

hnewh

Ys

YY

#4B Theorem

Inferences on a new individual response is based on the following results:

−++=

−++=

2_

i

2_

hh(new)

^

)x(x

)x(xn11MSE)YSE(

2_

2_

)(

^2

)(

)(11)(xx

xxn

MSEYsi

hnewh

Again:

Our estimates are less precise toward the ends

{ }

MSE

NxY

^2

i

210

:zeromean with sample a is e),0(

:Model RegressionError Normal

=

++=

σ

σε

εββ

2

22-ndf2

E(MSE)

χ as ddistribute is SSE

σσ

=

=

:#5 Theorem

THE TEST FOR INDEPENDENCE

2

1

1

10

r12nrt

)s(bbt

:freedom of degrees 2)(nat test t""0β:H

−−

=

=

−=

+==

:r"" using test the toidentical iswhich

)|(:ResponseMean The

10 xxXYE ββ

The method we use most often is this “Test for Independence” which we are now approaching by a different way: ANOVA

COMPONENTS OF VARIATION • The variation in Y is conventionally measured in

terms of the deviations (Yi - Y)'s; the total variation, denoted by SST, is the sum of squared deviations: SST = Σ(Yi - Y)2. For example, SST=0 when all observations are the same; SST is the numerator of the sample variance of Y, the greater SST the greater the variation among Y-values.

• In the regression analysis, the variation in Y is decomposed into two components: (Yi - Y) = (Yi - Ŷi) + (Ŷi - Y)

DECOMPOSITION OF SST • In the decomposition: (Yi - Y) = (Yi - Ŷi) + (Ŷi - Y) • The first term (RHS) reflects the variation around

the regression line; the part than cannot be explained by the regression itself with the sum of squared errors SSE = Σ(Yi - Ŷi)2.

• The difference between the above two sums of squares, SSR = SST - SSE = Σ(Ŷi - Y)2, is called the regression sum of squares; SSR may be considered as a measure of the variation in Y associated with or explained by the regression line.

Regression helps to “improve” the estimate of Y from Y (without any information) to Ŷ (with information provided by knowing X)

SSRSSESST +=

+++=

−−+−+−=

−+−=−=

−=

∑∑∑∑∑

∑∑∑

iii eYYeSSRSSE

YYYYYYYY

YYYYYY

YYSST

_^

_^^2

_^2

^

2_^^

2_

2_

22

))((2)()(

)]()[()(

)(

0

ANALYSIS OF VARIANCE • SST measures the “total variation” in the sample (of

values of the dependent variable) with (n-1) degrees of freedom, n is the sample size. It is decomposed into: SST=SSE+SSR

• (1) SSE measures the variation cannot be explained by the regression with (n-2) degrees of freedom, and

• (2) SSR measures the variation in Y associated with or explained by the regression line with 1 degree of freedom (representing the slope).

∑∑

∑∑∑

−+=

−+=

=

−=

−+−=

−+=

2_

21

2

2_

211

2

2_

21

2_

1

_

1

_

2_

10

)(

)(])([

)()()(

])[(

)(

xx

xxb

SSREMSRExxb

yxbxby

yxbbSSR

βσ

βσ

Var(X) = E(X2) – {E(X)}2

E(X2) =Var(X) + {E(X)}2

“ANOVA” TABLE • The breakdowns of the total sum of squares and its

associated degree of freedom are displayed in the form of an “analysis of variance table” (ANOVA table) for regression analysis as follows:

Source of Variation SS df MS F Statistic p-value Regression SSR 1 MSR MSR/MSE Error SSE n-2 MSE Total SST n-1

• Recall: MSE, the “error mean square”, serves as an estimate of the constant variance σ2 as stipulated by the regression model.

∑ −+=

=

2_

21

2

2

)()(

)(

xxMSRE

MSEE

βσ

σ

Under the Null Hypothesis H0: β1 = 0, E(MSE) = E(MSR) so that F=MSR/MSE is expected to be near 1.0

Theorem 6: F is distributed, under H0, as F(1,n-2) following a theorem by Cochran.

THE F-TEST

The test statistic F for the above analysis of variance approach compares MSR and MSE, a value near 1 supports the null hypothesis of independence. In fact, we have: F = t2, where t is the test statistic for testing whether or not β1=0; the F-test is equivalent to the two-sided t-test when refereed to the F-table in Appendix B (Table B.4) with (1,n-2) degrees of freedom.

THE TEST FOR INDEPENDENCE

MSEMSRF

:freedom of degrees 2)n(1,at test F""r12nrt

)s(bbt

:freedom of degrees 2)(nat test t""

2

1

1

=

−−−

=

=

=

)2(

:r"" using test the toidentical iswhich

(1):choices identical Two

0::Hypothesis Null The

10 βH

COEFFICIENT OF DETERMINATION • We can express the coefficient of determination

(the square of the coefficient of correlation r) as:

• That is the portion of total variation attributable to regression; Regression helps to “improve” the estimate of Y from Y (without any information) to Ŷ (with information provided by knowing X) – reducing the total variation by (100)(r2)%

SSTSSRr =2

EXAMPLE #1: Birth Weight Data

SUMMARY OUTPUT

Regression StatisticsR Square 0.89546Observations 12

ANOVAdf SS MS F Significance F

Regression 1 6508 6508 85.66 3.21622E-06Residual 10 759.8 75.98Total 11 7268

x (oz) y (%)112 63111 66107 72119 5292 7580 11881 12084 114

118 42106 72103 9094 91

EXAMPLE #2: AGE & SBP Age (x) SBP (y)

42 13046 11542 14871 10080 15674 16270 15180 15685 16272 15864 15581 16041 12561 15075 165

SUMMARY OUTPUT

Regression StatisticsR Square 0.3183Observations 15

ANOVAdf SS MS F Significance F

Regression 1 1691 1691 6.071 0.028453563Residual 13 3621 278.6Total 14 5312

EXAMPLE #3: Toluca Company Data LotSize WorkHours

80 39930 12150 22190 37670 36160 224

120 54680 352

100 35350 15740 16070 25290 38920 113

110 435100 42030 21250 26890 377

110 42130 27390 46840 24480 34270 323

SUMMARY OUTPUT

Regression StatisticsR Square 0.3183Observations 15

ANOVAdf SS MS F Significance F

Regression 1 1691 1691 6.071 0.028453563Residual 13 3621 278.6Total 14 5312

),0(

:ModelRegession Error Normal

210

σε

εββ

NxY

++=

The normal regression model assumes that the X values are known constants. We do not impose any kind of distribution for the x-values

In many cases, this is not true; for example, if we study the relationship between “height of a person” and weight of a person”, a sample of persons are taken but both measurements are random. Rather than a regression model, one should consider a “correlation model”; the most widely used is the “Bivariate Normal Distribution” with density:

)])([(

),(

2)1(2

1exp12

1),(22

22

yx

xy

yx

xy

y

y

y

y

x

x

x

x

yx

YXEYXCov

YYXXYXf

µµ

σ

σσσ

ρ

σµ

σµ

σµρ

σµ

ρρσπσ

−−=

=

=

−+

−−

−−

−−

=

σxy is the Covariance and ρ is the Coefficient of Correlation between the two random variables X and Y; ρ is estimated by the (sample) Coefficient of Correlation r.

CORRELATION MODEL “Correlation Data” are often cross-sectional or observational. Instead of a regression model, one should consider a “correlation model”; the most widely used is the “Bivariate Normal Distribution” with density:

The Coefficient of Correlation ρ between the two random variables X and Y is estimated by the (sample) Coefficient of Correlation r but the sampling distribution of r is far from being normal. Confidence intervals of is by first making the “Fisher’s z transformation”; the distribution of z is normal if the sample size is not too small

CONDITIONAL DISTRIBUTION

2)21(2|

1

:x|yσdeviation standard andx 1mean with normal is Xgiven any for Y ofon distributi lconditiona The

0

0

2)1(2

1exp12

1),(22

22

yxy

x

yx

yxy

x

y

y

y

y

x

x

x

x

yx

YYXXYXf

σρσ

σ

σρβ

σ

σρµµβ

ββ

σµ

σµ

σµρ

σµ

ρρσπσ

−=

=

−=

+=

−+

−−

−−

−−

=

: Theorem

Again, since Var(Y|X)=(1- ρ2)Var(Y), ρ is both a measure of linear association and a measure of “variance reduction” (in Y associated with knowledge of X) – that’s why we called r2, an estimate of ρ2, the “coefficient of determination”.

Readings & Exercises • Readings: A thorough reading of the text’s

sections 2.4-2.5 (pp. 52-61), 2.7 (pp. 63-71), and 2.11 (pp. 78-82) is highly recommended.

• Exercises: The following exercises are good for practice, all from chapter 2 of text: 2.13, 2.23, 2.24, 2.28, and 2.29.

Due As Homework #9.1 Refer to dataset “Cigarettes”, Y= Cotinine & X=CPD: a) Obtain the 95% confidence interval for the mean

Cotinine level for subjects who consumed X = 30 cigarettes per day and give your interpretation.

b) Obtain the 95% confidence interval for Cotinine level of a subject who consumed 30 cigarettes per day; why is the result is different from (a)?

c) Plot the residual against X; What would be your conclusion about their possible linear relationship? What would be the average residual?

d) Set up the ANOVA table and test whether or not a linear association exist between Cotinine and CPD.

#9.2 Answer the 4 questions of Exercise 9.1 using dataset “Vital Capacity” with X = Age and Y = (100)(Vital Capacity); use X= 35 years for questions (a) and (b).

top related