Top Banner
PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II
53

PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

Aug 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

PubH 7405: REGRESSION ANALYSIS

SLR: INFERENCES, Part II

Page 2: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

We cover the topic of inference in two sessions; the first session focused on inferences concerning the slope and the intercept; this is a continuation on estimating the mean response – and more. Applications concerning the slope and the intercept are based on the following four (4) theorems

Page 3: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

SAMPLING DISTRIBUTION OF SLOPE

∑ −=

=

++=

2_

2

12

11

210

)()b(

)E(b:

),0(

:Model" RegressionError Normal" Under the

xx

NxY

σσ

β

σε

εββ

Variance and Mean with Normal is bslope estimated the of ondistributi sampling The

1

Theorem 1A:

Page 4: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

IMPLICATION

)()(

)()( 1

1

1

11

1

11

bbs

bb

bsb

σσββ

÷−

=−

222

1−=− ndfn

χdistributed as N(0,1)

freedom of degrees 2)(n with t"" as ddistribute is )( 1

11 −−bs

b β:1B Theorem

Page 5: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

CONFIDENCE INTERVALS

freedom of degrees 2)(n with t"" as ddistribute is )( 1

11 −−bs

b β:1B Theorem

freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1

)b()2;2/1(b:is

11

−−−−−±

−snt α

1βfor Interval Confidence α)100%(1

Page 6: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

SAMPLING DISTRIBUTION OF INTERCEPT

−+=

=

++=

∑ 2_

2_

20

2

00

210

)(

1)b(

)E(b:

),0(

:Model" RegressionError Normal" Under the

xx

xn

NxY

σσ

β

σε

εββ

Variance and Mean with Normal is bintercept estimated the of ondistributi sampling The

0

Theorem 2A:

Page 7: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

IMPLICATION

)()(

)()( 0

0

0

00

0

00

bbs

bb

bsb

σσββ

÷−

=−

222

1−=− ndfn

χdistributed as N(0,1)

freedom of degrees 2)(n with t"" as ddistribute is )( 0

00 −−bs

b β:2B Theorem

Page 8: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

CONFIDENCE INTERVALS

freedom of degrees 2)(n with t"" as ddistribute is )( 0

00 −−bs

b β:2B Theorem

freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1

)b()2;2/1(b:is

00

−−−−−±

−snt α

0βfor Interval Confidence α)100%(1

Page 9: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

xxXYE 10)|(:ResponseMean Theββ +==

A common objective in regression analysis is to estimate the mean response. For example: (1) we are interested to know the average blood pressure for women at certain age and how estimate it using the relationship between SBP and Age, and (2) in a study of the relationship between level of pay (salary, X) and worker productivity (Y), the mean productivity at high, medium, and low levels of pay may be of particular interest for any company.

Page 10: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

POINT ESTIMATE

xxXYE 10)|(:ResponseMean Theββ +==

Let X = xh denote the level of X for which we wish to estimate the mean response, i.e. E(Y|X=xh); this xh may be a value which occurred in the sample, or it may be some other value of the predictor variable within the scope of the model. The point estimate of the response is:

h

hh

xbbYxXYE

10

^)|(

:

+===

EstimatePoint

Page 11: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

SAMPLING DISTRIBUTION

−+=

+===

++=

∑ 2_

2_

2h

^2

10h

^

210

)(

)(1)Y(

)|()YE(

:

),0(

:Model" RegressionError Normal" Under the

xx

xxn

xxXYE

NxY

h

hh

σσ

ββ

σε

εββ

Variance and Mean with Normal is Y Response

Mean estimated the of ondistributi sampling The

:#3A Theorem

h^

Page 12: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

iih

hh

ii

ii

ykxxn

xY

yk

ykxn

−+=

+=

=

−=

)(1bb

b

1b

_

10

^

1

0

Page 13: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

The sampling distribution of Ŷh is “normal” because this estimated mean response, like the intercept and the slope, Ŷh is a linear combination of the observations yi and the distribution of each observation is normal under the “normal error regression model”:

Page 14: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

The estimated mean response is unbiased because the estimated intercept and estimated slope are both unbiased:

)|(

)()()(

10

10

^10

^

h

h

hh

hh

xXYEx

bExbEYE

xbbY

==+=

+=

+=

ββ

Page 15: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

−+=

−+−+=

−+−+=

−+=

−+=

∑ ∑

2_

2_

2

22__

2

22__

22

22_^

_^

)(

)(1

)()(21

)()(121

)(1)(

)(1

xx

xxn

kxxkxxnn

kxxkxxnn

kxxn

YVar

ykxxn

Y

i

h

ihih

ihih

ihh

iihh

σ

σ

σ

σ

Page 16: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

−+=

−+=

2_

2_

^2

2_

2_

2^

)(

)(1)(

)(

)(1)(

xx

xxn

MSEYs

xx

xxn

YVar

i

hh

i

hh σ

Taking square root to get Standard Error

Page 17: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

−+=

−+=

2_

i

2_

hh

^

)x(x

)x(xn1MSE)YSE(

2_

2_

^2

)(

)(1)(xx

xxn

MSEYsi

hh

Implication:

Our estimates are less precise toward the ends

Page 18: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

MORE ON SAMPLING DISTRIBUTION

)(

)(

)(

)(

)(

)(^

^

^

^

^

^

h

h

h

hh

h

hh

Y

Ys

Y

YEY

Ys

YEY

σσ÷

−=

222

1−=− ndfn

χdistributed as N(0,1)

freedom of degrees 2)(n with t"" as ddistribute is )(

)(^

^

−−

h

hh

Ys

YEY

:#3B Theorem

Page 19: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

CONFIDENCE INTERVALS

freedom of degrees 2)(n with t"" as ddistribute is)(

)(^

^

−−

h

hh

Ys

YEY

:#3B Theorem

freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1

)()2;2/1(

:is ^^

−−−−−±

hh YsntY α

h^Yfor Interval Confidence α)100%(1

Page 20: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

x (oz) y (%)112 63111 66107 72119 5292 7580 11881 12084 114

118 42106 72103 9094 91

EXAMPLE #1: Birth weight data: Intercept = 256.972 Slope = -1.737 MSE = 75.982 Mean of X = 100.58 SS of X = 2,156.913 For children with birth weight of xh = 95 ounces, the point estimate and 95% Confidence Interval for the Mean growth between 70-100 days as % of BW is:

%)83.97%,69.85(43.7)(76.91

429.7913.156,2

)58.10095(121)982.75()(

%757.91)95)(737.1(972.2562^

2

^

=

−+=

=−+=

2.228

h

h

Ys

Y

Page 21: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

EXAMPLE #2: Age and SBP Age (x) SBP (y)

42 13046 11542 14871 10080 15674 16270 15180 15685 16272 15864 15581 16041 12561 15075 165

Intercept = 99.958 Slope = .705 MSE = 278.554 Mean of X = 65.6 SS of X = 3403.6 For xh = 60 years old women, the point estimate and 95% Confidence Interval for the Mean SBP is:

)2.152,4.132(137.21)(3.142

137.216.3403

)6.6560(151)554.278()(

26.142)60)(705(.958.992^

2

^

=

−+=

=+=

2.160

h

h

Ys

Y

Page 22: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

LotSize WorkHours80 39930 12150 22190 37670 36160 224

120 54680 352

100 35350 15740 16070 25290 38920 113

110 435100 42030 21250 26890 377

110 42130 27390 46840 24480 34270 323

EXAMPLE #3: Toluca Company Data

Intercept = 62.366 Slope = 3.570 MSE = 2,384 Mean of X = 70.0 SS of X = 19,800 For the lots’ size of xh = 65 units, the point estimate and 90% Confidence Interval for the Mean Work Hours is:

)4.311,4.277(47.98)(4.294

47.98800,19

)0.7065(251)384,2()(

4.294)65)(57.3(37.622^

2

^

=

−+=

=+=

1.714

h

h

Ys

Y

Page 23: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

In regression analysis, besides estimating the mean response, sometimes one may want to estimate a new individual response. For example: (1) In addition to estimating the average blood pressure for women at certain age using the relationship between SBP and Age, we may be interested in estimating the SBP of a particular woman/patient at that age; and (2) In a study of the relationship between pay (salary, X) and worker productivity (Y), the interest may focus on the productivity of certain particular worker.

Page 24: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

POINT ESTIMATE Let X = xh denote the level of X under investigation, at which the mean response is E(Y|X=xh). Let Yh(new) be the value of the new individual response of interest. This new observation of Y to be predicted is often viewed as the result of a new trial independent of the trials on which the regression line is formed. The point estimate is still the same as that of the mean response:

)(

^10

^10)|(

newh

hh

hh

Y

xbbY

xxXYE

=

+=

+== ββ

Same as the mean

Page 25: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

VARIANCE The point estimates of the mean response and of an individual response are the same but the variances are different. In estimating an individual response, there are two layers of variation: (a) variation in the “position of the distribution” (that is of the mean response), and (b) the variation within that distribution (that is from the individual response to the mean response)

Page 26: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

normal. is Y ofon distributi sampling the

Model, RegressionError Normal" Under the:

)(

)(11

)(

)(1

)()()(

h(new)

^

2_

2_

2

2_

2_

22

^

)()(

^

#4A Theorem

−++=

−++=

+=

xx

xxn

xx

xxn

YVarYVarYVar

i

h

i

h

hnewhnewh

σ

σσ

Page 27: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

−++=

−++=

2_

2_

)(

^2

2_

2_

2)(

^

)(

)(11)(

)(

)(11)(

xx

xxn

MSEYs

xx

xxn

YVar

i

hnewh

i

hnewh σ

Taking square root to get Standard Error

Page 28: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

MORE ON SAMPLING DISTRIBUTION

freedom of degrees 2)(n with t"" as ddistribute is )(

:

)(

^

^

)(

^

−−

newh

hnewh

Ys

YY

#4B Theorem

Inferences on a new individual response is based on the following results:

Page 29: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

−++=

−++=

2_

i

2_

hh(new)

^

)x(x

)x(xn11MSE)YSE(

2_

2_

)(

^2

)(

)(11)(xx

xxn

MSEYsi

hnewh

Again:

Our estimates are less precise toward the ends

Page 30: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

{ }

MSE

NxY

^2

i

210

:zeromean with sample a is e),0(

:Model RegressionError Normal

=

++=

σ

σε

εββ

2

22-ndf2

E(MSE)

χ as ddistribute is SSE

σσ

=

=

:#5 Theorem

Page 31: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

THE TEST FOR INDEPENDENCE

2

1

1

10

r12nrt

)s(bbt

:freedom of degrees 2)(nat test t""0β:H

−−

=

=

−=

+==

:r"" using test the toidentical iswhich

)|(:ResponseMean The

10 xxXYE ββ

The method we use most often is this “Test for Independence” which we are now approaching by a different way: ANOVA

Page 32: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

COMPONENTS OF VARIATION • The variation in Y is conventionally measured in

terms of the deviations (Yi - Y)'s; the total variation, denoted by SST, is the sum of squared deviations: SST = Σ(Yi - Y)2. For example, SST=0 when all observations are the same; SST is the numerator of the sample variance of Y, the greater SST the greater the variation among Y-values.

• In the regression analysis, the variation in Y is decomposed into two components: (Yi - Y) = (Yi - Ŷi) + (Ŷi - Y)

Page 33: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

DECOMPOSITION OF SST • In the decomposition: (Yi - Y) = (Yi - Ŷi) + (Ŷi - Y) • The first term (RHS) reflects the variation around

the regression line; the part than cannot be explained by the regression itself with the sum of squared errors SSE = Σ(Yi - Ŷi)2.

• The difference between the above two sums of squares, SSR = SST - SSE = Σ(Ŷi - Y)2, is called the regression sum of squares; SSR may be considered as a measure of the variation in Y associated with or explained by the regression line.

Page 34: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

Regression helps to “improve” the estimate of Y from Y (without any information) to Ŷ (with information provided by knowing X)

Page 35: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

SSRSSESST +=

+++=

−−+−+−=

−+−=−=

−=

∑∑∑∑∑

∑∑∑

iii eYYeSSRSSE

YYYYYYYY

YYYYYY

YYSST

_^

_^^2

_^2

^

2_^^

2_

2_

22

))((2)()(

)]()[()(

)(

0

Page 36: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

ANALYSIS OF VARIANCE • SST measures the “total variation” in the sample (of

values of the dependent variable) with (n-1) degrees of freedom, n is the sample size. It is decomposed into: SST=SSE+SSR

• (1) SSE measures the variation cannot be explained by the regression with (n-2) degrees of freedom, and

• (2) SSR measures the variation in Y associated with or explained by the regression line with 1 degree of freedom (representing the slope).

Page 37: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

∑∑

∑∑∑

−+=

−+=

=

−=

−+−=

−+=

2_

21

2

2_

211

2

2_

21

2_

1

_

1

_

2_

10

)(

)(])([

)()()(

])[(

)(

xx

xxb

SSREMSRExxb

yxbxby

yxbbSSR

βσ

βσ

Var(X) = E(X2) – {E(X)}2

E(X2) =Var(X) + {E(X)}2

Page 38: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

“ANOVA” TABLE • The breakdowns of the total sum of squares and its

associated degree of freedom are displayed in the form of an “analysis of variance table” (ANOVA table) for regression analysis as follows:

Source of Variation SS df MS F Statistic p-value Regression SSR 1 MSR MSR/MSE Error SSE n-2 MSE Total SST n-1

• Recall: MSE, the “error mean square”, serves as an estimate of the constant variance σ2 as stipulated by the regression model.

Page 39: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

∑ −+=

=

2_

21

2

2

)()(

)(

xxMSRE

MSEE

βσ

σ

Under the Null Hypothesis H0: β1 = 0, E(MSE) = E(MSR) so that F=MSR/MSE is expected to be near 1.0

Theorem 6: F is distributed, under H0, as F(1,n-2) following a theorem by Cochran.

Page 40: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

THE F-TEST

The test statistic F for the above analysis of variance approach compares MSR and MSE, a value near 1 supports the null hypothesis of independence. In fact, we have: F = t2, where t is the test statistic for testing whether or not β1=0; the F-test is equivalent to the two-sided t-test when refereed to the F-table in Appendix B (Table B.4) with (1,n-2) degrees of freedom.

Page 41: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

THE TEST FOR INDEPENDENCE

MSEMSRF

:freedom of degrees 2)n(1,at test F""r12nrt

)s(bbt

:freedom of degrees 2)(nat test t""

2

1

1

=

−−−

=

=

=

)2(

:r"" using test the toidentical iswhich

(1):choices identical Two

0::Hypothesis Null The

10 βH

Page 42: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

COEFFICIENT OF DETERMINATION • We can express the coefficient of determination

(the square of the coefficient of correlation r) as:

• That is the portion of total variation attributable to regression; Regression helps to “improve” the estimate of Y from Y (without any information) to Ŷ (with information provided by knowing X) – reducing the total variation by (100)(r2)%

SSTSSRr =2

Page 43: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

EXAMPLE #1: Birth Weight Data

SUMMARY OUTPUT

Regression StatisticsR Square 0.89546Observations 12

ANOVAdf SS MS F Significance F

Regression 1 6508 6508 85.66 3.21622E-06Residual 10 759.8 75.98Total 11 7268

x (oz) y (%)112 63111 66107 72119 5292 7580 11881 12084 114

118 42106 72103 9094 91

Page 44: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

EXAMPLE #2: AGE & SBP Age (x) SBP (y)

42 13046 11542 14871 10080 15674 16270 15180 15685 16272 15864 15581 16041 12561 15075 165

SUMMARY OUTPUT

Regression StatisticsR Square 0.3183Observations 15

ANOVAdf SS MS F Significance F

Regression 1 1691 1691 6.071 0.028453563Residual 13 3621 278.6Total 14 5312

Page 45: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

EXAMPLE #3: Toluca Company Data LotSize WorkHours

80 39930 12150 22190 37670 36160 224

120 54680 352

100 35350 15740 16070 25290 38920 113

110 435100 42030 21250 26890 377

110 42130 27390 46840 24480 34270 323

SUMMARY OUTPUT

Regression StatisticsR Square 0.3183Observations 15

ANOVAdf SS MS F Significance F

Regression 1 1691 1691 6.071 0.028453563Residual 13 3621 278.6Total 14 5312

Page 46: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

),0(

:ModelRegession Error Normal

210

σε

εββ

NxY

++=

The normal regression model assumes that the X values are known constants. We do not impose any kind of distribution for the x-values

Page 47: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

In many cases, this is not true; for example, if we study the relationship between “height of a person” and weight of a person”, a sample of persons are taken but both measurements are random. Rather than a regression model, one should consider a “correlation model”; the most widely used is the “Bivariate Normal Distribution” with density:

Page 48: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

)])([(

),(

2)1(2

1exp12

1),(22

22

yx

xy

yx

xy

y

y

y

y

x

x

x

x

yx

YXEYXCov

YYXXYXf

µµ

σ

σσσ

ρ

σµ

σµ

σµρ

σµ

ρρσπσ

−−=

=

=

−+

−−

−−

−−

=

σxy is the Covariance and ρ is the Coefficient of Correlation between the two random variables X and Y; ρ is estimated by the (sample) Coefficient of Correlation r.

CORRELATION MODEL “Correlation Data” are often cross-sectional or observational. Instead of a regression model, one should consider a “correlation model”; the most widely used is the “Bivariate Normal Distribution” with density:

Page 49: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

The Coefficient of Correlation ρ between the two random variables X and Y is estimated by the (sample) Coefficient of Correlation r but the sampling distribution of r is far from being normal. Confidence intervals of is by first making the “Fisher’s z transformation”; the distribution of z is normal if the sample size is not too small

Page 50: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

CONDITIONAL DISTRIBUTION

2)21(2|

1

:x|yσdeviation standard andx 1mean with normal is Xgiven any for Y ofon distributi lconditiona The

0

0

2)1(2

1exp12

1),(22

22

yxy

x

yx

yxy

x

y

y

y

y

x

x

x

x

yx

YYXXYXf

σρσ

σ

σρβ

σ

σρµµβ

ββ

σµ

σµ

σµρ

σµ

ρρσπσ

−=

=

−=

+=

−+

−−

−−

−−

=

: Theorem

Page 51: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

Again, since Var(Y|X)=(1- ρ2)Var(Y), ρ is both a measure of linear association and a measure of “variance reduction” (in Y associated with knowledge of X) – that’s why we called r2, an estimate of ρ2, the “coefficient of determination”.

Page 52: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

Readings & Exercises • Readings: A thorough reading of the text’s

sections 2.4-2.5 (pp. 52-61), 2.7 (pp. 63-71), and 2.11 (pp. 78-82) is highly recommended.

• Exercises: The following exercises are good for practice, all from chapter 2 of text: 2.13, 2.23, 2.24, 2.28, and 2.29.

Page 53: PubH 7405: BIOSTATISTICS: REGRESSIONchap/F09-InferencesPartII.pdf · PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part II . We cover the topic of inference in two sessions; the

Due As Homework #9.1 Refer to dataset “Cigarettes”, Y= Cotinine & X=CPD: a) Obtain the 95% confidence interval for the mean

Cotinine level for subjects who consumed X = 30 cigarettes per day and give your interpretation.

b) Obtain the 95% confidence interval for Cotinine level of a subject who consumed 30 cigarettes per day; why is the result is different from (a)?

c) Plot the residual against X; What would be your conclusion about their possible linear relationship? What would be the average residual?

d) Set up the ANOVA table and test whether or not a linear association exist between Cotinine and CPD.

#9.2 Answer the 4 questions of Exercise 9.1 using dataset “Vital Capacity” with X = Age and Y = (100)(Vital Capacity); use X= 35 years for questions (a) and (b).