Top Banner
Biostatistics in Practice Peter D. Christenson Biostatistician http://gcrc.LABioMed.org/ Biostat Session 5: Methods for Assessing Associations
25

Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Dec 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Biostatistics in Practice

Peter D. ChristensonBiostatistician

http://gcrc.LABioMed.org/Biostat

Session 5: Methods for Assessing Associations

Page 2: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Readings for Session 5from StatisticalPractice.com

• Simple Linear Regression• Introduction to Simple Linear Regression• Transformations in Linear Regression

• Multiple Regression• Introduction to Multiple Regression• What Does Multiple Regression Look Like?• Which Predictors are More Important?

Also, without any reading: Correlation

Page 3: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Purpose of Session 5Earlier: Compare means for a single measure among groups.

Use t-test, ANOVA.

Session 5: Relate two or more measures.

Use correlation, regression.

Qu et al(2005), JCEM 90:1563-1569.

Δ

ΔY/ΔX

Page 4: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Correlation• Visualize Y (vertical) by X (horizontal) scatter plot.

• Pearson correlation, r, is used to measure association between two measures X and Y

• Ranges from -1 (perfect inverse association) to 1 (perfect direct association)

• Value of r does not depend on:

scales (units) of X and Ywhich role X and Y assume, as in a X-Y plot

• Value of r does depend on: the ranges of X and Yvalues chosen for X, if X is fixed and Y is

measured

Page 5: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Graphs and Values of Correlation

Page 6: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Correlation Depends on Ranges of X and Y

Graph B contains only the graph A points in the ellipse.

Correlation is reduced in graph B.

Thus: correlations for the same quantities X and Y may be quite different in different study populations.

BA

Page 7: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Regression

• Again: Y (vertical) by X (horizontal) scatterplot, as with correlation. See next slide.

• X and Y now assume unique roles: Y is an outcome, response, output, dependent variable X is an input, predictor, independent variable • Regression analysis is used to:

Measure X-Y association, as with correlation. Fit a straight line through the scatter plot, for:Prediction of Y from X. Estimation of Δ in Y for a unit change in X (slope = “effect” of X on Y).

Page 8: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Regression Example

ei

MinimizesΣei

2

Page 9: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

X-Y Association

If slope=0 then X and Y are not associated.

But the slope measured from a sample will never be 0. How different from 0 does a measured slope need to be in order to claim X and Y are associated?

Test H0: slope=0 vs. HA: slope≠0, with the rule:

Claim association (HA) if tc=|slope/SE(slope)| > t ≈ 2.

There is a 5% chance of claiming an X-Y association that really does not exist.

Note similarity to t-test for means: tc=|mean/ SE(mean)|.

Formula for SE(slope) is in statistics books.

Page 10: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

X-Y Association, Continued

Refer to the graph of the example, 2 slides back.

We are 95% sure that the true line for the X-Y association is within the inner ..… band about the estimated line from our limited sample data.

If our test of H0: slope=0 vs. HA: slope≠0 results in claiming HA, then the inner ..… band does not include the horizontal line, and vice-versa. X and Y are significantly associated.

We can also test H0: ρ=0 vs. HA: ρ ≠0 , where ρ is the true correlation estimated by r. The result is identical to that for the slope.

Thus, correlation and regression are equivalent methods for measuring whether two variables are linearly associated.

Page 11: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Estimation of Effects using Regression

Again, refer to the graph of the example, 3 slides back.

Regression Equation: y = 81.6 + 2.16x

If the study was designed to infer causation:

Slope estimates the effect of X on Y.

Best estimate is: Y increases 2.16 for a 1-unit increase in X.

Approximate 95% confidence interval for this effect is:

slope ± 2SE(slope)

2.16 ± 2(0.11)

1.94 to 2.38

Page 12: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Prediction from Regression

Again, refer to the graph of the example, 4 slides back.

The regression line (e.g., y=81.6 + 2.16x) is used for:

1. Predicting y for an individual with a known value of x. We are 95% sure that the individual’s true y is between the outer (---) band endpoints vertically above x. This interval is analogous to mean±2SD.

2. Predicting the mean y for “all” subjects with a known value of x. We are 95% sure that this mean is between the inner (….) band endpoints vertically above x. This interval is analogous to mean±2SE.

Page 13: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Example Software OutputThe regression equation is: Y = 81.6 + 2.16 X

Predictor Coeff StdErr T PConstant 81.64 11.47 7.12 <0.0001X 2.1557 0.1122 19.21 <0.0001

S = 21.72 R-Sq = 79.0%

Predicted Values:

X: 100Fit: 297.21SE(Fit): 2.1795% CI: 292.89 - 301.5295% PI: 253.89 - 340.52

Predicted y = 81.6 + 2.16(100)

Range of Ys with 95% assurance for:

Mean of all subjects with x=100.

Individual with x=100.

19.21=2.16/0.112 should be between ~ -2 and 2 if “true” slope=0.

Refers to Intercept

Page 14: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Regression Issues

1. We are assuming that the relation is linear.

2. We can generalize to more complicated non-linear associations.

3. Transformations, e.g., logarithmic, can be made to achieve linearity on other scales.

4. The difference, actual Y minus predicted Y, is called the “residual” or prediction error. Its magnitude (absolute value) should not depend on the value of x (e.g., should not tend to be larger for larger x), and it should be symmetrically distributed about 0. If not, transformations can often achieve this.

Page 15: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Multiple Regression: Geometric View

“Multiple” refers to using more than one X (say X1 and X2) simultaneously to predict Y. Geometrically, this is fitting a slanted plane to a cloud of points:

Graph from the readings.

LHCY is the Y (homocysteine) to be predicted from the two X’s: LCLC (folate) and LB12 (B12).

LHCY = b0 + b1LCLC + b2LB12 is the equation of the plane

Page 16: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

How Are Coefficients Interpreted?

LHCY = b0 + b1LCLC + b2LB12

OutcomePredictors

LHCY

LCLC

LB12

LB12 may have both an independent and an indirect (via LCLC) association with LHCY

Correlation

b1 ?

b2 ?

Page 17: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Coefficients: Meaning of their Values

LHCY = b0 + b1LCLC + b2LB12

OutcomePredictors

LHCY increases by b2 for a 1-unit increase in LB12 …

… if other factors (LCLC) remain constant, or

… adjusting for other factors in the model (LCLC)

May be physiologically impossible to maintain one predictor constant while changing the other by 1 unit.

Page 18: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Coefficients: Significance of Predictors

LHCY = b0 + b1LCLC + b2LB12

OutcomePredictors

As typical for many estimators, the significance of LB12’s association with LHCY is measured with b2/SE(b2).

SE(b2) is found by first fitting LHCY with all other predictors (e.g. here, LHCY = b3 + b4LCLC), and then including LB12 also, getting eqn (1) with b2, and basing SE(b2) on the degree of fit improvement.

Thus, the significance of LB12 (via p-value for b2) is for its independent effect, after removing the effect through its correlation with LCLC (“adjusted” for LCLC).

(1)

Page 19: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Multiple Regression: More General

• More than 2 predictors can be used. The equation is for a “hyperplane”: y = b0 + b1x1 + b2x2 + … + bkxk.

• A more realistic functional form, more complex than a plane, can be used. For example, to fit curvature for x2, use y = b0 + b1x1 + b2x2 + b3x2

2 .

• If predictors are highly correlated with each other, then the fitted equation is imprecise.

This is because the x1 and x2 data then lie in almost a line in the x1-x2 plane, so the fitted plane is like an unstable tabletop with the table legs not well-spaced.

Page 20: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Multiple Regression: Variable Selection

• Which factors should be included in the equation? More predictors → less bias , but less precision also.

• Only those that are significant (p<0.05)?• Any that are biologically relevant?• Those that alter other predictor effects (the bis)

by at least some minimal magnitude, e.g., 10%?

• Can depend on which of several different goals:• Best prediction of the outcome, e.g., life

expectancy with terminal illness, regardless of which predictors are used?

• The effects of particular factors on outcome?

Page 21: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Reading Example: HDL Cholesterol

Parameter Std Standardized Estimate Error T Pr > |t| Estimate

Intercept 1.16448 0.28804 4.04 <.0001 0AGE -0.00092 0.00125 -0.74 0.4602 -0.05735BMI -0.01205 0.00295 -4.08 <.0001 -0.35719BLC 0.05055 0.02215 2.28 0.0239 0.17063PRSSY -0.00041 0.00044 -0.95 0.3436 -0.09384DIAST 0.00255 0.00103 2.47 0.0147 0.23779GLUM -0.00046 0.00018 -2.50 0.0135 -0.18691SKINF 0.00147 0.00183 0.81 0.4221 0.07108LCHOL 0.31109 0.10936 2.84 0.0051 0.20611

The predictors of log(HDL) are age, body mass index, blood vitamin C, systolic and diastolic blood pressures, skinfold thickness, and the log of total cholesterol. The equation is:

LHDL = 1.16 - 0.00092(Age) +…+ 0.311(LCHOL)

Page 22: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Reading Example: Coefficients

Interpretation of coefficients on previous slide:

1. Need to use entire equation for making predictions.

2. Each coefficient measures the difference in expected LHDL between 2 subjects if the factor differs by 1 unit between the two subjects, and if all other factors are the same. E.g., expected LHDL is 0.012 lower in a subject whose BMI is 1 unit greater, but is the same as the other subject on other factors.

3. P-values measure independent association. SKINF is probably is associated, although p=0.42, but not after accounting for other factors such as BMI.

Page 23: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Self-Quiz: Extend Earlier Example

Earlier Example: Now use another variable Z to predict the same Y:

Strong effect of X on Y:Slope = 2.16±0.1195% CI = 1.94 to 2.38 p<0.0001

Strong effect of Z on Y:Slope = 0.93±0.0595% CI = 0.83 to 1.03 p<0.0001 Next Slide→

Page 24: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Self-Quiz: Continued

Now use both X and Z together to predict Y. Output:

The regression equation isY = 81.7 + 0.04 Z + 2.07 X

Predictor Coeff SE T p Constant 81.66 11.55 7.07 0.000Z 0.039 1.041 0.04 0.970X 2.066 2.408 0.86 0.393

S = 21.83 R-Sq = 79.0%

Why do both X and Z now seem to be unassociated with Y (p-values > 0.39), whereas individually p<0.0001?

Page 25: Biostatistics in Practice Peter D. Christenson Biostatistician . LABioMed.org /Biostat Session 5: Methods for Assessing Associations.

Self-Quiz: Answer

Graphing X and Z shows that they essentially have the same information. Their Pearson correlation is >0.99.

The p=0.39 for X in the multiple regression says that X has no additional info on Y beyond that in Z. It says nothing about X’s association with Y, ignoring Z.