B IOS 312: M ODERN R EGRESSION A NALYSIS James C (Chris) Slaughter Department of Biostatistics Vanderbilt University School of Medicine [email protected]biostat.mc.vanderbilt.edu/CourseBios312 Copyright 2009-2012 JC Slaughter All Rights Reserved Updated January 5, 2012
41
Embed
BIOS 312: MODERN REGRESSION ANALYSIS - WebHomebiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios312/covariates.pdf · 7.2.3 Weights for Confounders and Precision Variables ... CHAPTER
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
BIOS 312: MODERN REGRESSION ANALYSIS
James C (Chris) SlaughterDepartment of Biostatistics
– Most often scientific questions are translated into comparing the distribu-tion of some response variable across groups of interest
– Groups are defined by the predictor of interest (POI)∗ Categorical predictors of interest: Treatment or control, knockout or
wild type, ethnic group
∗ Continuous predictors of interest: Age, BMI, cholesterol, blood pres-sure
· If we only considered the response and POI, this is referred to as a simple(linear, logistic, PH, etc.) regression model
· Often we need to consider additional variables other than POI because...
– We want to make comparisons in different strata
5
CHAPTER 7. ADJUSTMENT FOR COVARIATES 6
∗ e.g if we stratify by gender, we may get different answers to our scien-tific question in men and women
– Groups being compared differ in other ways∗ Confounding: A variable that is related to both the outcome and pre-
dictor of interest
– Less variability in the response if we control for other variables∗ Precision: If we restrict to looking within certain strata, may get smallerσ2
· Statistics: Covariates other than the Predictor of Interest are included in themodel as...
– Effect modifiers
– Confounders
– Precision variables
· Two main statistical methods to adjust for covariates
– Stratified analyses∗ Combines information about associations between response across
strata
∗ Will not borrow information about (or even estimate) associations be-tween response and adjustment variables
– Adjustment in multiple regression∗ Can (but does not have to) borrow information about associations be-
tween response and all modeled variables
∗ Could conduct a stratified analysis using regression
CHAPTER 7. ADJUSTMENT FOR COVARIATES 7
∗ In practice, when researchers say they are using regression, they arealmost certainly doing so to borrow information
· Example: Is smoking associated with FEV in teenagers?
– Stratified Analysis∗ Separately estimate mean FEV in 19 year olds, 18 year olds, 17 year
olds, etc. by smoking status
∗ Average means (using weights) to come up with overall effect of smok-ing on FEV
∗ Key: Not trying to estimate a common effect of age across strata (notborrowing information across age)
∗ No estimate of the age effect in this analysis
– Multiple regression∗ Fit a regression model with FEV as the outcome, smoking as the POI,
and age as an adjustment variable
∗ Will provide you an estimate of the association between FEV and age(but do you care?)
∗ Can borrow information across ages to estimate the age effect· Linear/spline function for age would borrow information
· Separate indicator variable for each age would borrow less infor-mation (would still assume that all 19.1 and 19.2 year olds are thesame)
· Adjustment for two factors: Age and Sex
– Stratified analyses
CHAPTER 7. ADJUSTMENT FOR COVARIATES 8
∗ Calculate separate means by age and sex, combine using weight av-erages as before
∗ This method adjusts for the interaction of age and sex (in addition toage and sex main effects)
– Multiple regression∗ “We adjusted for age and sex...” or “Holding age and sex constant, we
found ...”
∗ Almost certainly the research adjusted for age and sex, but not theinteraction of the two variables (but they could have)
7.2 Stratified Analysis
7.2.1 Methods
· General approach to conducting a stratified analysis
– Divide the data into strata based on all combinations of the “adjustment”covariates∗ e.g. every combination of age, gender, race, SES, etc.
– Within each strata, perform an analysis comparing responses across POIgroups
– Use (weighted) average of estimated associations across groups
· Combining responses: Easy if estimates are independent and approximatelyNormally distributed
– For independent strata k, k = 1, . . . , K
∗ Estimate in stratum k: θk ∼ N(θk, se2k)
CHAPTER 7. ADJUSTMENT FOR COVARIATES 9
∗ Weight in stratum k: wk
∗ Stratified estimate is
θ =
∑Kk=1wkθk∑Kk=1wk
∼ N
∑Kk=1wkθk∑Kk=1wk
,
∑Kk=1w
2kse
2k(∑K
k=1wk)2 (7.1)
· How to choose the weights?
– Scientific role of the stratified estimate∗ Just because I have more women in my sample than men, does that
mean I should weight my estimate towards women? Maybe, maybenot.
– Statistical precision of the stratified estimate∗ Just because the data are more variable in women than men, does
that mean I should down-weight women? Maybe, maybe not.
∗ Weight usually chosen on statistical criteria
· Weights should be chosen based on the statistical role of the adjustmentvariable
– Effect modifiers
– Confounding
– Precision
7.2.2 Weights for Effect Modification
· Scientific criteria
– Sometimes we anticipate effect modification by some variables, but
CHAPTER 7. ADJUSTMENT FOR COVARIATES 10
∗ We do not choose to report estimates of the association between theresponse and POI in each stratum separately· e.g. political polls, age adjusted incidence rates
∗ We are interested in estimating the “average association” for a popu-lation
· Choosing weights according to scientific importance
– Want to estimate the average effect in some population of interest∗ The real population, or,
∗ Some standard population used for comparisons
– Example: Ecologic studies comparing incidence of hip fractures acrosscountries∗ Hip fracture rates increase with age
∗ Industrialized countries and developing world have very different agedistributions
∗ Choose a standard age distribution to remove confounding by age
· Comment on oversampling
– In political polls or epidemiologic studies we sometimes oversample somestrata in order to gain precision∗ For fixed maximal sample size, we gain most precision if stratum sam-
ples size is proportional to weight times standard deviation of mea-surements in stratum
∗ Example: Oversample swing-voters relative to individuals who we canbe more certain about their voting preferences
CHAPTER 7. ADJUSTMENT FOR COVARIATES 11
– For independent strata k, k = 1, . . . , K
∗ Sample size in stratum k: nk
∗ Estimate in stratum k: θk ∼ N(θk, se
2k =
Vknk
)
∗ Importance weight for stratum k: wk
∗ Optimal sample size when N =∑Kk=1 nk is:
w1
√V1
n1=w2
√V2
n2= . . . =
wk√Vk
nk(7.2)
7.2.3 Weights for Confounders and Precision Variables
· If the true association is the same in each stratum, we are free to considerstatistical criteria
– It is very unlikely that there is no effect modification in truth, but is it smallenough to ignore?
· Statistical criteria
– Maximize precision of stratified estimates by minimizing the standard er-ror
· Optimal statistical weights
– For independent strata k, k = 1, . . . , K
∗ Sample size in stratum k: nk
∗ Estimate in stratum k: θk ∼ N(θk, se
2k =
Vknk
)
∗ Importance weight for stratum k: wk
CHAPTER 7. ADJUSTMENT FOR COVARIATES 12
∗ Optimal sample size when N =∑Kk=1 nk is:
w1
√V1
n1=w2
√V2
n2= . . . =
wk√Vk
nk(7.3)
· We often ignore the aspect that variability may differ across strata
– Simplifies so that we choose weight by sample size for each stratum
· Example: Mantel-Haenszel Statistic
– Popular method used to create a common odds ratio estimate acrossstrata∗ One way to combine a binary response variable and binary predictor
across various strata
– Hypothesis test comparing odds (proportions) across two groups∗ Adjust for confounding in a stratified analysis
∗ Weights chosen for statistical precision
– Approximate weighting of difference in proportions based on harmonicmeans of sample sizes in each stratum∗ Usually viewed as a weighted odds ratio
∗ (Why not weight by log odds or probabilities?)
– For independent strata k, k = 1, . . . , K
∗ Sample size in stratum k: n1k, n0k
∗ Estimates in stratum k: p1k, p2k
∗ Precision weight for stratum k:
CHAPTER 7. ADJUSTMENT FOR COVARIATES 13
wk =n1kn0kn1k + n0k
÷K∑k=1
n1kn0kn1k + n0k
(7.4)
∗ These weights work well in practice, and are not as complicated assome other weighting systems
· M-H OR (.303) differs slightly from the multiple logistic regression OR for(.301).
· Conceptually, the are attempting to do the same thing, but are using differentweights
· Multivariable logistic regression would allow you to control for continuouscovariates without complete stratification
7.3 Multivariable Regression
7.3.1 General Regression Setting
· Types of variables
– Binary data: e.g. sex, death
– Nominal (unordered categorical) data: e.g. race, martial status
– Ordinal (ordered categorical data): e.g. cancer stage, asthma severity
CHAPTER 7. ADJUSTMENT FOR COVARIATES 17
– Quantitative data: e.g. age, blood pressure
– Right censored data: e.g. time to death
· Which regression model you choose to use is based on the parameter beingcompared across groups
Means → Linear regressionGeometric means → Linear regression on log scaleOdds → Logistic regressionRates → Poisson regressionHazards → Proportional Hazards (Cox) regression
· General notation for variables and parameters
Yi Response measured on the ith subjectXi Value of the predictor of interest measured on the ith subjectW1i,W2i, . . . Value of the adjustment variable for the ith subjectθi Parameter summarizing distribution of Yi
– The parameter (θi) might be the mean, geometric mean, odds, rate, in-stantaneous risk of an event (hazard), etc.
– In multiple linear regression on means, θi = E[Yi|Xi,W1i,W1i, . . .]
– Choice of correct θi should be based on scientific understanding of prob-lem
· General notation for multiple regression model
– g(θi) = β0 + β1 ×Xi + β2 ×W1i + β3 ×W2i + . . .
g() Link function used for modelingβ0 Interceptβ1 Slope for predictor of interest Xβj Slope for covariate Wj−1
CHAPTER 7. ADJUSTMENT FOR COVARIATES 18
– The link function is often either none (for modeling means) or log (formodeling geometric means, odds, hazards)
7.3.2 General Uses of Multivariable Regression
· Borrowing information
– Use other groups to make estimates in groups with sparse data∗ Intuitively, 67 and 69 year olds would provide some relevant informa-
tion about 68 year olds
∗ Assuming a straight line relationship tells us about other, even moredistant, individuals
∗ If we do not want to assume a straight line, we may only want to borrowinformation from nearby groups
· Defining “Contrasts”
– Define a comparison across groups to use when answering scientificquestions
– If the straight line relationship holds, the slope for the POI is the differ-ence in parameter between groups differing by 1 unit in X when all othercovariates are held constant
– If a non-linear relationship in parameter, the slope is still the averagedifference in parameter between groups differing by 1 unit in X when allother covariates are held constant∗ Slope is a (first order or linear) test for trend in the parameter
∗ Statistical jargon: “a contrast” across groups
CHAPTER 7. ADJUSTMENT FOR COVARIATES 19
· The major difference between different regression models is the interpreta-tion of the parameters
– How do I want to summarize the outcome?∗ Mean, geometric mean, odds, hazard
– How do I want to compare groups?∗ Difference, ratio
· Issues related to the inclusion of covariates remains the same
– Address the scientific question: Predictor of interest, effect modification
– Address confounding
– Increase precision
· Interpretation of parameters
– Intercept∗ Corresponds to a population with all modeled covariates equal to zero· Quite often, this will be outside the range of the data so that the
intercept has no meaningful interpretation by itself
– Slope∗ A comparison between groups differing by 1 unit in corresponding co-
variate, but agreeing on all other modeled covariates· Sometimes impossible to use this interpretation when modeling in-
teractions or complex dose-response curves (e.g. a model with ageand age-squared)
· Stratification versus regression
– Generally, any stratified analysis could be performed as a regressionmodel
CHAPTER 7. ADJUSTMENT FOR COVARIATES 20
∗ Stratification adjusts for covariates and all interactions among thosecovariates
∗ Our habit in regression is to just adjust for covariates as main effects,and consider interactions less often
7.3.3 Software
· In Stata or R, we use the same commands as were used for simple regres-sion models
– We just list more variable names
– Interpretations of the CIs, p-values for coefficients estimates now relateto new scientific interpretations of intercept and slopes
– Test of entire regression model also provided (a test that all slopes areequal to zero)
7.3.4 Example: FEV and Smoking
· Scientific question: Is the maximum forced expiatory volume (FEV) relatedto smoking status in children?
· Age ranges from 3 to 19, but no child under 9 smokes in the sample
· Models we will compare
– Unadjusted (simple) model: FEV and smoking
– Adjusted for age: FEV and smoking with age (confounder)
– Adjusted for age, height: FEV and smoking with age (confounder) andheight (precision)
· Adjusting for covariates changes the scientific question
– Unadjusted models: Slope compares parameters across groups differingby 1 unit in the modeled predictor∗ Groups may also differ with respect to other variables
– Adjusted models: Slope compares parameters across groups differingby 1 unit in the modeled predictor but similar with respect to other modelcovariates
· Interpretation of Slopes
– Unadjusted model: g(θ|Xi) = β0 + β1 ×Xi
∗ β1: Compares θ for groups differing by 1 unit in X· (The distribution of W might differ across groups being compared)
∗ γ1: Compares θ for groups differing by 1 unit in X, but agreeing ontheir values of W
· Comparing unadjusted and adjusted models
– Science questions∗ When does γ1 = β1?
∗ When does γ1 = β1?
– Statistics questions∗ When does se(γ1) = se(β1)?
CHAPTER 7. ADJUSTMENT FOR COVARIATES 23
∗ When does se(γ1) = se(β1)
– In above, note placement of theˆ(“hat”) which signifies estimates of pop-ulation parameters
– Lack of a hat indicates “the truth” in the population
– When γ1 = β1 (the formulas are the same), then it must be the case thatse(γ1) = se(β1)
∗ But our estimates of the standard errors may not be the same
– Example of when γ1 = β1
∗ Want to compare smokers to non-smokers (POI) with respect to theirFEV (response) and have conducted a randomized experiment in whichboys and girls (W ) are equally represented as smokers and non-smokers
∗ When we compare a random smoker to a random non-smoker, thataverage difference will be the same if we adjust for gender or not
∗ Numbers in unadjusted and adjusted analyses are the same, but in-terpretation is different
· Answering these four questions cannot be done in the general case
– However, in linear regression we can derive exact results
– These exact results can serve as a basis for examination of other regres-sion models∗ Logistic regression
∗ Poisson regression
CHAPTER 7. ADJUSTMENT FOR COVARIATES 24
∗ Proportional hazards regression
7.4.2 Comparison of Adjusted and Unadjusted Models in Linear Regression
Interpretation of Slopes
· Unadjusted model: E[Yi|Xi] = β0 + β1 ×Xi
– β1: Difference in mean Y for groups differing by 1 unit in X∗ (The distribution of W might differ across groups being compared)
· When X is associated with W , and W is associated with Y after we controlfor X, that is what we call confounding
· When X is associated with W , and W is not associated with Y after wecontrol for X, this inflates the variance of the association between X and Y(more on this follows)
· When X is not associated with W , and W is associated with Y after wecontrol for X, this increases the precision of our estimate of the associationbetween X and Y
· When X is not associated with W , and W is not associated with Y after wecontrol for X, there is no reason to be concerned with modeling W
7.4.3 Precision variables
Precision in Linear Regression
· Adjusting for a true precision variable should not impact the point estimateof the association between the POI and the response, but will decrease thestandard error
· X,W independent in the population (or a completely randomized experi-ment) AND W is associated with Y independent of X
– Geometric mean of FEV in nonsmokers is 2.88 l/sec∗ The scientific relevance is questionable here because we do not really
know the population our sample represents
CHAPTER 7. ADJUSTMENT FOR COVARIATES 34
∗ Calculations e1.05817 = 2.88
∗ The p-value is completely unimportant here as it is testing that the loggeometric mean is 0, or that the geometric mean is 1. Why would wecare?
– Because smoker is a binary variable, the estimate corresponds to a ge-ometric mean. In many regression models, the intercept will have notinterpretation
· Smoking effect
– Geometric mean of FEV is 10.8% higher in smokers than in nonsmokers(95% CI: 4.1% to 17.9% higher)∗ These results are atypical of what we might expect with no true differ-
– Geometric mean of FEV in newborn nonsmokers is 1.42 l/sec∗ Intercept corresponds to the log geometric mean in a group having all
predictors equal to 0
∗ There is no scientific relevance here as we are extrapolating beyondthe range of our data
∗ Calculations e0.352 = 1.422
· Age effect
– Geometric mean of FEV is 6.6% higher for each year difference in agebetween two groups with the same smoking status (95% CI: 5.5% to 7.6%higher)∗ These results are highly atypical of what we might expect with no
true difference in geometric means between age groups having similarsmoking status (p < 0.001)
· Smoking effect (age adjusted interpretation)
CHAPTER 7. ADJUSTMENT FOR COVARIATES 36
– Geometric mean of FEV is 5.0% lower in smokers than in nonsmokers ofthe same age (95% CI: 12.2% lower to 1.6% higher)∗ These results are not atypical of what we might expect with no true
difference between groups of the same age (p = 0.136)
∗ Lack of statistical significance can also be noted by the fact that theCI for the ratio contain 1 or the CI for the percent difference contains0
– Geometric mean of FEV in newborn nonsmokers who are 1 inch in heightis 0.000015 l/sec∗ Intercept corresponds to the log geometric mean in a group having all
predictors equal to 0 (nonsmokers, age 0, log height 0)
∗ There is no scientific relevance because there are no such people inour data in either our sample or the population
· Age effect
– Geometric mean of FEV is 2.2% higher for each year difference in agebetween two groups with the same height and smoking status (95% CI:1.5% to 2.9% higher for each year difference in age)∗ These results are highly atypical of what we might expect with no
true difference in geometric means between age groups having similarsmoking status and height (p < 0.001)
– Note that there is clear evidence that height confounded the age effectestimated in the analysis that only considered age and smoking, but there
CHAPTER 7. ADJUSTMENT FOR COVARIATES 38
is still a clearly independent effect of age on FEV
· Height effect
– Geometric mean of FEV is 31.5% higher for each 10% difference in heightbetween two groups with the same age and smoking status (95% CI:28.3% to 34.6% higher for each 10% difference in height)∗ These results are highly atypical of what we might expect with no true
difference in geometric means between height groups having similarsmoking status and age (p < 0.001)
∗ Calculations: 1.12.867 = 1.315
– Note that the CI for the regression coefficient is consistent with the scientifically-hypothesize value of 3
– Geometric mean of FEV is 5.2% lower in smokers than in nonsmokers ofthe same age and height (95% CI: 9.6% to 0.6% lower)∗ These results are atypical of what we might expect with no true differ-
ence between groups of the same age and height (p = 0.027)