Regression Topics for today Readings • Jewell Chapters 12, 13, 14 & 15
Regression
Topics for today
Readings
• Jewell Chapters 12, 13, 14 & 15
ContextSo far in the course, we have learned how to• Quantify disease occurrence (prevalence, incidence etc)• Quantify association with an exposure (relative risk, odds
ratios) and assess its significance (standard errors, confidence intervals)
• Stratify for a confounding variable (Mantel-Haenszel test, Wolf or Mantel-Haenszel adjusted estimates)
• Test to see if a factor influences the exposure/response association (interaction)
All of this can be done with fairly simple procedures and tests. However, we’ve also been exploring how to do equivalent analyses with Poisson and Logistic regression.
Today we see some additional advantages of the regression approach. Lets motivate with some examples.
Example - arsenic
We’ve looked at the relative risk for the highest village (934 ppb) compared to the control group. But, we really need to characterise the whole dose response relationship. Regression allows us to do that
Example – Anti-epileptic drugsWe have looked at the effect of drug exposure, adjusted for whether or not the
mother smokes. But there are additional variables we would like to adjust for: – alc2: Alcohol use during pregnancy (1=yes, 0=no)– cig2: Cigarette smoking during pregnancy– sub2: Substance abuse during pregnancy– seiz: Severity of seizures (1=no seizure, 2=seizures with convulsions,
3=loss of conciousness) – mohcx2: whether mother has a small head circumference– mohtx2: whether mother has small height
There is also information on the type of drug exposure
– monopht: Phenytoin monotherapy – monocbz: Carbamazepine monotherapy – monopb: Penobarbital monotherapy – monooth: Other monotherapy
As well as whether the mother took one drug or a combination of drugs:monostat2: Monotherapy/Polytherapy status ( 1=Polytherapy, 2=Monotherapy,
3=Seizure History,4=Controls)
Regression allows us to explore some of these effects simultaneously
Type of regression modelsBasic concept: outcome=predicted mean+error• Linear regression – most natural when the outcome is
continuous (e.g. blood pressure)
• Logistic – most natural for 0/1 outcome
• Poisson – most natural when outcome is count among person-years at risk, or rare disease count in population
0 1
0 1 1 2 2
log ( ( )) (simple)
log ( ( )) ... (multiple)k k
it Mean Y X
it Mean Y X X X
0 1
0 1 1 2 2
( ) (simple)
( ) ... (multiple)k k
Mean Y X
Mean Y X X X
0 1
0 1 1 2 2
log( ) (simple)
log( ) ... (multiple)k k
disease rate X
disease rate X X X
Notes and comments• We’ve looked at some simple models
where the predictors in the model are categorical. But, predictors can also be continuous. E.g.
BP = β0+β1Age+error
The slope (β1) tells us how much BP is predicted to increase for each 1 unit increase in age
• Other models (e.g. probit – Jewell 12.3) are available, but less common
• Logistic and Poisson are linear on the logit and log scales, respectively. But this induces non-linear model on mean of Y
R command: p=exp(-4+.1*x)/(1+exp(-4+.1*x)); y=rbinom(100,1,p); plot(x,y); lines(x,p)
Logit(p)=-4+.1*x
Example - epilepsyproc genmod descending;
class drug;
model one3=drug cig2 sub2 /dist=binomial;
run;
How do we interpret this? Drug Cig Subs Logit Prob Obs %(n)
1 0 0 -3.42+1.06 .09
1 1 0 -3.42++1.06+.87 .18
1 0 1 -3.42+1.06+.93 .19
1 1 1 -3.42+1.06+.87+.93 .36
3 0 0 -3.42 .03
3 1 0 -3.42+.87 .07
3 0 1 -3.42+.93 .08
3 1 1 -3.42+.87+.93 .17
How to decide what goes into a model? Hard problem - no single right answer (J15.2)Hosmer/Lemeshow approach:• Start by exploring relationship between each
individual variable and the outcome• Select all the individually important variables and put
into one model• Remove variables one at a time if not “significant”
(look at p-values as well as likelihood ratio test)• Check if variables originally left out should go in• Consider interactions (a few limited ones)• Assess if model fits well
Example - EpilepsyVariables significant (p<.10) on their own:
Drug, cig2, sub2, mohcx2, seiz,
Example continued
• Dropping out the least significant, one at a time leads to model with drug, cig2 and sub2
• Note that the coefficient of mohcx2 was large, though variable not significant. Only 11 mothers had small head. It is possible (likely?) that this variable is important, but we didn’t have enough power to detect effect
Stepwise RegressionAutomatic variable selection procedure that will automatically sort
through a dataset to find best model
• Forward – start with null model and add variables one at a time
• Backward – start with saturated model and remove variables one at a time
• Combined – Do forward regression, but check at each step whether any variables need to be removed
Can be useful, though caution needed.
• Don’t overinterpret
• Consider clinical/scientific relevance as well
• Sometimes a useful way to start
Stepwise logistic regression in SASproc logistic descending;
class drug;
model one3=drug cig2 sub2 alc2 mohtx2 mohcx2 seiz /selection=stepwise;
run;
Identified only drug and cig2 because 240 cases omitted because they missed one or more variables
Additional considerations for model selection
• Some variables important to include, even if not “significant”
• May need to decide whether to add variable as categorical or continuous or how to scale a variable (more in a moment)
• Need to make sure variable is entered in a sensible way
• Looking at p-values in regression model is a “quick and dirty” method. Better to look at likelihood ratio tests (see Jewell p248-9)
Dealing with continuous predictors
Consider our arsenic dataset. We have two variables, concentration and agegroup. Lets run a model with both variables treated as categorical (via the class statement). Result is a HUGE regression output with each value of conc having it’s own relative risk
etc
Conc as a continuous variable. Entering conc as a linear term (remove the class
statement) implies a model where the log of the disease rate increases linearly with exposure, after adjusting for age.
etc
Can we model age as linear too?
Nice clean looking model!
Is it appropriate?
• Do a model comparison via LRT
• Add quadratic terms to assess non-linearities
Comparing nested models via LRTModel # param loglik LRT (df, p)
1) Conc and age as categorical variables
50 15290
2) Age categorical, conc linear
13 15237 2 vs 1 104 (33, p<.001)
3) Conc and age both linear
3 15023 3 vs 2 428 (10, p<.001)
4) Age categorical, conc quadratic
14 15243 4 vs 2 120 (1, p<.001)
Suppose model A includes model B as a special case (B is nested in A). Likelihood ratio test for modelB vs modelA is 2*(loglikA-loglikB). Compare test statistic to chi-squared distribution with df=#param(A)-#param(B).
What about non-nested models? Model # param (K) Loglik (ll) AIC=-ll/N+2K
2) Age categorical, conc linear
13 15237 -28.51
4) Age categorical, conc quadratic
14 15243 -26.54
5) Age categorical, conc logged
13 15240 -28.53
4) Age categorical, conc cubic
15 15244 -24.54
Choose model with minimum Akaike Information Criterion:
AIC =-loglikelihood/N+2*#param
Model selection can be very challenging!Especially critical for environmental risk assessment
Recent Harvard Biostatistics student worked with me to apply Bayesian model averaging techniques to the arsenic data.
Concentration - US equivalent (micro-grams/L)
Exc
ess
lifet
ime
risk
0 500 1000 1500 2000
0.0
0.04
0.08
0.12
Male Lung (Carlin & Chib method)
This is the end of my lectures!
I have enjoyed being your teacher.
Thank you for your kind attention and respect!