Logistic Regression Logistic Regression Dr Mike Blyth Dr Mike Blyth February 2006 February 2006
Jan 24, 2015
Logistic RegressionLogistic Regression
Dr Mike BlythDr Mike BlythFebruary 2006February 2006
Logistic RegressionLogistic Regression
A way to look at effect ofA way to look at effect of– ““Numeric” (interval or ratio) independent Numeric” (interval or ratio) independent
variable variable OnOn
– BinaryBinary (yes-no) dependent variable (yes-no) dependent variable
Dependent variable is continuous Dependent variable is continuous intervalinterval or or ratio ratio (numeric)(numeric)Independent variables are also interval or Independent variables are also interval or ratioratioExamplesExamples– Effect of weight on blood pressureEffect of weight on blood pressure– Effect of drug dose on reticulocyte countEffect of drug dose on reticulocyte count
Review Linear RegressionReview Linear Regression
Linear RegressionLinear Regression
Independent Variable Dependent Variable
Logistic RegressionLogistic Regression
Independent Variable Dependent Variable
Logistic RegressionLogistic Regression
Dependent variable is binary (yes/no) outcome.Dependent variable is binary (yes/no) outcome.
Independent variables are continuous interval Independent variables are continuous interval
Examples:Examples:– Relation of weight and BP to 10 year risk of deathRelation of weight and BP to 10 year risk of death
– Relation of CD4 count to 1 year risk of AIDS diagnosisRelation of CD4 count to 1 year risk of AIDS diagnosis
Why do we need it?Why do we need it?Could use categorical analysis such as frequency tableCould use categorical analysis such as frequency table
AIDSAIDS No AIDSNo AIDS
CD4 > 350CD4 > 350 8080 2020
150 < CD4 < 350150 < CD4 < 350 5050 5050
CD4 < 150CD4 < 150 2020 8080
• Problems
a) some information is lost when we collapse the numeric data into categories. This leads to loss of power.
b) no estimate of magnitude of relation
Odds RatioOdds Ratio
Probability: Probability: p = probability of eventp = probability of event1 - p = probabilty of 1 - p = probabilty of notnot the event (also called q) the event (also called q)p varies from 0 to 1p varies from 0 to 1
OddsOdds– Ratio of probability of event to probability of not Ratio of probability of event to probability of not
having the event: Odds = p/(1 - p)having the event: Odds = p/(1 - p)– When p = 0.5, odds = 1 (or “1:1 odds”)When p = 0.5, odds = 1 (or “1:1 odds”)– When p = 0.1, odds = 0.1/0.9 = 0.11When p = 0.1, odds = 0.1/0.9 = 0.11
Log Odds RatioLog Odds RatioThe log odds ratio (also called “logit”) is simply the natural The log odds ratio (also called “logit”) is simply the natural logarithm of the odds ratio:logarithm of the odds ratio:¤ logit logit = ln(odds ratio) = ln(odds ratio)
= ln(p/(1-p))= ln(p/(1-p))= ln(p) – ln(1-p)= ln(p) – ln(1-p)
ln (1) = 0, so logit is 0 when odds are 1:1, or ln (1) = 0, so logit is 0 when odds are 1:1, or probability = 50%probability = 50%
The logit for event of probability p is the opposite of the logit The logit for event of probability p is the opposite of the logit for the probability of not having the event. for the probability of not having the event.
Relation between probability p and logit
0.000
0.250
0.500
0.750
1.000
-8 -6 -4 -2 0 2 4 6 8
logit = ln[p/(1-p)]
Logistic regression modelLogistic regression model
The linear regression model with one variable The linear regression model with one variable isisy = a + bx + ey = a + bx + e
The logistic regression model with one The logistic regression model with one variable isvariable islogit = a + bx + elogit = a + bx + ewherewhere
logit = ln(p/(1-p))logit = ln(p/(1-p))
The logistic regression model with one The logistic regression model with one variable isvariable islogit = a + bx logit = a + bx where logit = ln(p/(1-p))where logit = ln(p/(1-p))
In other words, the model says the odds of the event In other words, the model says the odds of the event happening are happening are – A constant factor (a)A constant factor (a)– Some other constant (b) Some other constant (b) – times a numeric risk factor (x) (for example, SBP)times a numeric risk factor (x) (for example, SBP)
Logistic regression modelLogistic regression model
Logistic regression modelLogistic regression model
Given value of the independent variables, the Given value of the independent variables, the regression equation predicts the regression equation predicts the
Log Odds RatioLog Odds Ratio
Logistic regression modelLogistic regression model
The statistics program calculates the The statistics program calculates the coefficient bcoefficient b
The The coefficient bcoefficient b shows how much the odds shows how much the odds ratio changes with a change in the ratio changes with a change in the independent variableindependent variable
Positive b Positive b higher risk with higher values higher risk with higher values
Negative b Negative b lower risk with higher values lower risk with higher values
Logistic regression modelLogistic regression model
Hypothetical example given above examining relation of BP to Hypothetical example given above examining relation of BP to risk of stroke/death. The model predicts:risk of stroke/death. The model predicts:
ln(odds ratio) = constant + b ln(odds ratio) = constant + b ∙ SBPSBP
ee(ln odds ratio) (ln odds ratio) = e= e(c + b (c + b ∙ SBP)SBP)
Odds Ratio Odds Ratio = = ee(c + b(c + b∙SBP)SBP)
= = eec c ∙ e e(b(b∙SBP)SBP)
Logistic regression modelLogistic regression model
The coefficient b shows how much the odds ratio The coefficient b shows how much the odds ratio changes with a change in the independent variablechanges with a change in the independent variable
Odds Ratio Odds Ratio = = eec c ∙ e e(bx)(bx)
In other words, In other words,
Odds Ratio Odds Ratio = = somethingsomething ∙ (e(ebb))(x) (x)
Logistic regression modelLogistic regression model
Odds Ratio Odds Ratio = constant = constant ∙ ((eebb))(x) (x)
So So eebb is the factor indicating effect of x on the is the factor indicating effect of x on the event.event.
Each one unit change in x will multiply the odds Each one unit change in x will multiply the odds ratio by a factor of eratio by a factor of eb b ..
Logistic regression modelLogistic regression model
Odds Ratio Odds Ratio = constant = constant ∙ ( (eebb))(x) (x)
– Suppose b = 0.693 so eSuppose b = 0.693 so ebb = 2 = 2– A one-unit change in x will A one-unit change in x will doubledouble the odds ratio the odds ratio
– Suppose b = -0.693 so eSuppose b = -0.693 so ebb = 0.5 = 0.5– A one-unit change in x will A one-unit change in x will halvehalve the odds ratio. the odds ratio.
– If b = 0, eIf b = 0, ebb = 1, and x has no effect on OR = 1, and x has no effect on OR
Logistic regression modelLogistic regression model
For the hypothetical example above, the report is For the hypothetical example above, the report is given by Epi Info as given by Epi Info as
TermTerm Odds Odds RatioRatio
95% CI95% CI CoeffCoeff S. E.S. E. ZZ PP
BPBP 1.05971.0597 1.022 1.022 1.0981.098 0.05790.0579 0.01850.0185 3.1313.131 0.00170.0017
ConstConst ** ** ** -7.201-7.201 2.29942.2994 3.1313.131 0.00170.0017
Logistic regression modelLogistic regression model
TermTerm Odds RatioOdds Ratio 95% CI95% CI CoefficientCoefficient S. E.S. E. ZZ P-valueP-value
BPBP 1.05971.0597 1.022 1.022 1.0981.098 0.05790.0579 0.0180.018 3.1313.131 0.00170.0017
ConstantConstant ** ** ** -7.2014-7.2014 2.2992.299 3.1313.131 0.00170.0017
Coefficient, or beta, or b, is the slope or magnitude of the effect.
Logistic regression modelLogistic regression model
TermTerm Odds Odds RatioRatio
95% CI95% CI CoefficientCoefficient S. E.S. E. ZZ P-valueP-value
BPBP 1.05971.0597 1.0220 1.0220 1.09871.0987 0.05790.0579 0.01850.0185 3.13193.1319 0.00170.0017
ConstantConstant ** ** ** -7.2014-7.2014 2.29942.2994 3.13193.1319 0.00170.0017
Odds ratio for one unit change in the independent variable (e.g. BP). This is the calculated eb
eb
A one unit change in BP multiplies the odds ratio by 1.0597.
Logistic regression modelLogistic regression model
TermTerm Odds RatioOdds Ratio 95% CI95% CI CoeffCoeff S. E.S. E. ZZ P-valueP-value
BPBP 1.05971.0597 1.022 1.022 1.0981.098 0.05790.0579 0.01850.0185 3.13193.1319 0.00170.0017
ConstantConstant ** ** ** -7.2014-7.2014 2.29942.2994 3.13193.1319 0.00170.0017
95% confidence interval for that odds ratio.
The confidence interval does not include 1, so the effect is statistically significant
Using more than one independent Using more than one independent variablevariable
Single variable:Single variable:logit = c + bxlogit = c + bx
OR = c’ (e∙OR = c’ (e∙ bb))xx
Multiple variables:Multiple variables:logit = c + blogit = c + b11xx1 1 + b+ b22xx2 2 + … + b+ … + bnnxxnn
OR = c’ (e∙OR = c’ (e∙ b1b1))x1 x1 (e∙ (e∙ b2b2))x2 x2 … (e∙ ∙ … (e∙ ∙ bnbn))xnxn
Note that the terms Note that the terms multiplymultiply their effect on their effect on odds ratio.odds ratio.
Using more than one independent Using more than one independent variablevariable
Analysis reports a b coefficient for each Analysis reports a b coefficient for each independent variable.independent variable.
That coefficient is the effect of the given That coefficient is the effect of the given independent variable, separated from the independent variable, separated from the effects of all the other independent variables.effects of all the other independent variables.
Real Life ExampleReal Life Example
Prospective cohort study of causes of Prospective cohort study of causes of cardiac disease: Evans County Study 1965cardiac disease: Evans County Study 1965
Independent variables = age, gender, Independent variables = age, gender, race, social index, SBP, diabetes, smoking, race, social index, SBP, diabetes, smoking, cholesterol, and an obesity indexcholesterol, and an obesity index
Dependent variable = risk of dying during Dependent variable = risk of dying during 10 year period10 year period
VariableVariable RangeRange b coeffb coeff SESE pp
ConstantConstant -6.376-6.376 1.6341.634 <0.001<0.001
AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001
GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
Age x genderAge x gender -0.043-0.043 0.0170.017 0.0110.011
Social indexSocial index 20-8420-84 -0.056-0.056 0.0400.040 0.1600.160
(Soc ind)(Soc ind)22 400-7056400-7056 0.00060.0006 0.00030.0003 0.0820.082
SBPSBP 88-31088-310 0.0190.019 0.0020.002 <0.001<0.001
DiabetesDiabetes 0=n, 1=y0=n, 1=y 1.1231.123 0.2610.261 <0.001<0.001
SmokingSmoking 0=n, 1=y0=n, 1=y 0.3170.317 0.1570.157 0.0430.043
CholesterolCholesterol 94-54694-546 0.00310.0031 0.00150.0015 0.0410.041
QuartletQuartlet 2.11-8.762.11-8.76 -1.064-1.064 0.4320.432 0.0140.014
(Quartlet)(Quartlet)22 4.44-76.84.44-76.8 0.1120.112 0.0490.049 0.0220.022
Cited in Kelsey et al., Methods in Observational Epidemiology, 1986
VariableVariable RangeRange b coeffb coeff SESE pp
ConstantConstant -6.376-6.376 1.6341.634 <0.001<0.001
AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001
GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
Age x genderAge x gender -0.043-0.043 0.0170.017 0.0110.011
Social indexSocial index 20-8420-84 -0.056-0.056 0.0400.040 0.1600.160
(Soc ind)(Soc ind)22 400-7056400-7056 0.00060.0006 0.00030.0003 0.0820.082
SBPSBP 88-31088-310 0.0190.019 0.0020.002 <0.001<0.001
DiabetesDiabetes 0=n, 1=y0=n, 1=y 1.1231.123 0.2610.261 <0.001<0.001
SmokingSmoking 0=n, 1=y0=n, 1=y 0.3170.317 0.1570.157 0.0430.043
CholesterolCholesterol 94-54694-546 0.00310.0031 0.00150.0015 0.0410.041
QuartletQuartlet 2.11-8.762.11-8.76 -1.064-1.064 0.4320.432 0.0140.014
(Quartlet)(Quartlet)22 4.44-76.84.44-76.8 0.1120.112 0.0490.049 0.0220.022
Statistical SignificanceStatistical Significance
The p value indicates statistical significanceThe p value indicates statistical significance
Age is positively correlated with risk of deathAge is positively correlated with risk of death
Gender has positive b coefficient, but the p value Gender has positive b coefficient, but the p value is 0.12, indicating that we cannot say that there is is 0.12, indicating that we cannot say that there is a significant relationship.a significant relationship.
VariableVariable RangeRange b coeffb coeff SESE pp
AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001
GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
Dichotomous (yes-no) variablesDichotomous (yes-no) variables
Gender is coded as 0 for male, 1 for femaleGender is coded as 0 for male, 1 for female
eebb [e [e1.5 1.5 = 4.48] is change in OR for 1 unit change in = 4.48] is change in OR for 1 unit change in gender, i.e. OR for females relative to malesgender, i.e. OR for females relative to males
eebb for any dummy variable (coded 0-1) is the adjusted for any dummy variable (coded 0-1) is the adjusted OR for that risk factor, since “1 unit of change” = OR for that risk factor, since “1 unit of change” = presence vs. absence of risk factorpresence vs. absence of risk factor
VariableVariable RangeRange b coeffb coeff SESE pp
ConstantConstant -6.376-6.376 1.6341.634 <0.001<0.001
AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001
GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
Squared termsSquared terms
Social index squared is included as well as Social index squared is included as well as social index itself.social index itself.
Squared terms allow for curvilinear Squared terms allow for curvilinear relationships, just as in ordinary relationships, just as in ordinary regressionregression
VariableVariable RangeRange b coeffb coeff SESE pp
Age x genderAge x gender -0.043-0.043 0.0170.017 0.0110.011
Social indexSocial index 20-8420-84 -0.056-0.056 0.0400.040 0.1600.160
(Soc ind)(Soc ind)22 400-7056400-7056 0.00060.0006 0.00030.0003 0.0820.082
Interaction termsInteraction terms
Age and gender are entered into model as Age and gender are entered into model as separate termsseparate terms
Age x gender included to see whether age Age x gender included to see whether age has different effect in males than in has different effect in males than in females. females.
VariableVariable RangeRange b coeffb coeff SESE pp
AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001
GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
Age x genderAge x gender M: 0-0M: 0-0
F: 40-69F: 40-69-0.043-0.043 0.0170.017 0.0110.011
InterpretationInterpretation
With binary, dummy variables, eWith binary, dummy variables, ebb is the odds ratio. is the odds ratio. You can compare the strength (slope) of the effect by You can compare the strength (slope) of the effect by comparing b.comparing b.
With numeric variables, b is not a direct measure of With numeric variables, b is not a direct measure of strength of effect. strength of effect. – Example: b is quite small in effect of BP on mortality, Example: b is quite small in effect of BP on mortality,
because it is the effect of only because it is the effect of only one mmHgone mmHg change in BP. BP change in BP. BP is still an important factor in mortality because there is a is still an important factor in mortality because there is a wide wide rangerange in the BP. in the BP.
InterpretationInterpretation
In a prospective cohort study we can use In a prospective cohort study we can use logistic regression model to predict logistic regression model to predict probability probability of the event given the independent variables. of the event given the independent variables. Also can derive relative risk.Also can derive relative risk.
In a cross sectional study we only have the In a cross sectional study we only have the odds ratio.odds ratio.
Selection of variablesSelection of variables
Same principle as with ordinary regressionSame principle as with ordinary regression
Forward selection: add one variable at a time Forward selection: add one variable at a time until there are no more that make a significant until there are no more that make a significant differencedifference
Backward selection: start with all, remove one Backward selection: start with all, remove one at a time to see if they made a significant at a time to see if they made a significant contributioncontribution
EPI Info has suggestions on how to do thisEPI Info has suggestions on how to do this