Binary Logistic Regression “To be or not to be, that is the question..”(William Shakespeare, “Hamlet”)
Jan 18, 2016
Binary Logistic Regression
“To be or not to be, that is the question..”(William Shakespeare,
“Hamlet”)
Binary Logistic Regression
Also known as “logistic” or sometimes “logit” regression
Foundation from which more complex models derivede.g., multinomial regression and ordinal
logistic regression
Dichotomous Variables
Two categories indicating whether an event has occurred or some characteristic is present
Sometimes called “binary” or “binomial” variables
Dichotomous DVs
Placed in foster care or not Diagnosed with a disease or not Abused or not Pregnant or not Service provided or not
Single (Dichotomous) IV Example DV = continue fostering, 0 = no, 1 = yes
Customary to code category of interest 1 and the other category 0
IV = married, 0 = not married, 1 = married
N = 131 foster families
Are two-parent families more likely to continue fostering than one-parent families?
Crosstabulation
Table 2.1
Relationship between marital status and continuation is statistically significant [2(1, N = 131) = 5.65, p = .017]
A higher percentage of two-parent families (62.20%) than single-parent families (40.82%) planned to continue fostering
Strength & Direction of Relationships
Different ways to quantify the relationship between IV(s) and DVProbabilitiesOddsOdds Ratio (OR)
• Also abbreviated as eB, Exp(B) (on SPSS output), or exp(B)
% change
Roadmap to Computations
Probabilities
Oddsp / 1 - p
Odds RatiosOdds(1) / Odds(0)
% change100(OR - 1)
Probabilities
Percentages in Table 2.1 as probabilities (e.g., 62.20% as .6220)
p• Probability that event will occur (continue)• e.g., probability that one-parent families plan to
continue is .4082
1 – p• Probability that event will not occur (not continue)• e.g., probability that one-parent families do not
plan to continue is .5918 (1 - .4082)
Odds Ratio of probability that event will occur
to probability that it will not
e.g., odds of continuation for one-parent families are .69 (.4082 / .5918)
Can range from 0 to positive infinity
p
podds
1
Probabilities and Odds
Table 2.2 Odds = 1
Both outcomes equally likely Odds > 1
Probability that event will occur greater than probability that it will not
Odds < 1Probability that event will occur less than
probability that it will not
Odds Ratio (OR)
Odds of the event for one value of the IV (two-parent families) divided by the odds for a different value of the IV, usually a value one unit lower (one-parent families)
e.g., odds of continuing for two-parent families more than double the odds for one-parent familiesOR = 1.6455 / .6898 = 2.39
OR (cont’d)
Plays a central role in quantifying the strength and direction of relationships between IVs and DVs in binary, multinomial, and ordinal logistic regression
OR < 1 indicates a negative relationshipOR > 1 indicates a positive relationshipOR = 1 indicates no linear relationship
ORs > 1
e.g., OR of 2.39
A one-unit increase in the independent variable increases the odds of continuing by a factor of 2.39
The odds of continuing are 2.39 times higher for two-parent compared to one-parent families
ORs < 1
e.g., OR = .50
A one-unit increase in the independent variable decreases the odds of continuing by a factor of .50
The odds that two-parent families will continue are .50 (or one-half) of the odds that one-parent families will continue
ORs < 1 (cont’d)
Compute reciprocal (i.e., 1 / .50 = 2.00) Express relationship as opposite event
of interest (e.g., discontinuing)
A one-unit increase in the independent variable increases the odds of discontinuing by a factor of 2.00
The odds that two-parent families will discontinue are 2.00 times (or twice) the odds of one-parent families
OR to Percentage Change
% change = 100(OR – 1) Alternative way to express OR
e.g., A one-unit increase in the independent variable increases the odds of continuing by 139.00%
• 100(2.39 – 1) = 139.00
e.g., A one-unit increase in the independent variable decreases the odds of continuing by 50.00%
• 100(.50 – 1) = -50.00
Comparing OR > 1 and OR OR > 1 and OR < 1< 1 Compute reciprocal of one of the ORs
e.g., OR of 2.00 and an OR of .50
Reciprocal of .50 is 2.00 (1 / .50 = 2.00)ORs are equal in size (but not in direction of
the relationship)
Qualitative Descriptors for OR Table 2.3 Use cautiously with IVs that aren’t
dichotomous
Question & Answer
Are two-parent families more likely to continue fostering than one-parent families?Yes. The odds of continuing are 2.39 times
(139%) higher for two-parent compared to one-parent families. The probability of continuing is .41 for one-parent families and .62 for two-parent families.
Binary Logistic Regression Example DV = continue fostering, 0 = no, 1 = yes
Customary to code category of interest 1 and the other category 0
IV = married, 0 = not married, 1 = married
N = 131 foster families
Are two-parent families more likely to continue fostering than one-parent families?
Statistical Significance
Table 2.4Relationship between marital status and
continuation is statistically significant (Wald 2 = 5.544, p = .019)
Direction of Relationship
B = slopePositive slope, positive relationship
• OR > 1
Negative slope, negative relationship• OR < 1
0 slope, no linear relationship• OR = 1
Direction/Strength of Relationship
Positive relationship between marital status and continuationTwo-parent families more likely to continueB = .869Exp(B) = OR = 2.385
• % change = 100(2.385 - 1) = 139%
The odds of continuing are 2.39 times (139%) higher for two-parent compared to one-parent families
Roadmap to Computations Logits
ln(p / 1 – p) = L short for ln(p / 1 – p)
OddseL
ProbabilitieseL / (1 + eL)
Odds RatiosOdds(1) / Odds(0)
% change100(OR - 1)
Binary Logistic Regression Model
ln(π/ (1 - π)) = α + 1X1 + 1X2 + … kXk, or
ln(π / (1 - π)) =
π is the probability of the event (eta) is the abbreviation for the linear
predictor (right hand side of this equation) k = number of independent variables
Logit Link
ln(π / (1 - π))Log of the odds that the DV equals 1 (event
occurs)Connects (i.e., links) DV to linear
combination of IVs
Estimated Logits (L)
ln(p / 1 - p) = a + B1X1 + B1X2 + … BkXk
ln(p / 1 – p)Log of the odds that the DV equals 1 (event
occurs)Estimated logit, LDoes not have intuitive or substantive
meaning Useful for examining curvilinear
relationships and interaction effectsPrimarily useful for estimating probabilities,
odds, and ORs
Estimated Logits (L)
L(Continue) = a + BMarriedXMarried
L(Continue) = -.372 + (.869)(XMarried)
a = intercept B = slope
Logit to Odds
If L = 0:Odds = eL = e0 = 1.00
If L = .50:Odds = eL = e.50 = 1.65
If L = 1.00:Odds = eL = e1.00 = 2.72
Logits to Odds (cont’d)
Table 2.4One-parent families
• L(Continue) = -.372 = -.372 + (.869)(0)
• Odds of continuing = e-.372 = .69
Two-parent families• L(Continue) = .497 = -.372 + (.869)(1)
• Odds of continuing = e.497 = 1.65
Odds to OR
OR = 1.65 / .69 = 2.39, or
e.869 = 2.39, labeled Exp(B)Table 2.4
OR to Percentage Change
% change = 100(OR – 1)
e.g., A one-unit increase in the independent variable increases the odds of continuing by 139.00%
• 100(2.39 – 1) = 139.00
e.g., A one-unit increase in the independent variable decreases the odds of continuing by 50.00%
• 100(.50 – 1) = -50.00
Logits to Probabilities
One-parent families, L(Continue) = -.372
Two-parent families, L(Continue) = .497
L
L
)Continue( e
ep
..
.
e
ep
.
.
)Continue(
..
.
e
ep
.
.
)Continue(
Question & Answer
Are two-parent families more likely to continue fostering than one-parent families?Yes. The odds of continuing are 2.39 times
(139%) higher for two-parent compared to one-parent families. The probability of continuing is .41 for one-parent families and .62 for two-parent families.
Single (Quantitative) IV Example
DV = continue fostering, 0 = no, 1 = yesCustomary to code category of interest 1
and other category 0 IV = number of resources N = 131 foster families
Are foster families with more resources more likely to continue fostering?
Statistical Significance
Table 2.5Relationship between resources and
continuation is statistically significant (Wald 2 = 4.924, p = .026)
H0: = 0, 0, ≤ 0, same as
H0: OR = 1, OR 1, OR ≤ 1Likelihood ratio 2 better than Wald
Direction/Strength of Relationship
Positive relationship between resources and continuationFamilies with more resources are more
likely to continueB = .212Exp(B) = OR = 1.237
• % change = 100(1.237 – 1) = 24%
The odds of continuing are 1.24 times (24%) higher for each additional resource
Estimated Logits
L(Continue) = -1.227 + (.212)(X)
Figures
Resources.xls
Effect of Resources on Continuation (Logits)
-1.50
-1.00
-0.50
0.00
0.50
1.00
1.50
Resources
Lo
git
s
Logits -1.01 -0.80 -0.59 -0.38 -0.16 0.05 0.26 0.47 0.68 0.90 1.11
1 2 3 4 5 6 7 8 9 10 11
Effect of Resources on Continuation (Odds)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Resources
Od
ds
Odds 0.36 0.45 0.55 0.69 0.85 1.05 1.30 1.60 1.98 2.45 3.03
1 2 3 4 5 6 7 8 9 10 11
Effect of Resources on Continuation (Probabilities)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Resources
Pro
ba
bil
itie
s
Probabilities 0.27 0.31 0.36 0.41 0.46 0.51 0.56 0.62 0.66 0.71 0.75
1 2 3 4 5 6 7 8 9 10 11
Question & Answer
Are foster families with more resources more likely to continue fostering?Yes. The odds of continuing are 1.24 times
(24%) higher for each additional resource. The probability of continuing is .31 for families with two resources, .51 for families with 6 resources, and .71 for families with 10 resources.
Relationship of Linear Predictor to Logits, Odds & p Relationship between linear predictor and
logits is linear
Relationship between linear predictor and odds is non-linear
Relationship between linear predictor and p is non-linearChallenge is to summarize changes in odds
and probabilities associated with changes in IVs in the most meaningful and parsimonious way
Logit as Function of Linear Predictor
-3.00
-2.00
-1.00
.00
1.00
2.00
3.00
-3.00 -2.00 -1.00 .00 1.00 2.00 3.00
Linear Predictor
Log
it
Odds as Function of Linear Predictor
.003.006.009.0012.0015.0018.0021.00
-3.00 -2.00 -1.00 .00 1.00 2.00 3.00
Linear Predictor
Od
ds
Probabilities as Function of Linear Predictor
.00
.10
.20
.30
.40
.50
.60
.70
.80
.901.00
-3.00 -2.00 -1.00 .00 1.00 2.00 3.00
Linear Predictor
Pro
bab
ility
IVs to z-scores
z-scores (standard scores)Only the IV (not DV)--semi-standardized slopesOne-unit increase in the IV refers to a one-
standard-deviation increaseOR interpreted as expected change in the odds
associated with a one standard deviation increase in the IV
Conversion to z-scores changes intercept, slope, and OR, but not associated test statistics
Table 2.6 (compare to Table 2.5)
Figures
zResources.xls
Effect of zResources on Continuation (Probabilities)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
Standardized Resources
Pro
ba
bil
itie
s
Probabilities 0.26 0.34 0.44 0.54 0.64 0.73 0.80
-3 -2 -1 0 1 2 3
Question & Answer
Are foster families with more resources more likely to continue fostering?Yes. The odds of continuing are 1.51 times
(51%) higher for each one standard deviation (1.93) increase in resources. The probability of continuing is .34 for families with resources two standard deviations below the mean, .54 for families with the mean number of resources (6.60), and .73 for families with resources two standard deviations above the mean.
IVs Centered
CenteringTypically center on meanUseful when testing interactions, curvilinear
relationships, or when no meaningful 0 point (e.g., no family with 0 resources)
Centering doesn’t change slope, OR, or associated test statistics, but does change the intercept
Table 2.7 (compare to Table 2.5)
Figures
cResources.xls
Effect of cResources on Continuation (Probabilities)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
Centered Resources
Pro
ba
bil
itie
s
Probabilities 0.29 0.34 0.39 0.44 0.49 0.54 0.60 0.65 0.69 0.74 0.77
-5 -4 -3 -2 -1 0 1 2 3 4 5
Question & Answer
Are foster families with more resources more likely to continue fostering?Yes. The odds of continuing are 1.24 times
(24%) higher for each additional resource. The probability of continuing is .34 for families with 4 resources below the mean, .54 for families with the mean number of resources (6.60), and .74 for families with 4 resources above the mean.
Multiple IV Example
DV = continue fostering, 0 = no, 1 = yesCustomary to code the category of interest as
1 and the other category as 0 IV = married, 0 = not married, 1 =
married IV = number of resources (z-scores) N = 131 foster families
Are foster families with more resources more likely to continue fostering, controlling for marital status?
Statistical Significance
Table 2.12Relationship between set of IVs and
continuation is statistically significant (2 = 6.58, p = .037)
H0: 1 = 2 = k = 0, same as
H0: 1 = 2 = k = 1 (psi) is symbol for population value of OR
Statistical Significance (cont’d) Table 2.13
Relationship between resources and continuation is not statistically significant, controlling for marital status (2 = .92, p = .338)
Relationship between marital status and continuation is not statistically significant, controlling for resources (2 = 1.42, p = .234)
H0: = 0, 0, ≤ 0, same asH0: = 1, 1, ≤ 1
(psi) is symbol for population value of ORLikelihood ratio 2 better than Wald
Statistical Significance (cont’d) Table 2.9
Relationship between resources and continuation is not statistically significant, controlling for marital status (2 = .91, p = .340)
Relationship between marital status and continuation is not statistically significant, controlling for resources (2 = 1.41, p = .235)
H0: = 0, 0, ≤ 0, same asH0: = 1, 1, ≤ 1
(psi) is symbol for population value of OR Wald 2, but likelihood ratio 2 better
Estimated Logits
L(Continue) = -.183 + (.228)(XzResources) + (.570)(XMarried)
ORs & Percentage Change
ORzResources = 1.256 (ns)The odds of continuing are 1.26 times (26%)
higher for each one standard deviation (1.93) increase in resources, controlling for marital status
ORMarried = 1.769 (ns)The odds of continuing are 1.77 times (77%)
higher for two-parent compared to one-parent families, controlling for marital status
Figures
Married & zResources.xls
Effect of Resources and Marital Status on Plans to Continue Fostering (Odds)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Standardized Resources
Od
ds
One-Parent 0.42 0.53 0.66 0.83 1.05 1.31 1.65
Two-Parent 0.74 0.93 1.17 1.47 1.85 2.32 2.92
-3 -2 -1 0 1 2 3
Effect of Resources and Marital Status on Plans to Continue Fostering (Probabilities)
0.00
0.100.20
0.300.40
0.50
0.600.70
0.80
Standardized Resources
Pro
ba
bil
itie
s
One-Parent 0.30 0.35 0.40 0.45 0.51 0.57 0.62
Two-Parent 0.43 0.48 0.54 0.60 0.65 0.70 0.74
-3 -2 -1 0 1 2 3
Presenting Odds and Probabilities in Tables
Tables 2.10 and 2.11
Question & Answer
Are foster families with more resources more likely to continue fostering, controlling for marital status?No (ns). The odds of continuing are 1.26
times (26%) higher for each one standard deviation (1.93) increase in resources, controlling for marital status.
Cont’d
Question & Answer (cont’d)
For one-parent families the probability of continuing is .35 for families with resources two standard deviations below the mean, .45 for families with the mean number of resources, and .57 for families with resources two standard deviations above the mean. For two-parent families the probability of continuing is .48 for families with resources two standard deviations below the mean, .60 for families with the mean number of resources, and .70 for families with resources two standard deviations above the mean.
Comparing the Relative Strength of IVs
Size of slope and OR depend on how the IV is measuredWhen IVs measured the same way (e.g., two
dichotomous IVs or two continuous IVs transformed to z-scores) relative strength can be compared
Nothing comparable to standardized slope (Beta)
Nested ModelsNested Models
IV1, IV2, IV3
IV1, IV2 IV2, IV3 IV1, IV3
IV1 IV2 IV1IV2 IV3 IV3
Nested Models (cont’d)Nested Models (cont’d)
One regression model is nested within another if it contains a subset of variables included in the model within which it’s nested, and same cases are analyzed in both models
The more complex model called the “full model” The nested model called the “reduced model.” Comparison of full and reduced models allows
you to examine whether one or more variable(s) in the full model contribute to explanation of the DV
Sequential Entry of IVs
Used to compare full and reduced modelse.g., family resources entered first, and then
marital status
Fchange used in linear regression
Sequential Entry of IVs (cont’d) SPSS GZLM doesn’t allow sequential of
IVsEstimate models separately and compare
omnibus likelihood ratio 2 values
Reduced model 2(1) = 5.168Full model 2(2) = 6.585
2 difference = 6.585 – 5.168 = 1.417df difference = 2 – 1p = .234Chi-square Difference.xls
Assumptions Necessary for Testing Hypotheses No assumptions unique to binary
logistic regression other than ones discussed in GZLM lecture
Model Evaluation
Evaluate your model before you test hypotheses or interpret substantive resultsOutliersAnalogs of R2
Outliers
Atypical cases Can lead to flawed conclusions Can provide theoretical insights Common causes
Data entry errorsModel misspecificationRare events
Outliers (cont’d)
Leverage
ResidualsStandardized or unstandardized deviance
residuals
InfluenceCook’s D
Leverage
Think of a seesaw Leverage value for each case Cases with greater leverage can exert a
disproportionately large influence Leverage value for each case No clear benchmarks
Identify cases with substantially different leverage values than those of other cases
Residuals
Difference between actual and estimated values of the DV for a case
Residual for each case Large residual indicates a case for
which model fits poorly
Residuals (cont’d)
Standardized or unstandardized deviance residualsNot normally distributedValues less than -2 or greater than +2
warrant some concernValues less than -3 or greater than +3 merit
close inspection
Influence
Cases whose deletion result in substantial changes to regression coefficients
Cook’s D for each caseApproximate aggregate change in
regression parameters resulting from deletion of a case
Values of 1.0 or more indicate a problematic degree of influence for an individual case
Index Plot
Scatterplot
Horizontal axis (X)• Case id
Vertical axis (Y)• Leverage values, or• Residuals, or• Cook’s D
Index Plot: Leverage Values
Index Plot: Standardized Deviance Residuals
Index Plot: Cook’s D
Analogs of RAnalogs of R22
None in standard use and each may give different results
Typically much smaller than R2 values in linear regression
Difficult to interpret
Multicollinearity
SPSS GZLM doesn’t compute multicollinearity statistics
Use SPSS linear regression
Problematic levelsTolerance < .10 or VIF > 10
Additional Topics
Polychotomous IVs Curvilinear relationships Interactions
Overview of the Process
Select IVs and decide whether to test curvilinear relationships or interactions
Carefully screen and clean data Transform and code variables as needed Estimate regression model Examine assumptions necessary to
estimate binary regression model, examine model fit, and revise model as needed
Overview of the Process (cont’d)
Test hypotheses about the overall model and specific model parameters, such as ORs
Create tables and graphs to present results in the most meaningful and parsimonious way
Interpret results of the estimated model in terms of logits, probabilities, odds, and odds ratios, as appropriate
Additional Regression Models for Dichotomous DVs Binary probit regression
Substantive results essentially indistinguishable from binary logistic regression
Choice between this and binary logistic regression largely one of convenience and discipline-specific convention
Many researchers prefer binary logistic regression because it provides odds ratios whereas probit regression does not, and binary logistic regression comes with a wider variety of fit statistics
Additional Regression Models for Dichotomous DVs (cont’d) Complementary log-log (clog-log) and
log-log models Probability of the event is very small or
large Loglinear regression
Limited to categorical IVs Discriminant analysis
Limited to continuous IVs