Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Statistics and Data Analysisfor Nursing Research
Second Edition
CHAPTER
Logistic Regression
12
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Logistic Regression• Logistic regression (logit analysis)
analyzes the relationship between one or more predictor variable and a categorical dependent variable:– Binary logistic regression, used when the
outcome is dichotomous (e.g., sepsis, absence of sepsis)
– Multinomial logistic regression, used when the outcome has three or more categories (e.g., live birth, miscarriage, abortion) Chapter focuses on binary logistic regression
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Maximum Likelihood Estimation
• In logistic regression, parameter estimation is based on maximum likelihood estimation (MLE)
• Maximum likelihood estimators are those that estimate the parameters most likely to have generated the observed sample data
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
The Logit
• Logistic regression predicts the odds that an outcome will occur– The odds of an outcome is the ratio of the
probability that an event will occur to the probability that it will not
• In logistic regression, the dependent variable is transformed to be the natural log of the odds of the outcome, which is called a logit– A logit ranges from minus infinity to plus infinity
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Logistic Regression Equation
• The predicted value or left side of the equation is the logit
• The right side of the equation is the constant plus predictor variables weighted by b regression coefficients:
• The equation: Log [Prob (event) ÷ Prob (no event)] = b0 + b1X1 + b2X2 + .... bkXk
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Logistic Regression Equation (cont’d)
• Log odds are incomprehensible, so the equation is modified so that the left hand expression is the odds, not the log odds
• Right hand expression involves raising e (the base of natural logarithms) to the power of the right side of the equation:
eb0 + b1X1 + b2X2 + .... bkXk
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Odds Ratio
• The factor by which the odds change for a given predictor is the odds ratio
• The odds ratio (OR) provides an estimate of the risk of the event occurring given one condition, versus the risk of it occurring given another condition, when other predictors in the equation are held constant
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Odds Ratio Example
• The odds ratio for having a tubal ligation, predicted on the basis of whether or not a woman in this population has a child with a disability = 1.41
• The odds of having a tubal ligation were 41% higher for women who had a disabled child than for those who did not, holding constant other factors like age and number of prior births
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Classification
• The logistic regression equation produces predicted probabilities of the outcome for each case– Probabilities range from .00 to 1.00
• Predicted probabilities can be used to classify cases
• The default for the cut value is .50: – Those whose predicted probability > .50 are
predicted to have the outcome; others are classified as not having the outcome
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Classification Table Example
Observed: Tubal Ligation?
Predicted: Tubal Ligation?
Percent Correct
Predicted: No (0)
Predicted: Yes (1)
No (0) 2017 366 84.6%
Yes (1) 946 435 31.5%
Overall Percentage 65.1%
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• Dependent variable is usually coded1 (to represent an event or characteristic) or 0 (to represent the absence of the event or
characteristic)
• SPSS by default predicts to the category coded 1
Dependent Variable Coding
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• In logistic regression, predictors can be: – Continuous variables (e.g., age)– Dichotomous variables (e.g., sex)– Categorical variables (e.g., employment status:
not working (1), working part time (2), working full time (3)
– Interaction terms
Predictor Variables
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• In SPSS, categorical predictors can be defined in terms of the type of contrast desired and which category to use as the reference category
• SPSS will create C - 1 new variables, where C = number of categories
Categorical Predictor Variables
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• Indicator coding– C - 1 new “dummy” variables are created, all
with codes of 1 or 0– Reference group is coded 0 on all new
variables– Coefficients for new variables represent the
effect of each category compared to the reference category
Indicator Coding
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• Deviation coding– C - 1 new variables are created– Presence of attribute coded 1, absence coded
0, BUT– Reference group is coded -1 on all new
variables– Coefficients for new variables represent the
effect of each category compared to average effects for all categories
Deviation Coding
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Parameter Coding
Marital Status
Frequency Category 1
Category 2
Married 300 1 0
Divorced, Widowed
100 0 1
Never Married
100 -1 -1
Deviation Coding Example
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• Same entry options as in least-squares multiple regression: – Simultaneous (direct), all predictors entered
at once– Hierarchical (sequential), researcher
controls order of entry in blocks– Stepwise, forward selection or backward
elimination of variables using statistical criteria (LR criterion preferred)
Entering Predictors in Logstic Regression
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• Several approaches to testing the overall goodness of fit of the data to the hypothesized model of predictors
• Standard approach involves calculating the likelihood index, which is the probability of the observed results– MLE seeks to maximize likelihood; iterations stop
when likelihood does not increase significantly
• Likelihood most often shown transformed, multiplying its value by -2 times its log (-2LL)
Testing the Overall Model
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• The likelihood ratio test (chi-square goodness-of-fit test) involves subtracting -2LL for the model with predictors from -2LL for the null model (constant-only model)
• The null model estimates the outcome without any predictors:– In the absence of other information, the null
model predicts that everyone has the outcome that is most prevalent
Likelihood Ratio Test
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• Basic likelihood ratio statistic:
Χ2 = (-2LL [Reduced mode]) – (-2LL [Larger model])
• The null hypothesis is rejected if the chi-square value is statistically significant
• The likelihood ratio test can also be used to evaluate the significance of improvements to the model when new predictors are added (e.g., in hierarchical regression)
Likelihood Ratio Test (cont’d)
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• SPSS presents the likelihood ratio test results in a panel called Omnibus Test of Model Coefficients
• There is a statistic for the overall model, and for individual steps and blocks (not it is always relevant)
Chi-Square df Sig.
Step 450.567 5 .000
Block 450.567 5 .000
Model 450.567 5 .000
SPSS Omnibus Model Test
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• An alternative approach to testing the overall model: Comparing the prediction model to a hypothetically “perfect” model
• One such test is the Hosmer-Lemeshow test, which involves dividing people in the two outcome categories into deciles of risk, based on deciles of the predicted probability value
Hosmer-Lemeshow Test
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• The 10 deciles for the two outcome categories result in a 2 × 10 matrix, for which observed and expected frequencies are compared; then, a chi-square statistic is computed
• The desirable outcome is nonsignificance, which supports the inference that the model being tested is not reliably different from the perfect model
Hosmer-Lemeshow Test (cont’d)
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• The Homer-Lemeshow test is preferred by some to the likelihood ratio goodness-of-fit test BUT
• The test is not recommended for use with small samples (< 400 cases) or when small cell frequencies are expected
• Also, it can result in significance with large samples even when the model fits well
Hosmer-Lemeshow Test (cont’d)
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• Individual predictors are most often evaluated using the Wald statistic
• When predictors have 1 degree of freedom, the squared Wald statistic, which is distributed as chi-square, is:
(b ÷ SEb)2
– Where b = regression coefficient and SEb = its standard error
Wald Statistic
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• A Wald statistic is computed for each predictor or interaction term, indicating whether the b coefficient is statistically different from zero
• When there are categorical variables, a Wald statistic is computed for the overall variable, and for each new variable representing the desired contrast
Wald Statistic (cont’d)
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• When the absolute value of a b coefficient is large, its standard error is large, which results in a heightened risk of a Type II error with the Wald statistic
Wald Statistic Problems
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• Thus, some prefer the likelihood ratio improvement test to evaluate individual predictors:– This requires each predictor to be added in
successive blocks, to evaluate whether reductions to -2LL are significant
– This approach should always be used if the absolute value of a b coefficient is large
Wald Statistic Problems (cont’d)
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
• SPSS presents b, SE, Wald, and odds ratios [Exp(B)] in the panel called “Variables in the Equation”
b SE Wald df Sig. Exp(B)
Age .07 .006 180.2 1 .000 1.077
Births .27 .025 118.5 1 .000 1.32
Married .13 .101 1.65 1 .241 .98
Constant -4.05 .212 370.4 1 .000
Wald Statistic Output
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Classification Success
• Another way to consider the success of the model: Its ability to correctly classify sample members for whom the outcome is known
• Compare the overall percent correctly classified with predictors to the percent correctly classified with the null model
• Also, compare improvement in classification for those in an important outcome group
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Effect Size
• There is no ideal measure of overall effect size in logistic regression
• Several pseudo R2 statistics have been proposed,
including:– Cox and Snell R2: Not ideal, can never achieve the
value of 1.0– Nagelkerke R2: Can range from .00 to 1.00, and is a
preferred index
• These indexes are approximations to R2 but should not be interpreted as a proportion of the variance explained
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Sample Size
• For stability of the parameter estimates, there should be at least 15 (but preferably 20+) cases for each predictor in the model
• Power analysis in logistic regression can be done, but is complex:– A crude approximation to achieve adequate power:
Base sample size estimation on the expected relationship between the outcome and a single important predictor, preferably the one with the most modest relationship to the outcome
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Assumptions in Logistic Regression
• Assumptions in logistic regression are much less restrictive than in least-squares regression
• Does not assume:– Linear relationship between dependent
variable and predictors– Normal distribution (multivariate normality)– Homoscedasticity (homogeneity of variances)
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Assumptions in Logistic Regression (cont’d)
• BUT, there are two important assumptions:– Independence of observations (i.e., data
are not from a repeated measures or pair-matched design)
– Linearity between the continuous predictors and the logit Violation increases risk of a Type II error
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Linearity Assumption
• Linearity assumption is not easy to test
• One approach is a logit step test:– Divide a continuous variable into equal-
interval categories– Enter this new variable as a categorical
variable, with indicator coding– Examine resulting b coefficients to see if
the increase or decrease in magnitude of the coefficients is approximately linear
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
Other Potential Problems
• Use predictors with minimal level of measurement error
• Avoid multicollinearity (high correlations among predictors)
• Avoid outliers:– An outlier in logistic regression is one for
which the absolute value of the standardized residual value is large (greater than 2.58 or, less conservatively, 3.0)
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
SPSS and Logistic Regression
• Analyze Regression Binary Logistics
• Move outcome into Dependent slot
• Move predictors into slot for Covariates
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
SPSS and Logistic Regression (cont’d)
• Hierarchical entry: define new blocks using “Next” button
• To define categorical variable contrasts, push Categorical button
• To select statistical and display options, push Options button
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
SPSS Logistic Regression: Categorical Dialog Box
• Move categorical variables into slot for Categorical variables
• Select type of contrast
• Select reference category
• Remember to click Change
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
SPSS Logistic Regression: Options Dialog Box
• Useful options:• Hosmer-Lemeshow
test• Iteration history (to
see -2LL for null model)
• Listing of outliers via residuals
• Confidence intervals around odds ratios
Copyright ©2010 by Pearson Education, Inc.Upper Saddle River, New Jersey 07458
All rights reserved.
Statistics and Data Analysis for Nursing Research, Second EditionDenise F. Polit
SPSS Logistic Regression: Save Dialog Box
• Use the Save dialog box to add new variables to the original data file for each case
• Especially useful: Predicted probabilities
• Predicted classification on the outcome (group membership)