Top Banner
An Introduction to An Introduction to Logistic Regression Logistic Regression JohnWhitehead JohnWhitehead Department of Economics Department of Economics Appalachian State Appalachian State University University
48
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: logit

An Introduction An Introduction to Logistic to Logistic RegressionRegression

JohnWhiteheadJohnWhitehead

Department of EconomicsDepartment of Economics

Appalachian State UniversityAppalachian State University

Page 2: logit

OutlineOutline

Introduction and Introduction and DescriptionDescription

Some Potential Some Potential Problems and SolutionsProblems and Solutions

Writing Up the ResultsWriting Up the Results

Page 3: logit

Introduction and DescriptionIntroduction and Description

Why use logistic regression?Why use logistic regression? Estimation by maximum likelihoodEstimation by maximum likelihood Interpreting coefficientsInterpreting coefficients Hypothesis testingHypothesis testing Evaluating the performance of the Evaluating the performance of the

model model

Page 4: logit

Why use logistic regression?Why use logistic regression?

There are many important research There are many important research topics for which the dependent variable topics for which the dependent variable is "limited." is "limited."

For example: voting, morbidity or For example: voting, morbidity or mortality, and participation data is not mortality, and participation data is not continuous or distributed normally.continuous or distributed normally.

Binary logistic regression is a type of Binary logistic regression is a type of regression analysis where the regression analysis where the dependent variable is a dummy dependent variable is a dummy variable: coded 0 (did not vote) or 1(did variable: coded 0 (did not vote) or 1(did vote)vote)

Page 5: logit

The Linear Probability ModelThe Linear Probability Model

In the OLS regression: In the OLS regression:

Y = Y = + + X + e ; where Y = (0, 1)X + e ; where Y = (0, 1) The error terms are heteroskedasticThe error terms are heteroskedastic e is not normally distributed because e is not normally distributed because

Y takes on only two valuesY takes on only two values The predicted probabilities can be The predicted probabilities can be

greater than 1 or less than 0greater than 1 or less than 0

Page 6: logit

Q: EVAC

Did you evacuate your home to go someplace safer before Hurricane Dennis (Floyd) hit?  1 YES 2 NO 3 DON'T KNOW 4 REFUSED

An Example: Hurricane An Example: Hurricane EvacuationsEvacuations

Page 7: logit

The DataThe Data

EVAC PETS MOBLHOME TENURE EDUC0 1 0 16 160 1 0 26 120 1 1 11 131 1 1 1 101 0 0 5 120 0 0 34 120 0 0 3 140 1 0 3 160 1 0 10 120 0 0 2 180 0 0 2 120 1 0 25 161 1 1 20 12

Page 8: logit

OLS ResultsOLS Results

Dependent Variable: EVACVariable B t-value(Constant) 0.190 2.121PETS -0.137 -5.296MOBLHOME 0.337 8.963TENURE -0.003 -2.973EDUC 0.003 0.424FLOYD 0.198 8.147

R2 0.145F-stat 36.010

Page 9: logit

Problems:Problems:

Descriptive Statistics

1070 -.08498 .76027 .2429907 .1632534

1070

UnstandardizedPredicted ValueValid N (listwise)

N Minimum Maximum MeanStd.

Deviation

Predicted Values outside the 0,1 range

Page 10: logit

HeteroskedasticityHeteroskedasticity

TENURE

100806040200

Unstandardized Residual

10

0

-10

-20

Dependent Variable: LNESQB t-stat

(Constant) -2.34 -15.99LNTNSQ -0.20 -6.19

Park Test

Page 11: logit

The Logistic Regression ModelThe Logistic Regression Model

The "logit" model solves these problems:The "logit" model solves these problems:

ln[p/(1-p)] = ln[p/(1-p)] = + + X + eX + e

p is the probability that the event Y p is the probability that the event Y occurs, p(Y=1) occurs, p(Y=1)

p/(1-p) is the "odds ratio" p/(1-p) is the "odds ratio" ln[p/(1-p)] is the log odds ratio, or "logit"ln[p/(1-p)] is the log odds ratio, or "logit"

Page 12: logit

More:More: The logistic distribution constrains the The logistic distribution constrains the

estimated probabilities to lie between 0 estimated probabilities to lie between 0 and 1. and 1.

The estimated probability is:The estimated probability is:

p = 1/[1 + exp(-p = 1/[1 + exp(- - - X)] X)]

if you let if you let + + X =0, then p = .50 X =0, then p = .50 as as + + X gets really big, p approaches 1 X gets really big, p approaches 1 as as + + X gets really small, p approaches X gets really small, p approaches

00

Page 13: logit
Page 14: logit

Comparing LP and Logit Comparing LP and Logit ModelsModels

0

1

LP Model

Logit Model

Page 15: logit

Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE)(MLE)

MLE is a statistical method for MLE is a statistical method for estimating the coefficients of a model.estimating the coefficients of a model.

The likelihood function (L) measures the The likelihood function (L) measures the probability of observing the particular probability of observing the particular set of dependent variable values (pset of dependent variable values (p11, p, p22, , ..., p..., pnn) that occur in the sample: ) that occur in the sample:

L = Prob (pL = Prob (p11* p* p22* * * p* * * pnn)) The higher the L, the higher the The higher the L, the higher the

probability of observing the ps in the probability of observing the ps in the sample. sample.

Page 16: logit

MLE involves finding the coefficients (MLE involves finding the coefficients (, , ) that makes the log of the likelihood ) that makes the log of the likelihood function (LL < 0) as large as possible function (LL < 0) as large as possible

Or, finds the coefficients that make -2 Or, finds the coefficients that make -2 times the log of the likelihood function times the log of the likelihood function (-2LL) as small as possible(-2LL) as small as possible

The maximum likelihood estimates The maximum likelihood estimates solve the following condition: solve the following condition:

{Y - p(Y=1)}X{Y - p(Y=1)}Xii = 0 = 0

summed over all observations, i = 1,summed over all observations, i = 1,…,n…,n

Page 17: logit

Interpreting CoefficientsInterpreting Coefficients Since: Since:

ln[p/(1-p)] = ln[p/(1-p)] = + + X + eX + e

The slope coefficient (The slope coefficient () is interpreted as the rate ) is interpreted as the rate of change in the "log odds" as X changes … not of change in the "log odds" as X changes … not very useful.very useful.

Since: Since:

p = 1/[1 + exp(-p = 1/[1 + exp(- - - X)] X)]

The marginal effect of a change in X on the The marginal effect of a change in X on the probability is: probability is: p/p/X = f(X = f( X) X)

Page 18: logit

An interpretation of the logit An interpretation of the logit coefficient which is usually coefficient which is usually more intuitive is the "odds more intuitive is the "odds ratio"ratio"

Since:Since:

[p/(1-p)] = exp( [p/(1-p)] = exp( + + XX))

exp(exp() is the effect of the ) is the effect of the independent variable on the independent variable on the "odds ratio""odds ratio"

Page 19: logit

From SPSS Output:From SPSS Output:

Variable B Exp(B) 1/Exp(B)

PETS -0.6593 0.5172 1.933MOBLHOME 1.5583 4.7508TENURE -0.0198 0.9804 1.020EDUC 0.0501 1.0514Constant -0.916

“Households with pets are 1.933 times more likely to evacuate than those without pets.”

Page 20: logit

Hypothesis TestingHypothesis Testing

The Wald statistic for the The Wald statistic for the coefficient is:coefficient is:

Wald = [Wald = [ /s.e./s.e.BB]]22

which is distributed chi-square with which is distributed chi-square with 1 degree of freedom.1 degree of freedom.

The "Partial R" (in SPSS output) isThe "Partial R" (in SPSS output) is

R = {[(Wald-2)/(-2LL(R = {[(Wald-2)/(-2LL()]})]}1/21/2

Page 21: logit

An Example:An Example:

Variable B S.E. Wald R Sig t-value

PETS -0.6593 0.2012 10.732 -0.1127 0.0011 -3.28MOBLHOME 1.5583 0.2874 29.39 0.1996 0 5.42TENURE -0.0198 0.008 6.1238 -0.0775 0.0133 -2.48EDUC 0.0501 0.0468 1.1483 0.0000 0.2839 1.07Constant -0.916 0.69 1.7624 1 0.1843 -1.33

Page 22: logit

Evaluating the Performance Evaluating the Performance of the Modelof the Model

There are several statistics which There are several statistics which can be used for comparing can be used for comparing alternative models or evaluating alternative models or evaluating the performance of a single model: the performance of a single model: Model Chi-SquareModel Chi-Square Percent Correct PredictionsPercent Correct Predictions Pseudo-RPseudo-R22

Page 23: logit

Model Chi-SquareModel Chi-Square The model likelihood ratio (LR), statistic The model likelihood ratio (LR), statistic

isis

LR[i] = -2[LL(LR[i] = -2[LL() - LL() - LL(, , ) ] ) ] {Or, as you are reading SPSS printout: {Or, as you are reading SPSS printout:

LR[i] = [-2LL (of beginning model)] - [-2LL (of ending LR[i] = [-2LL (of beginning model)] - [-2LL (of ending model)]}model)]}

The LR statistic is distributed chi-square The LR statistic is distributed chi-square with i degrees of freedom, where i is the with i degrees of freedom, where i is the number of independent variablesnumber of independent variables

Use the “Model Chi-Square” statistic to Use the “Model Chi-Square” statistic to determine if the overall model is determine if the overall model is statistically significant. statistically significant.

Page 24: logit

An Example:An Example:Beginning Block Number 1. Method: Enter -2 Log Likelihood 687.35714

Variable(s) Entered on Step Number1.. PETS PETS MOBLHOME MOBLHOME TENURE TENURE EDUC EDUC

Estimation terminated at iteration number 3 becauseLog Likelihood decreased by less than .01 percent.

-2 Log Likelihood 641.842

Chi-Square df Sign.

Model 45.515 4 0.0000

Page 25: logit

Percent Correct PredictionsPercent Correct Predictions

The "Percent Correct Predictions" statistic The "Percent Correct Predictions" statistic assumes that if the estimated p is greater assumes that if the estimated p is greater than or equal to .5 then the event is than or equal to .5 then the event is expected to occur and not occur otherwise. expected to occur and not occur otherwise.

By assigning these probabilities 0s and 1s By assigning these probabilities 0s and 1s and comparing these to the actual 0s and and comparing these to the actual 0s and 1s, the % correct Yes, % correct No, and 1s, the % correct Yes, % correct No, and overall % correct scores are calculated.overall % correct scores are calculated.

Page 26: logit

An Example:An Example:

Observed % Correct0 1

0 328 24 93.18%1 139 44 24.04%

Overall 69.53%

Predicted

Page 27: logit

Pseudo-RPseudo-R22

OneOne psuedo-R psuedo-R22 statistic is the McFadden's- statistic is the McFadden's-RR22 statistic: statistic:

McFadden's-RMcFadden's-R2 2 = 1 - [LL(= 1 - [LL(,,)/LL()/LL()])] { {= 1 - [-2LL(= 1 - [-2LL(, , )/-2LL()/-2LL()] (from)] (from SPSSSPSS printout)printout)}}

where the Rwhere the R22 is a scalar measure which is a scalar measure which varies between 0 and (somewhat close to) varies between 0 and (somewhat close to) 1 much like the R1 much like the R22 in a LP model. in a LP model.

Page 28: logit

An Example:An Example:

Beginning -2 LL 687.36Ending -2 LL 641.84Ending/Beginning 0.9338

McF. R2 = 1 - E./B. 0.0662

Page 29: logit

Some potential problems and Some potential problems and solutions solutions

Omitted Variable BiasOmitted Variable Bias Irrelevant Variable BiasIrrelevant Variable Bias Functional FormFunctional Form MulticollinearityMulticollinearity Structural BreaksStructural Breaks

Page 30: logit

Omitted Variable BiasOmitted Variable Bias

Omitted variable(s) can result in bias in the Omitted variable(s) can result in bias in the coefficient estimates. To test for omitted coefficient estimates. To test for omitted variables you can conduct a likelihood ratio test:variables you can conduct a likelihood ratio test:

LR[q] = {[-2LL(constrained model, i=k-q)] LR[q] = {[-2LL(constrained model, i=k-q)]

- [-2LL(unconstrained model, i=k)]} - [-2LL(unconstrained model, i=k)]}

where LR is distributed chi-square with q degrees where LR is distributed chi-square with q degrees of freedom, with q = 1 or more omitted variables of freedom, with q = 1 or more omitted variables

{This test is conducted automatically by {This test is conducted automatically by SPSSSPSS if if you specify "blocks" of independent variables}you specify "blocks" of independent variables}

Page 31: logit

An Example:An Example:Variable B Wald Sig

PETS -0.699 10.968 0.001MOBLHOME 1.570 29.412 0.000TENURE -0.020 5.993 0.014EDUC 0.049 1.079 0.299CHILD 0.009 0.011 0.917WHITE 0.186 0.422 0.516FEMALE 0.018 0.008 0.928Constant -1.049 2.073 0.150

Beginning -2 LL 687.36Ending -2 LL 641.41

Page 32: logit

Constructing the LR TestConstructing the LR Test

“Since the chi-squared value is less than the critical value the set of coefficients is not statistically significant. The full model is not an improvement over the partial model.”

Ending -2 LL Partial Model 641.84Ending -2 LL Full Model 641.41Block Chi-Square 0.43DF 3Critical Value 11.345

Page 33: logit

The inclusion of irrelevant The inclusion of irrelevant variable(s) can result in poor variable(s) can result in poor model fit. model fit.

You can consult your Wald You can consult your Wald statistics or conduct a likelihood statistics or conduct a likelihood ratio test.ratio test.

Irrelevant Variable Bias

Page 34: logit

Functional FormFunctional Form

Errors in functional form can result in Errors in functional form can result in biased coefficient estimates and poor biased coefficient estimates and poor model fit. model fit.

You should try different functional forms You should try different functional forms by logging the independent variables, by logging the independent variables, adding squared terms, etc.adding squared terms, etc.

Then consult the Wald statistics and model Then consult the Wald statistics and model chi-square statistics to determine which chi-square statistics to determine which model performs best.model performs best.

Page 35: logit

MulticollinearityMulticollinearity The presence of multicollinearity will The presence of multicollinearity will notnot lead lead

to biased coefficients. to biased coefficients. But the standard errors of the coefficients will But the standard errors of the coefficients will

be inflated. be inflated. If a variable which you think should be If a variable which you think should be

statistically significant is not, consult the statistically significant is not, consult the correlation coefficients. correlation coefficients.

If two variables are correlated at a rate greater If two variables are correlated at a rate greater than .6, .7, .8, etc. then try dropping the least than .6, .7, .8, etc. then try dropping the least theoretically important of the two.theoretically important of the two.

Page 36: logit

Structural BreaksStructural Breaks

You may have structural breaks in your data. You may have structural breaks in your data. Pooling the data imposes the restriction that an Pooling the data imposes the restriction that an independent variable has the same effect on the independent variable has the same effect on the dependent variable for different groups of data dependent variable for different groups of data when the opposite may be true. when the opposite may be true.

You can conduct a likelihood ratio test:You can conduct a likelihood ratio test:

LR[i+1] = -2LL(pooled model)LR[i+1] = -2LL(pooled model)

[-2LL(sample 1) + -2LL(sample 2)] [-2LL(sample 1) + -2LL(sample 2)]

where samples 1 and 2 are pooled, and i is the where samples 1 and 2 are pooled, and i is the number of independent variables. number of independent variables.

Page 37: logit

An ExampleAn Example Is the evacuation behavior from Is the evacuation behavior from

Hurricanes Dennis and Floyd Hurricanes Dennis and Floyd statistically equivalent?statistically equivalent?

Floyd Dennis PooledVariable B B BPETS -0.66 -1.20 -0.79MOBLHOME 1.56 2.00 1.62TENURE -0.02 -0.02 -0.02EDUC 0.05 -0.04 0.02Constant -0.92 -0.78 -0.97Beginning -2 LL 687.36 440.87 1186.64Ending -2 LL 641.84 382.84 1095.26Model Chi-Square 45.52 58.02 91.37

Page 38: logit

Constructing the LR TestConstructing the LR Test

Floyd Dennis PooledEnding -2 LL 641.84 382.84 1095.26Chi-Square 70.58 [Pooled - (Floyd + Dennis)]

DF 5Critical Value 13.277 p = .01

Since the chi-squared value is greater than the critical value the set of coefficients are statistically different. The pooled model is inappropriate.

Page 39: logit

What should you do?What should you do?

Try adding a dummy variable:Try adding a dummy variable:

FLOYD = 1 if Floyd, 0 if DennisFLOYD = 1 if Floyd, 0 if Dennis

Variable B Wald SigPETS -0.85 27.20 0.000MOBLHOME 1.75 65.67 0.000TENURE -0.02 8.34 0.004EDUC 0.02 0.27 0.606FLOYD 1.26 59.08 0.000Constant -1.68 8.71 0.003

Page 40: logit

Writing Up ResultsWriting Up Results Present descriptive statistics in a tablePresent descriptive statistics in a table Make it clear that the dependent variable is Make it clear that the dependent variable is

discrete (0, 1) and not continuous and that you discrete (0, 1) and not continuous and that you will use logistic regression.will use logistic regression.

Logistic regression is a standard statistical Logistic regression is a standard statistical procedure so you don't (necessarily) need to procedure so you don't (necessarily) need to write out the formula for it. You also (usually) write out the formula for it. You also (usually) don't need to justify that you are using Logit don't need to justify that you are using Logit instead of the LP model or Probit (similar to logit instead of the LP model or Probit (similar to logit but based on the normal distribution [the tails but based on the normal distribution [the tails are less fat]).are less fat]).

Page 41: logit

An Example:An Example:

"The dependent variable which measures the willingness to evacuate is EVAC. EVAC is equal to 1 if the respondent evacuated their home during Hurricanes Floyd and Dennis and 0 otherwise. The logistic regression model is used to estimate the factors which influence evacuation behavior."

Page 42: logit

In the heading state that your dependent In the heading state that your dependent variable (dependent variable = EVAC) and that variable (dependent variable = EVAC) and that these are "logistic regression results.”these are "logistic regression results.”

Present coefficient estimates, t-statistics (or Present coefficient estimates, t-statistics (or Wald, whichever you prefer), and (at least the) Wald, whichever you prefer), and (at least the) model chi-square statistic for overall model fitmodel chi-square statistic for overall model fit

If you are comparing several model If you are comparing several model specifications you should also present the % specifications you should also present the % correct predictions and/or Pseudo-Rcorrect predictions and/or Pseudo-R22 statistics statistics to evaluate model performanceto evaluate model performance

If you are comparing models with hypotheses If you are comparing models with hypotheses about different blocks of coefficients or testing about different blocks of coefficients or testing for structural breaks in the data, you could for structural breaks in the data, you could

present the ending log-likelihood values.present the ending log-likelihood values.

Organize your regression results in a table:

Page 43: logit

An Example:An Example:

Table 2. Logistic Regression ResultsDependent Variable = EVACVariable B B/S.E.

PETS -0.6593 -3.28MOBLHOME 1.5583 5.42TENURE -0.0198 -2.48EDUC 0.0501 1.07Constant -0.916 -1.33

Model Chi-Squared 45.515

Page 44: logit

"The results from Model 1 indicate that "The results from Model 1 indicate that coastal residents behave according to coastal residents behave according to risk theory. The coefficient on the risk theory. The coefficient on the MOBLHOME variable is negative and MOBLHOME variable is negative and statistically significant at the p < .01 statistically significant at the p < .01 level (t-value = 5.42). Mobile home level (t-value = 5.42). Mobile home residents are 4.75 times more likely to residents are 4.75 times more likely to evacuate.”evacuate.”

When describing the statistics in the tables, point out the highlights for the reader. What are the statistically significant variables?

Page 45: logit

“The overall model is significant at the .01 level according to the Model chi-square statistic. The model predicts 69.5% of the responses correctly. The McFadden's R2 is .066."

Is the overall model statistically significant?

Page 46: logit

Which model is preferred?

"Model 2 includes three additional independent variables. According to the likelihood ratio test statistic, the partial model is superior to the full model of overall model fit. The block chi-square statistic is not statistically significant at the .01 level (critical value = 11.35 [df=3]). The coefficient on the children, gender, and race variables are not statistically significant at standard levels."

Page 47: logit

AlsoAlso You usually don't need to discuss the You usually don't need to discuss the

magnitude of the coefficients--just the magnitude of the coefficients--just the sign (+ or -) and statistical significance. sign (+ or -) and statistical significance.

If your audience is unfamiliar with the If your audience is unfamiliar with the extensions (beyond extensions (beyond SPSSSPSS or or SASSAS printouts) to logistic regression, discuss printouts) to logistic regression, discuss the calculation of the statistics in an the calculation of the statistics in an appendix or footnote or provide a appendix or footnote or provide a citation. citation.

Always state the degrees of freedom for Always state the degrees of freedom for your likelihood-ratio (chi-square) test. your likelihood-ratio (chi-square) test.

Page 48: logit

ReferencesReferences

http://personal.ecu.edu/whiteheadj/data/logit/http://personal.ecu.edu/whiteheadj/data/logit/

http://personal.ecu.edu/whiteheadj/data/logit/logitpap.htmhttp://personal.ecu.edu/whiteheadj/data/logit/logitpap.htm

E-mail: [email protected]: [email protected]