Top Banner
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA
21

AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

Jan 03, 2016

Download

Documents

Emery Cobb
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

AN INTRODUCTION TO LOGISTIC REGRESSION

ENI SUMARMININGSIH, SSI, MM

PROGRAM STUDI STATISTIKA

JURUSAN MATEMATIKA

UNIVERSITAS BRAWIJAYA

Page 2: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

OUTLINE

Introduction and Description

Some Potential Problems and Solutions

Page 3: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

INTRODUCTION AND DESCRIPTION

Why use logistic regression? Estimation by maximum

likelihood Interpreting coefficients Hypothesis testing Evaluating the performance of

the model

Page 4: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

WHY USE LOGISTIC REGRESSION?

There are many important research topics for which the dependent variable is "limited." For example: voting, morbidity or mortality, and participation data is not continuous or distributed normally.Binary logistic regression is a type of regression analysis where the dependent variable is a dummy variable: coded 0 (did not vote) or 1(did vote)

Page 5: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

THE LINEAR PROBABILITY MODEL

In the OLS regression: Y = + X + e ; where Y = (0, 1)

The error terms are heteroskedastic e is not normally distributed

because Y takes on only two values The predicted probabilities can be

greater than 1 or less than 0

Page 6: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

You are a researcher who is interested in understanding the effect of smoking and weight upon resting pulse rate. Because you have categorized the response-pulse rate-into low and high, a binary logistic regression analysis is appropriate to investigate the effects of smoking and weight upon pulse rate.

AN EXAMPLE

Page 7: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

THE DATARestingPulse Smokes Weight

Low No 140

Low No 145

Low Yes 160

Low Yes 190

Low No 155

Low No 165

High No 150

Low No 190

Low No 195

⁞ ⁞ ⁞

Low No 110

High No 150

Low No 108

Page 8: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

OLS RESULTSResultsRegression Analysis: Tekanan Darah versus Weight, Merokok  The regression equation isTekanan Darah = 0.745 - 0.00392 Weight + 0.210 Merokok  Predictor Coef SE Coef T PConstant 0.7449 0.2715 2.74 0.007Weight -0.003925 0.001876 -2.09 0.039Merokok 0.20989 0.09626 2.18 0.032 S = 0.416246 R-Sq = 7.9% R-Sq(adj) = 5.8% 

Page 9: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

PROBLEMS:

Predicted Values outside the 0,1 range

Descriptive Statistics: FITS1  Variable N N* Mean StDev Minimum Q1 Median Q3 MaximumFITS1 92 0 0.2391 0.1204 -0.0989 0.1562 0.2347 0.3132 0.5309

Page 10: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

HETEROSKEDASTICITY

Weight

RES

I1

220200180160140120100

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

Scatterplot of RESI1 vs Weight

Page 11: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

THE LOGISTIC REGRESSION MODELThe "logit" model solves these problems:

ln[p/(1-p)] = + X + e

p is the probability that the event Y occurs, p(Y=1)

p/(1-p) is the "odds ratio" ln[p/(1-p)] is the log odds ratio, or

"logit"

Page 12: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

More: The logistic distribution constrains

the estimated probabilities to lie between 0 and 1.

The estimated probability is:

p = 1/[1 + exp(- - X)]

if you let + X =0, then p = .50 as + X gets really big, p

approaches 1 as + X gets really small, p

approaches 0

Page 13: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
Page 14: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

COMPARING LP AND LOGIT MODELS

0

1

LP Model

Logit Model

Page 15: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

MAXIMUM LIKELIHOOD ESTIMATION (MLE)

MLE is a statistical method for estimating the coefficients of a model.

Page 16: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

INTERPRETING COEFFICIENTS

Since:

ln[p/(1-p)] = + X + e

The slope coefficient () is interpreted as the rate of change in the "log odds" as X changes … not very useful.

Page 17: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

An interpretation of the logit coefficient which is usually more intuitive is the "odds ratio"

Since:

[p/(1-p)] = exp( + X)

exp() is the effect of the independent variable on the "odds ratio"

Page 18: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

FROM MINITAB OUTPUT:

**Although there is evidence that the estimated coefficient for Weight is not zero, the odds ratio is very close to one  (1.03), indicating that a one pound increase in weight minimally effects a person's resting pulse rate**Given that subjects have the same weight, the odds ratio can be interpreted as the odds of smokers in the sample having a low pulse being 30% of the odds of non-smokers having a low pulse.

Logistic Regression Table                                                 Odds     95% CIPredictor       Coef     SE Coef       Z      P   Ratio  Lower  UpperConstant    -1.98717    1.67930   -1.18   0.237Smokes Yes        -1.19297    0.552980  -2.16   0.031    0.30   0.10   0.90Weight     0.0250226  0.0122551  2.04   0.041   1.03   1.00   1.05

Page 19: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

HYPOTHESIS TESTING The Wald statistic for the coefficient is:

Wald (Z)= [ /s.e.B]2

which is distributed chi-square with 1 degree of freedom. The last Log-Likelihood from the maximum likelihood

iterations is displayed along with the statistic G. This statistic tests the null hypothesis that all the coefficients associated with predictors equal zero versus these coefficients not all being equal to zero. In this example, G = 7.574, with a p-value of 0.023, indicating that there is sufficient evidence that at least one of the coefficients is different from zero, given that your accepted level is greater than 0.023.

Page 20: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

EVALUATING THE PERFORMANCE OF THE MODEL

Goodness-of-Fit Tests displays Pearson, deviance, and Hosmer-Lemeshow goodness-of-fit tests. If the p-value is less than your accepted α-level, the test would reject the null hypothesis of an adequate fit.

The goodness-of-fit tests, with p-values ranging from 0.312 to 0.724, indicate that there is insufficient evidence to claim that the model does not fit the data adequately

Page 21: AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.

MULTICOLLINEARITY

The presence of multicollinearity will not lead to biased coefficients.

But the standard errors of the coefficients will be inflated.

If a variable which you think should be statistically significant is not, consult the correlation coefficients.

If two variables are correlated at a rate greater than .6, .7, .8, etc. then try dropping the least theoretically important of the two.