Top Banner
1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006
21

1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

Apr 01, 2015

Download

Documents

Hamza Stickler
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

1

Problems with infinite solutions in logistic regression

Ian WhiteMRC Biostatistics Unit, Cambridge

UK Stata Users’ GroupLondon, 12th September 2006

Page 2: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

2

Introduction

• I teach logistic regression for the analysis of case-control studies to Epidemiology Master’s students, using Stata

• I stress how to work out degrees of freedom– e.g. if E has 2 levels and M has 4 levels then you get

3 d.f. for testing the E*M interaction

• Our practical uses data on 244 cases of leprosy and 1027 controls– previous BCG vaccination is the exposure of interest– level of schooling is a possible effect modifier– in what follows I’m ignoring other confounders

Page 3: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

3

Leprosy data-> tabulation of d outcome

0=control, | 1=case | Freq. Percent Cum.------------+----------------------------------- 0 | 1,027 80.80 80.80 1 | 244 19.20 100.00------------+----------------------------------- Total | 1,271 100.00

-> tabulation of bcg exposure

BCG scar | Freq. Percent Cum.------------+----------------------------------- Absent | 743 58.46 58.46 Present | 528 41.54 100.00------------+----------------------------------- Total | 1,271 100.00

-> tabulation of school possible effect modifier

Schooling | Freq. Percent Cum.------------+----------------------------------- 0 | 282 22.19 22.19 1 | 606 47.68 69.87 2 | 350 27.54 97.40 3 | 33 2.60 100.00------------+----------------------------------- Total | 1,271 100.00

Page 4: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

4

Main effects model. xi: logistic d i.bcg i.schooli.bcg _Ibcg_0-1 (naturally coded; _Ibcg_0 omitted)i.school _Ischool_0-3 (naturally coded; _Ischool_0 omitted)

Logistic regression Number of obs = 1271 LR chi2(4) = 97.50 Prob > chi2 = 0.0000Log likelihood = -572.86093 Pseudo R2 = 0.0784

------------------------------------------------------------------------------ d | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ibcg_1 | .2908624 .0523636 -6.86 0.000 .204384 .4139314 _Ischool_1 | .7035071 .1197049 -2.07 0.039 .5040026 .9819836 _Ischool_2 | .4029998 .0888644 -4.12 0.000 .2615825 .6208704 _Ischool_3 | .09077 .0933769 -2.33 0.020 .0120863 .6816944------------------------------------------------------------------------------

. estimates store main

Page 5: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

5

Interaction model. xi: logistic d i.bcg*i.schooli.bcg _Ibcg_0-1 (naturally coded; _Ibcg_0 omitted)i.school _Ischool_0-3 (naturally coded; _Ischool_0 omitted)i.bcg*i.school _IbcgXsch_#_# (coded as above)

Logistic regression Number of obs = 1271 LR chi2(7) = 101.43 Prob > chi2 = 0.0000Log likelihood = -570.90012 Pseudo R2 = 0.0816

------------------------------------------------------------------------------ d | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ibcg_1 | .2248804 .0955358 -3.51 0.000 .0977993 .5170913 _Ischool_1 | .6626409 .1234771 -2.21 0.027 .4599012 .9547549 _Ischool_2 | .4116581 .1027612 -3.56 0.000 .2523791 .6714598 _Ischool_3 | 1.28e-08 1.42e-08 -16.41 0.000 1.46e-09 1.12e-07_IbcgXsch_~1 | 1.448862 .7046411 0.76 0.446 .5585377 3.758385_IbcgXsch_~2 | 1.086848 .6226504 0.15 0.884 .3536056 3.340553_IbcgXsch_~3 | 4.25e+07 . . . . .------------------------------------------------------------------------------Note: 17 failures and 0 successes completely determined.

. estimates store inter

Page 6: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

6

The problem. table bcg school, by(d)

----------------------------------0=control |, 1=case |and BCG | Schoolingscar | 0 1 2 3----------+-----------------------0 | Absent | 141 257 129 17 Present | 57 229 182 15----------+-----------------------1 | Absent | 77 93 29 Present | 7 27 10 1----------------------------------

Page 7: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

7

LR test

. xi: logistic d i.bcg i.school

LR chi2(4) = 97.50

Log likelihood = -572.86093

. estimates store main

. xi: logistic d i.bcg*i.school

LR chi2(7) = 101.43

Log likelihood = -570.90012

. estimates store inter

. lrtest main inter

Likelihood-ratio test LR chi2(2) = 3.92

(Assumption: main nested in inter) Prob > chi2 = 0.1407

Page 8: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

8

What is Stata doing? (guess)

• Recognises the information matrix is singular• Hence reduces model df by 1• In other situations Stata drops observations

– if a single variable perfectly predicts success/failure– this happens if the problematic cell doesn’t occur in a

reference category– then Stata refuses to perform lrtest, but we can force it to do so

– Stata still gets df=2; can use df(3) option

Page 9: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

9

. gen bcgrev=1-bcg

. xi: logistic d i.bcgrev*i.schooli.bcgrev _Ibcgrev_0-1 (naturally coded; _Ibcgrev_0 omitted)i.school _Ischool_0-3 (naturally coded; _Ischool_0 omitted)i.bcg~v*i.sch~l _IbcgXsch_#_# (coded as above)

note: _IbcgXsch_1_3 != 0 predicts failure perfectly _IbcgXsch_1_3 dropped and 17 obs not used

Logistic regression Number of obs = 1254 LR chi2(6) = 94.12 Prob > chi2 = 0.0000Log likelihood = -570.90012 Pseudo R2 = 0.0762

------------------------------------------------------------------------------ d | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ibcgrev_1 | 4.446809 1.889136 3.51 0.000 1.933895 10.22502 _Ischool_1 | .9600749 .4312915 -0.09 0.928 .3980361 2.315729 _Ischool_2 | .4474097 .2307071 -1.56 0.119 .1628482 1.229215 _Ischool_3 | .5428571 .6013396 -0.55 0.581 .0619132 4.75979_IbcgXsch_~1 | .6901971 .3356713 -0.76 0.446 .2660717 1.79039_IbcgXsch_~2 | .920092 .5271167 -0.15 0.884 .2993516 2.82801------------------------------------------------------------------------------

. est store interrev

. lrtest interrev mainobservations differ: 1254 vs. 1271r(498);

. lrtest interrev main, force

Likelihood-ratio test LR chi2(2) = 3.92(Assumption: main nested in interrev) Prob > chi2 = 0.1407

Page 10: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

10

What’s right?

• Zero cell suggests small sample so asymptotic 2

distribution may be inappropriate for LRT– true in this case: have a bcg*school category with only 1

observation– but I’m going to demonstrate the same problem in hypothetical

example with expected cell counts > 3 but a zero observed cell count

• Could combine or drop cells to get rid of zeroes– but the cell with zeroes may carry information

• Problems with testing boundary values are well known – e.g. LRT for testing zero variance component isn’t 2

1

– but here the point estimate, not the null value, is on the boundary

Page 11: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

11

Example to explain why LRT makes some sense

. tab x y, chi2 exact

| y x | 0 1 | Total-----------+----------------------+---------- 0 | 10 20 | 30 1 | 0 10 | 10-----------+----------------------+---------- Total | 10 30 | 40

Pearson chi2(1) = 4.4444 Pr = 0.035 Fisher's exact = 0.043 1-sided Fisher's exact = 0.035

Page 12: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

12

Model: logit P(y=1|x) = + x-2

6-2

4-2

2-2

0-1

8Lo

g lik

elih

ood

0 5 10beta

Difference in log lik = 3.4

LRT = 6.8 on 0 df?

Page 13: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

13

Example to explore correct df using Pearson / Fisher as gold standard

. tab x y, chi2 exact

| y x | 0 1 | Total-----------+----------------------+---------- 1 | 6 0 | 6 2 | 3 6 | 9 3 | 3 6 | 9 -----------+----------------------+---------- Total | 12 12 | 24

Pearson chi2(2) = 8.0000 Pr = 0.018 Fisher's exact = 0.029

• All expected counts ≥3

• Don’t want to drop or merge category 1 - contains the evidence for association!

Page 14: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

14

. xi: logistic y i.xi.x _Ix_1-3 (naturally coded; _Ix_1 omitted)

Logistic regression Number of obs = 24 LR chi2(2) = 10.36 Prob > chi2 = 0.0056Log likelihood = -11.457255 Pseudo R2 = 0.3113

------------------------------------------------------------------------------ y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ix_2 | 1.61e+08 1.61e+08 18.90 0.000 2.27e+07 1.14e+09 _Ix_3 | 1.61e+08 . . . . .------------------------------------------------------------------------------Note: 6 failures and 0 successes completely determined.

. est store x

. xi: logistic y

Logistic regression Number of obs = 24 LR chi2(0) = 0.00 Prob > chi2 = .Log likelihood = -16.635532 Pseudo R2 = 0.0000

------------------------------------------------------------------------------ y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------------------------------------------------------------------------------------

. est store null

Page 15: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

15

LRT

. xi: logistic y i.x

Log likelihood = -11.457255

. est store x

. xi: logistic y

Log likelihood = -16.635532

. est store null

. lrtest x null

Likelihood-ratio test LR chi2(1) = 10.36(Assumption: null nested in x) Prob > chi2 = 0.0013

Page 16: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

16

Comparison of tests | y x | 0 1 | Total-----------+----------------------+---------- 1 | 6 0 | 6 2 | 3 6 | 9 3 | 3 6 | 9 -----------+----------------------+---------- Total | 12 12 | 24

Pearson chi2(2) = 8.0000 P = 0.018 Fisher's exact = P = 0.029 LR chi2(1) = 10.36 P = 0.0013 (using 2df: P = 0.0056)

Clearly LRT isn’t great. But 1df is even worse than 2df

Page 17: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

17

Note

• In this example, we could use Pearson / Fisher as gold standard.

• Can’t do this in more complex examples (e.g. adjust for several covariates).

Page 18: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

18

My proposal for Stata

• lrtest appears to adjust df for infinite parameter estimates: it should not

• Model df should be incremented to allow for any variables dropped because they perfectly predict success/failure– Don’t need to increment log lik as it is 0 for

the cases dropped

• Can the ad hoc handling of zeroes by xi:logistic be improved?

Page 19: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

19

Conclusions for statisticians

• Must remember the 2 approximation is still poor for these LRTs– typically anti-conservative? (Kuss, 2002)

• Performance of LRT can be improved by using penalised likelihood (Firth, 1993; Bull, 2006) - like a mildly informative prior– worth using routinely?

• Gold standard: Bayes or exact logistic regression (logXact)?

Page 20: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

20

The end

Page 21: 1 Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006.

21

Output for example with 2-level x. logit y x

Log likelihood = -19.095425

------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _cons | .6931472 .3872983 1.79 0.074 -.0659436 1.452238------------------------------------------------------------------------------

. estimates store x

. logit y

Log likelihood = -22.493406

------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _cons | 1.098612 .3651484 3.01 0.003 .3829346 1.81429------------------------------------------------------------------------------

. estimates store null

. lrtest x nulldf(unrestricted) = df(restricted) = 1r(498);

. lrtest x null, force df(1)

Likelihood-ratio test LR chi2(1) = 6.80(Assumption: null nested in x) Prob > chi2 = 0.0091