© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1

Stats 330: Lecture 25


Plan of the dayIn today’s lecture we discuss prediction and present a logistic regression case study. Topics covered will be

Prediction in logistic regressionIn-sample and out-of-sample error ratesCross-validation and bootstrap estimates of error ratesSensitivity and specificityROC curves

Then, a case study


Housekeeping

• Error in slide 34 in lecture 23:

Function is now influenceplots

• Bug in ROC.curve – download replacement from web page


Prediction

• Suppose we have fitted a logistic model and we want to use the model to predict new cases. If a new case presents with explanatory variables x, how do we predict the y-value, 0 or 1?

• Work out the estimated log-odds for the case

• Work out probability: Prob = exp(log-odds)/(1+exp(log.odds))• Predict

– Y=1 if prob >= 0.5 (equivalently log.odds >=0)– Y=0 if prob < 0.5 (equivalently log.odds <0)

0 1 1ˆ ˆ ˆlog-odds k kx x


Estimating the prediction error

• Prediction error is the probability of a wrong classification ( 0’s predicted as 1’s, 1’s predicted as 0’s)

• As in linear regression, using the training data to estimate these proportions tends to give an optimistic estimate

• We can use cross-validation or the bootstrap to improve the estimate –see the case study


Sensitivity and specificity

• Sensitivity: probability of predicting a 1 when the case is truly a 1: the “true positive rate”

• Specificity: probability of predicting a 0 when the case is truly a 0: the “true negative rate” (1-specificity is called the “false positive rate”)

• Ideally, want both to be close to 1• We would like to know what these would be for

new data – use cross-validation and the bootstrap as for normal regression


Calculating sensitivity and specificity

Model predicts

Failure (0) Success (1)

Actual Failure (0) 100 200

Success ( 1) 250 600

Specificity = 100/(100+200) = 33%

Sensitivity = 600/(600+250) = 70%

In-sample error rate = (200+250)/1150


ROC curves• We have predicted a “success” (Y=1) if the log-odds are positive.• We can generalize this to predict a success if log-odds >=c for some

constant c• If c is large and –ve, almost every case will be predicted as a

success (1) – Sensitivity close to 1, specificity close to 0

• If c is large and +ve, almost every case will be predicted as a failure (0)– Sensitivity close to 0, specificity close to 1

• Allows a trade-off between sensitivity and specificity• As c varies, the sensitivity and specificity change.• ROC curve is a plot of the points (1-specificity, sensitivity)

as c changes.


False positive rate

Tru

e p

osi

tive

ra

te

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC = 0.6567


ROC curves - cont

False positive rate

False positive rate

Tru

e p

osit

ive

rate

False positive rate

Tru

e p

osit

ive

rate

Perfect prediction

Tru

e p

osit

ive

rate

Worst case prediction

Predictor no help


Area under the curve

• For a perfect predictor, the area under the ROC curve (AUC) is 1.

• If the predictor is independent of the response, the sensitivity and specificity are both 0.5.

• AUC curve serves as a measure of how good the model is at predicting.


Case study

The data comes from the University of Massachusetts AIDS Research Unit IMPACT study, a medical study performed in the US in the early 90’s. The study aimed to evaluate two different treatments for drug addiction.

Reference: Hosmer and Lemeshow, Applied Logistic Regression (2nd Ed), p28


List of variablesVariable Description Codes/Values Name

Identification Code 1-575 IDAge at Enrollment Years AGEBeck Depression Score 0-54 BECKIV Drug Use History 1 = Never, IVHX

at Admission 2 = Previous, 3 = Recent No of prior Treatments 0-40 NDRUGTXSubject's Race 0 = White , RACE 1 = OtherTreatment Duration 0=short, 1 = Long TREAT Treatment Site 0 = A, 1 = B SITE Remained Drug Free 1 = Yes, 0 = No DFREE


The variables

• The response DFREE is binary: records if subject is drug-free after conclusion of treatment.

• There is a mix of categorical and continuous explanatory variables

• Categorical: IVHX, RACE, TREAT, SITE

• Continuous: AGE, BECK, NDRUGTX.


Questions

• Is the longer treatment more effective?

• Did Site A deliver the program more effectively than site B?

• What other variables have an effect on successful rehabilitation of addicts?

• Can we predict who is likely to be drug-free in 12 months?


Analysis strategy

• Preliminary plots, tables

• Variable selection

• Model fitting

• Interpretation of coefficients

• Evaluation as a predictor of recovery from addiction.


Preliminary plots

20 25 30 35 40 45 50 55

010

2030

4050

Age

BE

CH

Sco

re

1

2

3

4

5

6

7

8

9

10

1112

13

1415

16

17

18

19

20

21

22

23

24

25

26

27

28

29

3031

3233

34

35

3637

38

39 1

2

3

4

5

6

7

8

910

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

3031

32

33

34

35

36

37

38

391

2

34

56

7

8

9

10

11

12

13

14

15

16

17

1819

2021

22

23

24

25

26

27

2829

30

31

32

33

34

35

36

37

3839

1

2

3

4

5

6

7

8

9

10

11 12

13 14

15

16

1718

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

3435

36

37

38

39

1

2

3

4 5

6

7

8

9

10

11

12

13

1415

16

17

18

1920

2122

23

24

25

26 27

28

29

3031

32

33

34

35

36

37

38

39

1

2 3

4

5

6

7

8

910

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

2829

30

3132

33

34

35

36

37

38

391

2

3

4

56

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

2324

25

26

27

28

29

3031

32

33

34 35

36

3738

39

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 17

1819

20

21

22

23

24

25

2627

2829

30

31

32

33

34

35

36

37

38

39

12

34

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

21

22

23

24

2526

27

2829

30

31

32

33

34

35

36

37

3839

1

2

3

4

5

6

7

8 9

10

11

12

13

14

1516

17

18

1920

21

22

23

2425

26

27

28

2930

31

32

3334

35

36

37

38

39

1

2

3

45

6

7

8

9

10

11

12

13

14

15

1617

18

19

20

21

22

23

24

25

26

27

28 293031 32

3334

35

36

37

38

39

1

2

34

5

6

7

8

9

10

11

12

13

14

15

16 17

18

19

20

21

22

23

2425

26

27

28

29

3031 32

33

34

35

36

37

38

39

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17 18

19

20

21

22

23

2425

262728

2930

31

32

33

34

35

36

3738

39

1

2

34

5 6

7

8

9

10

11

1213 14

15

16

17

18

19

2021

22

23

24

25

26

27

2829

30

31

32

33

34

35

36

37

38

39

1

2

3

4

5

6

7 89

10 11

12

13

14

15

16

17

18

192021

22

23

24

25

26

27

28

29

Red: Drug Free

Blue: Relapse

20 25 30 35 40 45 50 55

010

2030

40

Age

Num

ber of

Prio

r Tre

atm

ents

1

2

34

5

6

7

89

1011

1213

14

1516

17

18

19

20

21

22

23

24

2526

27

2829

30

3132

3334

35

36

37 3839

12

3

45

67

8

9

10

11

12

1314

15

1617

181920

2122

2324 25

2627

28

29

30

31

3233

34

35

3637

38 391

2

34

5

6

7

8 9101112

13

14 1516

17

18192021

2223

24

25

2627

28

2930

3132

33

34

3536

3738

39

1

2

3

45

6

78

9

10

11 1213

1415

1617

18

192021 22

2324

252627

2829

3031

32

3334

3536

3738

391

2

3

45

6

78

9

1011 12

13

1415

1617

18

19

20

21

2223

24

25

2627

28

29 30

31

3233

34

3536

37

3839

12

345

67 8

910

111213

14

1516

17

18

19

20

21

22 23

24

2526

27

2829

30

3132

3334

353637

38

39

1

23

4

56

78

9

1011

12

13

14

15

16

17

1819

20

2122 23

24

25

26

27

28

29 30

31

32

3334

35

3637

3839

1

2

34

56

7

8

910

1112

1314

1516 17

18

19

2021

22

23

24

2526

27

28

293031

32

3334

3536

37

38

39

1

23

4

5

6

78

910

1112

131415

1617

18

19

2021

22

23 242526

2728

2930

3132

33

34

35

36

37

38

39

1

2

34

5

67

8

910

1112

1314

15

16

17

18

19

20

21

222324

25

26

2728 29

3031

32

33

34

35

36

37

38

39

1

2

34 5

6

7

89

10

1112

13

1415

161718

1920

21 222324 25

26

2728

2930

31

32

33

3435

36

3738

39

1

2

3

456

7

89

10 1112

1314

1516 17

1819

2021 22

2324

25

26

27

28

2930

31

32

33

3435 36

3738

391

2

3

456

7 8

9

10

11 12

13

1415

16

17 18

1920

21 22

23

24

2526

27

28

29

30

313233

3435 36

3738

39 12

3

4

5

6

78

9

10

1112

13

14

15

16

17

18

19

20

21

22

23

2425

2627

2829

30

31

32

33

3435

36

37

3839

12

3

4

56

7 8910

1112

13

14

15

16

1718

19

20

21

22

23

24

2526

2728

29

Red: Drug Free


Preliminary plots (2)

IVHX

Est

imat

ed p

roba

bility

of b

eing

dru

g free

Never Previous Recent

0.20

0.25

0.30

TREAT

Est

imat

ed p

roba

bility

of b

eing

dru

g free

Short Long

0.22

0.24

0.26

0.28

0.30

SITE

Est

imat

ed p

roba

bility

of b

eing

dru

g free

A B

0.24

0.25

0.26

0.27

0.28

0.29


Preliminary plots (3)

• Seems like number of previous drug treatments have an effect

• Seems like factors IVHX (Recent IV drug use), SITE (Site A or Site B) and TREAT (short or long treatment) have an effect


Preliminary fits (1)

Call:

glm(formula = DFREE ~ . - IVHX - ID + factor(IVHX), family = binomial,

data = drug.df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.3465 -0.8091 -0.6326 1.1834 2.4231

Don’t include ID!


Preliminary fits (2)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.4111283 0.5983427 -4.030 5.59e-05 ***AGE 0.0504143 0.0174057 2.896 0.00377 ** BECK 0.0002759 0.0107982 0.026 0.97961 NDRUGTX -0.0615329 0.0256441 -2.399 0.01642 * RACE 0.2260262 0.2233685 1.012 0.31159 TREAT 0.4424802 0.1992922 2.220 0.02640 * SITE 0.1489209 0.2176062 0.684 0.49375 factor(IVHX)2 -0.6036962 0.2875974 -2.099 0.03581 * factor(IVHX)3 -0.7336591 0.2549893 -2.877 0.00401 ** (Dispersion parameter for binomial family taken to be 1)Null deviance: 653.73 on 574 degrees of freedomResidual deviance: 619.25 on 566 degrees of freedomAIC: 637.25Number of Fisher Scoring iterations: 4


Preliminary conclusions

• Important variables seem to be AGE, NDRUGTX, TREAT, IVHX

• Data are ungrouped, can’t assess goodness of fit with residual deviance

• No extremely large residuals


Hosmer-Lemeshow test

> HLstat(drug.glm)

Value of HL statistic = 5.05

P-value = 0.752

No evidence of a bad fit using this test


Variable selection (1)> anova(drug.glm, test="Chisq")Analysis of Deviance TableModel: binomial, link: logitResponse: DFREETerms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|)NULL 574 653.73 AGE 1 1.40 573 652.33 0.24BECK 1 0.57 572 651.76 0.45NDRUGTX 1 14.07 571 637.69 0.0001760RACE 1 3.06 570 634.63 0.08TREAT 1 4.96 569 629.67 0.03SITE 1 1.07 568 628.60 0.30factor(IVHX) 2 9.35 566 619.25 0.01


Variable selection (2)Step: AIC= 632.59 DFREE ~ NDRUGTX + IVHX + AGE + TREAT

Call: glm(formula = DFREE ~ NDRUGTX + IVHX + AGE + TREAT,

family = binomial, data = drug.df)

Degrees of Freedom: 574 Total (i.e. Null); 569 Residual

Null Deviance: 653.7 Residual Deviance: 620.6 AIC: 632.6


Sub-model> sub.glm<-glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family=binomial, data=drug.df)

> summary(sub.glm)Call:glm(formula = DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family = binomial, data = drug.df)

Deviance Residuals: Min 1Q Median 3Q Max -1.2598 -0.8051 -0.6284 1.1401 2.4574


Sub-model (ii)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.33276 0.54838 -4.254 2.1e-05 ***NDRUGTX -0.06376 0.02563 -2.488 0.012844 * factor(IVHX)2 -0.62366 0.28470 -2.191 0.028484 * factor(IVHX)3 -0.80561 0.24453 -3.294 0.000986 ***AGE 0.05259 0.01721 3.056 0.002244 ** TREAT 0.45134 0.19860 2.273 0.023048 *(Dispersion parameter for binomial family taken to be 1)Null deviance: 653.73 on 574 degrees of freedomResidual deviance: 620.59 on 569 degrees of freedomAIC: 632.59Number of Fisher Scoring iterations: 4

All variables significan

t,but use caution


Do we need interaction terms?

> sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT , family=binomial, data=drug.df)

> sub2.glm<-glm(DFREE ~ NDRUGTX*IVHX + AGE*IVHX + AGE*TREAT + NDRUGTX*TREAT , family=binomial, data=drug.df)

>> anova(sub.glm, sub2.glm, test="Chisq")Analysis of Deviance Table

Model 1: DFREE ~ NDRUGTX + IVHX + AGE + TREATModel 2: DFREE ~ NDRUGTX * IVHX + AGE * IVHX + AGE * TREAT

+ NDRUGTX * TREAT Resid. Df Resid. Dev Df Deviance P(>|Chi|)1 569 620.59 2 563 616.16 6 4.42 0.62

Big p-value so interactions not required

Model with interactions


0 10 20 30 40

-4-3

-2-1

01

2

NDRUGTX

s(N

DR

UG

TX

,1.5

2)

20 25 30 35 40 45 50 55

-4-3

-2-1

01

2

AGE

s(A

GE

,1)

Do we need to transform?

par(mfrow=c(1,2))sub.gam<-gam(DFREE ~ s(NDRUGTX) + factor(IVHX) + s(AGE) + TREAT , family=binomial, data=drug.df)plot(sub.gam)


Transforming• Suggests a possible quadratic in NDRUGTX:

> subq.glm<-glm(DFREE ~ poly(NDRUGTX,2) + IVHX + AGE + TREAT, family=binomial, data=drug.df)> summary(subq.glm)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.70763 0.56715 -4.774 1.80e-06 ***poly(NDRUGTX, 2)1 -7.24501 2.93274 -2.470 0.01350 * poly(NDRUGTX, 2)2 4.21319 2.69624 1.563 0.11814 IVHX2 -0.59018 0.28608 -2.063 0.03912 * IVHX3 -0.76015 0.24725 -3.074 0.00211 ** AGE 0.05458 0.01730 3.154 0.00161 ** TREAT 0.44379 0.19904 2.230 0.02577 *

But term is not significant, so we stick with no transformation


Diagnostics

-4 -3 -2 -1 0

-10

12

Predicted values

Res

idua

ls

Residuals vs Fitted

471 7350

-3 -2 -1 0 1 2 3

-10

12

Theoretical Quantiles

Std

. de

vian

ce r

esid

.

Normal Q-Q plot

4717

350

-4 -3 -2 -1 0

0.0

0.5

1.0

1.5

Predicted values

Std

. de

vian

ce r

esid

.

Scale-Location plot471 7

350

0 100 200 300 400 500

0.00

0.02

0.04

0.06

0.08

0.10

Obs. number

Coo

k's

dist

ance

Cook's distance plot

7

471

350

0 100 200 300 400 500

-10

12

Index plot of deviance residuals

Observation number

Dev

ianc

e R

esid

uals

781 255284322350

471

0 100 200 300 400 500

0.00

0.01

0.02

0.03

0.04

0.05

Leverage plot

Observation Number

Leve

rage

85

384 551571

0 100 200 300 400 500

0.00

0.02

0.04

0.06

0.08

Cook's Distance Plot

Observation number

Coo

k's

Dis

tanc

e

0 100 200 300 400 500

01

23

45

6

Deviance Changes Plot

Observation number

Dev

ianc

e ch

ange

s

7

81255284322

350

366399405

471

Pt 85

7, 4717,

471


Influence of 7, 471, 85

None 7 85 471 All(Intercept) -2.333 -2.295 -2.222 -2.447 -2.293

NDRUGTX -0.064 -0.084 -0.065 -0.075 -0.100

IVHX2 -0.624 -0.595 -0.635 -0.680 -0.662

IVHX3 -0.806 -0.783 -0.785 -0.795 -0.747

AGE 0.053 0.053 0.049 0.057 0.054

TREAT 0.451 0.434 0.441 0.479 0.450

Effect on coefficients of removing cases:

None seem particularly influential! We will not delete them


Over-dispersion qsub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT,

family=quasibinomial, data=drug.df)> summary(qsub.glm)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.33276 0.55435 -4.208 2.99e-05 ***NDRUGTX -0.06376 0.02591 -2.461 0.01414 * IVHX2 -0.62366 0.28780 -2.167 0.03065 * IVHX3 -0.80561 0.24720 -3.259 0.00118 ** AGE 0.05259 0.01740 3.023 0.00262 ** TREAT 0.45134 0.20076 2.248 0.02495 * ---(Dispersion parameter for quasibinomial family taken

to be 1.021892)Very close to 1 so no overdispersion


InterpretationCoefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.33276 0.54838 -4.254 2.1e-05 ***NDRUGTX -0.06376 0.02563 -2.488 0.012844 * IVHX2 -0.62366 0.28470 -2.191 0.028484 * IVHX3 -0.80561 0.24453 -3.294 0.000986 ***AGE 0.05259 0.01721 3.056 0.002244 ** TREAT 0.45134 0.19860 2.273 0.023048 *

• As the number of prior treatments goes up, the probability of a drug-free recovery goes down

• The probability of a drug-free recovery for persons with no IV drug use is more than for persons with previous IV drug use

• The probability of a drug-free recovery for persons with previous IV drug use is more than for persons with recent IV drug use.

• The probability of a drug-free recovery goes up with age• The probability of a drug-free recovery is higher for

the long treatment


Interpreting p-values after model selection

• We have seen that this is not valid, as model selection changes the distribution of the estimated coefficients.

• We can use the bootstrap to examine the revised distribution

• Leave TREAT in the model, use forward selection to select the other variables.


Procedure

• Draw a bootstrap sample

• Do forward selection, record value of regression coef for TREAT (forced to be in every model)

• Repeat 200 times, draw histogram of the results


R code# bootstrap sample

n = dim(drug.df)[1]

B=200

beta.boot = numeric(B)

for(b in 1:B){

ni = rmultinom(1, n ,prob=rep(1/n,n))

newdata = drug.df[rep(1:n,ni),]

drug.boot.glm = glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + BECK + TREAT + RACE + SITE,

family=binomial, data= newdata)

chosen = step(drug.boot.glm, list(lower = DFREE ~ TREAT, upper= formula(drug.boot.glm)),

direction = “forward", trace=0)

k = match("TREAT",names(coef(chosen)))

beta.boot[b] = summary(chosen)$coefficients[k,1]

}


HistogramHistogram of beta.boot

beta.boot

Fre

qu

en

cy

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

05

10

15

20

> mean(beta.boot)[1] 0.4540209> sd(beta.boot)[1] 0.2019222> z.val = mean(beta.boot)/ sd(beta.boot)> 2*(1-pnorm(z.val))[1] 0.02454468

Compare

Beta = 0.45134SE = 0.19860 P-value = 0.023048


Prediction• Sensitivity: chance the model predicts a

successful recovery (drug-free at end of program), when one will actually occur

• Specificity: chance the model predicts a failure (return to drug use before end of program), when one actually will occur


R codesub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT ,

family=binomial, data=drug.df) > pred = predict(sub.glm, type="response")> predcode = ifelse(pred<0.5, 0,1)> table(drug.df$DFREE,predcode) predicted 0 1 Actual 0 426 2 1 144 3Sensitivity = 3/147 = 0.02040816Specificity = 426/428 = 0.9953271Error rate = 146/575 = 0.2539130Proportion correctly classified = 429/575 = 0.746087


ROC curveROC.curve(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df)# in the R330 package


Prediction (2)

• Use 10-fold cross-validation– Split data into 10 parts– Calculate sensitivity and specificity for each

part, using model fitted to the remaining parts– Average results– Repeat for different splits, average repeats

• E.g. for one part


CV and bootstrap Results

> cross.val(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, drug.df)

Mean Specificity = 0.9908424

Mean Sensitivity = 0.02854005

Mean Correctly classified = 0.7446491

> err.boot(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df)

$err

[1] 0.2539130

$Err

[1] 0.2552974

A poor classifier, but this doesn’t mean that the model fits poorly – there are very few cases with fitted probs over 0.5, and many with fitted probabilities between 0.2 and 0.5. We expect a moderate number of these to be misclassified, as some events (being drug free) with probs 0.2 to 0.5 have occurred.

Bootstrap estimate

Training set estimate


Overall conclusions• Model seems to fit well

• Strong evidence that longer treatments are better

• No apparent difference between sites

• Age and prior IV drug use affect recovery

• Model predicts poorly for the covariates in the data set – effectively always predicts patients will not be drug free

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.

Documents

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.