© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1
Stats 330: Lecture 25
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 2
Plan of the dayIn today’s lecture we discuss prediction and present a logistic regression case study. Topics covered will be
Prediction in logistic regressionIn-sample and out-of-sample error ratesCross-validation and bootstrap estimates of error ratesSensitivity and specificityROC curves
Then, a case study
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 3
Housekeeping
• Error in slide 34 in lecture 23:
Function is now influenceplots
• Bug in ROC.curve – download replacement from web page
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 4
Prediction
• Suppose we have fitted a logistic model and we want to use the model to predict new cases. If a new case presents with explanatory variables x, how do we predict the y-value, 0 or 1?
• Work out the estimated log-odds for the case
• Work out probability: Prob = exp(log-odds)/(1+exp(log.odds))• Predict
– Y=1 if prob >= 0.5 (equivalently log.odds >=0)– Y=0 if prob < 0.5 (equivalently log.odds <0)
0 1 1ˆ ˆ ˆlog-odds k kx x
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 5
Estimating the prediction error
• Prediction error is the probability of a wrong classification ( 0’s predicted as 1’s, 1’s predicted as 0’s)
• As in linear regression, using the training data to estimate these proportions tends to give an optimistic estimate
• We can use cross-validation or the bootstrap to improve the estimate –see the case study
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 6
Sensitivity and specificity
• Sensitivity: probability of predicting a 1 when the case is truly a 1: the “true positive rate”
• Specificity: probability of predicting a 0 when the case is truly a 0: the “true negative rate” (1-specificity is called the “false positive rate”)
• Ideally, want both to be close to 1• We would like to know what these would be for
new data – use cross-validation and the bootstrap as for normal regression
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 7
Calculating sensitivity and specificity
Model predicts
Failure (0) Success (1)
Actual Failure (0) 100 200
Success ( 1) 250 600
Specificity = 100/(100+200) = 33%
Sensitivity = 600/(600+250) = 70%
In-sample error rate = (200+250)/1150
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 8
ROC curves• We have predicted a “success” (Y=1) if the log-odds are positive.• We can generalize this to predict a success if log-odds >=c for some
constant c• If c is large and –ve, almost every case will be predicted as a
success (1) – Sensitivity close to 1, specificity close to 0
• If c is large and +ve, almost every case will be predicted as a failure (0)– Sensitivity close to 0, specificity close to 1
• Allows a trade-off between sensitivity and specificity• As c varies, the sensitivity and specificity change.• ROC curve is a plot of the points (1-specificity, sensitivity)
as c changes.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 9
False positive rate
Tru
e p
osi
tive
ra
te
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
AUC = 0.6567
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 10
ROC curves - cont
False positive rate
False positive rate
Tru
e p
osit
ive
rate
False positive rate
Tru
e p
osit
ive
rate
Perfect prediction
Tru
e p
osit
ive
rate
Worst case prediction
Predictor no help
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 11
Area under the curve
• For a perfect predictor, the area under the ROC curve (AUC) is 1.
• If the predictor is independent of the response, the sensitivity and specificity are both 0.5.
• AUC curve serves as a measure of how good the model is at predicting.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 12
Case study
The data comes from the University of Massachusetts AIDS Research Unit IMPACT study, a medical study performed in the US in the early 90’s. The study aimed to evaluate two different treatments for drug addiction.
Reference: Hosmer and Lemeshow, Applied Logistic Regression (2nd Ed), p28
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 13
List of variablesVariable Description Codes/Values Name
Identification Code 1-575 IDAge at Enrollment Years AGEBeck Depression Score 0-54 BECKIV Drug Use History 1 = Never, IVHX
at Admission 2 = Previous, 3 = Recent No of prior Treatments 0-40 NDRUGTXSubject's Race 0 = White , RACE 1 = OtherTreatment Duration 0=short, 1 = Long TREAT Treatment Site 0 = A, 1 = B SITE Remained Drug Free 1 = Yes, 0 = No DFREE
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 14
The variables
• The response DFREE is binary: records if subject is drug-free after conclusion of treatment.
• There is a mix of categorical and continuous explanatory variables
• Categorical: IVHX, RACE, TREAT, SITE
• Continuous: AGE, BECK, NDRUGTX.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 15
Questions
• Is the longer treatment more effective?
• Did Site A deliver the program more effectively than site B?
• What other variables have an effect on successful rehabilitation of addicts?
• Can we predict who is likely to be drug-free in 12 months?
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 16
Analysis strategy
• Preliminary plots, tables
• Variable selection
• Model fitting
• Interpretation of coefficients
• Evaluation as a predictor of recovery from addiction.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 17
Preliminary plots
20 25 30 35 40 45 50 55
010
2030
4050
Age
BE
CH
Sco
re
1
2
3
4
5
6
7
8
9
10
1112
13
1415
16
17
18
19
20
21
22
23
24
25
26
27
28
29
3031
3233
34
35
3637
38
39 1
2
3
4
5
6
7
8
910
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
3031
32
33
34
35
36
37
38
391
2
34
56
7
8
9
10
11
12
13
14
15
16
17
1819
2021
22
23
24
25
26
27
2829
30
31
32
33
34
35
36
37
3839
1
2
3
4
5
6
7
8
9
10
11 12
13 14
15
16
1718
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
3435
36
37
38
39
1
2
3
4 5
6
7
8
9
10
11
12
13
1415
16
17
18
1920
2122
23
24
25
26 27
28
29
3031
32
33
34
35
36
37
38
39
1
2 3
4
5
6
7
8
910
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2829
30
3132
33
34
35
36
37
38
391
2
3
4
56
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2324
25
26
27
28
29
3031
32
33
34 35
36
3738
39
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 17
1819
20
21
22
23
24
25
2627
2829
30
31
32
33
34
35
36
37
38
39
12
34
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
21
22
23
24
2526
27
2829
30
31
32
33
34
35
36
37
3839
1
2
3
4
5
6
7
8 9
10
11
12
13
14
1516
17
18
1920
21
22
23
2425
26
27
28
2930
31
32
3334
35
36
37
38
39
1
2
3
45
6
7
8
9
10
11
12
13
14
15
1617
18
19
20
21
22
23
24
25
26
27
28 293031 32
3334
35
36
37
38
39
1
2
34
5
6
7
8
9
10
11
12
13
14
15
16 17
18
19
20
21
22
23
2425
26
27
28
29
3031 32
33
34
35
36
37
38
39
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 18
19
20
21
22
23
2425
262728
2930
31
32
33
34
35
36
3738
39
1
2
34
5 6
7
8
9
10
11
1213 14
15
16
17
18
19
2021
22
23
24
25
26
27
2829
30
31
32
33
34
35
36
37
38
39
1
2
3
4
5
6
7 89
10 11
12
13
14
15
16
17
18
192021
22
23
24
25
26
27
28
29
Red: Drug Free
Blue: Relapse
20 25 30 35 40 45 50 55
010
2030
40
Age
Num
ber of
Prio
r Tre
atm
ents
1
2
34
5
6
7
89
1011
1213
14
1516
17
18
19
20
21
22
23
24
2526
27
2829
30
3132
3334
35
36
37 3839
12
3
45
67
8
9
10
11
12
1314
15
1617
181920
2122
2324 25
2627
28
29
30
31
3233
34
35
3637
38 391
2
34
5
6
7
8 9101112
13
14 1516
17
18192021
2223
24
25
2627
28
2930
3132
33
34
3536
3738
39
1
2
3
45
6
78
9
10
11 1213
1415
1617
18
192021 22
2324
252627
2829
3031
32
3334
3536
3738
391
2
3
45
6
78
9
1011 12
13
1415
1617
18
19
20
21
2223
24
25
2627
28
29 30
31
3233
34
3536
37
3839
12
345
67 8
910
111213
14
1516
17
18
19
20
21
22 23
24
2526
27
2829
30
3132
3334
353637
38
39
1
23
4
56
78
9
1011
12
13
14
15
16
17
1819
20
2122 23
24
25
26
27
28
29 30
31
32
3334
35
3637
3839
1
2
34
56
7
8
910
1112
1314
1516 17
18
19
2021
22
23
24
2526
27
28
293031
32
3334
3536
37
38
39
1
23
4
5
6
78
910
1112
131415
1617
18
19
2021
22
23 242526
2728
2930
3132
33
34
35
36
37
38
39
1
2
34
5
67
8
910
1112
1314
15
16
17
18
19
20
21
222324
25
26
2728 29
3031
32
33
34
35
36
37
38
39
1
2
34 5
6
7
89
10
1112
13
1415
161718
1920
21 222324 25
26
2728
2930
31
32
33
3435
36
3738
39
1
2
3
456
7
89
10 1112
1314
1516 17
1819
2021 22
2324
25
26
27
28
2930
31
32
33
3435 36
3738
391
2
3
456
7 8
9
10
11 12
13
1415
16
17 18
1920
21 22
23
24
2526
27
28
29
30
313233
3435 36
3738
39 12
3
4
5
6
78
9
10
1112
13
14
15
16
17
18
19
20
21
22
23
2425
2627
2829
30
31
32
33
3435
36
37
3839
12
3
4
56
7 8910
1112
13
14
15
16
1718
19
20
21
22
23
24
2526
2728
29
Red: Drug Free
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 18
Preliminary plots (2)
IVHX
Est
imat
ed p
roba
bility
of b
eing
dru
g free
Never Previous Recent
0.20
0.25
0.30
TREAT
Est
imat
ed p
roba
bility
of b
eing
dru
g free
Short Long
0.22
0.24
0.26
0.28
0.30
SITE
Est
imat
ed p
roba
bility
of b
eing
dru
g free
A B
0.24
0.25
0.26
0.27
0.28
0.29
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 19
Preliminary plots (3)
• Seems like number of previous drug treatments have an effect
• Seems like factors IVHX (Recent IV drug use), SITE (Site A or Site B) and TREAT (short or long treatment) have an effect
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 20
Preliminary fits (1)
Call:
glm(formula = DFREE ~ . - IVHX - ID + factor(IVHX), family = binomial,
data = drug.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3465 -0.8091 -0.6326 1.1834 2.4231
Don’t include ID!
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 21
Preliminary fits (2)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.4111283 0.5983427 -4.030 5.59e-05 ***AGE 0.0504143 0.0174057 2.896 0.00377 ** BECK 0.0002759 0.0107982 0.026 0.97961 NDRUGTX -0.0615329 0.0256441 -2.399 0.01642 * RACE 0.2260262 0.2233685 1.012 0.31159 TREAT 0.4424802 0.1992922 2.220 0.02640 * SITE 0.1489209 0.2176062 0.684 0.49375 factor(IVHX)2 -0.6036962 0.2875974 -2.099 0.03581 * factor(IVHX)3 -0.7336591 0.2549893 -2.877 0.00401 ** (Dispersion parameter for binomial family taken to be 1)Null deviance: 653.73 on 574 degrees of freedomResidual deviance: 619.25 on 566 degrees of freedomAIC: 637.25Number of Fisher Scoring iterations: 4
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 22
Preliminary conclusions
• Important variables seem to be AGE, NDRUGTX, TREAT, IVHX
• Data are ungrouped, can’t assess goodness of fit with residual deviance
• No extremely large residuals
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 23
Hosmer-Lemeshow test
> HLstat(drug.glm)
Value of HL statistic = 5.05
P-value = 0.752
No evidence of a bad fit using this test
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 24
Variable selection (1)> anova(drug.glm, test="Chisq")Analysis of Deviance TableModel: binomial, link: logitResponse: DFREETerms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|)NULL 574 653.73 AGE 1 1.40 573 652.33 0.24BECK 1 0.57 572 651.76 0.45NDRUGTX 1 14.07 571 637.69 0.0001760RACE 1 3.06 570 634.63 0.08TREAT 1 4.96 569 629.67 0.03SITE 1 1.07 568 628.60 0.30factor(IVHX) 2 9.35 566 619.25 0.01
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 25
Variable selection (2)Step: AIC= 632.59 DFREE ~ NDRUGTX + IVHX + AGE + TREAT
Call: glm(formula = DFREE ~ NDRUGTX + IVHX + AGE + TREAT,
family = binomial, data = drug.df)
Degrees of Freedom: 574 Total (i.e. Null); 569 Residual
Null Deviance: 653.7 Residual Deviance: 620.6 AIC: 632.6
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 26
Sub-model> sub.glm<-glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family=binomial, data=drug.df)
> summary(sub.glm)Call:glm(formula = DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family = binomial, data = drug.df)
Deviance Residuals: Min 1Q Median 3Q Max -1.2598 -0.8051 -0.6284 1.1401 2.4574
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 27
Sub-model (ii)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.33276 0.54838 -4.254 2.1e-05 ***NDRUGTX -0.06376 0.02563 -2.488 0.012844 * factor(IVHX)2 -0.62366 0.28470 -2.191 0.028484 * factor(IVHX)3 -0.80561 0.24453 -3.294 0.000986 ***AGE 0.05259 0.01721 3.056 0.002244 ** TREAT 0.45134 0.19860 2.273 0.023048 *(Dispersion parameter for binomial family taken to be 1)Null deviance: 653.73 on 574 degrees of freedomResidual deviance: 620.59 on 569 degrees of freedomAIC: 632.59Number of Fisher Scoring iterations: 4
All variables significan
t,but use caution
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 28
Do we need interaction terms?
> sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT , family=binomial, data=drug.df)
> sub2.glm<-glm(DFREE ~ NDRUGTX*IVHX + AGE*IVHX + AGE*TREAT + NDRUGTX*TREAT , family=binomial, data=drug.df)
>> anova(sub.glm, sub2.glm, test="Chisq")Analysis of Deviance Table
Model 1: DFREE ~ NDRUGTX + IVHX + AGE + TREATModel 2: DFREE ~ NDRUGTX * IVHX + AGE * IVHX + AGE * TREAT
+ NDRUGTX * TREAT Resid. Df Resid. Dev Df Deviance P(>|Chi|)1 569 620.59 2 563 616.16 6 4.42 0.62
Big p-value so interactions not required
Model with interactions
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 29
0 10 20 30 40
-4-3
-2-1
01
2
NDRUGTX
s(N
DR
UG
TX
,1.5
2)
20 25 30 35 40 45 50 55
-4-3
-2-1
01
2
AGE
s(A
GE
,1)
Do we need to transform?
par(mfrow=c(1,2))sub.gam<-gam(DFREE ~ s(NDRUGTX) + factor(IVHX) + s(AGE) + TREAT , family=binomial, data=drug.df)plot(sub.gam)
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 30
Transforming• Suggests a possible quadratic in NDRUGTX:
> subq.glm<-glm(DFREE ~ poly(NDRUGTX,2) + IVHX + AGE + TREAT, family=binomial, data=drug.df)> summary(subq.glm)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.70763 0.56715 -4.774 1.80e-06 ***poly(NDRUGTX, 2)1 -7.24501 2.93274 -2.470 0.01350 * poly(NDRUGTX, 2)2 4.21319 2.69624 1.563 0.11814 IVHX2 -0.59018 0.28608 -2.063 0.03912 * IVHX3 -0.76015 0.24725 -3.074 0.00211 ** AGE 0.05458 0.01730 3.154 0.00161 ** TREAT 0.44379 0.19904 2.230 0.02577 *
But term is not significant, so we stick with no transformation
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 31
Diagnostics
-4 -3 -2 -1 0
-10
12
Predicted values
Res
idua
ls
Residuals vs Fitted
471 7350
-3 -2 -1 0 1 2 3
-10
12
Theoretical Quantiles
Std
. de
vian
ce r
esid
.
Normal Q-Q plot
4717
350
-4 -3 -2 -1 0
0.0
0.5
1.0
1.5
Predicted values
Std
. de
vian
ce r
esid
.
Scale-Location plot471 7
350
0 100 200 300 400 500
0.00
0.02
0.04
0.06
0.08
0.10
Obs. number
Coo
k's
dist
ance
Cook's distance plot
7
471
350
0 100 200 300 400 500
-10
12
Index plot of deviance residuals
Observation number
Dev
ianc
e R
esid
uals
781 255284322350
471
0 100 200 300 400 500
0.00
0.01
0.02
0.03
0.04
0.05
Leverage plot
Observation Number
Leve
rage
85
384 551571
0 100 200 300 400 500
0.00
0.02
0.04
0.06
0.08
Cook's Distance Plot
Observation number
Coo
k's
Dis
tanc
e
0 100 200 300 400 500
01
23
45
6
Deviance Changes Plot
Observation number
Dev
ianc
e ch
ange
s
7
81255284322
350
366399405
471
Pt 85
7, 4717,
471
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 32
Influence of 7, 471, 85
None 7 85 471 All(Intercept) -2.333 -2.295 -2.222 -2.447 -2.293
NDRUGTX -0.064 -0.084 -0.065 -0.075 -0.100
IVHX2 -0.624 -0.595 -0.635 -0.680 -0.662
IVHX3 -0.806 -0.783 -0.785 -0.795 -0.747
AGE 0.053 0.053 0.049 0.057 0.054
TREAT 0.451 0.434 0.441 0.479 0.450
Effect on coefficients of removing cases:
None seem particularly influential! We will not delete them
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 33
Over-dispersion qsub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT,
family=quasibinomial, data=drug.df)> summary(qsub.glm)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.33276 0.55435 -4.208 2.99e-05 ***NDRUGTX -0.06376 0.02591 -2.461 0.01414 * IVHX2 -0.62366 0.28780 -2.167 0.03065 * IVHX3 -0.80561 0.24720 -3.259 0.00118 ** AGE 0.05259 0.01740 3.023 0.00262 ** TREAT 0.45134 0.20076 2.248 0.02495 * ---(Dispersion parameter for quasibinomial family taken
to be 1.021892)Very close to 1 so no overdispersion
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 34
InterpretationCoefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.33276 0.54838 -4.254 2.1e-05 ***NDRUGTX -0.06376 0.02563 -2.488 0.012844 * IVHX2 -0.62366 0.28470 -2.191 0.028484 * IVHX3 -0.80561 0.24453 -3.294 0.000986 ***AGE 0.05259 0.01721 3.056 0.002244 ** TREAT 0.45134 0.19860 2.273 0.023048 *
• As the number of prior treatments goes up, the probability of a drug-free recovery goes down
• The probability of a drug-free recovery for persons with no IV drug use is more than for persons with previous IV drug use
• The probability of a drug-free recovery for persons with previous IV drug use is more than for persons with recent IV drug use.
• The probability of a drug-free recovery goes up with age• The probability of a drug-free recovery is higher for
the long treatment
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 35
Interpreting p-values after model selection
• We have seen that this is not valid, as model selection changes the distribution of the estimated coefficients.
• We can use the bootstrap to examine the revised distribution
• Leave TREAT in the model, use forward selection to select the other variables.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 36
Procedure
• Draw a bootstrap sample
• Do forward selection, record value of regression coef for TREAT (forced to be in every model)
• Repeat 200 times, draw histogram of the results
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 37
R code# bootstrap sample
n = dim(drug.df)[1]
B=200
beta.boot = numeric(B)
for(b in 1:B){
ni = rmultinom(1, n ,prob=rep(1/n,n))
newdata = drug.df[rep(1:n,ni),]
drug.boot.glm = glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + BECK + TREAT + RACE + SITE,
family=binomial, data= newdata)
chosen = step(drug.boot.glm, list(lower = DFREE ~ TREAT, upper= formula(drug.boot.glm)),
direction = “forward", trace=0)
k = match("TREAT",names(coef(chosen)))
beta.boot[b] = summary(chosen)$coefficients[k,1]
}
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 38
HistogramHistogram of beta.boot
beta.boot
Fre
qu
en
cy
-0.2 0.0 0.2 0.4 0.6 0.8 1.0
05
10
15
20
> mean(beta.boot)[1] 0.4540209> sd(beta.boot)[1] 0.2019222> z.val = mean(beta.boot)/ sd(beta.boot)> 2*(1-pnorm(z.val))[1] 0.02454468
Compare
Beta = 0.45134SE = 0.19860 P-value = 0.023048
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 39
Prediction• Sensitivity: chance the model predicts a
successful recovery (drug-free at end of program), when one will actually occur
• Specificity: chance the model predicts a failure (return to drug use before end of program), when one actually will occur
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 40
R codesub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT ,
family=binomial, data=drug.df) > pred = predict(sub.glm, type="response")> predcode = ifelse(pred<0.5, 0,1)> table(drug.df$DFREE,predcode) predicted 0 1 Actual 0 426 2 1 144 3Sensitivity = 3/147 = 0.02040816Specificity = 426/428 = 0.9953271Error rate = 146/575 = 0.2539130Proportion correctly classified = 429/575 = 0.746087
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 41
ROC curveROC.curve(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df)# in the R330 package
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 42
Prediction (2)
• Use 10-fold cross-validation– Split data into 10 parts– Calculate sensitivity and specificity for each
part, using model fitted to the remaining parts– Average results– Repeat for different splits, average repeats
• E.g. for one part
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 43
CV and bootstrap Results
> cross.val(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, drug.df)
Mean Specificity = 0.9908424
Mean Sensitivity = 0.02854005
Mean Correctly classified = 0.7446491
> err.boot(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df)
$err
[1] 0.2539130
$Err
[1] 0.2552974
A poor classifier, but this doesn’t mean that the model fits poorly – there are very few cases with fitted probs over 0.5, and many with fitted probabilities between 0.2 and 0.5. We expect a moderate number of these to be misclassified, as some events (being drug free) with probs 0.2 to 0.5 have occurred.
Bootstrap estimate
Training set estimate
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 44
Overall conclusions• Model seems to fit well
• Strong evidence that longer treatments are better
• No apparent difference between sites
• Age and prior IV drug use affect recovery
• Model predicts poorly for the covariates in the data set – effectively always predicts patients will not be drug free