Top Banner
Evaluating Predictive Evaluating Predictive Models Models Niels Peek Niels Peek Department of Medical Informatics Department of Medical Informatics Academic Medical Center Academic Medical Center University of Amsterdam University of Amsterdam
48

Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Dec 13, 2015

Download

Documents

Margaret Owens
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Evaluating Predictive Evaluating Predictive ModelsModels

Niels PeekNiels Peek

Department of Medical InformaticsDepartment of Medical Informatics

Academic Medical CenterAcademic Medical Center

University of AmsterdamUniversity of Amsterdam

Page 2: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

OutlineOutline

1. Model evaluation basics

2. Performance measures

3. Evaluation tasks• Model selection• Performance assessment• Model comparison

4. Summary

Page 3: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Basic evaluation procedure

1. Choose performance measure

2. Choose evaluation design

3. Build model

4. Estimate performance

5. Quantify uncertainty

Page 4: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Basic evaluation procedure

1. Choose performance measure (e.g. error rate)

2. Choose evaluation design (e.g. split sample)

3. Build model(e.g. decision tree)

4. Estimate performance(e.g. compute test sample error rate)

5. Quantify uncertainty(e.g. estimate confidence interval)

Page 5: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Notation and terminology

x Rm feature vector (stat: covariate pattern)

y {0,1} class (med: outcome)

p(x) density (probability mass) of x

P(Y=1| x) class-conditional probability

h : Rm →{0,1}classifier (stat: discriminant model; Mitchell: hypothesis)

f : Rm →[0,1] probabilistic classifier (stat: binary regression model)

f (Y=1| x)estimated class-conditional probability

Page 6: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Error rate

The error rate (misclassification rate, inaccuracy) of a given classifier h is the probability that h will misclassify an arbitrary instance x :

m

dphYPherrorR

)()|)(()(x

xxxx

Probability that a given x is misclassified by h

Page 7: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Error rate

The error rate (misclassification rate, inaccuracy) of a given classifier h is the probability that h will misclassify an arbitrary instance x :

m

dphYPherrorR

)()|)(()(x

xxxx

Expectation over instances x randomly

drawn from Rm

Page 8: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Sample error rate

Let S = { (xi , yi) | i =1,...,n } be a sample of independent and identically distributed (i.i.d.) instances, randomly drawn from Rm.

The sample error rate of classifier h in sample S is the proportion of instances in S misclassified by h:

))((1)(1

n

iiiS hyI

nherror x

Page 9: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

The estimation problem

How well does errors(h) estimate error(h) ?

To answer this question, we must look at some basic concepts of statistical estimation theory.

Generally speaking, a statistic is a particular calculation made from a data sample. It describes a certain aspect of the distribution of the data in the sample.

Page 10: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Understanding randomness

Page 11: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

What can go wrong?

Page 12: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Sources of bias

• Dependenceusing data for both training/optimization and testing purposes

• Population driftunderlying densities have changed e.g. ageing

• Concept driftclass-conditional distributions have changed e.g. reduced mortality due to better treatments

Page 13: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Sources of variation

• Sampling of test data“bad day” (more probable with small samples)

• Sampling of training data instability of the learning method, e.g. trees

• Internal randomness of learning algorithm stochastic optimization, e.g. neural networks

• Class inseparability0 « P(Y=1| x) « 1 for many x Rm

Page 14: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Solutions

• Bias

– is usually be avoided through proper sampling, i.e. by taking an independent sample

– can sometimes be estimated and then used to correct a biased errors(h)

• Variance

– can be reduced by increasing the sample size (if we have enough data ...)

– is usually estimated and then used to quantify the uncertainty of errors(h)

Page 15: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Uncertainty = spread

10 15 20 25 30 35 40

05

1015

2025

30

10 15 20 25 30 35 40

05

1015

2025

30

We investigate the spread of a distribution by looking at the average distance to the (estimated) mean.

Page 16: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Quantifying uncertainty (1)

• The variance of e1, ..., en is defined as

• When e1, ..., en are binary, then

Let e1, ..., en be a sequence of observations, with

average .

n

i inee

1

1

2

11

12 )( n

i inees

)1(1

12 eesn

Page 17: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Quantifying uncertainty (2)

e]96.1,96.1[ sese

• The standard deviation of e1, ..., en is defined as

• When the distribution of e1, ..., en is approximately

Normal, a 95% confidence interval of is obtained by .

• Under the same assumption, we can also compute the probability (p-value) that the true mean equals a particular value (e.g., 0).

2

11

1 )( n

i inees

Page 18: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Example

training set test set

ntrain = 80 ntest = 40

• We split our dataset into a training sample and a test sample.• The classifier h is induced from the training sample, and

evaluated on the independent test sample.

• The estimated error rate is then unbiased.

Page 19: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Example (cont’d)

• Suppose that h misclassifies 12 of the 40 examples in the test sample.

• So

• Now, with approximately 95% probability, error(h) lies in the interval

• In this case, the interval ranges from .16 to .44

30.)( 4012 herrorS

1

))(1)((96.1)(

test

SSS n

herrorherrorherror

Page 20: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Basic evaluation procedure

1. Choose performance measure (e.g. error rate)

2. Choose evaluation design (e.g. split sample)

3. Build model(e.g. decision tree)

4. Estimate performance(e.g. compute test sample error rate)

5. Quantify uncertainty(e.g. estimate confidence interval)

Page 21: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

OutlineOutline

1. Model evaluation basics

2. Performance measures

3. Evaluation tasks• Model selection• Performance assessment• Model comparison

4. Summary

Page 22: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Confusion matrix

true positives false positives

false negatives

true negatives

A common way to refine the notion of prediction error is to construct a confusion matrix:

Y=1 Y=0

h(x)=1

h(x)=0

outcome

prediction

Page 23: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Example

1 0

2 3

Y

0

1

1

1

0

0

0

0

h(x)

0

0

1

0

Y=1 Y=0

h(x)=1

h(x)=0

Page 24: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Sensitivity

• “hit rate”: correctness among positive instances• TP / (TP + FN) = 1 / (1 + 2) = 1/3

• Terminology sensitivity (medical diagnostics) recall (information retrieval)

1 0

2 3

Y=1 Y=0

h(x)=1

h(x)=0

Page 25: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Specificity

• correctness among negative instances• TN / (TN + FP) = 3 / (0 + 3) = 1

• Terminology specificity (medical diagnostics) precision (information retrieval)

1 0

2 3

Y=1 Y=0

h(x)=1

h(x)=0

Page 26: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

ROC analysis

• When a model yields probabilistic predictions, e.g. f (Y=1| x) = 0.55, then we can evaluate its performance for different classification thresholds [0,1]

• This corresponds to assigning different (relative) weights to the two types of classification error

• The ROC curve is a plot of sensitivity versus 1-specificity for all 0 1

Page 27: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

ROC curve

sens

itivi

ty

1- specificity

each pointcorresponds to

a threshold value

=1

=0

(0,1): perfect model

Page 28: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

sens

itivi

ty

1- specificity

the area under the ROC curve is a good measure of discrimination

Area under ROC curve (AUC)

Page 29: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

sens

itivi

ty

1- specificity

Area under ROC curve (AUC)

when AUC=0.5, the model does

not predict better than chance

Page 30: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

sens

itivi

ty

1- specificity

Area under ROC curve (AUC)

when AUC=1.0, the model

discriminates perfectly between

Y=0 and Y=1

Page 31: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Discrimination vs. accuracy

• The AUC value only depends on the ordering of instances by the model

• The AUC value is insensitive to order-preserving transformations of the predictions f(Y=1|x), e.g. f’(Y=1|x) = f(Y=1|x) · 10-4711

In addition to discrimination, we must therefore investigate the accuracy of probabilistic predictions.

Page 32: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

10 0

17 0

32 1

… …

100 1

0.10

0.25

0.30

0.90

0.15

0.20

0.25

0.75

Probabilistic accuracy

P(Y=1|x)x Y f(Y=1|x)

Page 33: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Quantifying probabilistic error

Let (xi , yi) be an observation, and let f (Y | xi) be the estimated class-conditional distribution.

• Option 1: i = | yi – f (Y=1| xi) | Not good: does not lead to the correct mean

• Option 2: i = (yi – f (Y=1| xi))2 (variance-based)

Correct, but mild on severe errors

• Option 3: i = ln( f (Y=yi | xi)) (entropy-based)

Better from a probabilistic viewpoint

Page 34: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

OutlineOutline

1. Model evaluation basics

2. Performance measures

3. Evaluation tasks• Model selection• Performance assessment• Model comparison

4. Summary

Page 35: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Evaluation tasks

• Model selectionSelect the appropriate size (complexity) of a model

• Performance assessmentQuantify the performance of a given modelfor documentation purposes

• Method comparisonCompare the performance of different learning methods

Page 36: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

10.024

n=4843

creatinin level 169

20.020

n=4738

4

0.015n=4382

electivesurgery

age 67

5

0.027n=1918

VII0.093n=118

no mitralvalve surgery

mitral valvesurgery

creatinin level 169

XI

0.200n=105

age 67

I0.006

n=2464

emergencyprocedure

3

0.076n=356

X0.150n=80

mod./poorLVEF

goodLVEF

7

0.054n=276

IX0.089n=123

age 67age 67

VIII0.026n=153

6

0.023n=1800

VI0.069n=160

firstcardiacsurgery

priorcardiacsurgery

8

0.018n=1640

V0.069n=101

age 81 age 81

9

0.015n=1539

no COPD COPD

II0.011

n=1293

10

0.037n=246

IV0.067n=104

BMI 25 BMI < 25

III0.014n=142

How far should we grow a tree?

Page 37: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

• When we build a model, we must decide upon its size (complexity)

• Simple models are robust but not flexible: they may neglect important features of the problem

• Complex models are flexible but not robust: they tend to overfit the data set

Model induction is a statistical estimation problem!

The model selection problem

Page 38: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

training sample error rate

true error rate

optimistic bias

How can we minimize the true error rate?

Page 39: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

The split-sample procedure

1. Data set is randomly split into training set and test set (usually 2/3 vs. 1/3)

2. Models are built on training set Error rates are measured on test set

training set test set

• Drawbacks– data loss– results are sensitive to split

Page 40: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Cross-validation

1. Split data set randomly into k subsets ("folds")

2. Build model on k-1 folds

3. Compute error on remaining fold

4. Repeat k times

fold 1 fold 2 fold k…

• Average error on k test folds approximates true error on independent data

• Requires automated model building procedure

Page 41: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Estimating the optimistic bias

• We can also estimate the error on the training set and subtract an estimated bias afterwards.

• Roughly, there exist two methods to estimate an optimistic bias:

a) Look at the model’s complexitye.g. the number of parameters in a generalized linear model (AIC, BIC)

b) Take bootstrap samplessimulate the sampling distribution(computationally intensive)

Page 42: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Summary: model selection

• In model selection, we trade-off flexibility in the representation for statistical robustness

• The problem is minimize the true error without suffering from a data loss

• We are not interested in the true error (or its uncertainty) itself – we just want to minimize it

• Methods:– Use independent observations– Estimate the optimistic bias

Page 43: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Performance assessment

In a performance assessment, we estimate how well a given model would perform on new data.

The estimated performance should be unbiased and its uncertainty must be quantified.

Preferrably, the performance measure used should be easy to interpret (e.g. AUC).

Page 44: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Types of performance

• Internal performancePerformance on patients from the same population and in the same setting

• Prospective performancePerformance for future patients from the same population and in the same setting

• External performancePerformance for patients from another population or another setting

Page 45: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Internal performance

Both the split-sample and cross-validation procedures can be used to assess a model's internal performance, but not with the same data that was used in model selection

A commonly applied procedure looks as follows:

fold 1 fold 2 fold k… validation

model selection

Page 46: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Mistakes are frequently made

• Schwarzer et al. (2000) reviewed 43 applications of artificial neural networks in oncology

• Most applications used a split-sample or cross-validation procedure to estimate performance

• In 19 articles, an incorrect (optimistic) performance estimate was presented – E.g. model selection and validation on a single set

• In 6 articles, the test set contained less than 20 observations

Schwarzer G, et al. Stat Med 2000; 19:541–61.

Page 47: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

OutlineOutline

1. Model evaluation basics

2. Performance measures

3. Evaluation tasks• Model selection• Performance assessment• Model comparison

4. Summary

Page 48: Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Summary

• Both model induction and evaluation are statistical estimation problems

• In model induction we increase bias to reduce variation (and avoid overfitting)

• In model evaluation we must avoid bias or correct for it

• In model selection, we trade-off flexibility for robustness by optimizing the true performance

• A common pitfall is to use data twice without correcting for the resulting bias