Top Banner
Learning Algorithm Evaluation
33

Learning Algorithm Evaluation. Algorithm evaluation: Outline Why? Overfitting How? Train/Test vs Cross-validation What? Evaluation measures.

Dec 14, 2015

Download

Documents

America Holeman
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Learning Algorithm Evaluation

Page 2: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Algorithm evaluation: Outline

Why? Overfitting

How? Train/Test vs Cross-validation

What? Evaluation measures

Who wins? Statistical significance

Page 3: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Introduction

Page 4: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Introduction

A model should perform well on unseen data drawn from the same distribution

Page 5: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Classification accuracy

performance measure Success: instance’s class is predicted correctly Error: instance’s class is predicted incorrectly Error rate: #errors/#instances Accuracy: #successes/#instances

Quiz 50 examples, 10 classified incorrectly

• Accuracy? Error rate?

Page 6: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Evaluation

Rule #1

Never evaluate on training data!

Page 7: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Train and Test

Step 1: Randomly split data into training and test set (e.g. 2/3-1/3)

a.k.a. holdout set

Page 8: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Train and Test

Step 2: Train model on training data

Page 9: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Train and Test

Step 3: Evaluate model on test data

Page 10: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Train and Test

Quiz: Can I retry with other parameter settings?Quiz: Can I retry with other parameter settings?

Page 11: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Evaluation

Rule #1

Never train on test data!(that includes parameter setting or feature selection)

Never evaluate on training data!

Rule #2

Page 12: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Train and TestStep 4: Optimize parameters on separate validation set

validation

testing

Page 13: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Test data leakage

Never use test data to create the classifier Can be tricky: e.g. social network

Proper procedure uses three sets training set: train models validation set: optimize algorithm parameters test set: evaluate final model

Page 14: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Making the most of the data

Once evaluation is complete, all the data can be used to build the final classifier

Trade-off: performance evaluation accuracy More training data, better model (but returns diminish) More test data, more accurate error estimate

Page 15: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Step 5: Build final model on ALL data (more data, better model)

Train and Test

Page 16: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Cross-Validation

Page 17: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

k-fold Cross-validation

• Split data (stratified) in k-folds• Use (k-1) for training, 1 for testing • Repeat k times• Average results

traintest

Original Fold 1 Fold 2 Fold 3

Page 18: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Cross-validation

Standard method: Stratified ten-fold cross-validation

10? Enough to reduce sampling bias Experimentally determined

Page 19: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Leave-One-Out Cross-validation

A particular form of cross-validation: #folds = #instances n instances, build classifier n times

Makes best use of the data, no sampling bias Computationally expensive

100

Original Fold 1 Fold 100

………

Page 20: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

ROC Analysis

Page 21: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

ROC Analysis

Stands for “Receiver Operating Characteristic” From signal processing: tradeoff between hit rate

and false alarm rate over noisy channel Compute FPR, TPR and plot them in ROC space Every classifier is a point in ROC space For probabilistic algorithms

Collect many points by varying prediction threshold Or, make cost sensitive and vary costs (see below)

Page 22: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Confusion Matrix

TPrate (sensitivity):

FPrate (fall-out):

++ --

++

--

TPTP

FNFN

FPFP

TNTN

actualactual

pre

dic

ted

pre

dic

ted

TP+FNTP+FN FP+TNFP+TN

true positivetrue positive false positivefalse positive

false negativefalse negative true negativetrue negative

Page 23: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

ROC space

classifiers

J48

OneR

J48parameters fitted

Page 24: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

ROC curvesChange prediction threshold:Threshold t: (P(+) > t)

Area Under Curve (AUC)=0.75

Page 25: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

ROC curves

Jagged curve—one set of test data Smooth curve—use cross-validation

Alternative method (easier, but less intuitive) Rank probabilities Start curve in (0,0), move down probability list If positive, move up. If negative, move right

Page 26: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

ROC curvesMethod selection

Overall: use method with largest Area Under ROC curve (AUROC)

If you aim to cover just 40% of true positives in a sample: use method A

Large sample: use method B In between: choose between A and B with

appropriate probabilities

Page 27: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

ROC Space and Costs

equalcosts

skewedcosts

Page 28: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Different Costs

In practice, TP and FN errors incur different costs Examples:

Medical diagnostic tests: does X have leukemia? Loan decisions: approve mortgage for X? Promotional mailing: will X buy the product?

Add cost matrix to evaluation that weighs TP,FP,...

pred + pred -

actual + cTP = 0 cFN = 1

actual - cFP = 1 cTN = 0

Page 29: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Statistical Significance

Page 30: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Comparing data mining schemes

Which of two learning algorithms performs better? Note: this is domain dependent!

Obvious way: compare 10-fold CV estimates Problem: variance in estimate

Variance can be reduced using repeated CV However, we still don’t know whether results are reliable

Page 31: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Significance tests Significance tests tell us how confident we can be

that there really is a difference Null hypothesis: there is no “real” difference Alternative hypothesis: there is a difference

A significance test measures how much evidence there is in favor of rejecting the null hypothesis

E.g. 10 cross-validation scores: B better than A?

Algorithm AAlgorithm B

perf

P(perf) mea

n A

mea

n B

x x x xxxxx x xx x x xxxx x x x

Page 32: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Paired t-test

Student’s t-test tells whether the means of two samples (e.g., 10 cross-validation scores) are significantly different

Use a paired t-test when individual samples are paired

i.e., they use the same randomization Same CV folds are used for both algorithms

32

William Gosset

Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England

Worked as chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".

Algorithm AAlgorithm B

perf

P(perf)

Page 33: Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Performing the test

1. Fix a significance level Significant difference at % level implies (100-)%

chance that there really is a difference Scientific work: 5% or smaller (>95% certainty)

2. Divide by two (two-tailed test)

3. Look up the z-value corresponding to /2:

4. If t –z or t z: difference is significant null hypothesis can be rejected

Algoritme AAlgoritme B

perf

P(perf)

α z

0.1% 4.3

0.5% 3.25

1% 2.82

5% 1.83

10% 1.38

20% 0.88