Top Banner
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th 2001
25

On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

On Comparing Classifiers: Pitfalls to

Avoid and Recommended

Approach

Published bySteven L. Salzberg

Presented byPrakash Tilwani

MACS 598April 25th 2001

Page 2: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Agenda Introduction Classification basics Definitions Statistical validity

• Bonferroni Adjustment• Statistical Accidents• Repeated Tuning

A Recommended Approach Conclusion

Page 3: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Introduction Comparative studies – proper

methodology? Public databases – relied too

heavily on them? Comparison results – are they

really correct or just statistical accidents?

Page 4: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Definitions T-test F-test P-value Null hypothesis

Page 5: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

T-test The t-test assesses whether the

means of two groups are statistically different from each other.

Ratio of difference in means to variability of groups.

Page 6: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

F-test It determines whether the

variances of two samples are significantly different.

Ratio of variance of two datasets Basis for “Analysis of Variance”

(ANOVA)

Page 7: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

p-value It represents probability of concluding

(incorrectly) that there is a difference in samples when no true difference exists.

Dependent upon the statistical test being performed.

P = 0.05 means that there is 5% chance that you would be wrong if concluding the populations are different.

Page 8: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

NULL Hypothesis Assumption that there is no

difference in two or more populations.

Any observed difference in samples is due to chance or sampling error.

Page 9: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Statistical Validity Tests Statistics offers many tests that

are designed to measure the significance of any difference.

Adaptation to classifier comparison should be done carefully.

Page 10: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Bonferroni Adjustment – an

example Comparison of classifier algorithms 154 datasets NULL hypothesis is true if p-value

is < 0.05 (not very stringent) Differences were reported

significant if a t-test produced p-value < 0.05.

Page 11: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Example (cont.) This is not correct usage of p-value

significance test. There were 154 experiments.

Therefore, 154 chances to be significant.

Actual p-value used is 154*0.05 (= 7.7).

Page 12: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Example (cont.) Let the significance for each level be Chance for making right conclusion for

one experiment is 1- Assuming experiments are

independent of one another, chance for getting n experiments correct is (1-)n

Chances of not making correct conclusion is 1-(1-)n

Page 13: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Example (cont.) Substituting =0.05 Chances for making incorrect

conclusion is 0.9996 To obtain results significant at 0.05

level with 154 tests1-(1-)154 < 0.05 or < 0.003

Page 14: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Example - conclusion Rough calculations but provides

insight to problem The use of wrong p-value results in

incorrect conclusions T-test overall is wrong test as

training and test sets are not independent

Page 15: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Simple Recommended Statistical Test

Comparison must consider 4 numbers when a common test set to compare two algorithms (A and B)

A > B A < B A = B ~A = ~B

Page 16: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Simple Recommended Statistical Test (cont.)

If only two algorithms compared Throw out ties. Compare A>B Vs A<B

If more than two algorithms compared Use “Analysis of Variance” (ANOVA) Bonferroni adjustment for multiple

test should be applied

Page 17: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Statistical Accidents Suppose 100 people are studying

the effect of algorithms A and B. At least 5 will get results

statistically significant at p <= 0.05 (assuming independent experiments).

These results are nothing but due to chance.

Page 18: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Repeated Tuning Algorithms are “tuned” repeatedly on

same datasets. Every “tuning” attempt should be

considered as a separate experiment. For example if 10 tuning experiments

were attempted, then p-value should be 0.005 instead of 0.05.

Page 19: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Repeated Tuning (cont.) Datasets are not independent,

therefore even Bonferroni adjustment is not very accurate.

A greater problem occurs while using an algorithm that has been used before: you may not know how it was tuned (one disadvantage of using public databases).

Page 20: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Repeated Tuning – Recommended approach

Break dataset into k disjoint subsets of approximately equal size.

K experiments are performed. After every experiment one subset

is removed. Trained system is tested on held-

out subsystem.

Page 21: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Repeated Tuning – Recommended approach

(cont.)

At the end of k-fold experiment, every sample has been used in test set exactly once.

Advantage: test sets are independent.

Disadvantage: training sets are clearly not independent.

Page 22: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

A Recommended Approach

Choose other algorithms to include in the comparison. Try including most similar to new algorithm.

Choose datasets. Divide the data set into k subsets

for cross validation. Typically k=10. For a small data set, choose larger k,

since this leaves more examples in the training set.

Page 23: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

A Recommended Approach (cont.)

Run a cross-validation For each of the k subsets of the data

set D, create a training set T = D – k Divide T into T1 (training) and T2

(tuning) subsets Once tuning is done, rerun training

on T Finally measure accuracy on k Overall accuracy is averaged across

all k partitions.

Page 24: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

A Recommended Approach (cont.)

Finally, compare algorithms

In case of multiple data sets, Bonferroni adjustment should be applied

Page 25: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Conclusion

We don’t mean to discourage empirical comparisons but to provide suggestions to avoid pitfalls.

Statistical tools should be used carefully.

Every details of the experiment should be reported.