On Comparing Classifiers: Pitfalls to
Avoid and Recommended
Approach
Published bySteven L. Salzberg
Presented byPrakash Tilwani
MACS 598April 25th 2001
Agenda Introduction Classification basics Definitions Statistical validity
• Bonferroni Adjustment• Statistical Accidents• Repeated Tuning
A Recommended Approach Conclusion
Introduction Comparative studies – proper
methodology? Public databases – relied too
heavily on them? Comparison results – are they
really correct or just statistical accidents?
Definitions T-test F-test P-value Null hypothesis
T-test The t-test assesses whether the
means of two groups are statistically different from each other.
Ratio of difference in means to variability of groups.
F-test It determines whether the
variances of two samples are significantly different.
Ratio of variance of two datasets Basis for “Analysis of Variance”
(ANOVA)
p-value It represents probability of concluding
(incorrectly) that there is a difference in samples when no true difference exists.
Dependent upon the statistical test being performed.
P = 0.05 means that there is 5% chance that you would be wrong if concluding the populations are different.
NULL Hypothesis Assumption that there is no
difference in two or more populations.
Any observed difference in samples is due to chance or sampling error.
Statistical Validity Tests Statistics offers many tests that
are designed to measure the significance of any difference.
Adaptation to classifier comparison should be done carefully.
Bonferroni Adjustment – an
example Comparison of classifier algorithms 154 datasets NULL hypothesis is true if p-value
is < 0.05 (not very stringent) Differences were reported
significant if a t-test produced p-value < 0.05.
Example (cont.) This is not correct usage of p-value
significance test. There were 154 experiments.
Therefore, 154 chances to be significant.
Actual p-value used is 154*0.05 (= 7.7).
Example (cont.) Let the significance for each level be Chance for making right conclusion for
one experiment is 1- Assuming experiments are
independent of one another, chance for getting n experiments correct is (1-)n
Chances of not making correct conclusion is 1-(1-)n
Example (cont.) Substituting =0.05 Chances for making incorrect
conclusion is 0.9996 To obtain results significant at 0.05
level with 154 tests1-(1-)154 < 0.05 or < 0.003
Example - conclusion Rough calculations but provides
insight to problem The use of wrong p-value results in
incorrect conclusions T-test overall is wrong test as
training and test sets are not independent
Simple Recommended Statistical Test
Comparison must consider 4 numbers when a common test set to compare two algorithms (A and B)
A > B A < B A = B ~A = ~B
Simple Recommended Statistical Test (cont.)
If only two algorithms compared Throw out ties. Compare A>B Vs A<B
If more than two algorithms compared Use “Analysis of Variance” (ANOVA) Bonferroni adjustment for multiple
test should be applied
Statistical Accidents Suppose 100 people are studying
the effect of algorithms A and B. At least 5 will get results
statistically significant at p <= 0.05 (assuming independent experiments).
These results are nothing but due to chance.
Repeated Tuning Algorithms are “tuned” repeatedly on
same datasets. Every “tuning” attempt should be
considered as a separate experiment. For example if 10 tuning experiments
were attempted, then p-value should be 0.005 instead of 0.05.
Repeated Tuning (cont.) Datasets are not independent,
therefore even Bonferroni adjustment is not very accurate.
A greater problem occurs while using an algorithm that has been used before: you may not know how it was tuned (one disadvantage of using public databases).
Repeated Tuning – Recommended approach
Break dataset into k disjoint subsets of approximately equal size.
K experiments are performed. After every experiment one subset
is removed. Trained system is tested on held-
out subsystem.
Repeated Tuning – Recommended approach
(cont.)
At the end of k-fold experiment, every sample has been used in test set exactly once.
Advantage: test sets are independent.
Disadvantage: training sets are clearly not independent.
A Recommended Approach
Choose other algorithms to include in the comparison. Try including most similar to new algorithm.
Choose datasets. Divide the data set into k subsets
for cross validation. Typically k=10. For a small data set, choose larger k,
since this leaves more examples in the training set.
A Recommended Approach (cont.)
Run a cross-validation For each of the k subsets of the data
set D, create a training set T = D – k Divide T into T1 (training) and T2
(tuning) subsets Once tuning is done, rerun training
on T Finally measure accuracy on k Overall accuracy is averaged across
all k partitions.
A Recommended Approach (cont.)
Finally, compare algorithms
In case of multiple data sets, Bonferroni adjustment should be applied
Conclusion
We don’t mean to discourage empirical comparisons but to provide suggestions to avoid pitfalls.
Statistical tools should be used carefully.
Every details of the experiment should be reported.