Bell Laboratories Intrinsic complexity of classification problems Tin Kam Ho With contributions from Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner, Martin Law, Erinija Pranckeviciene, Albert Orriols-Puig, Nuria Macia
Jan 20, 2016
Bell Laboratories
Intrinsic complexity of classification problems
Tin Kam Ho
With contributions from
Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner, Martin Law,Erinija Pranckeviciene, Albert Orriols-Puig, Nuria Macia
2 All Rights Reserved © Alcatel-Lucent 2008
Supervised Learning: Many Methods, Data Dependent Performances
Bayesian classifiers, logistic regression, linear & polynomial discriminators, nearest-neighbors, decision trees & forests, neural networks, support vector machines, ensemble methods, …
ZeroR NN1 NNK NB C4.5 PART SMO XCS aud 25.3 76.0 68.4 69.6 79.0 81.2 - 57.7
aus 55.5 81.9 85.4 77.5 85.2 83.3 84.9 85.7
bal 45.0 76.2 87.2 90.4 78.5 81.9 - 79.8
bpa 58.0 63.5 60.6 54.3 65.8 65.8 58.0 68.2
bps 51.6 83.2 82.8 78.6 80.1 79.0 86.4 83.3
bre 65.5 96.0 96.7 96.0 95.4 95.3 96.7 96.0
cmc 42.7 44.4 46.8 50.6 52.1 49.8 - 52.3
gls 34.6 66.3 66.4 47.6 65.8 69.0 - 72.6
h-c 54.5 77.4 83.2 83.6 73.6 77.9 - 79.9
hep 79.3 79.9 80.8 83.2 78.9 80.0 83.9 83.2
irs 33.3 95.3 95.3 94.7 95.3 95.3 - 94.7
krk 52.2 89.4 94.9 87.0 98.3 98.4 96.1 98.6
lab 65.4 81.1 92.1 95.2 73.3 73.9 93.2 75.4
led 10.5 62.4 75.0 74.9 74.9 75.1 - 74.8
lym 55.0 83.3 83.6 85.6 77.0 71.5 - 79.0
mmg 56.0 63.0 65.3 64.7 64.8 61.9 67.0 63.4
mus 51.8 100.0 100.0 96.4 100.0 100.0 100.0 99.8
mux 49.9 78.6 99.8 61.9 99.9 100.0 61.6 100.0
pmi 65.1 70.3 73.9 75.4 73.1 72.6 76.7 76.0
prt 24.9 34.5 42.5 50.8 41.6 39.8 - 43.7
seg 14.3 97.4 96.1 80.1 97.2 96.8 - 96.1
sick 93.8 96.1 96.3 93.3 98.4 97.0 93.8 96.7
soyb 13.5 89.5 90.3 92.8 91.4 90.3 - 76.2
tao 49.8 96.1 96.0 80.8 95.1 93.6 83.6 88.4
thy 19.5 68.1 65.1 80.6 92.1 92.1 - 86.3
veh 25.1 69.4 69.7 46.2 73.6 72.6 - 72.2
vote 61.4 92.4 92.6 90.1 96.3 96.5 95.6 95.4
vow 9.1 99.1 96.6 65.3 80.7 78.3 - 87.6
wne 39.8 95.6 96.8 97.8 94.6 92.9 - 96.3
zoo 41.7 94.6 92.5 95.4 91.6 92.5 - 92.6
Avg 44.8 80.0 82.4 78.0 82.1 81.8 84.1 81.7
• No clear winners good for all problems
• Often, accuracy reaches a limit for a practical problem, even with the best known method
3 All Rights Reserved © Alcatel-Lucent 2008
Accuracy Depends on the Goodness of Match between Classifiers and Problems
NNXCSerror=0.06%
error=1.9%
Better!
Problem A Problem B
error=0.6%
error=0.7%
XCS NN
Better!
4 All Rights Reserved © Alcatel-Lucent 2008
Measuring Geometrical Complexity of Classification Problems
Our goal: tools and languages for studying
Characteristics of geometry & topology of high-dim data sets
How they change with feature transformations and sampling
How they interact with classifier geometry
We want to know:
What are real-world problems like? What is my problem like? What can be expected of a method on a specific problem?
5 All Rights Reserved © Alcatel-Lucent 2008
Parameterization of Data Complexity
6 All Rights Reserved © Alcatel-Lucent 2008
Some Useful Measures of Geometric Complexity
22
21
221
σσ)μ(μ
f
Classical measure of class separability
Maximize over all features to find the most discriminating
Fisher’s Discriminant Ratio
Degree of Linear Separability
Find separating hyper-plane by linear programming
Error counts and distances to plane measure separability
Length of Class Boundary
Compute minimum spanning tree
Count class-crossing edges
Shapes of Class Manifolds
Cover same-class pts with maximal balls
Ball counts describe shape of class manifold
7 All Rights Reserved © Alcatel-Lucent 2008
Real-World Data Sets:
Benchmarking data from UC-Irvine archive
844 two-class problems452 are linearly separable, 392 non-separable
Synthetic Data Sets:
Random labeling of
randomly located points100 problems in 1-100 dimensions
Using Complexity Measures to Study Problem Distributions
Random labeling
Linearly separable real-world data
Linearly non-separable real-world data
Complexity Metric 1
Metr
ic 2
8 All Rights Reserved © Alcatel-Lucent 2008
Measures of Geometrical Complexity
9 All Rights Reserved © Alcatel-Lucent 2008
Distribution of Problems in Complexity Space lin.sep lin.nonsep random �
10 All Rights Reserved © Alcatel-Lucent 2008
The First 6 Principal Components
11 All Rights Reserved © Alcatel-Lucent 2008
Interpretation of the First 4 PCs
PC 1: 50% of variance: Linearity of boundary and proximity of opposite class neighbor
PC 2: 12% of variance: Balance between within-class scatter and between-class distance
PC 3: 11% of variance: Concentration & orientation of intrusion into opposite class
PC 4: 9% of variance: Within-class scatter
12 All Rights Reserved © Alcatel-Lucent 2008
• Continuous distribution
• Known easy & difficult problems occupy opposite ends
• Few outliers
• Empty regionsRandom labels
Linearly separable
Problem Distribution in 1st & 2nd Principal Components
13 All Rights Reserved © Alcatel-Lucent 2008
Relating Classifier Behavior to Data Complexity
14 All Rights Reserved © Alcatel-Lucent 2008
Class Boundaries Inferred by Different Classifiers
XCS: a genetic algorithm
Nearest neighbor classifier
Linear classifier
15 All Rights Reserved © Alcatel-Lucent 2008
Domains of Competence of Classifiers
•Which classifier works the best for a given classification problem?
•Can data complexity give us a hint?
Complexity metric 1
Metr
ic 2
NN
LC
XCSDecisionForest
?
16 All Rights Reserved © Alcatel-Lucent 2008
Domain of Competence Experiment
Use a set of 9 complexity measuresBoundary, Pretop, IntraInter, NonLinNN, NonLinLP,Fisher, MaxEff, VolumeOverlap, Npts/Ndim
Characterize 392 two-class problems from UCI data,all shown to be linearly non-separable
Evaluate 6 classifiersNN (1-nearest neighbor)LP (linear classifier by linear programming)Odt (oblique decision tree)Pdfc (random subspace decision forest)Bdfc (bagging based decision forest)XCS (a genetic-algorithm based classifier)
ensemble methodsensemble methodsensemble methods
17 All Rights Reserved © Alcatel-Lucent 2008
Identifiable Domains of Competence by NN and LP
Best Classifier for Benchmarking Data
18 All Rights Reserved © Alcatel-Lucent 2008
Regions in complexity space where the best classifier is (nn,lp, or odt) vs. an ensemble technique
Boundary-NonLinNN
IntraInter-Pretop
MaxEff-VolumeOverlap
ensemble+ nn,lp,odt
Less Identifiable Domains of Competence
19 All Rights Reserved © Alcatel-Lucent 2008
Difficulties in Estimating Data Complexity
20 All Rights Reserved © Alcatel-Lucent 2008
Apparent vs. True Complexity: Uncertainty in Measures due to Sampling Density
2 points 10 points
100 points 500 points 1000 points
Problem may appear deceptively simple or complex with small samples
21 All Rights Reserved © Alcatel-Lucent 2008
Uncertainty of Estimates at Two Levels
Sparse training data in each problem & complex geometry cause ill-posedness of class boundaries
(uncertainty in feature space)
Sparse sample of problems causes difficulty in identifying regions of dominant competence
(uncertainty in complexity space)
22 All Rights Reserved © Alcatel-Lucent 2008
Complexity Estimates and Dimensionality Reduction
Feature selection/transformation may change the difficulty of a classification problem:
• Widening the gap between classes• Compressing the discriminatory information• Removing irrelevant dimensions
It is often unclear to what extent these happen We seek quantitative description of such changes
Feature selection Discrimination
23 All Rights Reserved © Alcatel-Lucent 2008
10 20 30 40 50 60 70 80 90
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Boundary
1N
N e
rro
r FFS subsets all datasets
boundary versus 1NN classification error spectra1
colon spectra2eogat ovarian spectra3
Spread of classification accuracy and geometrical complexity due to forward feature selection
24 All Rights Reserved © Alcatel-Lucent 2008
Conclusions
25 All Rights Reserved © Alcatel-Lucent 2008
Summary: Early Discoveries
•Problems distribute in a continuum in complexity space
•Several key measures provide independent characterization
•There exist identifiable domains of classifier’s dominant competency
•Sparse sampling, feature selection, and feature transformation induce variability in complexity estimates
26 All Rights Reserved © Alcatel-Lucent 2008
For the Future
Further progress in statistical learning will need systematic, scientific evaluation of the algorithms with problems that are difficult for different reasons.
A “problem synthesizer” will be useful to provide a complete evaluation platform, and reveal the “blind spots” of current learning algorithms.
Rigorous statistical characterization of complexity estimates from limited training data will help gauge the uncertainty, and determine applicability of data complexity methods.
Ongoing: DCol: Data Complexity Library ICPR 2010 Contest on Domain of Dominant Competence