Top Banner
On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)
36

On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Dec 13, 2015

Download

Documents

Loraine Lawson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

On the Role ofDataset Complexity

in Case-Based Reasoning

Derek BridgeUCC

Ireland

(based on work done with Lisa Cummins)

Page 2: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Dataset

Page 3: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

classifier

Page 4: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

CB CB

case basemaintenance

algorithm

Page 5: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Overview

Dataset complexity measuresClassification experimentCase base maintenance experimentGoing forward

Page 6: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Overview

Dataset complexity measuresClassification experimentCase base maintenance experimentGoing forward

Page 7: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Dataset Complexity Measures

• Measures of classification difficulty• apparent difficulty, since we measure a dataset

which samples the problem space

• Little impact on CBR• Fornells et al., ICCBR 2009• Cummins & Bridge, ICCBR 2009

• (Little impact on ML in general!)

Page 8: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Dataset Complexity Measures

• Survey of 12 geometrical measures• Ho & Basu, 2002

• DCoL: open source C++ library of 13 measures• Orriols-Puig et al., 2009

• We have found 4 candidate measures in the CBR literature

Page 9: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Overlap of attribute values

F1 Maximum Fisher’s Discriminant Ratio

F2' Volume of Overlap Region

F3' Maximum Attribute Efficiency

F4' Collective Attribute Efficiency

Page 10: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Separability of classes

N1' Fraction of Instances on a Boundary

N2 Ratio of Average Intra/Inter Class Distance

N3 Error Rate of a 1NN classifier

L1 Minimized Sum of Error Distance of a Linear Classifier

L2 Training Error of a Linear Classifier

C1 Complexity Profile

C2 Similarity-Weighted Complexity Profile

N5 Separability Emphasis Measure

Page 11: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Manifold Topology & Density

L3 Nonlinearity of a Linear Classifier

N4 Nonlinearity of a 1NN Classifier

T1 Fraction of Maximum Covering Spheres

T2 Number of Instances per Attribute

T3 Dataset Competence

Page 12: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Dataset Complexity Measures

• Desiderata• Predictive• Independent of what is being analyzed• Widely applicable across datasets• Cheap-to-compute• Incremental• Transparent/explainable

Page 13: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Overview

Dataset complexity measuresClassification experimentCase base maintenance experimentGoing forward

Page 14: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Classification experiment

• 25 datasets• 14 Boolean classification; 11 multi-class• 21 numeric-valued attributes only (12 Boolean

classification; 9 multi-class)

• 4 Weka classifiers trained on 60% of dataset• Neural Net with 1 hidden layer• SVM with SMO • J48• IBk with k = 3

• Error measured on 20% of dataset• Repeated 10 times

Page 15: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

An example of the results

Dataset NN SVM J48 IBk Mean N1'

Iris 2.67 4.00 5.00 2.67 3.58 0.13

Lung Cancer

58.00 50.00 46.00 56.00 52.50 0.75

Page 16: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

An example of the results

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 10.00 20.00 30.00 40.00 50.00 60.00

Error %

N1'

Correlation coefficient: 0.96

Page 17: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

N1' Fraction of instances on a boundary

• Build a minimum spanning tree

• Compute fraction of instances directly connected to instances of a different class

• Shuffle dataset, repeat, & average

1

35

2

6445

6 8

1

35

2

4

Page 18: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Other competitive measures

• N3 Error Rate of a 1NN Classifier• leave-one-out error rate of 1NN on the dataset

• N2 Ratio of Average Intra/Inter Class Distance• sum distances to nearest neighbour of same class• divide by sum of distances to nearest neighbour of

different class

• L2 Training Error of a Linear Classifier• build, e.g., SVM on dataset• compute error on original dataset• problems with multi-class; problems with symbolic

values

Page 19: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

C1 Complexity Profile

• Computed for each instance, with parameter K [Massie et al. 2006]

• For a dataset measure, compute average complexity

1

0.5

0.67

00.10.20.30.40.50.60.70.80.91

1 2 3

k

Pk

For K = 3

Page 20: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Other measures from CBR

• C2 Similarity-Weighted Complexity Profile• use similarity values when computing Pk

• N5 Separability Emphasis Measure [Fornells et al. ’09]

• N5 = N1'×N2

• T3 Dataset Competence [Smyth & McKenna ’98]

• competence groups based on overlapping coverage sets• group coverage based on size and similarity• dataset competence as sum of group coverages

Page 21: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Their predictivity

• C1 Complexity Profile

• Correlation coefficient: 0.98

• C2 Similarity-Weighted Complexity Profile

• Correlation coefficient: 0.97

• N5 Separability Emphasis Measure

• Between N1'andN2

• T3 Dataset Competence

• Correlation coefficient: near zero

Page 22: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Summary of experiment

• Very predictive• C1 Complexity Profile

• N3 Error Rate of 1NN Classifier

• N1' Fraction of Instances on a Boundary

• Predictive but problems with applicability• L2 Training Error of a Linear Classifier

• Moderately predictive• N2 Ratio of Average Intra/Inter Class Distance

• All are measures of separability of classes

Page 23: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Overview

Dataset complexity measuresClassification experimentCase base maintenance experimentGoing forward

Page 24: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Meta-CBR for Maintenance

• Case base maintenance algorithms seek to:• delete noisy cases• delete redundant cases

• Different case bases require different maintenance algorithms

• The same case base may require different maintenance algorithms at different times in its life cycle

• We have been building classifiers to select maintenance algorithms

Page 25: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

F1: 0.2F2: 0.4…CB

RENN

Meta-Case Base

F1: 0.9F2: 0.4…

RENN

F1: 0.3F2: 0.4…

RENN

F1: 0.7F2: 0.4…

RENN

F1: 0.9F2: 0.4…

RENN

F1: 0.0F2: 0.4…

RENN

F1: 0.3F2: 0.4…

RENN

F1: 0.2F2: 0.3…

CRR

kNN Classifier

Page 26: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Case Base Maintenance Experiment• Training (building the meta-case base)

• From 60% of each dataset, create a case base

• Create a meta-case to describe this case base• attributes are complexity measures

• problem solution • run a small set of maintenance algorithms on each case base

• record % deleted

• record accuracy on the next 20% of each dataset

• maintenance algorithm with highest harmonic mean of % deleted and accuracy becomes this meta-case’s solution

• But, we use feature selection to choose a subset of the complexity measures • wrapper method, best-first search

Page 27: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Case Base Maintenance Experiment

• Testing• Target problem is a case base built from

remaining 20% of each dataset• attributes again are complexity measures

• Ask the classifier to predict a maintenance algorithm

• Run the algorithm, record % deleted, accuracy and their harmonic mean

• Compare meta-CBR with perfect classifier and ones that choose same algorithm each time

Page 28: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Example results

ClassifierCases

deleted (%)Accuracy

(%)Harmonic

mean

Choose-best 72.37 71.86 69.56

Meta-CBR 66.32 70.76 63.98

Choose ICF 64.54 69.63 62.29

Choose CBE 57.11 72.64 60.41

Page 29: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Which measures get selected?

Page 30: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

F2' Volume of Overlap Region

• For a given attribute, a measure of how many values for that attribute appear in instances labelled with different classes

a

a

Page 31: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Quick computation of F2

a

a

min

max

min

max

max

min

min

max

Page 32: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

A problem for F2

amax

min

min

max

Page 33: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

F2' Our version

• o'(a) = count how many values are in the overlap• r'(a) = count the number of values of a

n

i i

i

ar

aoF

1 )(

)(2

Page 34: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Summary of experiment

• Feature selection • chose between 2 and 18 attributes, average 9.2• chose range of measures, across Ho & Basu’s

categories• always at least one measure of overlap of

attribute values, e.g. F2'

• but measures of class separability only about 50% of the time

• But this is just one experiment

Page 35: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Overview

Dataset complexity measuresClassification experimentCase base maintenance experimentGoing forward

Page 36: On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Going forward

• Use of complexity measures in CBR (and ML)• More research into complexity measures:

• experiments with more datasets, different datasets, more classifiers,…

• new measures, e.g. Information Gain

• applicability of measures• missing values

• loss functions

• dimensionality reduction, e.g. PCA

• the CBR similarity assumption and measures of case alignment [Lamontagne 2006, Hüllermeier 2007, Raghunandan et al. 2008]