Index Driven Selective Sampling for CBR Nirmalie Wiratunga Susan Craw Stewart Massie THE ROBERT GORDON UNIVERSITY ABERDEEN School of Computing
Mar 28, 2015
Index Driven Selective Sampling for CBR
Nirmalie Wiratunga Susan Craw Stewart Massie
THEROBERT GORDON
UNIVERSITYABERDEEN
School of Computing
Overview
Selective sampling
Cluster creation using an index
Cluster and case utility scores
Evaluation
Selective Sampling
selected cases
labelled cases
select interesting cases
unlabelled cases(pool)
Index
case-base•Relevance feedback•Distance learning•Patient monitoring
Uncertainty and Representativeness
+ -? ?
+ -?
?
??
??
Sampling Procedure
L = set of labelled casesU = set of unlabelled casesLOOP
model <= create-domain-model (L)clusters <= create-clusters(model, L, U)k-clusters <= select-clusters(k, clusters, L, U)FOR 1 to Max-Batch-Size
case <= select-case(k-clusters, L, U)L <= L U get-label(case, oracle)U <= L \ case
UNTIL stopping-criterion
Overview
Selective sampling
Cluster creation using an index
Cluster and case utility scores
Evaluation
Forming Clusters
5 labelled(4X, 1Y)
6 unlabelled
0 labelled 6 unlabelled
f35 labelled
(2X, 2Z, 1Y) 0 unlabelled
< N >= N
5 labelled(2X, 2Y, 1Z) 6 unlabelled
f1
f2
a b
d e
5 labelled(4Y, 1Z)
0 unlabelled
c
Analysing Clusters
X
X X
Y
X
Y
X X
Y
Z
Z
Y Y
Y
YZ
X X
Y
Z
Overview
Selective sampling
Cluster creation
Cluster and case utility scores
Evaluation
Ranking Clusters - Cluster Utility Score
Ranking Cases - Case Utility Score
Overview
Selective sampling
Cluster creation
Cluster and case utility scores
Evaluation
Evaluation
Selection Heuristics Rnd : randomly select cluster and cases Rnd-Cluster : random cluster with highest ranked cases Rnd-Case : highest ranked cluster random cases Informed-S : highest ranked cluster and cases Informed-M : highest ranked clusters and case
UCI ML (6 datasets) smaller data sets (Zoo, Iris, Lymph, Hep) medium data sets (house votes, breast cancer)
Experimental Design
Index
case-base
sampling pool
Inc 2Inc 3Inc 4Inc 5Inc
test set
case base size = L + selected cases
selected cases = sampling iterations * Max-Batch-Size
kNNaccuracy
Results I
70
75
80
85
90
50 75 100 125 150
Zoo: Sampling Pool Size
Acc
urac
y on
Tes
t Set
80
85
90
95
50 75 100 125 150
Iris: Sampling Pool Size
Acuu
racy
on
Test
Set
Rnd Rnd-cluster Rnd-case Informed-M Informed-S
Zoo (7C, 18F, A, P9) Iris (3C, 4F, #+A, P3)
Results II
65
70
75
80
50 75 100 125 150
Lymphography: Sampling Pool Size
Accu
racy
on
Test
Set
80
81
82
83
84
50 75 100 125 150
Hepatitis: Sampling Pool Size
Accu
racy
on
Test
Set
Rnd Rnd-cluster Rnd-case Informed-M Informed-S
Lymphography (4C, 19F, #+A, P9) Hepatitis (2C, 20F, A+?, P7)
Results III
80
84
88
92
150 200 250 300 350
House Votes: Sampling Pool Size
Accu
racy
on
Test
Set
62
63
64
65
66
67
68
69
150 200 250 300 350Breast Cancer: Sampling Pool Size
Accu
racy
on
Test
Set
Rnd Rnd-cluster Rnd-case Informed-M Informed-S
House (2C, 16F, A+?, P3 ) Breast (2C, 9F, A+?, P7)
Conclusions
Developed a case selection mechanism exploiting case base partitions
Utility Scores to rank clusters and cases ClUS captures uncertainty within clusters and uses
entropy to further weight this score CaUS captures the impact on other cases
Significant improvement with informed selection on 6 data sets
The influence of votes, partitions and entropy needs further investigation