Index Driven Selective Sampling for CBR Nirmalie Wiratunga Susan Craw Stewart Massie THE ROBERT GORDON UNIVERSITY ABERDEEN School of Computing.

Index Driven Selective Sampling for CBR

Nirmalie Wiratunga Susan Craw Stewart Massie

THEROBERT GORDON

UNIVERSITYABERDEEN

School of Computing

Overview

Selective sampling

Cluster creation using an index

Cluster and case utility scores

Evaluation

Selective Sampling

selected cases

labelled cases

select interesting cases

unlabelled cases(pool)

Index

case-base•Relevance feedback•Distance learning•Patient monitoring

Uncertainty and Representativeness

+ -? ?

+ -?

?

??

??

Sampling Procedure

L = set of labelled casesU = set of unlabelled casesLOOP

model <= create-domain-model (L)clusters <= create-clusters(model, L, U)k-clusters <= select-clusters(k, clusters, L, U)FOR 1 to Max-Batch-Size

case <= select-case(k-clusters, L, U)L <= L U get-label(case, oracle)U <= L \ case

UNTIL stopping-criterion

Overview

Selective sampling

Cluster creation using an index


Evaluation

Forming Clusters

5 labelled(4X, 1Y)

6 unlabelled

0 labelled 6 unlabelled

f35 labelled

(2X, 2Z, 1Y) 0 unlabelled

< N >= N

5 labelled(2X, 2Y, 1Z) 6 unlabelled

f1

f2

a b

d e

5 labelled(4Y, 1Z)

0 unlabelled

c

Analysing Clusters

X

X X

Y

X

Y

X X

Y

Z

Z

Y Y

Y

YZ

X X

Y

Z

Overview

Selective sampling

Cluster creation


Evaluation

Ranking Clusters - Cluster Utility Score

Ranking Cases - Case Utility Score

Overview

Selective sampling

Cluster creation


Evaluation

Evaluation

Selection Heuristics Rnd : randomly select cluster and cases Rnd-Cluster : random cluster with highest ranked cases Rnd-Case : highest ranked cluster random cases Informed-S : highest ranked cluster and cases Informed-M : highest ranked clusters and case

UCI ML (6 datasets) smaller data sets (Zoo, Iris, Lymph, Hep) medium data sets (house votes, breast cancer)

Experimental Design

Index

case-base

sampling pool

Inc 2Inc 3Inc 4Inc 5Inc

test set

case base size = L + selected cases

selected cases = sampling iterations * Max-Batch-Size

kNNaccuracy

Results I

70

75

80

85

90

50 75 100 125 150

Zoo: Sampling Pool Size

Acc

urac

y on

Tes

t Set

80

85

90

95

50 75 100 125 150

Iris: Sampling Pool Size

Acuu

racy

on

Test

Set

Rnd Rnd-cluster Rnd-case Informed-M Informed-S

Zoo (7C, 18F, A, P9) Iris (3C, 4F, #+A, P3)

Results II

65

70

75

80

50 75 100 125 150

Lymphography: Sampling Pool Size

Accu

racy

on

Test

Set

80

81

82

83

84

50 75 100 125 150

Hepatitis: Sampling Pool Size

Accu

racy

on

Test

Set


Lymphography (4C, 19F, #+A, P9) Hepatitis (2C, 20F, A+?, P7)

Results III

80

84

88

92

150 200 250 300 350

House Votes: Sampling Pool Size

Accu

racy

on

Test

Set

62

63

64

65

66

67

68

69

150 200 250 300 350Breast Cancer: Sampling Pool Size

Accu

racy

on

Test

Set


House (2C, 16F, A+?, P3 ) Breast (2C, 9F, A+?, P7)

Conclusions

Developed a case selection mechanism exploiting case base partitions

Utility Scores to rank clusters and cases ClUS captures uncertainty within clusters and uses

entropy to further weight this score CaUS captures the impact on other cases

Significant improvement with informed selection on 6 data sets

The influence of votes, partitions and entropy needs further investigation

Index Driven Selective Sampling for CBR Nirmalie Wiratunga Susan Craw Stewart Massie THE ROBERT GORDON UNIVERSITY ABERDEEN School of Computing.

Documents

set of labelled cases

unlabelled n

unlabelled c slide

unlabelled f3

unlabelled f1 f2 ab

z ab d e n

index ncluster

train test set sizes