111/06/14 1 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan , Keke Chen, and Ling Liu Proceedings of the 15 th International Conference on Information and Knowledge Management, ACM CIKM, 2006 報報報 : 報報報
112/04/21 1
Efficiently Clustering Transactional data with Weighted
Coverage Density
M. Hua Yan , Keke Chen, and Ling Liu
Proceedings of the 15th International Conference on Information and Knowledge Management, ACM CIKM, 2006
報告人 : 吳建良
2
Outline Motivation SCALE Framework BKPlot Method WCD Clustering Algorithm Cluster Validity Evaluation Experimental Results
3
Motivation Transactional data is a kind of special categorical data
t1={milk, bread, beer}, t2={milk, bread} Can be transformed to row by column table with Boolean value Large volume and high dimensionality make the existing
algorithms inefficient to process the transformed data
Clustering transactional data algorithm: LargeItem, CLPOE, CCCD Require users to manually tune at least one or two parameters Setting these parameters are different from dataset to dataset
SCALE Framework
ACE & BkPlot (SSDBM’05) ACE: Agglomerative Categorical
clustering with Entropy criterion BkPlot:
Examine the entropy difference between the clustering structures with varying K
Reports the Ks where the clustering stricture changes dramatically
Evaluation Metrics LISR: Large Item Size Ratio AMI: Average pair-clusters Merging
Index
4
ACE Algorithm Bottom-up process
Initially, each record is a cluster Iteratively, find the most similar pair of clusters Cp and Cq,
and then merge them Incremental entropy
The most similar pair of clusters
is minimum among all possible pairs
denote the Im value in forming the K-cluster partition from the K+1-cluster partition
5
))(ˆ)(ˆ()(ˆ)(),( qqppqpqpqpm CHnCHnCCHnnCCI
)(KmI
),( qpm CCI
BkPlot Increasing rate of entropy:
N: total records, d: columns Small increasing rate
Merging does not introduce any impurity to the clusters Clustering structure is not significantly changed
Large increasing rate Introduce considerable impurity into the partitions Clustering structure can be changed significantly
6
)(1)( K
mINd
KI
BkPlot (contd.)
Relative changes Use relative changes to determine if a globally
significant clustering structure emerges7
I(K)≈I(K+1), but I(K-1)>I(K)
BkPlot (contd.)
8)()1()(
)1()()(2 KIKIKI
KIKIKI
Entropy Characteristic Graph (ECG) Second-order differential of ECG: )(2 KI
WCD Clustering Algorithm Notations
D: transactional dataset N: size of dataset I={I1, I2,…, Im}: a set of items
tj={Ij1, Ij2,…, Ijl}: a transaction
A transaction clustering result CK={C1, C2,…,CK} is a partition of D, where
9
jiiK CCCDCC , ,1
Intra-cluster Similarity Measure Coverage Density (CD)
Given a cluster Ck
Mk: Number of distinct items
: Items set of Ck
Nk : Number of transaction in Ck
Sk: Sum occurrences of all items in Ck
10
area rectangle
cells filled
},,,{ 21 kkMkkk IIII
kk
M
j kj
kk
kk MN
Ioccur
MN
SCCD
k
1)(
)( CD↑, compactness ↑
Intra-cluster Similarity Measure (contd.)
Drawback of CD Insufficient to measure the density of frequent
itemset Each item has equal contribution in a cluster
Two clusters may have the same CD but different filled-cell distribution
11
kj M
W1
a b c a b c
9
5CD
Intra-cluster Similarity Measure (contd.)
Weighted Coverage Density (WCD) Focus on high-frequency items Define Wj as
12
1 . )(
1
kM
j jk
kjj Wst
S
IoccurW
kk
M
j kj
k
kjM
j kjk
j
M
j kjk
k
SN
Ioccur
S
IoccurIoccur
N
WIoccurN
CWCD
k
k
k
2
1
1
1
)(
)()(
1
)(1
)(
a b c a b c
CD WCD
3
1
6
3
6
2
6
1
Clustering Criterion Expected Weighted Coverage Density (EWCD)
Clustering algorithm try to maximize the EWCD When every individual transaction is considered as a
cluster, it will get the maximum EWCD=1 Use BKPlot method to generate a set of candidate “best Ks”
13
K
k k
M
j kjK
kk
kK
S
Ioccur
NCWCD
N
NCEWCD
k
1
1
2
1
)(1)()(
WCD Clustering Algorithm
14
Input: Dataset D, Number of clusters K, Initial K seedsOutput: K clusters
/* Phase 1 – Initialization*/K seeds form the initial K clusters;while not end of D do read one transaction t from D; add t into Ci that maximizes EWCD; write <t, i> back to D;
/* Phase 2 – Iteration*/while moveMark = true do moveMark = false; randomly generate the access sequence R while has not checked all transactions do read <t, i>; if moving t to cluster Cj increases EWCD and i ≠ j
moveMark = true; write <t, j> back to D;
Cluster Validity Evaluation LISR (Large Item Size Ratio)
Measure the preservation of frequent itemsets , where LSk is #Large Items in Ck
high concurrences of items
high possibility of finding more frequent
itemsets at user-specified minimum support
15
K
kk
kk
S
LS
N
NLISR
1
LISR
Cluster Validity Evaluation (contd.)
Inter-cluster dissimilarity between Ci and Cj
16
)()()(),( jijji
ji
ji
iji CCCDCCD
NN
NCCD
NN
NCCd
))11
()11
((1
)(1
)(),(
ijjj
ijii
ji
ij
ji
j
j
i
i
ji
ijji
ji
jj
j
ji
j
ii
i
ji
iji
MMS
MMS
NN
M
SS
M
S
M
S
NN
MNN
SS
MN
S
NN
N
MN
S
NN
NCCd
simplify
, where Mij is the number of distinct items after merging two cluster
thus Mij max{≧ Mi, Mj}
Because of and , d(Ci, Cj) is a real number between 0 and 1 iij MM
11
jij MM
11
Cluster Validity Evaluation (contd.)
Example If Mi=Mj=Mij, then d(Ci,Cj)=0
Mi=Mj=3, Mij=5
17
a b c
Ci Cj
3
1))
5
1
3
1(4)
5
1
3
1(5(
6
1 ),( ji CCd
a b c
a b c
Ci Cj
c d e
Cluster Validity Evaluation (contd.)
AMI (Average pair-clusters Merging Index) Evaluate the overall inter-dissimilarity of a
clustering result having K clusters
better the clustering quality
18
},,,1,),,(min{
,1
1
jiKjiCCdD
DK
AMI
jii
K
i i
AMI
Experiments Dataset
Tc30a6r1000 1000 records, 30 column, 6 possible attribute values
Zoo 101 records, 18 attributes
Mushroom 8124 instances, 22 attributes
Mushroom100k Sample the mushroom data with duplicates 100,000 instances
TxI4Dx IBM Data Generator 19
Experimental Results Tc30a6r1000
20
The repulsion parameter r of CLOPE iscontrolling the number of clusters
5 clusters 9 clusters
Experimental Results (contd.)
Zoo: K=7 is the best
21
2 clusters 4 clusters 7 clusters
Experimental Results (contd.)
Mushroom: K=19 is the best
22
Experimental Results (contd.)
Performance evaluation on mushroom100k
23
r=0.5~4.0 r=2.0
Experimental Results (contd.)
Performance evaluation on TxI4Dx
24
T10I4Dx TxI4D100k