What’s in a Label? Business value of “soft” vs “hard” cluster ensembles solutions-2 Nicole Huyghe & Anita Prinzie
Dec 05, 2014
What’s in a Label? Business value of “soft” vs “hard” cluster
ensemblessolutions-2
Nicole Huyghe & Anita Prinzie
Stability
Similarity Index (Lange et al, 2004) indicates the percentage of pairs of observations that belong to the same cluster in both clustering C and clustering C’.
Cluster Integrity – Heterogeneity
Total separation of clusters: based on the distance between cluster centers
Cluster Integrity - Homogeneity
Scatter (compactness): average ratio of the cluster variance to the variance of the dataset.
Accuracy
Adjusted Rand Index (Hubert and Arabie, 1985): level of agreement between the predicted segment and the real segment correcting for the expected level of agreement.
1 2
38
7
9
4
5
6
1
2
38
7
9
4
56
Reality Prediction
Size
Uniformity deviation: average deviation from each segment from uniform segment size (1/number of segments).
LC gives smaller segments
Soft CCEA
Soft LC
Hard LC
Hard CCEA
Rheumatism
OsteoporosisSoftware journey
Soft CCEA
Soft LC
Hard LC
Hard CCEA
Sim. Index soft > hard
Sim. Index hard > soft
Stability: SOFT is better
Strong similarity
Weak similarity
High confidence
Low confidence
Homogeneity: SOFT is better
Scatter hard > soft
Strong similarity
Weak similarity
High confidence
Low confidence
Heterogeneity: Hard is better
Tot. Sep. soft > hard
Strong similarity
Weak similarity
High confidence
Low confidence
Size: Hard is better
Strong similarity
Weak similarity
Uni. dev. soft > hard
High confidence
Low confidence
References
• Fred and Jain, Combining Multiple Clustering using Evidence Accumulation (2005), IEEE Transactions on Pattern analysis and Machine Intelligence, 27(6), 835-850.
• Lange, T., Roth., V., Braun L. And Buhmann J.M. (2004) , Stability-based validation of Clustering Solutions, Neural Computation, 16, 1299-1323.
• Haldiki, M.,Vazirgiannis M. and Batistakis, Y. (2000), Quality Scheme Assessment in the Clustering Process, Proc. Of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 265-276.
• Hubert, L. And Arabie, P. (1985) Comparing partitions, Journal of Classification, 193-218.
• Nieweglowski, L., CLV package (2007), R software.• Martin, A., Quinn, K.M. And Park, J.H., Markov Chain Monte Carlo
Package (MCMCpack) (2003-2012), R software.