Top Banner
What’s in a Label? Business value of “soft” vs “hard” cluster ensembles solutions-2 Nicole Huyghe & Anita Prinzie
23

Sawtooth 2012 what's in a label

Dec 05, 2014

Download

Documents

Anita Prinzie

Business value of soft vs hard cluster ensembles
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sawtooth 2012   what's in a label

What’s in a Label? Business value of “soft” vs “hard” cluster

ensemblessolutions-2

Nicole Huyghe & Anita Prinzie

Page 2: Sawtooth 2012   what's in a label

Answers the who and the why

Page 3: Sawtooth 2012   what's in a label

Theme 1

Theme 2

...

Theme 3

Theme 9

Theme 10

Cluster Ensemble

Page 4: Sawtooth 2012   what's in a label

HARD OR SOFT CLUSTER ENSEMBLE

Page 5: Sawtooth 2012   what's in a label

Stability Integrity Accuracy Size

Page 6: Sawtooth 2012   what's in a label

Stability

Similarity Index (Lange et al, 2004) indicates the percentage of pairs of observations that belong to the same cluster in both clustering C and clustering C’.

Page 7: Sawtooth 2012   what's in a label

Cluster Integrity – Heterogeneity

Total separation of clusters: based on the distance between cluster centers

Page 8: Sawtooth 2012   what's in a label

Cluster Integrity - Homogeneity

Scatter (compactness): average ratio of the cluster variance to the variance of the dataset.

Page 9: Sawtooth 2012   what's in a label

Accuracy

Adjusted Rand Index (Hubert and Arabie, 1985): level of agreement between the predicted segment and the real segment correcting for the expected level of agreement.

1 2

38

7

9

4

5

6

1

2

38

7

9

4

56

Reality Prediction

Page 10: Sawtooth 2012   what's in a label

Size

Uniformity deviation: average deviation from each segment from uniform segment size (1/number of segments).

Page 11: Sawtooth 2012   what's in a label

Rheumatism

Osteoporosis

Software journey

Page 12: Sawtooth 2012   what's in a label

Stability Heterogeneity

Accuracy Homogeneity

H>S H>S

H>S H>SS>H

S>HS>H

Page 13: Sawtooth 2012   what's in a label

LC gives smaller segments

Soft CCEA

Soft LC

Hard LC

Hard CCEA

Rheumatism

OsteoporosisSoftware journey

Soft CCEA

Soft LC

Hard LC

Hard CCEA

Page 14: Sawtooth 2012   what's in a label

MIXED EVIDENCE

Page 15: Sawtooth 2012   what's in a label

Fixed Factors

x 10100 100 100 100

Page 16: Sawtooth 2012   what's in a label

High

confidence

Low

confidence

High

confidence

Low

confidence

Page 17: Sawtooth 2012   what's in a label

Sim. Index soft > hard

Sim. Index hard > soft

Stability: SOFT is better

Strong similarity

Weak similarity

High confidence

Low confidence

Page 18: Sawtooth 2012   what's in a label

Homogeneity: SOFT is better

Scatter hard > soft

Strong similarity

Weak similarity

High confidence

Low confidence

Page 19: Sawtooth 2012   what's in a label

Heterogeneity: Hard is better

Tot. Sep. soft > hard

Strong similarity

Weak similarity

High confidence

Low confidence

Page 20: Sawtooth 2012   what's in a label

Size: Hard is better

Strong similarity

Weak similarity

Uni. dev. soft > hard

High confidence

Low confidence

Page 21: Sawtooth 2012   what's in a label

HARD ENSEMBLES GIVE BETTER BUSINESS SEGMENTS

Page 22: Sawtooth 2012   what's in a label

risingquestionsdo we cause

Anita Prinzie, Nicole [email protected]

www.solutions2.be

Page 23: Sawtooth 2012   what's in a label

References

• Fred and Jain, Combining Multiple Clustering using Evidence Accumulation (2005), IEEE Transactions on Pattern analysis and Machine Intelligence, 27(6), 835-850.

• Lange, T., Roth., V., Braun L. And Buhmann J.M. (2004) , Stability-based validation of Clustering Solutions, Neural Computation, 16, 1299-1323.

• Haldiki, M.,Vazirgiannis M. and Batistakis, Y. (2000), Quality Scheme Assessment in the Clustering Process, Proc. Of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 265-276.

• Hubert, L. And Arabie, P. (1985) Comparing partitions, Journal of Classification, 193-218.

• Nieweglowski, L., CLV package (2007), R software.• Martin, A., Quinn, K.M. And Park, J.H., Markov Chain Monte Carlo

Package (MCMCpack) (2003-2012), R software.