Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Ensembles of Partitions via Data

Resampling Behrouz Minaei, Alexander Topchy and William Punch

Department of Computer Science and Engineering

ITCC 2004, Las Vegas, April 7th 2004

Outline Overview of Data Mining Tasks

Cluster analysis and its difficulty Clustering Ensemble

How to generate different partitions? How to combine multiple partitions?

Resampling Methods Bootstrap vs. Subsampling

Experimental study Methods Results Conclusion

Overview of Data Mining Tasks Classification:

The goal is to predict the class variable based on the feature values of samples …Avoid Overfitting

Clustering: (unsupervised learning) Association Analysis:

Dependence Modeling: A generalization of classification task. Any feature

variable can occur both in antecedent and in the consequent of a rule.

Association Rules: Find binary relationships among data items

Clustering vs. Classification Identification of a pattern as a member of a category (pattern class) we

already know, or we are familiar with Supervised Classification (known categories) Unsupervised Classification, or “Clustering”

(creation of new categories)

Category “A”

Category “B”

Classification

Clustering

Classification vs. Clustering

Given some training patterns from each class, the goal is to construct decision boundaries or to partition the feature space

Given some patterns, the goal is to discover the underlying structure (categories) in the data based on inter-pattern similarities

Taxonomy of Clustering Approaches

A. Jain, M. N. Murty, and P. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, September 1999.

k-Means Algorithm

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Minimize the sum of within-cluster square errors

Start with k cluster centers Iterate between

1. Assign data points to the closest cluster centers

2. Adjust the cluster centers to be the means of the data points

User specified parameters: k, initialization of cluster centers

Fast O(kNI) Proven to converge to local optimum In practice, converges quickly Tends to produce spherical, equal-

sized clusters k-means, k=3

Single-Link algorithm

Form a hierarchy for the data points (dendrogram), which can be used to partition the data

The “closest” data points are joined to form a cluster at each step

Closely related to the minimum spanning tree-based clustering

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.02

0.04

0.06

0.08

0.1

Data Dendrogram Single-link, k=30 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

User’s Dilemma!

Which similarity measure and which features to use?

How many clusters? Which is the “best” clustering method? Are the individual clusters and the partitions

valid? How to choose algorithmic parameters?

How Many Clusters?

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

k-means, k=2 k-means, k=3

k-means, k=5k-means, k=4

Any “Best” Clustering Algorithm?

Clustering is an “ill-posed” problem; there does not exist a uniformly best clustering algorithm

In practice, we need to determine which clustering algorithm(s) is appropriate for the given data

-2 0 2 4 6 8-2

-1

0

1

2

k-means, 3 clusters-2 0 2 4 6 8

-2

-1

0

1

2

Single-link, 30 clusters

-2 0 2 4 6 8-2

-1

0

1

2

-2 0 2 4 6 8-2

-1

0

1

2

Spectral, 3 clustersEM, 3 clusters

Ensemble Benefits

Combinations of classifiers proved to be very effective in supervised learning framework, e.g. bagging and boosting algorithms

Distributed data mining requires efficient algorithms capable to integrate the solutions obtained from multiple sources of data and features

Ensembles of clusterings can provide novel, robust, and stable solutions

-6 -4 -2 0 2 4 6-8

-6

-4

-2

0

2

4

6

8

2 clusters

-6 -4 -2 0 2 4 6-8

-6

-4

-2

0

2

4

6

83 clusters

Is Meaningful Clustering Combination Possible?

-6 -4 -2 0 2 4 6-8

-6

-4

-2

0

2

4

6

82 clusters

-6 -4 -2 0 2 4 6-8

-6

-4

-2

0

2

4

6

8 3 clusters

-5 0 5

-6

-4

-2

0

2

4

6

“Combination” of 4 different partitions can lead to true clusters!

Pattern Matrix, Distance matrix

Features

X1 x11 x12 … x1j … x1d

X2 x21 x22 … x2j … x2d

… … … … … … …

Xi xi1 xi2 … xij … xid

… … … … … … …

XN xN1 xN2 … xNj … xNd

X1 X2 … Xj … XN

X1 d11 d12 … d1j … d1N

X2 d21 d22 … d2j … d2N

… … … … … … …

Xi di1 di2 … dij … diN

… … … … … … …

XN dN1 dN2 … dNj … dNN

Representation of Multiple Partitions

Combination of partitions can be viewed as another clustering problem, where each Pi represents a new feature with categorical values

Cluster membership of a pattern in different partitions is regarded as a new feature vector

Combining the partitions is equivalent to clustering these tuples

objects

P1 P2 P3 P4

x1 1 A Z

X2 1 A Y

X3 3 D ?

X4 2 D Y

X5 2 B Z

X6 3 C ? Z

X7 3 C ?

7 objects clustered by 4 algorithms

Re-labeling and Voting

C-1 C-2 C-3 C-4

X1 1 A Z

X2 1 A Y

X3 3 B ?

X4 2 C Y

X5 2 B Z

X6 3 C ? Z

X7 3 B ?

C-1 C-2 C-3 C-4

X1 1 1 1 2

X2 1 1 2 1

X3 3 3 2 ?

X4 2 2 1 1

X5 2 3 2 2

X6 3 2 ? 2

X7 3 3 2 ?

FC

1

1

3

?

2

2

3

Similarity between objects can be estimated by the number of clusters shared by two objects in all the partitions of an ensemble

This similarity definition expresses the strength of co-association of n objects by an n x n matrix

xi: the i-th pattern; k(xi): cluster label of xi in the k-th partition; (): Indicator function; N = no. of different partitions

This consensus function eliminates the need for solving the label correspondence problem

Co-association As Consensus Function

Taxonomy of Clustering Combination Approaches Clustering Combination Approaches

Generative mechanism Consensus function

Different initialization for one algorithm

Different subsets of objects

Co-association-based

Different subsets of features

Projection to subspaces

Voting approach

Hypergraph methods

Different algorithms

Mixture Model (EM)

CSPA

HGPA

MCLA

Single link

Comp. link

Avg. link

Information Theory approach

Others …

Project to 1D

Rand. cuts/plane

Deterministic

Resampling

Resampling Methods

Bootstrapping (Sampling with replacement) Create an artificial list by randomly drawing N

elements from that list. Some elements will be picked more than once.

Statistically on average 37% of elements are repeated

Subsampling (Sampling without replacement) Control over the size of subsample

Experiment: Data sets

Number of Classes

Number of Features

Total no of patterns

Patterns per class

Halfrings 2 2 400 100-300

2-spirals 2 2 200 100-100

Star/Galaxy 2 14 4192 2082-2110

Wine 3 13 178 59-71-48

LON 2 6 227 64-163

Iris 3 4 150 50-50-50

Original data set k-Means, k=2

Half Rings Data Set

k-means with k = 2 does not identify the true clusters

-1 -0.5 0 0.5 1 1.5 2-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1 1.5 2-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Euclidean distance over the original data set

Co-association matrix, k=15, N=200

l2

2-cluster lifetime

l3

Half Rings Data Set

-1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

Dendrograms produced by the single-link algorithm using:

Both SL and k-means algorithms fail on this data, but clustering combination detects true clusters

Bootstrap results on Iris

0

10

20

30

40

50

60

70

80

5 10 20 50 100 250

Number of Partitions, B

# o

f m

isa

ssig

ne

d p

att

ern

sk = 2k = 3k = 4k = 5k = 10

Iris , MCLA

Bootstrap results on Galaxy/Star

0

500

1000

1500

2000

2500

5 10 20 50 100


# o

f mis

assi

gn

ed p

atte

rns

k=2k=3k=4k=5

Galaxy, Av. link

Bootstrap results on Galaxy/Stark=5, different consensus functions

0

500

1000

1500

2000

2500

5 10 20 50 100


# o

f m

isa

ss

ign

ed

pa

tte

rns

Mutual InfHGPAMCLAAvg.link

Galaxy, k=5

Error Rate for Individual ClusteringData set k-means Single Link

Complete Link

Average Link

Halfrings 25% 24.3% 14% 5.3%

2 Spiral 43.5% 0% 48% 48%

Iris 15.1% 32% 16% 9.3%

Wine 30.2% 56.7% 32.6% 42%

LON 27% 27.3% 25.6% 27.3%

Star/Galaxy 21% 49.7% 44.1% 49.7%

Summary of the best results of BootstrappingData set Best Consensus function(s)

Lowest Error rate obtained

Parameters

HalfringsCo-association, SLCo-association, AL

0%0%

K ≥ 10, B. ≥ 100k ≥ 15, B ≥ 100

2 Spiral Co-association, SL 0% k ≥ 10, B.≥ 100

Iris Hypergraph-HGPA 2.7% k ≥ 10, B ≥ 20

Wine Hypergraph-CSPA 26.8% k ≥ 10, B ≥ 20

LON Co-association, CL 21.1% k ≥ 4, B ≥100

Star/GalaxyHypergraph-MCLACo-association, ALMutual Information

9.5%10%11%

k ≥ 20, B ≥ 10k ≥ 10, B ≥ 100

k ≥ 3, B ≥ 20

Discussion

What is the trade-off between the accuracy of the overall clustering combination and computational cost of generating component partitions?

What is the optimal size and granularity of the component partitions?

What is the best consensus function to combine bootstrap partitions?

References

B. Minaei-Bidgoli, A. Topchy and W.F. Punch, “Effect of the Resampling Methods on Clustering Ensemble Efficacy”, prepared to submit to Intl. Conf. on Machine Learning; Models, Technologies and Applications, 2004

A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F. Punch, “Adaptive Clustering Ensembles”, Intl. Conf on Pattern Recognition, ICPR 2004, in press

A. Topchy, A.K. Jain and W. Punch, “A Mixture Model of Clustering Ensembles”, in Proceedings SIAM Conf. on Data Mining, April 2004, in press

Clusters of Galaxies

Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Documents