Ensembles of Partitions via Data Resampling Behrouz , Minaei Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004, Las Vegas, April 7 th 2004
Ensembles of Partitions via Data
Resampling Behrouz Minaei, Alexander Topchy and William Punch
Department of Computer Science and Engineering
ITCC 2004, Las Vegas, April 7th 2004
Outline Overview of Data Mining Tasks
Cluster analysis and its difficulty Clustering Ensemble
How to generate different partitions? How to combine multiple partitions?
Resampling Methods Bootstrap vs. Subsampling
Experimental study Methods Results Conclusion
Overview of Data Mining Tasks Classification:
The goal is to predict the class variable based on the feature values of samples …Avoid Overfitting
Clustering: (unsupervised learning) Association Analysis:
Dependence Modeling: A generalization of classification task. Any feature
variable can occur both in antecedent and in the consequent of a rule.
Association Rules: Find binary relationships among data items
Clustering vs. Classification Identification of a pattern as a member of a category (pattern class) we
already know, or we are familiar with Supervised Classification (known categories) Unsupervised Classification, or “Clustering”
(creation of new categories)
Category “A”
Category “B”
Classification
Clustering
Classification vs. Clustering
Given some training patterns from each class, the goal is to construct decision boundaries or to partition the feature space
Given some patterns, the goal is to discover the underlying structure (categories) in the data based on inter-pattern similarities
Taxonomy of Clustering Approaches
A. Jain, M. N. Murty, and P. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, September 1999.
k-Means Algorithm
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Minimize the sum of within-cluster square errors
Start with k cluster centers Iterate between
1. Assign data points to the closest cluster centers
2. Adjust the cluster centers to be the means of the data points
User specified parameters: k, initialization of cluster centers
Fast O(kNI) Proven to converge to local optimum In practice, converges quickly Tends to produce spherical, equal-
sized clusters k-means, k=3
Single-Link algorithm
Form a hierarchy for the data points (dendrogram), which can be used to partition the data
The “closest” data points are joined to form a cluster at each step
Closely related to the minimum spanning tree-based clustering
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.02
0.04
0.06
0.08
0.1
Data Dendrogram Single-link, k=30 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
User’s Dilemma!
Which similarity measure and which features to use?
How many clusters? Which is the “best” clustering method? Are the individual clusters and the partitions
valid? How to choose algorithmic parameters?
How Many Clusters?
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
k-means, k=2 k-means, k=3
k-means, k=5k-means, k=4
Any “Best” Clustering Algorithm?
Clustering is an “ill-posed” problem; there does not exist a uniformly best clustering algorithm
In practice, we need to determine which clustering algorithm(s) is appropriate for the given data
-2 0 2 4 6 8-2
-1
0
1
2
k-means, 3 clusters-2 0 2 4 6 8
-2
-1
0
1
2
Single-link, 30 clusters
-2 0 2 4 6 8-2
-1
0
1
2
-2 0 2 4 6 8-2
-1
0
1
2
Spectral, 3 clustersEM, 3 clusters
Ensemble Benefits
Combinations of classifiers proved to be very effective in supervised learning framework, e.g. bagging and boosting algorithms
Distributed data mining requires efficient algorithms capable to integrate the solutions obtained from multiple sources of data and features
Ensembles of clusterings can provide novel, robust, and stable solutions
-6 -4 -2 0 2 4 6-8
-6
-4
-2
0
2
4
6
8
2 clusters
-6 -4 -2 0 2 4 6-8
-6
-4
-2
0
2
4
6
83 clusters
Is Meaningful Clustering Combination Possible?
-6 -4 -2 0 2 4 6-8
-6
-4
-2
0
2
4
6
82 clusters
-6 -4 -2 0 2 4 6-8
-6
-4
-2
0
2
4
6
8 3 clusters
-5 0 5
-6
-4
-2
0
2
4
6
“Combination” of 4 different partitions can lead to true clusters!
Pattern Matrix, Distance matrix
Features
X1 x11 x12 … x1j … x1d
X2 x21 x22 … x2j … x2d
… … … … … … …
Xi xi1 xi2 … xij … xid
… … … … … … …
XN xN1 xN2 … xNj … xNd
X1 X2 … Xj … XN
X1 d11 d12 … d1j … d1N
X2 d21 d22 … d2j … d2N
… … … … … … …
Xi di1 di2 … dij … diN
… … … … … … …
XN dN1 dN2 … dNj … dNN
Representation of Multiple Partitions
Combination of partitions can be viewed as another clustering problem, where each Pi represents a new feature with categorical values
Cluster membership of a pattern in different partitions is regarded as a new feature vector
Combining the partitions is equivalent to clustering these tuples
objects
P1 P2 P3 P4
x1 1 A Z
X2 1 A Y
X3 3 D ?
X4 2 D Y
X5 2 B Z
X6 3 C ? Z
X7 3 C ?
7 objects clustered by 4 algorithms
Re-labeling and Voting
C-1 C-2 C-3 C-4
X1 1 A Z
X2 1 A Y
X3 3 B ?
X4 2 C Y
X5 2 B Z
X6 3 C ? Z
X7 3 B ?
C-1 C-2 C-3 C-4
X1 1 1 1 2
X2 1 1 2 1
X3 3 3 2 ?
X4 2 2 1 1
X5 2 3 2 2
X6 3 2 ? 2
X7 3 3 2 ?
FC
1
1
3
?
2
2
3
Similarity between objects can be estimated by the number of clusters shared by two objects in all the partitions of an ensemble
This similarity definition expresses the strength of co-association of n objects by an n x n matrix
xi: the i-th pattern; k(xi): cluster label of xi in the k-th partition; (): Indicator function; N = no. of different partitions
This consensus function eliminates the need for solving the label correspondence problem
Co-association As Consensus Function
Taxonomy of Clustering Combination Approaches Clustering Combination Approaches
Generative mechanism Consensus function
Different initialization for one algorithm
Different subsets of objects
Co-association-based
Different subsets of features
Projection to subspaces
Voting approach
Hypergraph methods
Different algorithms
Mixture Model (EM)
CSPA
HGPA
MCLA
Single link
Comp. link
Avg. link
Information Theory approach
Others …
Project to 1D
Rand. cuts/plane
Deterministic
Resampling
Resampling Methods
Bootstrapping (Sampling with replacement) Create an artificial list by randomly drawing N
elements from that list. Some elements will be picked more than once.
Statistically on average 37% of elements are repeated
Subsampling (Sampling without replacement) Control over the size of subsample
Experiment: Data sets
Number of Classes
Number of Features
Total no of patterns
Patterns per class
Halfrings 2 2 400 100-300
2-spirals 2 2 200 100-100
Star/Galaxy 2 14 4192 2082-2110
Wine 3 13 178 59-71-48
LON 2 6 227 64-163
Iris 3 4 150 50-50-50
Original data set k-Means, k=2
Half Rings Data Set
k-means with k = 2 does not identify the true clusters
-1 -0.5 0 0.5 1 1.5 2-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1 1.5 2-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Euclidean distance over the original data set
Co-association matrix, k=15, N=200
l2
2-cluster lifetime
l3
Half Rings Data Set
-1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
Dendrograms produced by the single-link algorithm using:
Both SL and k-means algorithms fail on this data, but clustering combination detects true clusters
Bootstrap results on Iris
0
10
20
30
40
50
60
70
80
5 10 20 50 100 250
Number of Partitions, B
# o
f m
isa
ssig
ne
d p
att
ern
sk = 2k = 3k = 4k = 5k = 10
Iris , MCLA
Bootstrap results on Galaxy/Star
0
500
1000
1500
2000
2500
5 10 20 50 100
Number of Partitions, B
# o
f mis
assi
gn
ed p
atte
rns
k=2k=3k=4k=5
Galaxy, Av. link
Bootstrap results on Galaxy/Stark=5, different consensus functions
0
500
1000
1500
2000
2500
5 10 20 50 100
Number of Partitions, B
# o
f m
isa
ss
ign
ed
pa
tte
rns
Mutual InfHGPAMCLAAvg.link
Galaxy, k=5
Error Rate for Individual ClusteringData set k-means Single Link
Complete Link
Average Link
Halfrings 25% 24.3% 14% 5.3%
2 Spiral 43.5% 0% 48% 48%
Iris 15.1% 32% 16% 9.3%
Wine 30.2% 56.7% 32.6% 42%
LON 27% 27.3% 25.6% 27.3%
Star/Galaxy 21% 49.7% 44.1% 49.7%
Summary of the best results of BootstrappingData set Best Consensus function(s)
Lowest Error rate obtained
Parameters
HalfringsCo-association, SLCo-association, AL
0%0%
K ≥ 10, B. ≥ 100k ≥ 15, B ≥ 100
2 Spiral Co-association, SL 0% k ≥ 10, B.≥ 100
Iris Hypergraph-HGPA 2.7% k ≥ 10, B ≥ 20
Wine Hypergraph-CSPA 26.8% k ≥ 10, B ≥ 20
LON Co-association, CL 21.1% k ≥ 4, B ≥100
Star/GalaxyHypergraph-MCLACo-association, ALMutual Information
9.5%10%11%
k ≥ 20, B ≥ 10k ≥ 10, B ≥ 100
k ≥ 3, B ≥ 20
Discussion
What is the trade-off between the accuracy of the overall clustering combination and computational cost of generating component partitions?
What is the optimal size and granularity of the component partitions?
What is the best consensus function to combine bootstrap partitions?
References
B. Minaei-Bidgoli, A. Topchy and W.F. Punch, “Effect of the Resampling Methods on Clustering Ensemble Efficacy”, prepared to submit to Intl. Conf. on Machine Learning; Models, Technologies and Applications, 2004
A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F. Punch, “Adaptive Clustering Ensembles”, Intl. Conf on Pattern Recognition, ICPR 2004, in press
A. Topchy, A.K. Jain and W. Punch, “A Mixture Model of Clustering Ensembles”, in Proceedings SIAM Conf. on Data Mining, April 2004, in press