Top Banner
Ensembles of Partitions via Data Resampling Behrouz , Minaei Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004, Las Vegas, April 7 th 2004
30

Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Jan 02, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Ensembles of Partitions via Data

Resampling Behrouz Minaei, Alexander Topchy and William Punch

Department of Computer Science and Engineering

ITCC 2004, Las Vegas, April 7th 2004

Page 2: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Outline Overview of Data Mining Tasks

Cluster analysis and its difficulty Clustering Ensemble

How to generate different partitions? How to combine multiple partitions?

Resampling Methods Bootstrap vs. Subsampling

Experimental study Methods Results Conclusion

Page 3: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Overview of Data Mining Tasks Classification:

The goal is to predict the class variable based on the feature values of samples …Avoid Overfitting

Clustering: (unsupervised learning) Association Analysis:

Dependence Modeling: A generalization of classification task. Any feature

variable can occur both in antecedent and in the consequent of a rule.

Association Rules: Find binary relationships among data items

Page 4: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Clustering vs. Classification Identification of a pattern as a member of a category (pattern class) we

already know, or we are familiar with Supervised Classification (known categories) Unsupervised Classification, or “Clustering”

(creation of new categories)

Category “A”

Category “B”

Classification

Clustering

Page 5: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Classification vs. Clustering

Given some training patterns from each class, the goal is to construct decision boundaries or to partition the feature space

Given some patterns, the goal is to discover the underlying structure (categories) in the data based on inter-pattern similarities

Page 6: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Taxonomy of Clustering Approaches

A. Jain, M. N. Murty, and P. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, September 1999.

Page 7: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

k-Means Algorithm

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Minimize the sum of within-cluster square errors

Start with k cluster centers Iterate between

1. Assign data points to the closest cluster centers

2. Adjust the cluster centers to be the means of the data points

User specified parameters: k, initialization of cluster centers

Fast O(kNI) Proven to converge to local optimum In practice, converges quickly Tends to produce spherical, equal-

sized clusters k-means, k=3

Page 8: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Single-Link algorithm

Form a hierarchy for the data points (dendrogram), which can be used to partition the data

The “closest” data points are joined to form a cluster at each step

Closely related to the minimum spanning tree-based clustering

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.02

0.04

0.06

0.08

0.1

Data Dendrogram Single-link, k=30 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 9: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

User’s Dilemma!

Which similarity measure and which features to use?

How many clusters? Which is the “best” clustering method? Are the individual clusters and the partitions

valid? How to choose algorithmic parameters?

Page 10: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

How Many Clusters?

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

k-means, k=2 k-means, k=3

k-means, k=5k-means, k=4

Page 11: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Any “Best” Clustering Algorithm?

Clustering is an “ill-posed” problem; there does not exist a uniformly best clustering algorithm

In practice, we need to determine which clustering algorithm(s) is appropriate for the given data

-2 0 2 4 6 8-2

-1

0

1

2

k-means, 3 clusters-2 0 2 4 6 8

-2

-1

0

1

2

Single-link, 30 clusters

-2 0 2 4 6 8-2

-1

0

1

2

-2 0 2 4 6 8-2

-1

0

1

2

Spectral, 3 clustersEM, 3 clusters

Page 12: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Ensemble Benefits

Combinations of classifiers proved to be very effective in supervised learning framework, e.g. bagging and boosting algorithms

Distributed data mining requires efficient algorithms capable to integrate the solutions obtained from multiple sources of data and features

Ensembles of clusterings can provide novel, robust, and stable solutions

Page 13: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

-6 -4 -2 0 2 4 6-8

-6

-4

-2

0

2

4

6

8

2 clusters

-6 -4 -2 0 2 4 6-8

-6

-4

-2

0

2

4

6

83 clusters

Is Meaningful Clustering Combination Possible?

-6 -4 -2 0 2 4 6-8

-6

-4

-2

0

2

4

6

82 clusters

-6 -4 -2 0 2 4 6-8

-6

-4

-2

0

2

4

6

8 3 clusters

-5 0 5

-6

-4

-2

0

2

4

6

“Combination” of 4 different partitions can lead to true clusters!

Page 14: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Pattern Matrix, Distance matrix

Features

X1 x11 x12 … x1j … x1d

X2 x21 x22 … x2j … x2d

… … … … … … …

Xi xi1 xi2 … xij … xid

… … … … … … …

XN xN1 xN2 … xNj … xNd

   X1 X2 … Xj … XN

X1 d11 d12 … d1j … d1N

X2 d21 d22 … d2j … d2N

… … … … … … …

Xi di1 di2 … dij … diN

… … … … … … …

XN dN1 dN2 … dNj … dNN

Page 15: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Representation of Multiple Partitions

Combination of partitions can be viewed as another clustering problem, where each Pi represents a new feature with categorical values

Cluster membership of a pattern in different partitions is regarded as a new feature vector

Combining the partitions is equivalent to clustering these tuples

objects

P1 P2 P3 P4

x1 1 A Z

X2 1 A Y

X3 3 D ?

X4 2 D Y

X5 2 B Z

X6 3 C ? Z

X7 3 C ?

7 objects clustered by 4 algorithms

Page 16: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Re-labeling and Voting

  C-1 C-2 C-3 C-4

X1 1 A Z

X2 1 A Y

X3 3 B ?

X4 2 C Y

X5 2 B Z

X6 3 C ? Z

X7 3 B ?

  C-1 C-2 C-3 C-4

X1 1 1 1 2

X2 1 1 2 1

X3 3 3 2 ?

X4 2 2 1 1

X5 2 3 2 2

X6 3 2 ? 2

X7 3 3 2 ?

FC

1

1

3

?

2

2

3

Page 17: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Similarity between objects can be estimated by the number of clusters shared by two objects in all the partitions of an ensemble

This similarity definition expresses the strength of co-association of n objects by an n x n matrix

xi: the i-th pattern; k(xi): cluster label of xi in the k-th partition; (): Indicator function; N = no. of different partitions

This consensus function eliminates the need for solving the label correspondence problem

Co-association As Consensus Function

Page 18: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Taxonomy of Clustering Combination Approaches Clustering Combination Approaches

Generative mechanism Consensus function

Different initialization for one algorithm

Different subsets of objects

Co-association-based

Different subsets of features

Projection to subspaces

Voting approach

Hypergraph methods

Different algorithms

Mixture Model (EM)

CSPA

HGPA

MCLA

Single link

Comp. link

Avg. link

Information Theory approach

Others …

Project to 1D

Rand. cuts/plane

Deterministic

Resampling

Page 19: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Resampling Methods

Bootstrapping (Sampling with replacement) Create an artificial list by randomly drawing N

elements from that list. Some elements will be picked more than once.

Statistically on average 37% of elements are repeated

Subsampling (Sampling without replacement) Control over the size of subsample

Page 20: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Experiment: Data sets

Number of Classes

Number of Features

Total no of patterns

Patterns per class

Halfrings 2 2 400 100-300

2-spirals 2 2 200 100-100

Star/Galaxy 2 14 4192 2082-2110

Wine 3 13 178 59-71-48

LON 2 6 227 64-163

Iris 3 4 150 50-50-50

Page 21: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Original data set k-Means, k=2

Half Rings Data Set

k-means with k = 2 does not identify the true clusters

-1 -0.5 0 0.5 1 1.5 2-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1 1.5 2-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Page 22: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Euclidean distance over the original data set

Co-association matrix, k=15, N=200

l2

2-cluster lifetime

l3

Half Rings Data Set

-1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

Dendrograms produced by the single-link algorithm using:

Both SL and k-means algorithms fail on this data, but clustering combination detects true clusters

Page 23: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Bootstrap results on Iris

0

10

20

30

40

50

60

70

80

5 10 20 50 100 250

Number of Partitions, B

# o

f m

isa

ssig

ne

d p

att

ern

sk = 2k = 3k = 4k = 5k = 10

Iris , MCLA

Page 24: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Bootstrap results on Galaxy/Star

0

500

1000

1500

2000

2500

5 10 20 50 100

Number of Partitions, B

# o

f mis

assi

gn

ed p

atte

rns

k=2k=3k=4k=5

Galaxy, Av. link

Page 25: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Bootstrap results on Galaxy/Stark=5, different consensus functions

0

500

1000

1500

2000

2500

5 10 20 50 100

Number of Partitions, B

# o

f m

isa

ss

ign

ed

pa

tte

rns

Mutual InfHGPAMCLAAvg.link

Galaxy, k=5

Page 26: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Error Rate for Individual ClusteringData set k-means Single Link

Complete Link

Average Link

Halfrings 25% 24.3% 14% 5.3%

2 Spiral 43.5% 0% 48% 48%

Iris 15.1% 32% 16% 9.3%

Wine 30.2% 56.7% 32.6% 42%

LON 27% 27.3% 25.6% 27.3%

Star/Galaxy 21% 49.7% 44.1% 49.7%

Page 27: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Summary of the best results of BootstrappingData set Best Consensus function(s)

Lowest Error rate obtained

Parameters

HalfringsCo-association, SLCo-association, AL

0%0%

K ≥ 10, B. ≥ 100k ≥ 15, B ≥ 100

2 Spiral Co-association, SL 0% k ≥ 10, B.≥ 100

Iris Hypergraph-HGPA 2.7% k ≥ 10, B ≥ 20

Wine Hypergraph-CSPA 26.8% k ≥ 10, B ≥ 20

LON Co-association, CL 21.1% k ≥ 4, B ≥100

Star/GalaxyHypergraph-MCLACo-association, ALMutual Information

9.5%10%11%

k ≥ 20, B ≥ 10k ≥ 10, B ≥ 100

k ≥ 3, B ≥ 20

Page 28: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Discussion

What is the trade-off between the accuracy of the overall clustering combination and computational cost of generating component partitions?

What is the optimal size and granularity of the component partitions?

What is the best consensus function to combine bootstrap partitions?

Page 29: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

References

B. Minaei-Bidgoli, A. Topchy and W.F. Punch, “Effect of the Resampling Methods on Clustering Ensemble Efficacy”, prepared to submit to Intl. Conf. on Machine Learning; Models, Technologies and Applications, 2004

A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F. Punch, “Adaptive Clustering Ensembles”, Intl. Conf on Pattern Recognition, ICPR 2004, in press

A. Topchy, A.K. Jain and W. Punch, “A Mixture Model of Clustering Ensembles”, in Proceedings SIAM Conf. on Data Mining, April 2004, in press

Page 30: Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004,

Clusters of Galaxies