Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1
Data Mining Cluster Analysis: Advanced Concepts
and Algorithms
ref. Chapter 9
Introduction to Data Mining by
Tan, Steinbach, Kumar
1
Data e Web Mining 2
Outline
Prototype-based – Fuzzy c-means – Mixture Model Clustering
Density-based – Grid-based clustering – Subspace clustering
Graph-based – Chameleon
Scalable Clustering Algorithms – Cure and Birch
Characteristics of Clustering Algorithms
Data e Web Mining 3
Hard (Crisp) vs Soft (Fuzzy) Clustering
Hard (Crisp) vs. Soft (Fuzzy) clustering – Generalize K-means objective function (for all the N
points)
wij : weight with which object xi belongs to cluster Cj
– To minimize SSE, repeat the following steps: Fixed cj and determine wij (cluster assignment) Fixed wij and recompute cj
– Hard clustering: wij ∈ {0,1}
€
SSE = wij xi − c j( )2
i=1
N
∑j=1
k
∑ ,
€
wijj=1
k
∑ =1
Data e Web Mining 4
Hard (Crisp) vs Soft (Fuzzy) Clustering
SSE(x) is minimized when wx1 = 1, wx2 = 0
1 2 5
c1 c2 x
Data e Web Mining 5
Fuzzy C-means
Objective function
wij : weight with which object xi belongs to cluster Cj
– To minimize objective function, repeat the following: Fixed cj and determine wij Fixed wij and recompute cj
– Fuzzy clustering: wij ∈[0,1]
€
wijj=1
k
∑ =1
p: fuzzifier (p > 1)
€
SSE = wijp xi − c j( )
2
i=1
N
∑j=1
k
∑ ,
Data e Web Mining 6
Fuzzy C-means
SSE(x) is minimized when wx1 = 0.9, wx2 = 0.1
1 2 5
c1 c2 x
Data e Web Mining 7
Fuzzy C-means
Objective function:
Initialization: choose the weights wij randomly
Repeat: – Update centroids:
– Update weights:
€
SSE = wijp xi − c j( )
2
i=1
N
∑j=1
k
∑ ,
€
wijj=1
k
∑ =1
p: fuzzifier (p > 1)
Data e Web Mining 8
Fuzzy K-means Applied to Sample Data
Data e Web Mining 9
Hard (Crisp) vs Soft (Probabilistic) Clustering
Idea is to model the set of data points as arising from a mixture of distributions
– Typically, normal (Gaussian) distribution is used – But other distributions have been very profitably used.
Clusters are found by estimating the parameters of the statistical distributions
– Can use a k-means like algorithm, called the EM algorithm, to estimate these parameters Actually, k-means is a special case of this approach
– Provides a compact representation of clusters – The probabilities with which point belongs to each cluster provide
a functionality similar to fuzzy clustering.
Data e Web Mining 10
Probabilistic Clustering: Example
Informal example: consider modeling the points that generate the following histogram.
Looks like a combination of two normal distributions
Suppose we can estimate the mean and standard deviation of each normal distribution.
– This completely describes the two clusters – We can compute the probabilities with which each point belongs
to each cluster – Can assign each point to the
cluster (distribution) in which it is most probable.
Data e Web Mining 11
Probabilistic Clustering: EM Algorithm
Initialize the parameters Repeat
For each point, compute its probability under each distribution Using these probabilities, update the parameters of each distribution
Until there is not change
Very similar to of K-means Consists of assignment and update steps Can use random initialization
– Problem of local minima For normal distributions, typically use K-means to initialize If using normal distributions, can find elliptical as well as
spherical shapes.
Data e Web Mining 12
Probabilistic Clustering: EM Algorithm
Choose K seeds: means of a gaussian distribution
Estimation: calculate probability of belonging to a cluster based on distance
Maximization: move mean of gaussian to centroid of data set, weighted by the contribution of each point
Repeat till means don’t move
Data e Web Mining 13
Probabilistic Clustering Applied to Sample Data
Data e Web Mining 14
Grid-based Clustering
A type of density-based clustering
Data e Web Mining 15
Grid-based Clustering
Issues – how to discretize the dimensions
equal width vs. equal frequency discretization
– density of cells containing the points close to the border of a cluster can be very low these cells are discarded. A possible solution is to reduce the size of cells, but this may yield additional problems
Data e Web Mining 16
Subspace Clustering
Until now, we found clusters by considering all of the attributes
Some clusters may involve only a subset of attributes, i.e., subspaces of the data – Example:
In a document collection, documents can be represented as vectors, where the dimensions correspond to terms
When k-means is used to find document clusters, the resulting clusters can typically be characterized by 10 or so terms
Data e Web Mining 17
Example
Three clear clusters. – The circle points are not a cluster in three dimensions – If the dimensions are discretized (equal width), these
points are included in low density cells
Data e Web Mining 18
Histograms to determine density
Equi-width discretized space
Density Threshold = 6%
Contiguous intervals to be clustered
Data e Web Mining 19
Example
Data e Web Mining 20
Example
Data e Web Mining 21
Example
Data e Web Mining 22
Example : remarks
The circles do not form a cluster in the three dimensions, but they may form a cluster in some subspaces
A cluster in the three dimensions is part of a cluster (maybe a larger one) in the subspaces
Data e Web Mining 23
Clique Algorithm - Overview
A grid-based clustering algorithm that methodically finds subspace clusters – Partitions the data space into rectangular units of
equal volume – Measures the density of each unit by the fraction of
points it contains – A unit is dense if the fraction of overall points it
contains is above a user specified threshold, τ – A cluster is a group of collections of contiguous
(touching) dense units
Data e Web Mining 24
Clique Algorithm
It is impractical to check each subspace to see if it is dense, due to the exponential number of them – 2n subspaces, if n are the dimensions
Monotone property of density-based clusters: – If a set of points forms a density based cluster in k
dimensions, then the same set of points is also part of a density based cluster in all possible subsets of those dimensions
Very similar to the Apriori algorithm for frequent itemset mining
Can find overlapping clusters
Data e Web Mining 25
Clique Algorithm
Data e Web Mining 26
Limitations of Clique
Time complexity is exponential in number of dimensions – Especially if “too many” dense units are generated at
lower stages
May fail if clusters are of widely differing densities, since the threshold is fixed – Determining appropriate threshold and unit interval
length can be challenging
Data e Web Mining 27
Graph-Based Clustering: General Concepts
Graph-Based clustering uses the proximity graph – Start with the proximity matrix – Consider each point as a node in a graph – Each edge between two nodes has a weight which is
the proximity between the two points – Initially the proximity graph is fully connected – MIN (single-link) and MAX (complete-link) can be
viewed as starting with this graph
In the simplest case, clusters are connected components in the graph.
Data e Web Mining 28
Graph-Based Clustering: Chameleon
Based on several key ideas
– Sparsification of the proximity graph
– Partitioning the data into clusters that are relatively pure subclusters of the “true” clusters
– Merging based on preserving characteristics of clusters
Data e Web Mining 29
Graph-Based Clustering: Sparsification
The amount of data that needs to be processed is drastically reduced, thus making the algorithm more scalable – Sparsification can eliminate more than 99% of the
entries in a proximity matrix – The amount of time required to cluster the data is
drastically reduced – The size of the problems that can be handled is
increased
Data e Web Mining 30
Graph-Based Clustering: Sparsification …
Clustering may work better – Sparsification techniques keep the connections to the most
similar (nearest) neighbors of a point while breaking the connections to less similar points.
– The nearest neighbors of a point tend to belong to the same class as the point itself.
– This reduces the impact of noise and outliers and sharpens the distinction between clusters.
Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms)
– Chameleon and Hypergraph-based Clustering
Data e Web Mining 31
Sparsification in the Clustering Process
Data e Web Mining 32
Limitations of Current Merging Schemes
Existing merging schemes in hierarchical clustering algorithms are static in nature – MIN or CURE:
Merge two clusters based on their closeness (or minimum distance)
– GROUP-AVERAGE: Merge two clusters based on their average connectivity
Data e Web Mining 33
Limitations of Current Merging Schemes
Closeness schemes will merge (a) and (b)
(a)
(b)
(c)
(d)
Average connectivity schemes will merge (c) and (d)
Data e Web Mining 34
Chameleon: Clustering Using Dynamic Modeling
Adapt to the characteristics of the data set to find the natural clusters
Use a dynamic model to measure the similarity between clusters – Main properties are the relative closeness and relative inter-
connectivity of the cluster – Two clusters are combined if the resulting cluster shares certain
properties with the constituent clusters – The merging scheme preserves self-similarity
Data e Web Mining 35
Experimental Results: CHAMELEON
Data e Web Mining 36
Experimental Results: CHAMELEON
Data e Web Mining 37
Experimental Results: CHAMELEON
Data e Web Mining 38
CURE: a Scalable Algorithm
Agglomerative hierarchical clustering algorithms vary in terms of how the proximity of two clusters are computed
MIN (single link) – susceptible to noise/outliers
MAX (complete link)/GROUP AVERAGE/Centroid: – may not work well with non-globular clusters
CURE (Clustering Using REpresentatives) algorithm tries to handle both problems
– It is a graph-based algorithm – Starts with a proximity matrix/proximity graph
Data e Web Mining 39
Represents a cluster using multiple representative points – Goals: scalability, by choosing points that capture the
geometry and shape of clusters – Representative points are found by selecting a
constant number of points from a cluster The first representative point is chosen to be the point farthest
from the center of the cluster Remaining representative points are chosen so that they are
farthest from all previously chosen points
CURE Algorithm
Data e Web Mining 40
“Shrink” representative points toward the center of the cluster by a factor, α
Shrinking representative points toward the center helps avoid problems with noise and outliers
– shrinking factor: α
Cluster similarity is the similarity of the closest pair of representative points (MIN) from different clusters
CURE Algorithm
× ×
Data e Web Mining 41
CURE Algorithm
Uses agglomerative hierarchical scheme to perform clustering; – α = 0: similar to centroid-based – α = 1: somewhat similar to single-link (MIN)
CURE is better able to handle clusters of arbitrary shapes and sizes
Data e Web Mining 42
Experimental Results: CURE (10 clusters)
Data e Web Mining 43
Experimental Results: CURE (9 clusters)
Data e Web Mining 44
Experimental Results: CURE
Picture from CURE, Guha, Rastogi, Shim.
Data e Web Mining 45
Experimental Results: CURE
Picture from CURE, Guha, Rastogi, Shim.
(centroid)
(single link)
Data e Web Mining 46
CURE Cannot Handle Differing Densities
Original Points CURE
Data e Web Mining 47
BIRCH: a Scalable Algorithm
Balanced Iterative Reducing and Clustering using Hierarchies
Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans
Weakness: handles only numeric data (Euclidean space), and is sensitive to the order of the data record
Data e Web Mining 48
BIRCH
Clustering Feature (CF): – Number of point, Linear Sum of points, Sum of
Squares of points – CF incrementally updated, to be used for computing
centroids, variance (used for measuring the diameter of the cluster)
– Also used for computing distances between clusters
€
(N,LS→
,SS→
)
Data e Web Mining 49
BIRCH
CF is a compact storage for data on points in a cluster
Has enough information to calculate the intra-cluster distances
Additivity theorem allows us to merge sub-clusters – C3 = C1 C2 – CFC3= <nC1+ nC2 , LSC1+ LSC2, SSC1+SSC2>
€
Data e Web Mining 50
BIRCH
Basic steps of BIRCH
– Load the data into memory by creating a CF tree that “summarizes” the data (see the following slide)
– Perform global clustering. Produces a better clustering than the initial step. An agglomerative, hierarchical technique was selected.
– Redistribute the data points using the centroids of clusters discovered in the global clustering phase, and thus, discover a new (and hopefully better) set of clusters.
Data e Web Mining 51
BIRCH
BIRCH maintains a balanced CF-Tree – Branching Factor B: max entry number in a non-leaf – Max size leaf L: max entry number in a leaf – Threshold T: the diameter of a leaf < T
CF Tree CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
Data e Web Mining 52
High dimensionality Size of data set Sparsity of attribute values Noise and Outliers Types of attributes and type of data sets Differences in attribute scale Properties of the data space
– Can you define a meaningful centroid
Characteristics of Data
Data e Web Mining 53
Data distribution Shape Differing sizes Differing densities Poor separation Relationship of clusters Subspace clusters
Characteristics of Clusters
Data e Web Mining 54
Order dependence Non-determinism Parameter selection Scalability Underlying model Optimization based approach
Characteristics of Clustering Algorithms