Data Mining Cluster Analysis: Basic Concepts and Algorithms Part 2 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
Data Mining Cluster Analysis: Basic Concepts
and Algorithms Part 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
Clustering Algorithms
Hierarchical clustering
K-means
Density-based clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in 4
steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the
clusters of the current partition. The centroid is the
center (mean point) of the cluster.
– Assign each object to the cluster with the nearest
seed point.
– Go back to Step 2, stop when no more new
assignment.
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4
The K-Means Clustering Method
Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5
K-means Clustering – Details
Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the cluster.
‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
Importance of Choosing Initial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7
Importance of Choosing Initial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Limitations of K-means
K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes
K-means has problems when the data contains
outliers.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11
Limitations of K-means: Non-globular Shapes
Original Points K-means (2 Clusters)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters.
Find parts of clusters, but need to put together.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13
Overcoming K-means Limitations
Original Points K-means Clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
Overcoming K-means Limitations
Original Points K-means Clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
DBSCAN
DBSCAN is a density-based algorithm. – Density = number of points within a specified radius (Eps)
– A point is a core point if it has more than a specified number
of points (MinPts) within Eps
These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
– A noise point is any point that is not a core point or a border point.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
DBSCAN: Density Based Spatial Clustering of Applications with Noise
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17
DBSCAN: Core, Border and Noise Points
Original Points Point types: core,
border and noise
Eps = 10, MinPts = 4
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes