Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

Data Mining Cluster Analysis: Basic Concepts

and Algorithms Part 2

Introduction to Data Mining

by

Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1


Clustering Algorithms

Hierarchical clustering

K-means

Density-based clustering


The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4

steps:

– Partition objects into k nonempty subsets

– Compute seed points as the centroids of the

clusters of the current partition. The centroid is the

center (mean point) of the cluster.

– Assign each object to the cluster with the nearest

seed point.

– Go back to Step 2, stop when no more new

assignment.

http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html


The K-Means Clustering Method

Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


K-means Clustering – Details

Initial centroids are often chosen randomly.

– Clusters produced vary from one run to another.

The centroid is (typically) the mean of the points in the cluster.

‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.


Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6


Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6


Limitations of K-means

K-means has problems when clusters are of

differing

– Sizes

– Densities

– Non-globular shapes

K-means has problems when the data contains

outliers.


Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing Density



Limitations of K-means: Non-globular Shapes



Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.

Find parts of clusters, but need to put together.








DBSCAN

DBSCAN is a density-based algorithm. – Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number

of points (MinPts) within Eps

These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

– A noise point is any point that is not a core point or a border point.


DBSCAN: Density Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is

defined as a maximal set of density-connected points

Discovers clusters of arbitrary shape in spatial databases

with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5


DBSCAN: Core, Border and Noise Points

Original Points Point types: core,

border and noise

Eps = 10, MinPts = 4


When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

Documents