Top Banner
Data Mining Cluster Analysis: Basic Concepts and Algorithms Part 2 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
18

Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

Oct 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

Data Mining Cluster Analysis: Basic Concepts

and Algorithms Part 2

Introduction to Data Mining

by

Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Page 2: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Clustering Algorithms

Hierarchical clustering

K-means

Density-based clustering

Page 3: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4

steps:

– Partition objects into k nonempty subsets

– Compute seed points as the centroids of the

clusters of the current partition. The centroid is the

center (mean point) of the cluster.

– Assign each object to the cluster with the nearest

seed point.

– Go back to Step 2, stop when no more new

assignment.

http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Page 4: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

The K-Means Clustering Method

Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 5: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

K-means Clustering – Details

Initial centroids are often chosen randomly.

– Clusters produced vary from one run to another.

The centroid is (typically) the mean of the points in the cluster.

‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.

Page 6: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 7: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 8: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Limitations of K-means

K-means has problems when clusters are of

differing

– Sizes

– Densities

– Non-globular shapes

K-means has problems when the data contains

outliers.

Page 9: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Page 10: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Page 11: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Page 12: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.

Find parts of clusters, but need to put together.

Page 13: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Overcoming K-means Limitations

Original Points K-means Clusters

Page 14: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Overcoming K-means Limitations

Original Points K-means Clusters

Page 15: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

DBSCAN

DBSCAN is a density-based algorithm. – Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number

of points (MinPts) within Eps

These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

– A noise point is any point that is not a core point or a border point.

Page 16: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

DBSCAN: Density Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is

defined as a maximal set of density-connected points

Discovers clusters of arbitrary shape in spatial databases

with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5

Page 17: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

DBSCAN: Core, Border and Noise Points

Original Points Point types: core,

border and noise

Eps = 10, MinPts = 4

Page 18: Introduction to Data Miningturgaybilgin/2015-2016-bahar/VeriMadenciligi/c… · © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Clustering Algorithms Hierarchical

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes