Top Banner
Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: [email protected] Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford), Tan, Steinbach and Kumar
70

Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

Dec 13, 2015

Download

Documents

Jemima Small
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

Clustering

COSC 526 Class 12

Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]

Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford), Tan, Steinbach and Kumar

Page 2: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

2

Assignment 1: Your first hand at random walks

• Write up is here…

• Pair up and do the assignment– It helps to work in small teams

– Maximize your productivity

• Most of the assignment and its notes are in the handouts (class web-page)

Page 3: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

3

Clustering: Basics…

Page 4: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

4

Clustering

• Finding groups of items (or objects) in a group that are related to one other and different from other groups

Inter-cluster distances are maximized

Intra-cluster distances are minimized

Page 5: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

5

Applications• Grouping regions

together based on precipitation

• Grouping genes together based on expression patterns in cells

• Finding ensembles of folded/unfolded protein structures

Page 6: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

6

What is not clustering?

• Supervised classification– class label information

• Simple segmentation– Dividing students into different registration

groups (either alphabetically, by major, etc.)

• Results of a query– Grouping is a result of external specification

• Graph partitioning– Areas not identical…

Take Home Message:• Clustering of data is essentially

driven by the data at hand!!• Meaning or interpretation of the

clusters should be driven by the data!!

Page 7: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

7

Constitution of a cluster can be ambiguous

• How to decide between 8 clusters and 2 clusters?

Page 8: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

8

Types of Clustering

• Partitional Clustering– A division of data into

non-overlapping subsets (clusters) such that each data point is in exactly one subset

• Hierarchical Clustering– A set of nested clusters

organized as a hierarchical tree

p1 p2 p3 p4 p5 p6

p3

p4

p5

p6

p1p2

Page 9: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

9

Other types of distinctions…

• Exclusive vs. Non-exclusive:– Points may belong to multiple clusters

• Fuzzy vs. Non-fuzzy:– A point may belong to every cluster with weight

between 0 and 1

– Similar to probabilistic clustering

• Partial vs. Complete:– We may want to cluster only some of the data

• Heterogeneous vs. Homogeneous– Cluster of widely different sizes…

Page 10: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

10

Types of Clusters

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density based clusters

• Property/conceptual

• Described by an objective function

set of points such that any point in a cluster is closer to every other point in the cluster than to any point not in the cluster

Page 11: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

11

Types of Clusters

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density based clusters

• Property/conceptual

• Described by an objective function

Cluster is a set of objects such that an object in a cluster is closer to the center of the cluster (called centroid) than any other center of any other cluster…

Page 12: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

12

Types of Clusters

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density based clusters

• Property/conceptual

• Described by an objective function

Nearest neighbor or transitive…

Page 13: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

13

Types of Clusters

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density based clusters

• Property/conceptual

• Described by an objective function

A cluster is a dense region of points separated by low-density regions Used when clusters are irregular and when noise/outliers are present

Page 14: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

14

Types of Clusters

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density based clusters

• Property/conceptual

• Described by an objective function

Find clusters that share a common property or representationEg. taste, smell, …

Page 15: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

15

Types of Clusters

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density based clusters

• Property/conceptual

• Described by an objective function

• Find clusters based on minimizing or maximizing an objective function

• Enumerate all possible ways of dividing points into clusters:• Evaluate the goodness of

each potential set of clusters by an objective function

• NP Hard problem• Global vs. Local Objectives:

• Hierarchical clustering typically have local objectives

• Partitional algorithms typically have global objectives

Page 16: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

16

More on objective functions… (1)

• Objective functions tend to map the clustering problem to a different domain and solve a related problem:– E.g., defining a proximity matrix as a weighted

graph

– Clustering is equivalent to breaking the graph into connected components

– Minimize the edge weight between clusters and maximize edge weight within clusters

Page 17: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

17

More on objective functions… (2)

• Best clustering usually minimizes/maximizes an objective function

• Mixture models assume that the data is a mixture of a number of parametric statistical distributions (e.g., Gaussians)

Page 18: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

18

Characteristics of input data

• Type of proximity or density measure

– derived measure, central to clustering

• Sparseness:

– Dictates type of similarity

– Adds to efficiency

• Attribute Type:

– Dictates type of similarity

• Type of data:

– Dictates type of similarity

• Dimensionality

• Noise and outliers

• Type of distribution

Page 19: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

19

Clustering Algorithms:K-means ClusteringHierarchical ClusteringDensity-based Clustering

Page 20: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

20

K-means Clustering

• Partitional clustering:– Each cluster is associated with a centroid

– Each point is assigned to the cluster with the closest centroid

– We need to identify the total number of clusters, k, as one of the inputs

• Simple Algorithm K-means Algorithm

1 : Select K points as the initial centers2 : repeat

3 : Form K clusters by assigning all points to the closest centroid4 : Recompute the centroid of each cluster

5: until centroids don’t change

Page 21: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

21

K-means Clustering• Initial centroids are chosen randomly:

– clusters can vary depending on how you started

• Centroid is the mean of the points in the cluster

• “Closeness” is measured usually Euclidean distances

• K-means will typically converge quickly

– Points stop changing assignments

– Another stopping criterion: Only a few points change clusters

• Time complexity O(nKId) – n: number of points; K: number of clusters

– I: number of iterations; d: number of attributes

Page 22: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

22

K-means example

Page 23: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

23

How to initialize (seed) K-means?

• If there are K “real” clusters, then the chance of selecting one centroid from each cluster is small– Chance is relatively small when K is large

– If clusters have the same size (say m)

– If k = 10 P = 0.00036 (really small!!)

• The choice of centroids can have a deep impact on how the clusters are determined…

Page 24: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

24

Choosing K

Page 25: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

25

What are the solutions for this problem?

• Multiple runs!!– Usually helps

• Sample the points so that you can guesstimate the number of clusters– Depends on how we have sampled

– Or we have sampled outliers in the data

• Select more than the k number of centroids and then select k among these centroids– Choose widely separated k centroids

Page 26: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

26

How to evaluate k-means clusters

• Most common measure is the sum of squared errors (SSE):

• Given two clustering outputs from k-means, we can choose the one with the least error

• Only compare clustering with the same K

• Important side note: K-means is a heuristic for minimizing SSE

Page 27: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

27

Pre-processing and Post-processing

• Pre-processing:– normalize the data (e.g., scale the data to unit

standard deviation)

– eliminate outliers

• Post-processing:– Eliminate small clusters that may represent

outliers

– Split clusters that have a high SSE

– Merge clusters that have a low SSE

Page 28: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

28

Limitations of using K-means

• K-means can have problems when the data has:– different sizes

– different densities

– non-globular shapes

– outliers!

Page 29: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

29

How does this scale… (for MapReduce)In the map step:• Read the cluster centers

into memory from a sequencefile

• Iterate over each cluster center for each input key/value pair. 

• Measure the distances and save the nearest center which has the lowest distance to the vector

• Write the clustercenter with its vector to the filesystem.

In the reduce step (we get associated vectors for each center):• Iterate over each value vector

and calculate the average vector. (Sum each vector and devide each part by the number of vectors we received).

• This is the new center, save it into a SequenceFile.

• Check the convergence between the clustercenter that is stored in the key object and the new center.

• If it they are not equal, increment an update counter

Page 30: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

30

Making k-means streaming

• Two broad approaches:– Solving the k-means as it comes:• Guha, Mishra, Motwani, O’Callaghan (2001)

• Charikar, O'Callaghan, and Panigrahy (2003)

• Braverman, Meyerson, Ostrovsky, Roytman, Shindler, and Tagiku (2011)

– Solving k-means using weighted coresets:• Select a small sample of points that are weighted

• Weights are such that the solution of the k-means on the subset is similar to the original dataset

Page 31: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

31

Fast Streaming K-means

Shindler, Wong, Myerson, NIPS (2011)

Shindler, NIPS presentation (2011)

Page 32: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

32

Fast Streaming K-means

• Intuition on why this works: The probability that point x belongs to some cluster is proportional to its distance from the “mean”

– referred to as “facility” here

• Costliest step: measuring δ:

– Use approximate nearest neighbor algorithms

• Space complexity: Ω(k log n)– You are only storing neighborhood info

– Use hashing and metric embedding (not discussed)

• Time complexity: o(nk)Shindler, Wong, Myerson, NIPS (2011)

Page 33: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

33

Hierarchical Clustering

Page 34: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

34

Hierarchical Clustering

• Produce a set of nested clusters organized as a hierarchical tree

• Can be conveniently visualized as a dendrogram:– a tree like representation which records the

sequences of merges and splits

Page 35: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

35

Types of Hierarchical Clustering

• Agglomerative Clustering:– Start with points as individual points (leaves)

– At each step, merge the closest pair of clusters until one cluster (or k clusters) remain

• Divisive Clustering: – Start with one, all inclusive cluster

– At each step, split a cluster until each cluster has a point (or there are k clusters)

• Traditional hierarchical clustering:– uses similarity or distance matrix

– merge or split one cluster at a time

Page 36: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

36

Agglomerative Clustering

• One of the more popular algorithms

• Basic algorithm is straightforwardAgglomerative Clustering Algorithm

1 : Compute the distance matrix2 : Let each data point be a cluster3 : repeat

3 : Merge the two closest clusters4 : Update the distance matrix

5: until only a single cluster remains

Key operation is the computation of the proximity of two clusters →

Different approaches to defining the distance between clusters

distinguish the different algorithms

Page 37: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

37

Starting Situation

• Start with clusters of individual data points and a distance matrix

p1 p2 p3 p4 p5 p…

p1

p2

p3

p5

p…

Page 38: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

38

Next step: Group points…

• After merging a few of these data points

C1

C2

C3

C4

C5

c1 c2 c3 c4 c5

c1

c2

c3

c4

c5

Page 39: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

39

Next step: Merge clusters…

• After merging a few of these data points

C1

C2

C3

C4

C5

c1 c2 c3 c4 c5

c1

c2

c3

c4

c5

Page 40: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

40

How to merge and update the distance matrix?

• Measure of similarity:– Min

– Max

– Group average

– Distance between centroids

– Other methods driven by an objective function

• How do these look on the clustering process?

Page 41: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

41

Defining inter-cluster similarity

• Min (single link)

• Max (complete link)

• Group Average (average link)

• Distance between centroids

Page 42: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

42

Single Link

• non-spherical/non-convex clusters

Page 43: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

43

Complete Link Clustering

• Better suited for datasets with noise

• Tends to form smaller clusters

• Biased toward more globular clusters

Page 44: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

44

Average link / Unweighted Pair Group Method using Arithmetic Averages (UPGMA)

• Compromise between single and complete linkage

• Works generally well in practice

Page 45: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

45

How do we say when two clusterings are similar?

• Ward’s method– Similarity of two clusters is based on the

increase in SSE when two clusters are merged

• Advantage:– Less susceptible to errors/outliers in the data

– Analog of the K-means comparison

– Can be used to initialize K-means

• Disadvantage:– Biased toward more globular clusters

Page 46: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

46

Space and Time Complexity

• Space Complexity: O(N2)– N is the number of data points

– N2 entries in the distance matrix

• Time Complexity: O(N3)– Many cases: N-steps for tree construction, and

at each step the distance matrix with O(N2) entries must be updated

– Complexity can be reduced to O(N2logN) in some cases

Page 47: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

47

Let’s talk about Scaling!

• Specific type of hierarchical clustering algorithm:– UPGMA (average linking)

– Most widely used in bioinformatics literature

• However impractical for scaling to entire genome!– Need the whole distance/ dissimilarity matrix in

memory (N2)!

– How can we exploit sparsity?

Page 48: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

48

Problem of interest…• Given a large

number of sequences and we have a way to determine how two or more sequences are similar

• We have a pairwise matrix dissimilarity matrix

• Build a hierarchical clustering routine for understanding how proteins (or other bio-molecules) have evolved

Page 49: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

49

The problem with UPGMA: Distance matrix computation is expensive

• We are computing the arithmetic mean between the sequences

• This is not defined when we have sparse inputs

Triangle inequality is not satisfied based on how we have defined the way clusters are built…

Page 50: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

50

Strategy to scale up this for Big Data

• Two aspects to handle:– Missing edges

– Sparsity in the distance matrix

detection threshold – for missing edge data…

We are completing “missing” values in D using ψ!

Page 51: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

51

Sparse UPGMA: Speeding

Space: O(E)note E << N2

Time: O(E log V)

Still Expensive for E can be arbitrarily large!

How do we deal with this?

Page 52: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

52

Streaming for Sparsity: Multi-round Memory Constrained (MC-UPGMA)

• Two components needed:– Memory

constrained clustering unit• Holds only a subset

of the E that needs to be processed in the current round

– Memory constrained merging unit:• Ensures we get only

valid edges

Space is only O(N) depending on how many sequences we have to load at any given time…

Time: O(E log V)

Page 53: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

53

Limitations of Hierarchical Clustering

• Greedy: once we make a decision for merging, it cannot be usually undone– Or can be expensive to undo

– Methods exist to alter this

• No global function is being minimized or maximized

• Different schemes of hierarchical clustering have limitations:– Sensitivity to noise and outliers

– Difficulty in handling different shapes

– Chaining, breaking of clusters…

Page 54: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

54

Density-based Spatial Clustering of Applications with Noise (DBSCAN)

Page 55: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

55

Preliminaries

• Density is defined to be the number of points within a radius (ε)– In this case, density = 9ε

• Core point has more than a specified number of points (minPts) at ε

– Points are interior to a cluster

• Border points have < minPts at ε but are within vicinity of the core point

• A noise point is neither a core point nor a border point

ε

minPts = 4

core point

border point

noise point

Page 56: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

56

DBSCAN Algorithm

Page 57: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

57

Illustration of DBSCAN: Assignment of Core, Border and Noise Points

Page 58: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

58

DBSCAN: Finding Clusters

Page 59: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

59

Advantages and Limitations

• Resistant to noise

• Can handle clusters of different sizes and shapes

• Eps and MinPts are dependent on each other– Can be difficult to

specify

• Different density clusters within the same class can be difficult to find

Page 60: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

60

Advantages and Limitations

• Varying density data

• High dimensional data

Page 61: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

61

How to determine Eps and MinPoints

• For points within a cluster, kth nearest neighbors are roughly at the same distance

• Noise points are farther away in general

• Plot by sorting the distance of every point to its kth nearest neighbor

Page 62: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

62

How do we validate clusters?

Page 63: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

63

Cluster validity

• For supervised learning:– we had a class label,

– which meant we could identify how good our training and testing errors were

– Metric: Accuracy, Precision, Recall

• For clustering: – How do we measure the “goodness” of the

resulting clusters?

Page 64: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

64

Clustering random data (overfitting)

If you ask a clustering algorithm to find clusters, it will find some

Page 65: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

65

Different aspects of validating clsuters• Determine the clustering tendency of a set of

data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting)

• External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth).

• Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information.

• Compare clusterings to determine which is better.

• Determining the ‘correct’ number of clusters.

Page 66: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

66

Measures of cluster validity• External Index: Used to measure the

extent to which cluster labels match externally supplied class labels. – Entropy, Purity, Rand Index

• Internal Index: Used to measure the goodness of a clustering structure without respect to external information.– Sum of Squared Error (SSE), Silhouette

coefficient

• Relative Index: Used to compare two different clusterings or clusters. – Often an external or internal index is used for

this function, e.g., SSE or entropy

Page 67: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

67

Measuring Cluster Validation with Correlation

• Proximity Matrix vs. Incidence matrix:

– A matrix Kij with 1 if the point belongs to the

same cluster; 0 otherwise

• Compute the correlation between the two matrices:– Only n(n-1)/2 values to be computed

– High values indicate similarity between points in the same cluster

• Not suited for density based clustering

Page 68: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

68

Another approach: use similarity matrix for cluster validation

Page 69: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

69

Internal Measures: SSE

• SSE is also a good measure to understand how good the clustering is– Lower SSE good clustering

• Can be used to estimate number of clusters

Page 70: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

70

More on Clustering a little later…

• We will discuss other forms of clustering in the following classes

• Next class:– please bring your brief write up on the two

papers