Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

Clustering

COSC 526 Class 12

Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]

Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford), Tan, Steinbach and Kumar

mailto:[email protected]

2

Assignment 1: Your first hand at random walks

• Write up is here…

• Pair up and do the assignment– It helps to work in small teams

– Maximize your productivity

• Most of the assignment and its notes are in the handouts (class web-page)

3

Clustering: Basics…

4

Clustering

• Finding groups of items (or objects) in a group that are related to one other and different from other groups

Inter-cluster distances are maximized

Intra-cluster distances are minimized

5

Applications• Grouping regions

together based on precipitation

• Grouping genes together based on expression patterns in cells

• Finding ensembles of folded/unfolded protein structures

6

What is not clustering?

• Supervised classification– class label information

• Simple segmentation– Dividing students into different registration

groups (either alphabetically, by major, etc.)

• Results of a query– Grouping is a result of external specification

• Graph partitioning– Areas not identical…

Take Home Message:• Clustering of data is essentially

driven by the data at hand!!• Meaning or interpretation of the

clusters should be driven by the data!!

7

Constitution of a cluster can be ambiguous

• How to decide between 8 clusters and 2 clusters?

8

Types of Clustering

• Partitional Clustering– A division of data into

non-overlapping subsets (clusters) such that each data point is in exactly one subset

• Hierarchical Clustering– A set of nested clusters

organized as a hierarchical tree

p1 p2 p3 p4 p5 p6

p3

p4

p5

p6

p1p2

9

Other types of distinctions…

• Exclusive vs. Non-exclusive:– Points may belong to multiple clusters

• Fuzzy vs. Non-fuzzy:– A point may belong to every cluster with weight

between 0 and 1

– Similar to probabilistic clustering

• Partial vs. Complete:– We may want to cluster only some of the data

• Heterogeneous vs. Homogeneous– Cluster of widely different sizes…

10

Types of Clusters

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density based clusters

• Property/conceptual

• Described by an objective function

set of points such that any point in a cluster is closer to every other point in the cluster than to any point not in the cluster

11

Types of Clusters







Cluster is a set of objects such that an object in a cluster is closer to the center of the cluster (called centroid) than any other center of any other cluster…

12

Types of Clusters







Nearest neighbor or transitive…

13

Types of Clusters







A cluster is a dense region of points separated by low-density regions Used when clusters are irregular and when noise/outliers are present

14

Types of Clusters







Find clusters that share a common property or representationEg. taste, smell, …

15

Types of Clusters







• Find clusters based on minimizing or maximizing an objective function

• Enumerate all possible ways of dividing points into clusters:• Evaluate the goodness of

each potential set of clusters by an objective function

• NP Hard problem• Global vs. Local Objectives:

• Hierarchical clustering typically have local objectives

• Partitional algorithms typically have global objectives

16

More on objective functions… (1)

• Objective functions tend to map the clustering problem to a different domain and solve a related problem:– E.g., defining a proximity matrix as a weighted

graph

– Clustering is equivalent to breaking the graph into connected components

– Minimize the edge weight between clusters and maximize edge weight within clusters

17

More on objective functions… (2)

• Best clustering usually minimizes/maximizes an objective function

• Mixture models assume that the data is a mixture of a number of parametric statistical distributions (e.g., Gaussians)

18

Characteristics of input data

• Type of proximity or density measure

– derived measure, central to clustering

• Sparseness:

– Dictates type of similarity

– Adds to efficiency

• Attribute Type:


• Type of data:


• Dimensionality

• Noise and outliers

• Type of distribution

19

Clustering Algorithms:K-means ClusteringHierarchical ClusteringDensity-based Clustering

20

K-means Clustering

• Partitional clustering:– Each cluster is associated with a centroid

– Each point is assigned to the cluster with the closest centroid

– We need to identify the total number of clusters, k, as one of the inputs

• Simple Algorithm K-means Algorithm

1 : Select K points as the initial centers2 : repeat

3 : Form K clusters by assigning all points to the closest centroid4 : Recompute the centroid of each cluster

5: until centroids don’t change

21

K-means Clustering• Initial centroids are chosen randomly:

– clusters can vary depending on how you started

• Centroid is the mean of the points in the cluster

• “Closeness” is measured usually Euclidean distances

• K-means will typically converge quickly

– Points stop changing assignments

– Another stopping criterion: Only a few points change clusters

• Time complexity O(nKId) – n: number of points; K: number of clusters

– I: number of iterations; d: number of attributes

22

K-means example

23

How to initialize (seed) K-means?

• If there are K “real” clusters, then the chance of selecting one centroid from each cluster is small– Chance is relatively small when K is large

– If clusters have the same size (say m)

– If k = 10 P = 0.00036 (really small!!)

• The choice of centroids can have a deep impact on how the clusters are determined…

24

Choosing K

25

What are the solutions for this problem?

• Multiple runs!!– Usually helps

• Sample the points so that you can guesstimate the number of clusters– Depends on how we have sampled

– Or we have sampled outliers in the data

• Select more than the k number of centroids and then select k among these centroids– Choose widely separated k centroids

26

How to evaluate k-means clusters

• Most common measure is the sum of squared errors (SSE):

• Given two clustering outputs from k-means, we can choose the one with the least error

• Only compare clustering with the same K

• Important side note: K-means is a heuristic for minimizing SSE

27

Pre-processing and Post-processing

• Pre-processing:– normalize the data (e.g., scale the data to unit

standard deviation)

– eliminate outliers

• Post-processing:– Eliminate small clusters that may represent

outliers

– Split clusters that have a high SSE

– Merge clusters that have a low SSE

28

Limitations of using K-means

• K-means can have problems when the data has:– different sizes

– different densities

– non-globular shapes

– outliers!

29

How does this scale… (for MapReduce)In the map step:• Read the cluster centers

into memory from a sequencefile

• Iterate over each cluster center for each input key/value pair.

• Measure the distances and save the nearest center which has the lowest distance to the vector

• Write the clustercenter with its vector to the filesystem.

In the reduce step (we get associated vectors for each center):• Iterate over each value vector

and calculate the average vector. (Sum each vector and devide each part by the number of vectors we received).

• This is the new center, save it into a SequenceFile.

• Check the convergence between the clustercenter that is stored in the key object and the new center.

• If it they are not equal, increment an update counter

30

Making k-means streaming

• Two broad approaches:– Solving the k-means as it comes:• Guha, Mishra, Motwani, O’Callaghan (2001)

• Charikar, O'Callaghan, and Panigrahy (2003)

• Braverman, Meyerson, Ostrovsky, Roytman, Shindler, and Tagiku (2011)

– Solving k-means using weighted coresets:• Select a small sample of points that are weighted

• Weights are such that the solution of the k-means on the subset is similar to the original dataset

31

Fast Streaming K-means

Shindler, Wong, Myerson, NIPS (2011)

Shindler, NIPS presentation (2011)

32

Fast Streaming K-means

• Intuition on why this works: The probability that point x belongs to some cluster is proportional to its distance from the “mean”

– referred to as “facility” here

• Costliest step: measuring δ:

– Use approximate nearest neighbor algorithms

• Space complexity: Ω(k log n)– You are only storing neighborhood info

– Use hashing and metric embedding (not discussed)

• Time complexity: o(nk)Shindler, Wong, Myerson, NIPS (2011)

33

Hierarchical Clustering

34

Hierarchical Clustering

• Produce a set of nested clusters organized as a hierarchical tree

• Can be conveniently visualized as a dendrogram:– a tree like representation which records the

sequences of merges and splits

35

Types of Hierarchical Clustering

• Agglomerative Clustering:– Start with points as individual points (leaves)

– At each step, merge the closest pair of clusters until one cluster (or k clusters) remain

• Divisive Clustering: – Start with one, all inclusive cluster

– At each step, split a cluster until each cluster has a point (or there are k clusters)

• Traditional hierarchical clustering:– uses similarity or distance matrix

– merge or split one cluster at a time

36

Agglomerative Clustering

• One of the more popular algorithms

• Basic algorithm is straightforwardAgglomerative Clustering Algorithm

1 : Compute the distance matrix2 : Let each data point be a cluster3 : repeat

3 : Merge the two closest clusters4 : Update the distance matrix

5: until only a single cluster remains

Key operation is the computation of the proximity of two clusters →

Different approaches to defining the distance between clusters

distinguish the different algorithms

37

Starting Situation

• Start with clusters of individual data points and a distance matrix

p1 p2 p3 p4 p5 p…

p1

p2

p3

p5

p…

38

Next step: Group points…

• After merging a few of these data points

C1

C2

C3

C4

C5

c1 c2 c3 c4 c5

c1

c2

c3

c4

c5

39

Next step: Merge clusters…

• After merging a few of these data points

C1

C2

C3

C4

C5

c1 c2 c3 c4 c5

c1

c2

c3

c4

c5

40

How to merge and update the distance matrix?

• Measure of similarity:– Min

– Max

– Group average

– Distance between centroids

– Other methods driven by an objective function

• How do these look on the clustering process?

41

Defining inter-cluster similarity

• Min (single link)

• Max (complete link)

• Group Average (average link)

• Distance between centroids

42

Single Link

• non-spherical/non-convex clusters

43

Complete Link Clustering

• Better suited for datasets with noise

• Tends to form smaller clusters

• Biased toward more globular clusters

44

Average link / Unweighted Pair Group Method using Arithmetic Averages (UPGMA)

• Compromise between single and complete linkage

• Works generally well in practice

45

How do we say when two clusterings are similar?

• Ward’s method– Similarity of two clusters is based on the

increase in SSE when two clusters are merged

• Advantage:– Less susceptible to errors/outliers in the data

– Analog of the K-means comparison

– Can be used to initialize K-means

• Disadvantage:– Biased toward more globular clusters

46

Space and Time Complexity

• Space Complexity: O(N2)– N is the number of data points

– N2 entries in the distance matrix

• Time Complexity: O(N3)– Many cases: N-steps for tree construction, and

at each step the distance matrix with O(N2) entries must be updated

– Complexity can be reduced to O(N2logN) in some cases

47

Let’s talk about Scaling!

• Specific type of hierarchical clustering algorithm:– UPGMA (average linking)

– Most widely used in bioinformatics literature

• However impractical for scaling to entire genome!– Need the whole distance/ dissimilarity matrix in

memory (N2)!

– How can we exploit sparsity?

48

Problem of interest…• Given a large

number of sequences and we have a way to determine how two or more sequences are similar

• We have a pairwise matrix dissimilarity matrix

• Build a hierarchical clustering routine for understanding how proteins (or other bio-molecules) have evolved

49

The problem with UPGMA: Distance matrix computation is expensive

• We are computing the arithmetic mean between the sequences

• This is not defined when we have sparse inputs

Triangle inequality is not satisfied based on how we have defined the way clusters are built…

50

Strategy to scale up this for Big Data

• Two aspects to handle:– Missing edges

– Sparsity in the distance matrix

detection threshold – for missing edge data…

We are completing “missing” values in D using ψ!

51

Sparse UPGMA: Speeding

Space: O(E)note E << N2

Time: O(E log V)

Still Expensive for E can be arbitrarily large!

How do we deal with this?

52

Streaming for Sparsity: Multi-round Memory Constrained (MC-UPGMA)

• Two components needed:– Memory

constrained clustering unit• Holds only a subset

of the E that needs to be processed in the current round

– Memory constrained merging unit:• Ensures we get only

valid edges

Space is only O(N) depending on how many sequences we have to load at any given time…

Time: O(E log V)

53

Limitations of Hierarchical Clustering

• Greedy: once we make a decision for merging, it cannot be usually undone– Or can be expensive to undo

– Methods exist to alter this

• No global function is being minimized or maximized

• Different schemes of hierarchical clustering have limitations:– Sensitivity to noise and outliers

– Difficulty in handling different shapes

– Chaining, breaking of clusters…

54

Density-based Spatial Clustering of Applications with Noise (DBSCAN)

55

Preliminaries

• Density is defined to be the number of points within a radius (ε)– In this case, density = 9ε

• Core point has more than a specified number of points (minPts) at ε

– Points are interior to a cluster

• Border points have < minPts at ε but are within vicinity of the core point

• A noise point is neither a core point nor a border point

ε

minPts = 4

core point

border point

noise point

56

DBSCAN Algorithm

57

Illustration of DBSCAN: Assignment of Core, Border and Noise Points

58

DBSCAN: Finding Clusters

59

Advantages and Limitations

• Resistant to noise

• Can handle clusters of different sizes and shapes

• Eps and MinPts are dependent on each other– Can be difficult to

specify

• Different density clusters within the same class can be difficult to find

60

Advantages and Limitations

• Varying density data

• High dimensional data

61

How to determine Eps and MinPoints

• For points within a cluster, kth nearest neighbors are roughly at the same distance

• Noise points are farther away in general

• Plot by sorting the distance of every point to its kth nearest neighbor

62

How do we validate clusters?

63

Cluster validity

• For supervised learning:– we had a class label,

– which meant we could identify how good our training and testing errors were

– Metric: Accuracy, Precision, Recall

• For clustering: – How do we measure the “goodness” of the

resulting clusters?

64

Clustering random data (overfitting)

If you ask a clustering algorithm to find clusters, it will find some

65

Different aspects of validating clsuters• Determine the clustering tendency of a set of

data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting)

• External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth).

• Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information.

• Compare clusterings to determine which is better.

• Determining the ‘correct’ number of clusters.

66

Measures of cluster validity• External Index: Used to measure the

extent to which cluster labels match externally supplied class labels. – Entropy, Purity, Rand Index

• Internal Index: Used to measure the goodness of a clustering structure without respect to external information.– Sum of Squared Error (SSE), Silhouette

coefficient

• Relative Index: Used to compare two different clusterings or clusters. – Often an external or internal index is used for

this function, e.g., SSE or entropy

67

Measuring Cluster Validation with Correlation

• Proximity Matrix vs. Incidence matrix:

– A matrix Kij with 1 if the point belongs to the

same cluster; 0 otherwise

• Compute the correlation between the two matrices:– Only n(n-1)/2 values to be computed

– High values indicate similarity between points in the same cluster

• Not suited for density based clustering

68

Another approach: use similarity matrix for cluster validation

69

Internal Measures: SSE

• SSE is also a good measure to understand how good the clustering is– Lower SSE good clustering

• Can be used to estimate number of clusters

70

More on Clustering a little later…

• We will discuss other forms of clustering in the following classes

• Next class:– please bring your brief write up on the two

papers

Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

Documents

clusters propertyconceptual

cluster slide

clustering of data

set of nested clusters

multiple clusters fuzzy

homogeneous cluster

minimized slide

transitive slide