Clustering – Part III: Spectral Clustering COSC 526 Class 14 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Clustering – Part III: Spectral Clustering

COSC 526 Class 14

Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]

Slides inspired by: Andrew Moore (CMU), Jure Leskovec (http://www.mmds.org)

mailto:[email protected]

http://www.mmds.org/

2

Last Class

• DBSCAN:– Practical clustering algorithm that works even

with difficult datasets

– Scales to large datasets

– Works well with noise

3

Spectral Clustering

4

Graph Partitioning

• Undirected graph

• Bi-partitioning task:– Divide vertices into two disjoint groups

• Questions:– How can we define a “good” partition of ?– How can we efficiently identify such a partition?

1

32

5

4 6

A B

1

3

2

5

46

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5

Graph Partitioning

• What makes a good partition?– Maximize the number of within-group

connections

– Minimize the number of between-group connections

1

3

2

5

46

A B


6

A B

Graph Cuts

• Express partitioning objectives as a function of the “edge cut” of the partition

• Cut: Set of edges with only one vertex in a group:

cut(A,B) = 21

3

2

5

46

BjAi

ijwBAcut,

),(


7

Graph Cut Criterion

• Criterion: Minimum-cut– Minimize weight of connections between groups

• Degenerate case:

• Problem:– Only considers external cluster connections– Does not consider internal cluster connectivity

arg minA,B cut(A,B)

“Optimal cut”Minimum cut


8

Graph Cut Criteria[Shi-Malik]

)(

),(

)(

),(),(

Bvol

BAcut

Avol

BAcutBAncut

• Criterion: Normalized cut [Shi-Malik 1997]– Connectivity between groups relative to the

density of each group

– vol(A): Total weight of the edges with at least one end point in A:

• Intuition for this criterion:– Produce more balanced partitions

• How do we efficiently find a good partition?– Problem: computing optimal cut is NP hard


9

Spectral Graph Partitioning

• A: adjacency matrix of undirected G

– Aij =1 if is an edge, else 0

• x is a vector in n with components

– Think of it as a label/value of each node of

• What is the meaning of A x?

• Entry yi is a sum of labels xj of neighbors of i

nnnnn

n

y

y

x

x

aa

aa

11

1

111

Eji

j

n

jiji xxjAy

),(1


10

What is the meaning of Ax?

• jth coordinate of A x : – Sum of the x-values

of neighbors of j

– Make this a new value at node j

• Spectral Graph Theory:– Analyze the “spectrum” of matrix representing

– Spectrum: Eigenvectors of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues :

nnnnn

n

x

x

λ

x

x

aa

aa

11

1

111

},...,,{ 21 n n ...21


11

Matrix Representations

• Adjacency matrix (A):– n n matrix– A=[aij], aij=1 if edge between node i and j

• Important properties: – Symmetric matrix– Eigenvectors are real and orthogonal

1

3

2

5

46

1 2 3 4 5 6

1 0 1 1 0 1 0

2 1 0 1 0 0 0

3 1 1 0 1 0 0

4 0 0 1 0 1 1

5 1 0 0 1 0 1

6 0 0 0 1 1 0


12


• Degree matrix (D):– n n diagonal matrix

– D=[dii], dii = degree of node i


1

3

2

5

46

1 2 3 4 5 6

1 3 0 0 0 0 0

2 0 2 0 0 0 0

3 0 0 3 0 0 0

4 0 0 0 3 0 0

5 0 0 0 0 3 0

6 0 0 0 0 0 2

13


• Laplacian matrix (L):– n n symmetric matrix

• What is trivial eigenpair?

• Important properties:– Eigenvalues are non-negative real numbers

– Eigenvectors are real and orthogonal

𝑳=𝑫−𝑨

1

3

2

5

4 6

1 2 3 4 5 6

1 3 -1 -1 0 -1 0

2 -1 2 -1 0 0 0

3 -1 -1 3 -1 0 0

4 0 0 -1 3 -1 -1

5 -1 0 0 -1 3 -1

6 0 0 0 -1 -1 2


14

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

[Shi & Meila, 2002]

e2

e3

-0.4 -0.2 0 0.2

-0.4

-0.2

0.0

0.2

0.4

xx x xx x

yyyy

y

xx xxx x

zzzzz zzzzz z e1

e2


15

Spectral Clustering: Graph = MatrixW*v1 = v2 “propagates weights from neighbors”

eigenvaluer with eigenvectoan is : vvvW

If W is connected but roughly block diagonal with k blocks then• the top eigenvector is a

constant vector • the next k eigenvectors

are roughly piecewise constant with “pieces” corresponding to blocks


16

Spectral Clustering: Graph = MatrixW*v1 = v2 “propagates weights from neighbors”


If W is connected but roughly block diagonal with k blocks then• the “top” eigenvector

is a constant vector • the next k

eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks

Spectral clustering:• Find the top k+1

eigenvectors v1,…,vk+1

• Discard the “top” one• Replace every node a

with k-dimensional vector xa = <v2(a),…,vk+1 (a) >

• Cluster with k-means


17


eigenvaluer with eigenvectoan is : vvvW • smallest eigenvecs of D-A are largest eigenvecs of A• smallest eigenvecs of I-W are largest eigenvecs of WSuppose each y(i)=+1 or -1: • Then y is a cluster indicator that splits the nodes into two • what is yT(D-A)y ?


18

jijiji

jijiji

jijij

jiiij

jijiji

jj

iij

ii

jij

jijiji

iii

jijiji

iii

TTT

yya

yyayaya

yyayaya

yyayd

yyaydADAD

,

2,

,,

,

2

,

2

,,

22

,,

2

,,

2

)(2

1

22

1

22

1

222

1

)( yyyyyy

= size of CUT(y)

)NCUT( of size)( yyy WIT

NCUT: roughly minimize ratio of transitions between classes vs transitions within classes

19

So far…

• How to define a “good” partition of a graph?– Minimize a given graph cut criterion

• How to efficiently identify such a partition?

– Approximate using information provided by the eigenvalues and eigenvectors of a graph

• Spectral Clustering


20

Spectral Clustering Algorithms

• Three basic stages:– 1) Pre-processing

• Construct a matrix representation of the graph

– 2) Decomposition• Compute eigenvalues and eigenvectors of the matrix

• Map each point to a lower-dimensional representation based on one or more eigenvectors

– 3) Grouping• Assign points to two or more clusters, based on the

new representation


21

Spectral Partitioning Algorithm

• 1) Pre-processing:– Build Laplacian

matrix L of the graph

• 2) Decomposition:– Find eigenvalues

and eigenvectors x of the matrix L

– Map vertices to corresponding components of 2

21

0.0-

0.4-

0.40.4

-0.6

0.4

0.50.4-

0.2-

0.5-

0.30.4

-0.5

0.40.60.1-

0.30.4

0.5-

0.40.60.10.30.4

0.00.4-

0.40.40.60.4

-0.5

-0.4

-0.2

-0.5

0.30.4

5.0

4.0

3.0

3.0

1.0

0.0

= X =

How do we now find the

clusters?-0.6

6

-0.3

5

-0.3

4

0.33

0.62

0.31

1 2 3 4 5 6

1 3 -1 -1 0 -1 0

2 -1 2 -1 0 0 0

3 -1 -1 3 -1 0 0

4 0 0 -1 3 -1 -1

5 -1 0 0 -1 3 -1

6 0 0 0 -1 -1 2

22

Spectral Partitioning

• 3) Grouping:– Sort components of reduced 1-dimensional vector

– Identify clusters by splitting the sorted vector in two

• How to choose a splitting point?– Naïve approaches:

• Split at 0 or median value

– More expensive approaches:• Attempt to minimize normalized cut in 1-dimension

(sweep over ordering of nodes induced by the eigenvector)

22-

0.66

-0.35

-0.34

0.33

0.62

0.31 Split at 0:

Cluster A: Positive points

Cluster B: Negative points

0.33

0.62

0.31

-0.66

-0.35

-0.34

A B

23

Example: Spectral Partitioning

Rank in x2

Val

ue o

f x 2


24

Example: Spectral Partitioning

Rank in x2

Val

ue o

f x 2

Components of x2


25

Example: Spectral partitioning

Components of x1

Components of x3J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

26

k-Way Spectral Clustering

• How do we partition a graph into k clusters?

• Two basic approaches:– Recursive bi-partitioning [Hagen et al., ’92]

• Recursively apply bi-partitioning algorithm in a hierarchical divisive manner

• Disadvantages: Inefficient, unstable

– Cluster multiple eigenvectors [Shi-Malik, ’00]• Build a reduced space from multiple eigenvectors

• Commonly used in recent papers

• A preferable approach…


27

Why use multiple eigenvectors?

• Approximates the optimal cut [Shi-Malik, ’00]– Can be used to approximate optimal k-way normalized cut

• Emphasizes cohesive clusters– Increases the unevenness in the distribution of the data

– Associations between similar points are amplified, associations between dissimilar points are attenuated

– The data begins to “approximate a clustering”

• Well-separated space– Transforms data to a new “embedded space”,

consisting of k orthogonal basis vectors

• Multiple eigenvectors prevent instability due to information lossJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

28



Q: How do I pick v to be an eigenvector for a block-stochastic matrix?

• smallest eigenvecs of D-A are largest eigenvecs of A• smallest eigenvecs of I-W are largest eigenvecs of W


29



How do I pick v to be an eigenvector for a block-stochastic matrix?


30


eigenvaluer with eigenvectoan is : vvvW • smallest eigenvecs of D-A are largest eigenvecs of A• smallest eigenvecs of I-W are largest eigenvecs of WSuppose each y(i)=+1 or -1: • Then y is a cluster indicator that cuts the nodes into two • what is yT(D-A)y ? The cost of the graph cut defined by y• what is yT(I-W)y ? Also a cost of a graph cut defined by y• How to minimize it?

• Turns out: to minimize yT X y / (yTy) find smallest eigenvector of X• But: this will not be +1/-1, so it’s a “relaxed” solution


31



[Shi & Meila, 2002]

λ2

λ3

λ4

λ5,6,7,….

λ1e1

e2

e3

“eigengap”


32

Some more terms

• If A is an adjacency matrix (maybe weighted) and D is a (diagonal) matrix giving the degree of each node– Then D-A is the (unnormalized) Laplacian

– W=AD-1 is a probabilistic adjacency matrix

– I-W is the (normalized or random-walk) Laplacian

– More definitions are possible!

• The largest eigenvectors of W correspond to the smallest eigenvectors of I-W– So sometimes people talk about “bottom

eigenvectors of the Laplacian”

33

A

W

A

W

K-nn graph(easy)

Fully connected graph,weighted by distance

34

35

36

Spectral Clustering: Pros and Cons• Elegant, and well-founded mathematically

• Works quite well when relations are approximately transitive (like similarity)

• Very noisy datasets cause problems– “Informative” eigenvectors need not be in top few

– Performance can drop suddenly from good to terrible

• Expensive for very large datasets– Computing eigenvectors is the bottleneck


37

Use cases and runtimes

• K-Means– FAST

– “Embarrassingly parallel”

– Not very useful on anisotropic data

• Spectral clustering– Excellent quality under

many different data forms

– Much slower than K-Means


38

Further Reading

• Spectral Clustering Tutorial: http://www.informatik.uni-hamburg.de/ML/contents/people/luxburg/publications/Luxburg07_tutorial.pdf

http://www.informatik.uni-hamburg.de/ML/contents/people/luxburg/publications/Luxburg07_tutorial.pdf



39

How to validate clustering approaches?

40

Cluster validity

• For supervised learning:– we had a class label,

– which meant we could identify how good our training and testing errors were

– Metric: Accuracy, Precision, Recall

• For clustering: – How do we measure the “goodness” of the

resulting clusters?

41

Clustering random data (overfitting)

If you ask a clustering algorithm to find clusters, it will find some

42

Different aspects of validating clsuters• Determine the clustering tendency of a set of

data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting)

• External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth).

• Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information.

• Compare clusterings to determine which is better.

• Determining the ‘correct’ number of clusters.

43

Measures of cluster validity• External Index: Used to measure the

extent to which cluster labels match externally supplied class labels. – Entropy, Purity, Rand Index

• Internal Index: Used to measure the goodness of a clustering structure without respect to external information.– Sum of Squared Error (SSE), Silhouette

coefficient

• Relative Index: Used to compare two different clusterings or clusters. – Often an external or internal index is used for

this function, e.g., SSE or entropy

44

Measuring Cluster Validation with Correlation

• Proximity Matrix vs. Incidence matrix:

– A matrix Kij with 1 if the point belongs to the

same cluster; 0 otherwise

• Compute the correlation between the two matrices:– Only n(n-1)/2 values to be computed

– High values indicate similarity between points in the same cluster

• Not suited for density based clustering

45

Another approach: use similarity matrix for cluster validation

46

Internal Measures: SSE

• SSE is also a good measure to understand how good the clustering is– Lower SSE good clustering

• Can be used to estimate number of clusters

47

More on Clustering a little later…

• We will discuss other forms of clustering in the following classes

• Next class:– please bring your brief write up on the two

papers

– We will discuss frequent itemset mining and a few other aspects of clustering

– Move on to Dimensionality Reduction

48

Summary

• We saw spectral clustering techniques:– only a broad overview

– more details next class

• Speeding up Spectral clustering techniques can be challenging

49

Clustering Big Data

Clustering

Partitioning

K-means

Hierarchy-based

CURE, Rock

Density Based

DBSCAN

Grid Based

WaveCluster

Model Based

EM, COBWEB

50

Categorizing Big Data Clustering Algorithms

Clustering – Part III: Spectral Clustering COSC 526 Class 14 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Documents

mining of massive datasets

jure leskovec http

n n matrixa

spectrum of matrix

good partition

n n diagonal matrixd

spectral graph partitioninga

node jspectral graph