Top Banner
Cluster Analysis Hal Whitehead BIOL4062/5062
35

Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Dec 18, 2015

Download

Documents

Adele Morgan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Cluster Analysis

Hal Whitehead

BIOL4062/5062

Page 2: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

• What is cluster analysis?

• Non-hierarchical cluster analysis– K-means

• Hierarchical divisive cluster analysis

• Hierarchical agglomerative cluster analysis– Linkage: single, complete, average, …– Cophenetic correlation coefficient

• Additive trees

• Problems with cluster analyses

Page 3: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Cluster Analysis

“Classification”

Maximize within cluster homogeneity

(similar individuals within cluster)

“The Search for Discontinuities”Discontinuities: places to put divisions between clusters

4 5 6 7 81

2

3

4

5

?

Page 4: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Discontinuities:

Discontinuities generally present:

taxonomy

social organization

community ecology??

Page 5: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Types of cluster analysis:

• Uses: data, dissimilarity, similarity matrix

• Non-hierarchical– K-means

• Hierarchical– Hierarchical divisive (repeated K-means, network

methods)

– Hierarchical agglomerative• single linkage, average linkage, ...

• Additive trees

Page 6: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Non-hierarchical Clustering Techniques:K-Means

• Uses data matrix with Euclidean distances

• Maximizes between-cluster variance for given number of clusters– i.e. Choose clusters to maximize F-ratio in 1-

way MANOVA

Page 7: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

K-Means

Works iteratively:1. Choose number of clusters

2. Assigns points to clustersRandomly or some other clustering technique

3. Moves each point to other clusters in turn--increase in between cluster variance?

4. Repeat step 3. until no improvement possible

Page 8: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

K-means with three clusters

0.0 0.2 0.4 0.6 0.8 1.0X

0.0

0.2

0.4

0.6

0.8

1.0

Y

0.0 0.2 0.4 0.6 0.8 1.0X

0.0

0.2

0.4

0.6

0.8

1.0

Y

0.0 0.2 0.4 0.6 0.8 1.0X

0.0

0.2

0.4

0.6

0.8

1.0

Y

0.0 0.2 0.4 0.6 0.8 1.0X

0.0

0.2

0.4

0.6

0.8

1.0

Y

0.0 0.2 0.4 0.6 0.8 1.0X

0.0

0.2

0.4

0.6

0.8

1.0

Y

0.0 0.2 0.4 0.6 0.8 1.0X

0.0

0.2

0.4

0.6

0.8

1.0

Y

0.0 0.2 0.4 0.6 0.8 1.0X

0.0

0.2

0.4

0.6

0.8

1.0

Y

Page 9: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

K-means with three clusters

Variable Between SS df Within SS df F-ratio

X 0.536 2 0.007 7 256.163

Y 0.541 2 0.050 7 37.566

** TOTAL ** 1.078 4 0.058 14

0.0 0.2 0.4 0.6 0.8 1.0X

0.0

0.2

0.4

0.6

0.8

1.0

Y

Page 10: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

K-means with three clusters

Cluster 1 of 3 contains 4 cases

Members Statistics

Case Distance | Variable Minimum Mean Maximum St.Dev.

Case 1 0.02 | X 0.41 0.45 0.49 0.04

Case 2 0.11 | Y 0.03 0.19 0.27 0.11

Case 3 0.06 |

Case 4 0.05 |

Cluster 2 of 3 contains 4 cases

Members Statistics

Case Distance | Variable Minimum Mean Maximum St.Dev.

Case 7 0.06 | X 0.11 0.15 0.19 0.03

Case 8 0.03 | Y 0.61 0.70 0.77 0.07

Case 9 0.02 |

Case 10 0.06 |

Cluster 3 of 3 contains 2 cases

Members Statistics

Case Distance | Variable Minimum Mean Maximum St.Dev.

Case 5 0.01 | X 0.77 0.77 0.78 0.01

Case 6 0.01 | Y 0.33 0.35 0.36 0.02

0.0 0.2 0.4 0.6 0.8 1.0X

0.0

0.2

0.4

0.6

0.8

1.0

Y

Page 11: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Disadvantages of K-means

• Reaches optimum, but not necessarily global

• Must choose number of clusters before analysis– How many clusters?

Page 12: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Example: Sperm whale codas

Patterned series of clicks:

| | | | |ic1 ic2 ic3 ic4

For 5-click codas: 681 x 4 data set

Page 13: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

5-click codas:

| | | | |ic1 ic2 ic3 ic4

0.0 0.1 0.2 0.3 0.4 0.5 0.6ic1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ic4

-6 -2 2 61st Principal Component

-7

-3

1

5

2nd

Prin

cipa

l Com

pone

nt

93% of variance in 2 PC’s

Page 14: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

5-click codas:K-means with 10 clusters

-6 -2 2 61st Principal Component

-7

-3

1

5

2nd

Prin

cipa

l Com

pone

nt

0.0 0.1 0.2 0.3 0.4 0.5 0.6ic1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ic4

Page 15: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Cluster Analysis

• Usually represented by:– Dendrogram or tree-diagram

Cluster Tree

0.0 0.1 0.2 0.3Distances

Case 1

Case 2

Case 3

Case 4

Case 5

Case 6

Case 7

Case 8

Case 9

Case 10

Page 16: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Cluster Analysis

• Hierarchical Divisive Cluster Analysis

• Hierarchical Agglomerative Cluster Analysis

Page 17: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Divisive Cluster Analysis

• Starts with all units in one cluster, successively splits them– Successive use of K-Means, or some other divisive

technique, with n=2– Either: Each time use the cluster with the greatest

sum of squared distances– Or: Split each cluster each time.

• Hierarchical divisive are good techniques, but rarely used, outside network analysis

0.70.60.50.40.30.20.1

1

14

15

4

5

11

9

19

2

12

20

Association index

Page 18: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Agglomerative Cluster Analysis

• Start with each individual units occupying its own cluster

• The clusters are then gradually merged until just one is left

• The most common cluster analyses

0.7 0.6 0.5 0.4 0.3 0.2 0.1

1

14

15

4

5

11

9

19

2

12

20

Association index

Page 19: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Agglomerative Cluster Analysis

Works on dissimilarity matrix or negative similarity matrixmay be Euclidean, Penrose, … distances

At each step:1. There is a symmetric matrix of dissimilarities between clusters 2. The two clusters with least dissimilarity are merged3. The dissimilarity between the new (merged) cluster and all

others is calculated

Different techniques do step 3. in different ways:

Page 20: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Agglomerative Cluster Analysis

A B C D E

A 0 . . . .

B 0.35 0 . . .

C 0.45 0.67 0 . .

D 0.11 0.45 0.57 0 .

E 0.22 0.56 0.78 0.19 0

AD B C E

AD 0 . . .

B ? 0 . .

C ? 0.67 0 .

E ? 0.56 0.78 0

First link A and D How to calculate new disimmilarities?

Page 21: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Agglomerative Cluster AnalysisSingle Linkage

A B C D E

A 0 . . . .

B 0.35 0 . . .

C 0.45 0.67 0 . .

D 0.11 0.45 0.57 0 .

E 0.22 0.56 0.78 0.19 0

AD B C E

AD 0 . . .

B 0.35 0 . .

C ? 0.67 0 .

E ? 0.56 0.78 0

d(AD,B)=Min{d(A,B), d(D,B)}

Page 22: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Agglomerative Cluster AnalysisComplete Linkage

A B C D E

A 0 . . . .

B 0.35 0 . . .

C 0.45 0.67 0 . .

D 0.11 0.45 0.57 0 .

E 0.22 0.56 0.78 0.19 0

AD B C E

AD 0 . . .

B 0.45 0 . .

C ? 0.67 0 .

E ? 0.56 0.78 0

d(AD,B)=Max{d(A,B), d(D,B)}

Page 23: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Agglomerative Cluster AnalysisAverage Linkage

A B C D E

A 0 . . . .

B 0.35 0 . . .

C 0.45 0.67 0 . .

D 0.11 0.45 0.57 0 .

E 0.22 0.56 0.78 0.19 0

AD B C E

AD 0 . . .

B 0.40 0 . .

C ? 0.67 0 .

E ? 0.56 0.78 0

d(AD,B)=Mean{d(A,B), d(D,B)}

Page 24: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Agglomerative Cluster AnalysisCentroid Clustering

(uses data matrix, or true distance matrix)V1 V2 V3

A 0.11 0.75 0.33

B 0.35 0.99 0.41

C 0.45 0.67 0.22

D 0.11 0.71 0.37

E 0.22 0.56 0.78

F 0.13 0.14 0.55

G 0.55 0.90 0.21

V1(AD)=Mean{V1(A),V1(D)}

V1 V2 V3

AD 0.11 0.73 0.35

B 0.35 0.99 0.41

C 0.45 0.67 0.22

E 0.22 0.56 0.78

F 0.13 0.14 0.55

G 0.55 0.90 0.21

Page 25: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Agglomerative Cluster AnalysisWard’s Method

• Minimizes within-cluster sum-of squares

• Similar to centroid clustering

Page 26: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

0.7 0.6 0.5 0.4 0.3 0.2 0.1

1

14

15

4

5

11

9

19

2

12

20

Association index

Average Linkage

1 1.00

2 0.00 1.00

4 0.53 0.00 1.00

5 0.18 0.05 0.00 1.00

9 0.22 0.09 0.13 0.25 1.00

11 0.36 0.00 0.17 0.40 0.33 1.00

12 0.00 0.37 0.18 0.00 0.13 0.00 1.00

14 0.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00

15 0.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00

19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.09 1.00

20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18 0.25 1.00

1 2 4 5 9 11 12 14 15 19 20

Page 27: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3

1

14

15

4

5

11

9

19

2

12

20

Association index

Single Linkage

0.7 0.6 0.5 0.4 0.3 0.2 0.1

1

14

15

4

5

11

9

19

2

12

20

Association index

Average Linkage

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1

14

15

4

2

12

5

11

9

19

20

Association index

Complete Linkage

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1

14

15

4

2

12

20

5

11

9

19

Association index

Ward's

Page 28: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Hierarchical Agglomerative Clustering Techniques

• Single Linkage– Produces “straggly” clusters– Not recommended if much experimental error– Used in taxonomy– Invariant to transformations

• Complete Linkage– Produces “tight” clusters– Not recommended if much experimental error– Invariant to transformations

• Average Linkage, Centroid, Ward’s– Most likely to mimic input clusters– Not invariant to transformations in dissimilarity measure

Page 29: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Cophenetic Correlation Coefficient CCC

• Correlation between original disimilarity matrix and dissimilarity inferred from cluster analysis

• CCC >~ 0.8 indicate a good match

• CCC <~ 0.8, dendrogram not a good representation– probably should not be displayed

• Use CCC to choose best linkage method (highest coefficient)

1 1.00

2 0.00 1.00

4 0.53 0.00 1.00

5 0.18 0.05 0.00 1.00

9 0.22 0.09 0.13 0.25 1.00

11 0.36 0.00 0.17 0.40 0.33 1.00

12 0.00 0.37 0.18 0.00 0.13 0.00 1.00

14 0.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00

15 0.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00

19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.09 1.00

20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18 0.25 1.00

1 2 4 5 9 11 12 14 15 19 20

0.7 0.6 0.5 0.4 0.3 0.2 0.1

1

14

15

4

5

11

9

19

2

12

20

Association index

Average Linkage

Page 30: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3

1

14

15

4

5

11

9

19

2

12

20

Association index

Single Linkage

0.7 0.6 0.5 0.4 0.3 0.2 0.1

1

14

15

4

5

11

9

19

2

12

20

Association index

Average Linkage

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1

14

15

4

2

12

5

11

9

19

20

Association index

Complete Linkage

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1

14

15

4

2

12

20

5

11

9

19

Association index

Ward's

CCC=0.83

CCC=0.75

CCC=0.77

CCC=0.80

Page 31: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Additive trees• Dendrogram in which path

lengths represent dissimilarities

• Computation quite complex (cross between agglomerative techniques and multidimensional scaling)

• Good when data are measured as dissimilarities

• Often used in taxonomy and genetics

Additive Tree

A

B

C

D

E

A B C D EA . . . . .B 14 . . . .C 6 12 . . .D 81 7 13 . .E 17 1 6 16 .

Page 32: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Problems with Cluster Analysis• Are there really biologically-meaningful clusters in

the data?• Does the dendrogram represent biological reality

(web-of-life versus tree-of-life)?• How many clusters to use?

– stopping rules are arbitrary

• Which method to use?– best technique is data-dependent

• Dendrograms become messy with many units

Page 33: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

10.90.80.70.60.50.40.30.20.1 0

9179221367740757149316422515316539501501151215131515120613824103313111461019147054123956480133211161181426413170303115722246114351202252102222715980413642811141130711511181418371241431040114211431144832952961114910091025102010421045171081267409466106410752979861835787916818115731332515940644076451024727315231614157450250421394313489290509518165583051015088531563903991628829191271311132561931712121219122012685076747232353806512429591248134413451966213121441930165167186269649224623092310

Association index

Social Structure of 160 northern bottlenose whales

Page 34: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Clustering TechniquesType Technique Use

Non-hierarchical K-Means Dividing data sets

Hierarchical divisive Repeated K-means Good technique on

small data sets

Network methods...

Hierarchical agglomerative

Single linkage Taxonomy

Complete linkage Tighter Clusters

Average linkage,

Centroid, Ward’s Usually Preferred

Hierarchical Additive trees Excellent for displaying dissimilarity; taxonomy, genetics

Page 35: Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.