Top Banner
Formal Foundations of Clustering Margareta Ackerman [email protected] Work with Shai Ben-David, Simina Branzei, and David Loker
54

Formal Foundations of Clustering Margareta Ackerman [email protected] Work with Shai Ben-David, Simina Branzei, and David Loker.

Jan 17, 2016

Download

Documents

Harold Evans
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Formal Foundations of Clustering

Margareta Ackerman

[email protected]

Work with Shai Ben-David, Simina Branzei, and David Loker

Page 2: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Clustering is one of the most widely used tools

for exploratory data analysis. Social Sciences Biology Astronomy Computer Science ….

All apply clustering to gain a first understanding of the structure of large data sets.

The Theory-Practice Gap

2

Page 3: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

“While the interest in and application of cluster analysis has been rising rapidly, the abstract nature of the tool is still poorly understood” (Wright, 1973)

“There has been relatively little work aimed at reasoning about clustering independently of any particular algorithm, objective function, or generative data model” (Kleinberg, 2002)

Both statements still apply today. 3

The Theory-Practice Gap

Page 4: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Clustering aims to assign data into groups of similar items

Beyond that, there is very little consensus on the definition of clustering

4

Inherent Obstacles:Clustering is ill-defined

Page 5: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

• Clustering is inherently ambiguous– There may be multiple reasonable clusterings– There is usually no ground truth

• There are many clustering algorithms with different (often implicit) objective functions

• Different algorithms have radically different input-output behaviour

5

Inherent Obstacles

Page 6: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

6

Differences in Input/Output Behavior of Clustering Algorithms

Page 7: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

7

Differences in Input/Output Behavior of Clustering Algorithms

Page 8: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

There are a wide variety of clustering algorithms, which can produce very different clusterings.

8

How should a user decide which algorithm to use for a

given application?

Clustering Algorithm Selection

Page 9: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Users rely on cost related considerations: running

times, space usage, software purchasing costs, etc…

There is inadequate emphasis on

input-output behaviour

9

Clustering Algorithm Selection

Page 10: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

We propose a framework that lets a user utilize prior knowledge to select an algorithm

• Identify properties that distinguish between different input-output behaviour of clustering paradigms

• The properties should be:1) Intuitive and “user-friendly”2) Useful for distinguishing clustering algorithms

10

Our Framework for Algorithm Selection

Page 11: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

In essence, our goal is to understand fundamental differences between

clustering methods, and convey them formally, clearly, and as simply as

possible.

11

Our Framework for Algorithm Selection

Page 12: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

12

Previous Work

• Axiomatic perspective • Impossibility Result: Kleinberg (NIPS, 2003)• Consistent axioms for quality measures: Ackerman &

Ben-David (NIPS, 2009)• Axioms in the weighted setting: Wright (Pattern

Recognition, 1973)

Page 13: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

13

Previous Work

• Characterizations of Single-Linkage • Partitional Setting: Bosah Zehad and Ben-David (UAI, 2009)• Hierarchical Setting: Jarvis and Sibson (Mathematical

Taxonomy, 1981) and Carlsson and Memoli (JMLR, 2010).

• Characterizations of Linkage-Based Clustering • Partitional Setting: Ackerman, Ben-David, and Loker

(COLT, 2010).• Hierarchical Setting: Ackerman & Ben-David (IJCAI, 2011).

Page 14: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

14

Previous Work

• Classifications of clustering methods • Fischer and Van Ness (Biometrica, 1971)• Ackerman, Ben-David, and Loker (NIPS, 2010)

Page 15: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

15

What’s Left To Be Done?

Despite much work on clustering properties, some basic questions remain unanswered.

Consider some of the most popular clustering methods: k-means, single-linkage, average-linkage, etc…

• What are the advantages of k-means over other methods?

• Previous classifications are missing key properties.

Page 16: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

16

Our Contributions (at a high level)

We indentify 3 fundamental categories that clearly delineate some essential differences between common clustering methods

The strength of these categories is in their simplicity.

We hope this gives insight into core differences between popular clustering methods.

To define these categories, we first present the weighted clustering setting.

Page 17: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Outline

• Formal framework• Categories and classification• A result from each category• Conclusions and future work

Outline

Page 18: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Every element is associated with a real valued weight, representing its mass or importance.

Generalizes the notion of element duplication.

Algorithm design, particularly design of approximation algorithms, is often done in this framework.

18

Weighted Clustering

Page 19: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

• Apply clustering to facility allocation, such as the

placement of police stations in a new district. • The distribution of stations should enable quick

access to most areas in the district.

19

Other Reasons to Add Weight: An Example

• Accessibility of different

institutions to a station may have varying importance.

• The weighted setting enables a convenient method for prioritizing certain landmarks.

Page 20: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Traditional clustering algorithms can be readily translated into the weighted setting by considering their behavior on data containing element duplicates.

20

Algorithms in the Weighted Clustering Setting

Page 21: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

• For a finite domain set X, a weight function w: X →R+

defines the weight of every element.

• For a finite domain set X, a distance function d: X x X →R + u {0}

is the distance defined between the domain points.

Formal Setting

Page 22: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

(X,d) denotes unweighted data(w[X],d) denotes weighted data

A Partitional Algorithm mapsInput: (w[X],d,k)toOutput: a k-partition (k-clustering) of X

Formal SettingPartitional Clustering Algorithm

Page 23: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

A Hierarchical Algorithm mapsInput: (w[X],d)toOutput: dendrogram of X

A dendrogram of (X,d) is a strictly binary tree whose leaves correspond to elements of X

C appears in A(w[X],d) if its clusters are in the dendrogram

Formal SettingHierarchical Clustering Algorithm

Page 24: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

24

Our Contributions

• We utilize the weighted framework to indentify 3

fundamental categories, describing how algorithms respond to weight.

• Classify traditional algorithms according to these categories

• Fully characterize when different algorithms react to weight

Page 25: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

PARTITIONAL:

Range(A(X, d,k)) = {C | w s.t. C=A(w[X], d)}∃

The set of clusterings that A outputs on (X, d) over all possible weight functions.

HIERARCHICAL:

Range(A(X, d)) = {D | w s.t. D=A(w[X], d)}∃

The set of dendrograms that A outputs on (X, d) over all possible weight functions.

Towards Basic CategoriesRange(X,d)

Page 26: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Outline

• Formal framework• Categories and classification• A result from each category• Conclusions and future work

Outline

Page 27: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

27

Categories:Weight Robust

A is weight-robust if for all (X, d), |Range(X,d)| = 1.

A never responds to weight.

Page 28: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

28

Categories:Weight Sensitive

A is weight-sensitive if for all (X, d),|Range(X,d)| > 1.

A always responds to weight.

Page 29: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

29

Categories:Weight Considering

An algorithm A is weight-considering if1) There exists (X, d) where |Range(X,d)|=1.2) There exists (X, d) where |Range(X,d)|>1.

A responds to weight on some data sets, but not others.

Page 30: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Range(A(X, d)) = {C | ∃ w such that A(w[X], d) = C}Range(A(X, d)) = {D | ∃ w such that A(w[X], d) = D}

Weight-robust: for all (X, d), |Range(X,d)| = 1.

Weight-sensitive: for all (X, d),|Range(X,d)| > 1.

Weight-considering:1) ∃ (X, d) where |Range(X,d)|=1.2) ∃ (X, d) where |Range(X,d)|>1. 30

Summary of Categories

Page 31: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Outline

In the facility allocation example above, a weight-sensitive algorithm may be preferred.

Connecting To Applications

In phylogeny, where sampling procedures can be highly biased, some degree of weight robustness may be desired.

The desired category depends on the application.

Page 32: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Partitional Hierarchical

Weight Robust Min DiameterK-center

Single LinkageComplete Linkage

Weight Sensitive K-means, k-medoids, k-median, min-sum

Ward’s MethodBisecting K-means

Weight Considering

Ratio Cut Average Linkage

Classification

For the weight considering algorithms, we fully characterize when they are sensitive to weight.

Page 33: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Outline

• Formal framework• Categories and classification• A result from each category• Classification of heuristics• Conclusions and future work

Outline

Page 34: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Partitional Hierarchical

Weight Robust Min DiameterK-center

Single LinkageComplete Linkage

Weight Sensitive K-means k-medoids k-median, min-sum

Ward’s MethodBisecting K-means

Weight Considering

Ratio Cut Average Linkage

Classification

Page 35: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

35

Zooming Into:Weight Sensitive Algorithms

We show that k-means is weight-sensitive.

A is weight-separable if for any data set (X, d) and subset S of X with at most k points, ∃ w so that A(w[X],d,k) separates all points of S.

Fact: Every algorithm that is weight-separable is also weight-sensitive.

Page 36: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

36

K-means is Weight-Sensitive

Proof: • Show that k-means is weight-separable• Consider any (X,d) and S X ⊂ on at most k points• Increase weight of points in S until each belongs to a distinct cluster.

Theorem: k-means is weight-sensitive.

Page 37: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

• We show that Average-Linkage is Weight Considering.

• Characterize the precise conditions under which it is sensitive to weight.

37

Zooming Into:Weight Considering Algorithms

Recall: An algorithm A is weight-considering if1) There exists (X, d) where |Range(X,d)|=1.2) There exists (X, d) where |Range(X,d)|>1.

Page 38: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

38

Average Linkage

• Average-Linkage is a hierarchical algorithm.• It starts by creating a leaf for every element. • It then repeatedly merges the “closest” clusters using the following linkage function:

Average weighted distance between clusters

Page 39: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Average Linkage is Weight Considering

(X,d) where Range(X,d) =1:

The same dendrogram is output for every weight function.

A B C D

A B C D

Page 40: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Average Linkage is Weight Considering

(X,d) where Range(X,d) >1:

A B C D

A B C D

E

2+2ϵ11 1+ϵ

A B C D E

2+2ϵ11 1+ ϵ

E

A B C D E

Page 41: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

41

When is Average LinkageSensitive to Weight?

We showed that Average-Linkage is weight-considering.

Can we show when it is sensitive to weight?

We provide a complete characterization of when Average-Linkage is sensitive to weight, and when it is not.

Page 42: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

42

A clustering is nice if every point is closer to all points within its cluster than to all other points.

Nice

Nice Clustering

Page 43: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

43

A clustering is nice if every point is closer to all points within its cluster than to all other points.

Nice

Nice Clustering

Page 44: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

44

A clustering is nice if every point is closer to all points within its cluster than to all other points.

Not nice

Nice Clustering

Page 45: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

45

Theorem: Range(AL(X,d)) = 1 if and only if (X,d) has a nice dendrogram.

A dendrogram is nice if all of its clusterings are nice.

Characterizing When Average Linkage is Sensitive to Weight

Page 46: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

46

Characterizing When Average Linkage is Sensitive to Weight: Proof

Proof:Show that: 1) If there is a nice dendrogram for (X,d), then

Average-Linkage outputs it. 2) If a clustering that is not nice appears in dendrogram

AL(w[X],d) for some w, then Range(AL(X,d)) > 1.

Theorem: Range(AL(X,d)) = 1 if and only if (X,d) has a nice dendrogram.

Page 47: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

47

Characterizing When Average Linkage is Sensitive to Weight: Proof (cnt.)

Lemma: If there is a nice dendrogram for (X,d), then Average-Linkage outputs it.

Proof Sketch:1) Assume that (w[X],d) has a nice dendrogram. 2) Main idea: Show that every nice clustering of the

data appears in AL(w[X],d).3) For that, we show that each cluster in a nice

clustering is formed by the algorithm.

Page 48: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

48

Given a nice clustering C, it can be shown thatfor any clusters Ci and Cj of C, any disjoint subsets Y and Z of Ci, and any subset W of Cj, Y and Z are closer than Y and W.

This implies that C appears in the dendrogram.

Characterizing When Average Linkage is Sensitive to Weight: Proof (cnt.)

Page 49: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

49

Proof:• Since C is not nice, there exist points x, y, and z, so that• x and y are belong to the same cluster in C• x and z belong to difference clusters• yet d(x,z) < d(x,y) • If x, y and z are sufficiently heavier than all other points, then x and z will be merged before x and y, so C will not be formed.

Lemma: If a clustering C that is not nice appears in AL(w[X],d), then range(X,d)>1.

Characterizing When Average Linkage Responds to Weight: Proof (cnt.)

Page 50: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

50

Characterizing When Average Linkage is Sensitive to Weight

Average Linkage is robust to weight whenever there is a dendrogram of (X,d) consisting of only nice clusterings, and it is sensitive to weight otherwise.

Theorem: Range(AL(X,d)) = 1 if and only if (X,d) has a nice dendrogram.

Page 51: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

51

Zooming Into:Weight Robust Algorithms

These algorithms are invariant to element duplication.

Ex. Min-Diameter returns a clustering that minimizes the length of the longest within-cluster edge.

As this quantity is not effected by the number of points (or weight) at any location, Min-Diameter is weight robust.

Page 52: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Outline

• Introduce framework• Present categories and classification• Show several results from different categories• Conclusions and future work

Outline

Page 53: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

Conclusions

• We introduced three basic categories describing how algorithms respond to weights

• We characterize the precise conditions under which algorithms respond to weights

• The same results apply in the non-weighted setting for data duplicates

• This classification can be used to help select clustering algorithms for specific applications

Page 54: Formal Foundations of Clustering Margareta Ackerman mackerma@caltech.edu Work with Shai Ben-David, Simina Branzei, and David Loker.

• Capture differences between objective functions similar to k-means (ex. k-medians, k-medoids, min-sum)

• Show bounds on the size of the Range of weight considering and weight sensitive methods

• Analyze clustering algorithms for categorical data• Analyze clustering algorithms with a noise bucket• Indentify properties that are significant for specific

clustering applications (some previous work in this directions by Ackerman, Brown, and Loker (ICCABS, 2012)).

Future Directions