Top Banner
,,,,,,,,,, Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems Tuebingen, Germany Joint work with Ulrike von Luxburg
85

Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

Sep 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Pruning Nearest Neighbor Cluster Trees

Samory KpotufeMax Planck Institute for Intelligent Systems

Tuebingen, Germany

Joint work with Ulrike von Luxburg

Page 2: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 3: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 4: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 5: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 6: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 7: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 8: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 9: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

General motivation

More understanding of clustering

• Density yields intuitive (and clean) notion of clusters.

• Clusters take any shape =⇒ reveals complexity of clustering?

• Popular approches (e.g. DBscan, single linkage) aredensity-based methods.

More understanding of k-NN graphs

These appear everywhere in various forms!

Page 10: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

General motivation

More understanding of clustering

• Density yields intuitive (and clean) notion of clusters.

• Clusters take any shape =⇒ reveals complexity of clustering?

• Popular approches (e.g. DBscan, single linkage) aredensity-based methods.

More understanding of k-NN graphs

These appear everywhere in various forms!

Page 11: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Outline• Density-based clustering• Richness of k-NN graphs

• Guaranteed removal of false clusters

Page 12: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 13: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 14: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 15: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 16: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 17: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 18: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Clusters are G(λ) ≡ CCs of Lλ.= {x : f(x) ≥ λ}.

Page 19: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Clusters are G(λ) ≡ CCs of Lλ.= {x : f(x) ≥ λ}.

Page 20: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Clusters are G(λ) ≡ CCs of Lλ.= {x : f(x) ≥ λ}.

Page 21: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

The cluster tree of f is the infinite hierarchy {G(λ)}λ≥0.

Page 22: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Formal estimation problem:

Given: n i.i.d. samples X = {xi}i∈[n] from dist. with density f .

Clustering outputs: A hierarchy {Gn(λ)}λ≥0 of subsets of X.

We at least want consistency, i.e. for any λ > 0

P(Disjoint A,A′ ∈ G(λ) are in disjoint empirical clusters

)→ 1.

Page 23: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Formal estimation problem:

Given: n i.i.d. samples X = {xi}i∈[n] from dist. with density f .

Clustering outputs: A hierarchy {Gn(λ)}λ≥0 of subsets of X.

We at least want consistency, i.e. for any λ > 0

P(Disjoint A,A′ ∈ G(λ) are in disjoint empirical clusters

)→ 1.

Page 24: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Formal estimation problem:

Given: n i.i.d. samples X = {xi}i∈[n] from dist. with density f .

Clustering outputs: A hierarchy {Gn(λ)}λ≥0 of subsets of X.

We at least want consistency, i.e. for any λ > 0

P(Disjoint A,A′ ∈ G(λ) are in disjoint empirical clusters

)→ 1.

Page 25: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Formal estimation problem:

Given: n i.i.d. samples X = {xi}i∈[n] from dist. with density f .

Clustering outputs: A hierarchy {Gn(λ)}λ≥0 of subsets of X.

We at least want consistency, i.e. for any λ > 0

P(Disjoint A,A′ ∈ G(λ) are in disjoint empirical clusters

)→ 1.

Page 26: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

A good procedure should satisfy:

Consistency!

Every level should be recovered for sufficiently large n.

Finite sample behavior:

• Fast discovery of real clusters.

• “No false clusters !!!”

Page 27: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

A good procedure should satisfy:

Consistency!

Every level should be recovered for sufficiently large n.

Finite sample behavior:

• Fast discovery of real clusters.

• “No false clusters !!!”

Page 28: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

A good procedure should satisfy:

Consistency!

Every level should be recovered for sufficiently large n.

Finite sample behavior:

• Fast discovery of real clusters.

• “No false clusters !!!”

Page 29: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Earlier example is sampled from a bi-modal mixture of Gaussians!!!

My visual procedure yields false clusters at low resolution. §

Page 30: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What we’ll show:

k-NN graphs guarantees

• Finite sample: Salient clusters recovered as subgraphs.

• Consistency: All clusters eventually recovered.

Generic pruning guarantees:

• Finite sample: No false clusters + salient clusters remain.

• Consistency: Pruned tree remains a consistent estimator.

Page 31: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What we’ll show:

k-NN graphs guarantees

• Finite sample: Salient clusters recovered as subgraphs.

• Consistency: All clusters eventually recovered.

Generic pruning guarantees:

• Finite sample: No false clusters + salient clusters remain.

• Consistency: Pruned tree remains a consistent estimator.

Page 32: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What we’ll show:

k-NN graphs guarantees

• Finite sample: Salient clusters recovered as subgraphs.

• Consistency: All clusters eventually recovered.

Generic pruning guarantees:

• Finite sample: No false clusters + salient clusters remain.

• Consistency: Pruned tree remains a consistent estimator.

Page 33: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What we’ll show:

k-NN graphs guarantees

• Finite sample: Salient clusters recovered as subgraphs.

• Consistency: All clusters eventually recovered.

Generic pruning guarantees:

• Finite sample: No false clusters + salient clusters remain.

• Consistency: Pruned tree remains a consistent estimator.

Page 34: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What we’ll show:

k-NN graphs guarantees

• Finite sample: Salient clusters recovered as subgraphs.

• Consistency: All clusters eventually recovered.

Generic pruning guarantees:

• Finite sample: No false clusters + salient clusters remain.

• Consistency: Pruned tree remains a consistent estimator.

Page 35: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

People you might look up:

Wasserman, Tsybakov, Wishart, Rinaldo, Nugent, Stueltze,Rigollet, Wong, Lane, Dasgupta, Chauduri, Maeir, von Luxburg,Steinwart ...

Page 36: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

People you might look up:

Wasserman, Tsybakov, Wishart, Rinaldo, Nugent, Stueltze,Rigollet, Wong, Lane, Dasgupta, Chauduri, Maeir, von Luxburg,Steinwart ...

Page 37: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Consistency

• (fn → f) =⇒ (cluster tree of fn → cluster tree of f). ©No known practical estimators. §

• Various practical estimators of a single level set.Can these be extended to all levels at once?

• Recent: First consistent practical estimator (Ch. and Das).

A generalization of single linkage (by Wishart)©

Page 38: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Consistency

• (fn → f) =⇒ (cluster tree of fn → cluster tree of f). ©No known practical estimators. §

• Various practical estimators of a single level set.Can these be extended to all levels at once?

• Recent: First consistent practical estimator (Ch. and Das).

A generalization of single linkage (by Wishart)©

Page 39: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Consistency

• (fn → f) =⇒ (cluster tree of fn → cluster tree of f). ©No known practical estimators. §

• Various practical estimators of a single level set.Can these be extended to all levels at once?

• Recent: First consistent practical estimator (Ch. and Das).

A generalization of single linkage (by Wishart)©

Page 40: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Consistency

• (fn → f) =⇒ (cluster tree of fn → cluster tree of f). ©No known practical estimators. §

• Various practical estimators of a single level set.Can these be extended to all levels at once?

• Recent: First consistent practical estimator (Ch. and Das).

A generalization of single linkage (by Wishart)©

Page 41: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Consistency

• (fn → f) =⇒ (cluster tree of fn → cluster tree of f). ©No known practical estimators. §

• Various practical estimators of a single level set.Can these be extended to all levels at once?

• Recent: First consistent practical estimator (Ch. and Das).

A generalization of single linkage (by Wishart)©

Page 42: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Empirical tree contains good clusters ... but which? §

We need pruning guarantees!

Page 43: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Empirical tree contains good clusters ... but which? §

We need pruning guarantees!

Page 44: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Pruning

Consisted of removing small clusters!Problem: Not all false clusters are “small”!!

Page 45: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Pruning

Consisted of removing small clusters!Problem: Not all false clusters are “small”!!

Page 46: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Pruning

Consisted of removing small clusters!Problem: Not all false clusters are “small”!!

Page 47: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Pruning

Consisted of removing small clusters!Problem: Not all false clusters are “small”!!

Page 48: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Pruning

Consisted of removing small clusters!Problem: Not all false clusters are “small”!!

Page 49: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Outline• Ground-truth: Density-based clustering

• Richness of k-NN graphs• Guaranteed removal of false clusters

Page 50: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Richness of k-NN graphs

k-NN density estimate: fn(x).= k/n · vol(Bk,n(x)).

Procedure: Remove Xi from Gn in increasing order of fn(Xi).

Level λ of the tree: Gn(λ) ≡ subgraph with Xi s.t. fn(Xi) ≥ λ.

Page 51: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Richness of k-NN graphs

k-NN density estimate: fn(x).= k/n · vol(Bk,n(x)).

Procedure: Remove Xi from Gn in increasing order of fn(Xi).

Level λ of the tree: Gn(λ) ≡ subgraph with Xi s.t. fn(Xi) ≥ λ.

Page 52: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Richness of k-NN graphs

k-NN density estimate: fn(x).= k/n · vol(Bk,n(x)).

Procedure: Remove Xi from Gn in increasing order of fn(Xi).

Level λ of the tree: Gn(λ) ≡ subgraph with Xi s.t. fn(Xi) ≥ λ.

Page 53: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Richness of k-NN graphs

k-NN density estimate: fn(x).= k/n · vol(Bk,n(x)).

Procedure: Remove Xi from Gn in increasing order of fn(Xi).

Level λ of the tree: Gn(λ) ≡ subgraph with Xi s.t. fn(Xi) ≥ λ.

Page 54: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Sample from 2-modes mixture of gaussians

Page 55: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem I:

Let log n . k . n1/O(d):

A A′

& 1/√k

S

λ

&(knλ

)1/dAll such A ∩X and A′ ∩X belong to disjoint CCs of

Gn(λ−O(1/√k)).

Assumptions: f(x) ≤ F and ∀x, x′, |f(x)− f(x′)| ≤ L ‖x− x′‖α.

Page 56: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem I:

Let log n . k . n1/O(d):

A A′

& 1/√k

S

λ

&(knλ

)1/dAll such A ∩X and A′ ∩X belong to disjoint CCs of

Gn(λ−O(1/√k)).

Assumptions: f(x) ≤ F and ∀x, x′, |f(x)− f(x′)| ≤ L ‖x− x′‖α.

Page 57: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem I:

Let log n . k . n1/O(d):

A A′

& 1/√k

S

λ

&(knλ

)1/dAll such A ∩X and A′ ∩X belong to disjoint CCs of

Gn(λ−O(1/√k)).

Assumptions: f(x) ≤ F and ∀x, x′, |f(x)− f(x′)| ≤ L ‖x− x′‖α.

Page 58: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Note on key quantities:

• 1/√k & (density estimation error on samples Xi).

• (k/nλ)1/d & (k-NN distances of Xi in Lλ).

A A′

& 1/√k

S

λ

&(knλ

)1/dConsistency: both quantities → 0, so eventually An ∩A′n = ∅.

Page 59: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Note on key quantities:

• 1/√k & (density estimation error on samples Xi).

• (k/nλ)1/d & (k-NN distances of Xi in Lλ).

A A′

& 1/√k

S

λ

&(knλ

)1/dConsistency: both quantities → 0, so eventually An ∩A′n = ∅.

Page 60: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Note on key quantities:

• 1/√k & (density estimation error on samples Xi).

• (k/nλ)1/d & (k-NN distances of Xi in Lλ).

A A′

& 1/√k

S

λ

&(knλ

)1/dConsistency: both quantities → 0, so eventually An ∩A′n = ∅.

Page 61: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Main technicality:Showing that A ∩X remains connected in

Gn(λ−O(1/√k)).

Cover high density path with balls {Bt}• Bt’s have to be large so they contain points.

• Bt’s have to be small so points are connected.

So let Bt have mass about k/n!

Page 62: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Main technicality:Showing that A ∩X remains connected in

Gn(λ−O(1/√k)).

Cover high density path with balls {Bt}• Bt’s have to be large so they contain points.

• Bt’s have to be small so points are connected.

So let Bt have mass about k/n!

Page 63: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Main technicality:Showing that A ∩X remains connected in

Gn(λ−O(1/√k)).

Cover high density path with balls {Bt}• Bt’s have to be large so they contain points.

• Bt’s have to be small so points are connected.

So let Bt have mass about k/n!

Page 64: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Main technicality:Showing that A ∩X remains connected in

Gn(λ−O(1/√k)).

Cover high density path with balls {Bt}• Bt’s have to be large so they contain points.

• Bt’s have to be small so points are connected.

So let Bt have mass about k/n!

Page 65: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Outline• Ground-truth: Density-based clustering

• Richness of k-NN graphs

• Guaranteed removal of false clusters

Page 66: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Guaranteed removal of false clusters

Sample from 2-modes mixture of gaussians

Page 67: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Guaranteed removal of false clusters

Sample from 2-modes mixture of gaussians

Page 68: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What are false clusters?

Intuitively:

An and A′n in X should be in one (empirical) cluster if they are inthe same (true) cluster at every level containing An ∪A′n.

Page 69: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Pruning Intuition:key connecting points are missing!!!

Sample from 2-modes mixture of gaussians

Pruning: Connect Gn(0).Re-connect An, A′n in Gn(λn) if they are connected in Gn(λn− ε̃).

How do we set ε̃?

Page 70: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Pruning Intuition:key connecting points are missing!!!

Sample from 2-modes mixture of gaussians

Pruning: Connect Gn(0).Re-connect An, A′n in Gn(λn) if they are connected in Gn(λn− ε̃).

How do we set ε̃?

Page 71: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Pruning Intuition:key connecting points are missing!!!

Sample from 2-modes mixture of gaussians

Pruning: Connect Gn(0).Re-connect An, A′n in Gn(λn) if they are connected in Gn(λn− ε̃).

How do we set ε̃?

Page 72: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Pruning Intuition:key connecting points are missing!!!

Sample from 2-modes mixture of gaussians

Pruning: Connect Gn(0).Re-connect An, A′n in Gn(λn) if they are connected in Gn(λn− ε̃).

How do we set ε̃?

Page 73: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem II:

Suppose ε̃ & 1/√k.

• An and A′n belong to disjoint A and A′ in some G(λ).

• A ∩X and A′ ∩X belong to disjoint An and A′n ofGn(λ−O(1/

√k)).

• (ε̃, k, n)-salient modes map 1-1 to leaves of empirical tree.

Page 74: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem II:

Suppose ε̃ & 1/√k.

• An and A′n belong to disjoint A and A′ in some G(λ).

• A ∩X and A′ ∩X belong to disjoint An and A′n ofGn(λ−O(1/

√k)).

• (ε̃, k, n)-salient modes map 1-1 to leaves of empirical tree.

Page 75: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem II:

Suppose ε̃ & 1/√k.

• An and A′n belong to disjoint A and A′ in some G(λ).

• A ∩X and A′ ∩X belong to disjoint An and A′n ofGn(λ−O(1/

√k)).

• (ε̃, k, n)-salient modes map 1-1 to leaves of empirical tree.

Page 76: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem II:

Suppose ε̃ & 1/√k.

• An and A′n belong to disjoint A and A′ in some G(λ).

• A ∩X and A′ ∩X belong to disjoint An and A′n ofGn(λ−O(1/

√k)).

• (ε̃, k, n)-salient modes map 1-1 to leaves of empirical tree.

Page 77: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Consistency even after pruning:We just require ε̃→ 0 as n→∞.

Page 78: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Some last technical points:

[Ch. and Das. 2010] seem to be first to allow any cluster shapebesides mild requirements on envelopes of clusters.

We allow any cluster shape up to smoothness of f and canexplicitely relate empirical clusters to true clusters!

Page 79: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Some last technical points:

[Ch. and Das. 2010] seem to be first to allow any cluster shapebesides mild requirements on envelopes of clusters.

We allow any cluster shape up to smoothness of f and canexplicitely relate empirical clusters to true clusters!

Page 80: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Some last technical points:

[Ch. and Das. 2010] seem to be first to allow any cluster shapebesides mild requirements on envelopes of clusters.

We allow any cluster shape up to smoothness of f and canexplicitely relate empirical clusters to true clusters!

Page 81: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We have thus discussed:

• Density based clustering - Hartigan 1982).

• The richness of k-NN graphs Gn.Subgraphs of Gn consistently recover cluster tree of µ.

• Guaranteed pruning of false clusters.While discovering salient clusters and maintaining consistency!

Page 82: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We have thus discussed:

• Density based clustering - Hartigan 1982).

• The richness of k-NN graphs Gn.Subgraphs of Gn consistently recover cluster tree of µ.

• Guaranteed pruning of false clusters.While discovering salient clusters and maintaining consistency!

Page 83: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We have thus discussed:

• Density based clustering - Hartigan 1982).

• The richness of k-NN graphs Gn.Subgraphs of Gn consistently recover cluster tree of µ.

• Guaranteed pruning of false clusters.While discovering salient clusters and maintaining consistency!

Page 84: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We have thus discussed:

• Density based clustering - Hartigan 1982).

• The richness of k-NN graphs Gn.Subgraphs of Gn consistently recover cluster tree of µ.

• Guaranteed pruning of false clusters.While discovering salient clusters and maintaining consistency!

Page 85: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Thank you! ©