Top Banner
ECML PKDD September –, The Minimum Code Length for Clustering Using the Gray Code Mahito SUGIYAMA †,‡ Akihiro YAMAMOTO Kyoto University JSPS Research Fellow /
93

The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

May 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

ECML PKDD September –,

The Minimum Code Length forClustering Using the Gray Code

Mahito SUGIYAMA†,‡,Akihiro YAMAMOTO†

†Kyoto University‡JSPS Research Fellow

/

Page 2: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Contributions

. TheMCL (MinimumCode Length)– A new measure to score clustering results– Needed to distinguish each cluster under some fixed encod-

ing scheme for real-valued variables

. COOL (COding-Oriented cLustering)– A general clustering approach– Always finds the best clusters (i.e., the global optimal solution)

which minimizes the MCL in O(nd)– Parameter tuning is not needed

. G-COOL (COOLwith the Gray code)– Achieves internal cohesion and external isolation– Finds arbitrary shaped clusters

/

Page 3: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Demonstration (Synthetic Dataset)

0

0.5

1

0 0.5 1

/

Page 4: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

G-COOL

0

0.5

1

0 0.5 1

/

Page 5: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

K-means

0

0.5

1

0 0.5 1

/

Page 6: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Real datasets)Bi

nary

�lte

ring

Orig

inal

imag

e

Delta Dragon Europe Norway Ganges

/

Page 7: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Real datasets)G-COOL

K-means

Delta Dragon Europe Norway Ganges

/

Page 8: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Outline

. Overview

. Background and Our Strategy

. MCL and Clustering

. COOL Algorithm

. G-COOL: COOL with the Gray Code

. Experiments

. Conclusion

/

Page 9: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Outline

. Overview

. Background and Our Strategy

. MCL and Clustering

. COOL Algorithm

. G-COOL: COOL with the Gray Code

. Experiments

. Conclusion

/

Page 10: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Focusing on Compression

• The MDL approach [Kontkanen et al., ]– Data encoding has to be optimized

∘ All encoding schemes are (implicitly) considered∘ The time complexity ⩾ O(n2)

• The Kolmogorov complexity approach [Cilibrasi, ]– Measures the distance between data points based on com-

pression of finite sequences∘ Difficult to apply multivariate data

– Actual clustering process is the traditional agglomerative hi-erarchical clustering∘ The time complexity ⩾ O(n2)

• Both approaches are not suitable for massive data

/

Page 11: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Our Strategy

• Requirements:. Fast, and linear in the data size. Robust to changes in input parameters. Can find arbitrary shaped clusters

/

Page 12: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Our Strategy

• Requirements:. Fast, and linear in the data size. Robust to changes in input parameters. Can find arbitrary shaped clusters

• Solutions:. Fix an encoding scheme for continuous variables

– Motivated by Computable Analysis [Weihrauch, ]. Clustering = Discretizing real-valued data

– Always finds the best results w.r.t. the MCL. Use the Gray code for real numbers [Tsuiki, ]

– Discretized data points are overlapped and adjacent clus-ters are merged

/

Page 13: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Outline

. Overview

. Background and Our Strategy

. MCL and Clustering

. COOL Algorithm

. G-COOL: COOL with the Gray Code

. Experiments

. Conclusion

/

Page 14: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL (Minimum Code Length)

• The MCL is the code length of the maximally compressedclusters by using a fixed encoding scheme• The MCL is calculated in O(nd) by using radix sort– n and d are the number of data and dimension, resp.

/

Page 15: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL (Minimum Code Length)

• The MCL is the code length of the maximally compressedclusters by using a fixed encoding scheme• The MCL is calculated in O(nd) by using radix sort– n and d are the number of data and dimension, resp.

Example: X = {0.1, 0.2, 0.8, 0.9},𝒞1 = {{0.1, 0.2}, {0.8, 0.9}}𝒞2 = {{0.1}, {0.2, 0.8}, {0.9}}

– Use binary encoding– Which is preferred?

/

Page 16: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Binary Encoding

0 1

Positio

n

01234

0.50.1 0.2

00011...00110...

0.8 0.9

11001...11100...

/

Page 17: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL with Binary Encoding

0 0.5 10.25 0.75

A B C D

id value

A .B .C .D .

/

Page 18: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL with Binary Encoding

0 0.5 10.25 0.75

A B C D

id value

A .B .C .D .

/

Page 19: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL with Binary Encoding

0 0.5 10.25 0.75

A B C D

0 1Lv. 1

id value

A .B .C .D .

MCL = 1 + 1 = 2

/

Page 20: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL with Binary Encoding

0 0.5 10.25 0.75

A B C D

0 1Lv. 1

id value

A .B .C .D .

MCL = 1 + 1 = 2

/

Page 21: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL with Binary Encoding

0 0.5 10.25 0.75

A B C D

0 1Lv. 1

id value

A .B .C .D .

MCL = 1 + 1 = 2

/

Page 22: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL with Binary Encoding

0 0.5 10.25 0.75

A B C D

id value

A .B .C .D .

/

Page 23: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL with Binary Encoding

0 0.5 10.25 0.75

A B C D

0 1Lv. 1

00 10 1101Lv. 2

Lv. 3000 001 110 111

id value

A .B .C .D .

MCL = 3 ⋅ 4 = 12

/

Page 24: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL with Binary Encoding

0 0.5 10.25 0.75

A B C D

0 1Lv. 1

00 10 1101Lv. 2

Lv. 3000 001 110 111

id value

A .B .C .D .

MCL = 3 ⋅ 4 = 12

/

Page 25: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL with Binary Encoding

0 0.5 10.25 0.75

A B C D

0 1Lv. 1

00 10 1101Lv. 2

Lv. 3000 001 110 111

id value

A .B .C .D .

MCL = 3 ⋅ 4 = 12

/

Page 26: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

MCL with Binary Encoding

0 0.5 10.25 0.75

A B C D

0 1Lv. 1

00 10 1101Lv. 2

Lv. 3000 001 110 111

id value

A .B .C .D .

MCL = 3 ⋅ 4 = 12

/

Page 27: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Definition of MCL

• Fix an embedding γ :ℝd → Σω (Σ = {0,1} usually)• For p ∈ range(γ) and P ⊂ range(γ), define

Φ(p ∣P) ≔ {w ∈ Σ∗ p ∈ ↑w and P ∩ ↑v = ∅ for all vsuch that |v| = |w| and p ∈ ↑v }

– Each element in Φ(p ∣P) is a prefix that discriminates p from P

For a partition 𝒞 = {C1,… ,CK} of a data set X,MCL(𝒞) ≔ ∑ i∈{1,…,K} Li(𝒞), where

Li(𝒞) ≔ min{|W|γ(Ci) ⊆ ↑W and

W ⊆ x∈CiΦ(γ(x) ∣ γ(X ⧵ Ci)) }

/

Page 28: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Minimizing MCL and Clustering

Clusteringunder theMCL criterion is to find theglobal op-timal solution that minimizes the MCL

– Find 𝒞op such that𝒞op ∈ argmin

𝒞∈𝒞(X)⩾KMCL(𝒞),

where 𝒞(X)⩾K = {𝒞 is a partition of X ∣ #C ⩾ K }

• We give the lower bound of the number of clusters K as ainput parameter– 𝒞op becomes one set {X} without this assumption

/

Page 29: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Outline

. Overview

. Background and Our Strategy

. MCL and Clustering

. COOL Algorithm

. G-COOL: COOL with the Gray Code

. Experiments

. Conclusion

/

Page 30: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Optimization by COOL

• COOL solves the optimization problem in O(nd)– n and d are the number of data and dimension, resp.

∘ The naïve approach takes exponential time and space– Computing process of theMCL becomes clustering process it-

self via discretization

• COOL is level-wise, and makes the level-k partition 𝒞k

from k = 1, 2,… , which holds the following condition:– For all x, y ∈ X, they are in the same cluster ⟺

v = w for some v ⊏ γ(x) andw ⊏ γ(y) with |v| = |w| = k∘ Level-k partitions form hierarchy∘ For C ∈ 𝒞k, there exists 𝒟 ⊆ 𝒞k+1 such that ⋃ 𝒟 = C

• For all C ∈ 𝒞op, there exists k such that C ∈ 𝒞k

/

Page 31: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Binary Encoding

0 0.5 10.25 0.75

A B C D Eid value

A .B .C .D .E .

/

Page 32: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Binary Encoding

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

id value

A .B .C .D .E .

/

Page 33: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Binary Encoding

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

id value

A 0B 0C 1D 1E 1

/

Page 34: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Binary Encoding

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

id value

A 0B 0C 1D 1E 1

MCL = 1 + 1 = 2

/

Page 35: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Binary Encoding

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

00 10 1101Lv. 2

id value

A 00B 01C 10D 10E 11

/

Page 36: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Binary Encoding

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

00 10 1101Lv. 2

id value

A 00B 01C 10D 10E 11

MCL = 2 ⋅ 4 = 8

/

Page 37: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Binary Encoding

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

00 10 1101Lv. 2

100 101Lv. 3

id value

A 00B 01C 100D 101E 11

/

Page 38: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Binary Encoding

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

00 10 1101Lv. 2

100 101Lv. 3

id value

A 00B 01C 100D 101E 11

MCL = 6 + 6 = 12

/

Page 39: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Noise Filtering by COOL

• Noise filtering is easily implemented in COOL• Define 𝒞⩾N ≔ {C ∈ 𝒞 ∣ #C ⩾ N} for a partition 𝒞– See a cluster C as noises if #C < N• Example: Given 𝒞 = {{0.1}, {0.4, 0.5, 0.6}, {0.9}}– 𝒞⩾2 = {{0.4, 0.5, 0.6}}, and 0.1 and 0.9 are noises• We input the lower bound N of the cluster size as a inputparameter

0 0.5 10.25 0.75

N = 2

/

Page 40: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Noise Filtering by COOL

• Noise filtering is easily implemented in COOL• Define 𝒞⩾N ≔ {C ∈ 𝒞 ∣ #C ⩾ N} for a partition 𝒞– See a cluster C as noises if #C < N• Example: Given 𝒞 = {{0.1}, {0.4, 0.5, 0.6}, {0.9}}– 𝒞⩾2 = {{0.4, 0.5, 0.6}}, and 0.1 and 0.9 are noises• We input the lower bound N of the cluster size as a inputparameter

0 0.5 10.25 0.75

N = 2

/

Page 41: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Algorithm of COOLInput: A data set X, two lower bounds K and NOutput: The optimal partition 𝒞op and noisesfunction COOL(X, K, N): Find partitions 𝒞1

⩾N,… ,𝒞m⩾N such that ‖𝒞m−1

⩾N ‖ < K ⩽ ‖𝒞m⩾N‖

: (𝒞op, MCL(𝒞op)) ← FINDCLUSTERS(X, K, {𝒞1⩾N,… ,𝒞m

⩾N}): return (𝒞op, X ⧵ ⋃ 𝒞op)function FINDCLUSTERS(X, K, {𝒞1,… ,𝒞m}): Find k such that ‖𝒞k−1‖ < K and ‖𝒞k‖ ⩾ K: 𝒞op ← 𝒞k

: if K = 2 then return (𝒞op, MCL(𝒞op)): for each C in 𝒞1 ∪ … ∪ 𝒞k−1

: (𝒞, L) ← FINDCLUSTERS(X ⧵ C, K − 1, {𝒞1,… ,𝒞k}): if MCL(𝒞 ∪ C) < MCL(𝒞op) then 𝒞op ← C ∪ 𝒞: return (𝒞op, MCL(𝒞op))

/

Page 42: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Outline

. Overview

. Background and Our Strategy

. MCL and Clustering

. COOL Algorithm

. G-COOL: COOL with the Gray Code

. Experiments

. Conclusion

/

Page 43: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Gray Code

• Real numbers in [0, 1] are encoded with 0, 1, and ⊥Binary: 0.1 → 00011… , 0.25 → 00111…Gray: 0.1 → 00010… , 0.25 → 0⊥100…

• Originally, another binary encoding of natural numbers– Especially important in applications of conversion between

analog and digital information [Knuth, ]

The Gray code embedding is an injection γG that maps x ∈ [0, 1]to an infinite sequence p0p1p2 … , where– pi ≔ 1 if 2−im−2−(i+1) < x < 2−im+2−(i+1) for an oddm, pi ≔ 0

if the same holds for an evenm, and pi ≔ ⊥ if x = 2−im−2−(i+1)

for some integerm– For a vector x = (x1,… , xd), γG(x) = p1

1 … pd1p12 … pd2 …

/

Page 44: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Gray Code Embedding

0 0.5 1

0

1

235

Positio

n

0.8

10101...

0.25

0⊥100...

0.1

00010...

/

Page 45: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Gray Code (G-COOL)

0 0.5 10.25 0.75

A B C D Eid value

A .B .C .D .E .

/

Page 46: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Gray Code (G-COOL)

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

⊥1

id value

A .B .C .D .E .

/

Page 47: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Gray Code (G-COOL)

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

⊥1

id value

A 0B 0, ⊥1C 1, ⊥1D 1, ⊥1E 1

/

Page 48: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Gray Code (G-COOL)

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

⊥1

id value

A 0B 0, ⊥1C 1, ⊥1D 1, ⊥1E 1

/

Page 49: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Gray Code (G-COOL)

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

⊥1

id value

A 0B 0, ⊥1C 1, ⊥1D 1, ⊥1E 1

MCL = 1 ⋅ 2 = 2

/

Page 50: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Gray Code (G-COOL)

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

⊥100 10 1101Lv. 2

0⊥1 ⊥10 1⊥1

0 1Lv. 1

0 1Lv. 1

id value

A 00B 01, ⊥10C 10, ⊥10D 10, 1⊥1E 11, 1⊥1

/

Page 51: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Gray Code (G-COOL)

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

⊥100 10 1101Lv. 2

0⊥1 ⊥10 1⊥1

0 1Lv. 1

0 1Lv. 1

id value

A 00B 01, ⊥10C 10, ⊥10D 10, 1⊥1E 11, 1⊥1

/

Page 52: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Gray Code (G-COOL)

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

⊥100 10 1101Lv. 2

0⊥1 ⊥10 1⊥1

0 1Lv. 1

0 1Lv. 1

id value

A 00B 01, ⊥10C 10, ⊥10D 10, 1⊥1E 11, 1⊥1

MCL = 2 ⋅ 3 = 6

/

Page 53: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

COOL with Binary Encoding

0 0.5 10.25 0.75

A B C D E

0 1Lv. 1

id value

A 0B 0C 1D 1E 1

MCL = 1 + 1 = 2

/

Page 54: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Theoretical Analysis of G-COOL

• Use the Gray code as a fixed encoding in COOL– It achieves internal cohesion and external isolation

• Theorem: For the level-k partition 𝒞k, x, y ∈ X are in thesame cluster if d∞(x, y) < 2−(k+1)

– Thus x, y are in the different clusters only if d∞(x, y) ⩾ 2−(k+1)

– d∞(x, y) = maxi∈{1,…,d}|xi − yi| (L∞ metric)∘ Two adjacent intervals overlap and they are agglomerated

• Corollary: In the optimal partition 𝒞op, for all x ∈ C (C ∈𝒞op), its nearest neighbor y ∈ C– y is nearest neighbor of x ⟺ y ∈ argminy∈Xd∞(x, y)

/

Page 55: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Demonstration of G-COOL

0 0.5 10

0.5

1

G-COOL

0 0.5 10

0.5

1

COOL with the binary encoding

/

Page 56: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Outline

. Overview

. Background and Our Strategy

. MCL and Clustering

. COOL Algorithm

. G-COOL: COOL with the Gray Code

. Experiments

. Conclusion

/

Page 57: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Experimental Methods

• Analyze G-COOL empirically with synthetic and realdatasets compared to DBSCAN and K-means– Synthetic datasets were generated by the R package cluster-

Generation [Qiu and Joe, ]∘ n = 1, 500 for each cluster and d = 3

– Real datasets were geospatial images from Earth-as-Art∘ reduced to × pixels, translated into binary images

– All data were normalized by min-max normalization

• G-COOL was implemented by R (version ..)• Internal and External measure were used– Internal: MCL, connectivity, Silhouette width– External: adjusted Rand index

/

Page 58: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Synthetic datasets)

MCL

G-COOLDBSCANK-means

50000

5000

Number of clusters2 4 6

500Data show mean ± s.e.m.Each experiment was performed20 times

Bad

Good

/

Page 59: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Synthetic datasets)

Connectivity Silhouette width1000

1

100

Number of clusters2 4 6

0.1

10

0.6

0.3

0.5

Number of clusters2 4 6

0.4

G-COOLDBSCANK-means

Good

Bad

Bad

Good

/

Page 60: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Synthetic datasets)

Runtime (s)Adjusted Rand index20

5

15

Number of clusters2 4 6

0

10

1.0

0.3

0.9

Number of clusters2 4 6

0.6

G-COOLDBSCANK-means

Good

Bad

/

Page 61: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Synthetic datasets)

MCL

G-COOL

Data show mean ± s.e.m.Each experiment was performed20 times

5000

3000

00 4 62 8 10

4000

2000

1000

The noise parameter N

Bad

Good

/

Page 62: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Synthetic datasets)

Connectivity Silhouette width

80

20

60

0

40

0.6

0.2

0.4

0 4 62 8 10 0 4 62 8 10The noise parameter N The noise parameter N

0

Good

Bad

Bad

Good

/

Page 63: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Synthetic datasets)

Runtime (s)Adjusted Rand index

2.0

0.5

1.5

0 4 60

1.0

1.0

0.4

0.8

0.6

2 8 10The noise parameter N

0 4 62 8 10The noise parameter N

0.2

0

Good

Bad

/

Page 64: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Real datasets)Bi

nary

�lte

ring

Orig

inal

imag

e

Delta Dragon Europe Norway Ganges

/

Page 65: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Real datasets)G-COOL

K-means

Delta Dragon Europe Norway Ganges

/

Page 66: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Results (Real datasets)

Name n K Running time (s) MCLGC KM GC KM

Delta . . Dragon . . Europe . . Norway . . Ganges . .

GC: G-COOL, KM: K-means

/

Page 67: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Outline

. Overview

. Background and Our Strategy

. MCL and Clustering

. COOL Algorithm

. G-COOL: COOL with the Gray Code

. Experiments

. Conclusion

/

Page 68: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Conclusion

• Integrate clustering and its evaluation in the coding-oriented manner– An effective solution for two essential problems, how to mea-

sure goodness of results and how to find good clusters∘ No distance calculation and no data distribution

• Key ideas:. Fix of an encoding scheme for real-valued variables

– Introduced the MCL focusing on compression of clusters– Formulated clusteringwith theMCL, and constructedCOOL

that finds the global optimal solution linearly. The Gray code

– We showed efficiency and effectiveness of G-COOL by the-oretically and experimentally

/

Page 69: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Appendix

A-/A-

Page 70: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Notation (/)• A datum x ∈ ℝd, a data set X = {x1,… , xn}– #X is the number of elements in X– X ⧵ Y is the relative complement of Y in X• Clustering is partition of X into K subsets (clusters) C1, … , CK

– Ci ≠ ∅ and Ci ∩ Cj = ∅– We call 𝒞 = {C1,… ,CK} a partition of X– 𝒞(X) = {𝒞 ∣ 𝒞 is a partition of X}• The set of finite and infinite sequences over an alphabet Σ aredenoted by Σ∗ and Σω, resp.– The length |w| is the number of symbols other than ⊥

∘ Ifw = 11⊥100⊥⊥ … , then |w| = 5– For a set of sequencesW, |W| = ∑w∈W|w|

A-/A-

Page 71: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Notation (/)• An embedding of ℝd is an injective function γ from ℝd to Σω

• For p, q ∈ Σω, define p ⩽ q if pi = qi for all iwith pi ≠ ⊥– Intuitively, q is more concrete than p• For w ∈ Σ∗, we writew ⊏ p ifw⊥ω ⩽ p– ↑w = {p ∈ range(γ) ∣ w ⊏ p} forw ∈ Σ∗

– ↑W = {p ∈ range(γ) ∣ w ⊏ p for some w ∈ W} forW ⊆ Σ∗

• The following monotonicity holds– γ−1(↑v) ⊆ γ−1(↑w) iff v⊥ω ⩾ w⊥ω

A-/A-

Page 72: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Optimization by COOL

• The optimal partition𝒞op can be constructed by the level-k partitions

For all C ∈ 𝒞op, there exists k such that C ∈ 𝒞k

• The level-k partitions have the hierarchical structure– For each C ∈ 𝒞k we have ⋃ 𝒟 = C for some D ⊆ 𝒞k+1

– COOL is similar to divisive hierarchical clustering

COOL always outputs the global optimal partition 𝒞op

• The time complexity is O(nd) (best) and O(nd + K! ) (worst)– Usually K ≪ n holds, hence O(nd)

A-/A-

Page 73: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1

A-/A-

Page 74: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1 K = 2

A-/A-

Page 75: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1 K = 2

A-/A-

Page 76: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1 K = 2

Lv. 1

Cluster

1 2 31

2 3

A-/A-

Page 77: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1 K = 2

Lv. 1

Cluster

1 2 31

2 3

A-/A-

Page 78: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1

Lv. 1

Cluster

1 2 31

2 3K = 5

A-/A-

Page 79: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1

Lv. 1

Cluster

1 2 31

2 3K = 5

Lv. 2

A-/A-

Page 80: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1

Lv. 1

ClusterK = 5

Lv. 21 2

3

5

4 6

21 3 4 5 6

A-/A-

Page 81: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1

Lv. 1

ClusterK = 5

Lv. 2

A-/A-

Page 82: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1

Lv. 1

ClusterK = 5

Lv. 2

A-/A-

Page 83: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1

Lv. 1

ClusterK = 5

Lv. 2

A-/A-

Page 84: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Clustering Process of COOL

0 10

1

Lv. 1

ClusterK = 5

Lv. 2

A-/A-

Page 85: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

The Multi-Dimensional Gray Code

• Use thewrapping functionφ(p1,… , pd) ≔ p11 … pd1p

12 … pd2 …

Define the d-dimensional Gray code embedding γdG:ℐ →Σω

⊥,d by γdG(x1,… , xd) ≔ φ(γG(x1),… , γG(xd))

• We abbreviate d of γdG if it is understood from the context

A-/A-

Page 86: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Internal Measures

• Connectivity [Handl et al., ]– Conn(𝒞) = ∑x∈X ∑Mi=1 f(x,nn(x, i))/i

∘ nn(x, j) is the i-th neighbor of x, f(x, y) is 0 if x and y be-long to the same cluster, and 1 otherwise

∘ M is an input parameter (we set as )– Takes values from 0 to ∞, should be minimized• Silhouette width– The average of Silhouette value S(x) for each xS(x) = (b(x) − a(x)/max(b(x), a(x)))∘ a(x) = ‖C‖−1 ∑y∈C d(x, y) (x ∈ C)∘ b(x) = minD∈𝒞⧵C ‖D‖−1 ∑y∈D d(x, y)

– Takes values from −1 to 1, should be maximizedA-/A-

Page 87: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

External Measures

• Adjusted Rand index– Let the result be 𝒞 = {C1,… ,CK} and the correct parti-

tion be 𝒟 = {D1,… ,DM}– Suppose nij ≔ ‖{x ∈ X ∣ x ∈ Ci, x ∈ Dj}‖. Then

∑i, j nijC2 − (∑i ‖Ci‖C2 ∑h ‖Dj‖C2)/nC2

2−1(∑i ‖Ci‖C2 + ∑h ‖Dj‖C2) − (∑i ‖Ci‖C2 ∑h ‖Dj‖C2)/nC2

A-/A-

Page 88: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Discussion

• Results for synthetic datasets– Best performance under the internal measures– (nearly) Best performance under the internal measures– G-COOL is efficient and effective

∘ DBSCAN is sensitive to input parameters– The MCL works well as an internal measure• Results for real datasets– not good, and not bad

∘ There are no clear clusters originally• G-COOL is a good clustering method

A-/A-

Page 89: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

RelatedWork

• Partitional methods [Chaoji et al., ]

• Mass-based methods [Ting and Wells, ]

• Density-based methods (DBSCAN [Ester et al., ])• Hierarchical clustering methods(CURE [Guha et al., ], CHAMELEON [Karypis et al.,])• Grid-based methods(STING [Wang et al., ], WaveCluster [Sheikholeslami etal., ])

A-/A-

Page 90: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

Future Works

• Speeding up by using tree-structures such as BDD• Apply to anomaly detection• Theoretical analysis, in particular relation with Com-putable Analysis– Admissibility is a key property

A-/A-

Page 91: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

References

[Berkhin, ] P. Berkhin. A survey of clustering data mining techniques.GroupingMultidimensional Data, pages –, .

[Chaoji et al., ] V. Chaoji, M. A. Hasan, S. Salem, andM. J. Zaki. SPARCL:An effective and efficient algorithm formining arbitrary shape-basedclusters. Knowledge and Information Systems, ():–, .

[Cilibrasi and Vitányi, ] R. Cilibrasi and P. M. B. Vitányi. Cluster-ing by compression. IEEE Transactions on Information Theory,():–, .

[Ester et al., ] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databaseswith noise. In Proceedings of KDD, , –, .

[Guha et al., ] S. Guha, R. Rastogi, and K. Shim. CURE: An effi-cient clustering algorithm for large databases. Information Systems,():–, .

[Handl et al., ] J. Handl, J. Knowles, and D. B. Kell. Computational

A-/A-

Page 92: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

cluster validation in post-genomic data analysis. Bioinformatics,():, .

[Karypis et al., ] G. Karypis, H. Eui-Hong, and V. Kumar. CHAMELEON:Hierarchical clustering using dynamic modeling. Computer,():–, .

[Kontkanen and Myllymäki, ] P. Kontkanen and P. Myllymäki. An em-pirical comparison of NML clustering algorithms. In Proceedings ofInformation Theory and Statistical Learning, .

[Kontkanen et al., ] P. Kontkanen, P.Myllymäki,W. Buntine, J. Rissanen,andH. Tirri. AnMDL framework for data clustering. In P. Grünwald, I. J.Myung, andM. Pitt, editors, Advances inMinimumDescription Length:Theory and Applications. MIT Press, .

[Knuth, ] D. E. Knuth. TheArt of Computer Programming, Volume, Fas-cicle : Generating All Tuples and Permutations. Addison-Wesley Pro-fessional, .

[Qiu and Joe, ] W. Qiu and H. Joe. Generation of random clusters withspecified degree of separation. Journal of Classification, :–,.

A-/A-

Page 93: The Minimum Code Length for Clustering Using the Gray Code · AlgorithmofCOOL Input:Adataset X,twolowerbounds K and N Output:Theoptimalpartition 𝒞op andnoises function COOL(X,

[Sheikholeslami et al., ] G. Sheikholeslami, S. Chatterjee, andA. Zhang. WaveCluster: A multi-resolution clustering approach forvery large spatial databases. In Proceedings of the th InternationalConference on Very Large Data Bases, pages –, .

[Ting and Wells, ] K. M. Ting and J. R. Wells. Multi-dimensional massestimation and mass-based clustering. In Proceedings of th IEEE In-ternational Conference on DataMining, pages – , .

[Tsuiki, ] Hideki Tsuiki. Real number computation through Gray codeembedding. Theoretical Computer Science, ():–, .

[Wang et al., ] W. Wang, J. Yang, and R. Muntz. STING: A statisticalinformation grid approach to spatial data mining. In Proceedingsof the rd International Conference on Very Large Data Bases, pages–, .

[Weihrauch, ] K. Weihrauch. Computable Analysis. Springer, .

A-/A-