Top Banner
K-Clustering Coresets Sagi Hed 9th Dec 2008 , , , “Bi-Criteria Linear-Time Approximations for Generalized k-Mean/Median/Center” (Danny Feldman, Amos Fiat, Micha Sharir, Danny Segev) “Smaller Coresets for k-Median an k-Means Clustering” (Sariel Har-Peled, Akash Kushal)
35

K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Jan 19, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Clustering Coresets

Sagi Hed

9th Dec 2008

, , ,

• “Bi-Criteria Linear-Time Approximations for Generalized k-Mean/Median/Center”(Danny Feldman, Amos Fiat, Micha Sharir, Danny Segev)

• “Smaller Coresets for k-Median an k-Means Clustering”(Sariel Har-Peled, Akash Kushal)

Page 2: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Coresets Approach

• Not exactly sublinear time algorithm!

Instead of working on input, work on an alternative, very small, but indistinguishable input

• Create small coreset in streaming algorithm

Coreset will allow infinite various queries in sub-linear time..

, , ,

Page 3: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

StreamingWhat is A Streaming Algorithm?• Data continuously generated and total volume is extremely large• Algorithm examines arriving items once, and discards• Update internal state fast per new item (O(1)

to poly log(n))• Use small space (poly log(n))

In other words, can run whenever new data arrives, even if n

is too large for memory.

Page 4: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

CoresetsWhat is a Coreset?A subset of input, such that we can get a good approximation to the original input by solving the problem directly on the coreset.

Since the coreset is very small, typically poly-logarithmic, we can do this even with exhaustive search.

Coreset

Page 5: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-ClusteringP

set of n

points in Rd

C

set of k

points in Rd

dist(p, C)

distance from p

to the nearest point in C

CostP

(C) ≡ ∑pєP

dist(p, C)

K-Median ClusteringCostP

(C) ≡ ∑pєP

[dist(p, C)] 2

K-Means ClusteringCostP

(C) ≡

MaxpєP

{ dist(p, C) }

K-Center Clustering

CostP

(C)(in k-center clustering)

P

Page 6: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-ClusteringP

set of n

points in Rd

• K-Center/Median/Mean ClusteringFind C, |C| = k, s.t. CostP

(C)

is minimal…

• K-Center/Median/Mean QueriesGiven C, what is CostP

(C)?

Exact K-Clustering is NP-hard… (exponential in k)

Page 7: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

(k,ε) Coreset(k, ε)-coreset ≡

set S

of points in Rd

where for every C, |C| = k

(1-

ε)·CostP

(C) ≤

CostS

(C) ≤

(1+

ε)·CostP

(C)

Coreset S

Cost(C,S) ≈

Cost(C, P)

CostS

(OPT(S)) ≤

CostS

(OPT(P)) ≤

(1+ε)CostP

(OPT(P))But OPT(S) ≠

OPT(P)…

Page 8: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Our GoalStreaming algorithm for (k, ε)-coreset with O(kε-d·log(n)) points

With this small coreset we can answer k-queries or find a k-clustering in sub-linear time, without ever reading the original input again

We assume k and d is constantCoreset

Answer queries on coreset with exhaustive search –

Answer k-queries in O(k2·d·ε-d·log(n)) time

Find k-clustering in roughly O( [kε-d·log(n)] kd+1 ) time (number of Voronoi partitions is bounded)

Page 9: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Clustering Coresets

Order Of Business:

1. Coreset Streaming Reduction

2.

Bi-Criteria K-Clustering Approximation

3.

K-Center Coresets Using Bi-Criteria K-Clustering

4.

K-Means/Median Coresets Using Hand Waving

Page 10: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Coreset Streaming Reduction

Input

A (k,ε)-coreset creation algorithm for k-clustering, creating coresets of size O(k(1/ε)d)

in O(d·n·k)

OutputA Streaming (k,ε)-coreset creation algorithm for k-center clustering, creating coresets of size O(k(1/ε)dlog(n)),

that works in time O(polylog(k,1/ε))

per insertion

Page 11: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Coreset StreamingLemma 1C1

is a (k, ε)-coreset for P1C2

is a (k, ε)-coreset for P2=> C1

U C2

is a (k, ε)-coreset for P1 U P2

C2C1

P1 P2

C2C1Lemma 2

C1

is a (k, ε1

)-coreset for PC2

is a (k, ε2

)-coreset for C1=> C2

is a (k, O(ε1

+ε2

))-coreset for P

Page 12: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Coreset Streaming

• Operate coreset on existing coresets• Maintain log(n) coresets at a time

(k,p1

)-coreset(k,p2

)-coreset

(k,p3

)-coreset

pi = ε

/ [c(j+1)2]

∏(1+pi

) ≤

1 + ε/2

for a large enough c=> U Qi

is a (k,ε/2

)-coreset

O(k(1/ε)d)

Q3

=

Q2

=

Q1

=

Page 13: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Coreset StreamingSpace ComplexitySpace for algorithm is –∑1 ≤

j ≤

log(n)

O([c(j+1)2d] / εd) = O(polylog(n))

Coreset SizeBuild Ri

= (k, ε/6)-coreset from every QiU Ri

= still a (k, ε)-coreset for all points|U Ri

|

= ∑1 ≤

j ≤

log(n)

O((6/ε)d) = O((1/ε)dlog(n))=> actual coreset (for exhaustive search) is smaller

Time ComplexityRequires an amortized analysis similar to that of a binary counter=> O(polylog(k,1/ε))

per insertion

Page 14: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Bi-CriteriaK-Clustering

Clustering ApproximationC

is a β-approximation for clustering ≡

CostP

(C) ≤ β·CostP

(OPT)

GoalWith high probability obtain a2-approximation clustering C|C| = O(k·log2(n))in time O(d·n·k·log(n))With meticulous calculations can be improved to|C| = O(k·log(n)·loglog(n))and time O(d·n·k)

Page 15: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Algorithm1. F = Ø, t = 12.Until we cover all points:

1.

Ft

= Sample O(k·log(n))

points at random2.

F = F U Ft3.

Pt

= the |P|/2

points from P that are closest to Ft

4.

Remove Pt

from P5.

t++3.Output F

Bi-CriteriaK-Clustering

Sample

F =

Page 16: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Algorithm1.While (P is not empty, t++) :

1.

F = F U Ft

(≡

Sample O(k·log(n))

points at random)2.

Remove Pt

= the |P|/2

points from P that are closest to Ft

Bi-CriteriaK-Clustering

Analysis

Running Time –

O(n·d·k·log(n))

|F| = O(k·log2(n))

Page 17: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

good point

Bi-CriteriaK-Clustering

Correctness

F* ≡

optimal centers for clustering, |F| = k

p

is “bad”

for Ft

dist(p, Ft

) > 2·dist(p, F*)p

is “good”

for Ft

dist(p, Ft

) ≤

2·dist(p, F*)

F*

bad point

Page 18: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Bi-CriteriaK-Clustering

Lemma 1

With high probability, number of bad

points for Ft

|Pt

| / 8

OPT

bad points

Page 19: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Lemma 2With high probability, for every bad

point for Ft

which is discarded with Pt

, we can match a distinct good

point for Ft

which is outside Pt

.In fact, this point is discarded with Pt+1

.

bad point

good point

Pt

g

b

Bi-CriteriaK-Clustering

Proof of Lemma 2#{points in Pt+1

good

for Ft

}

|Pt+1

| -

#{points bad

for Ft

}

Using Lemma 1,≥

|Pt

| / 2 -

|Pt

| / 8 ≥

3/8 |Pt

|

#{points bad

for Ft

}

Since the point is discarded with Pt+1

, it is distinct..

Proof? (Hint: use previous lemma..)

Page 20: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

CorrectnessFor any bad

point b

for some Ft

, let g

its matching distinct good

point –dist(b, F) ≤

dist(b, Ft

) ≤

dist(g, Ft

) ≤

2·dist(g, F*)

K-Median ClusteringCostP

(F) = ∑pϵP

dist(p, F)

≤ ∑g

dist(g,F) + ∑b

dist(b, F)

≤ ∑g

2·dist(g, F*) + ∑g

2·dist(g, F*)

2 ∑g

dist(g, F*) ≤

2·CostP

(F*) □

K-Means/Center ClusteringSame proof…

bad point

good point

Pt

g

b

Bi-CriteriaK-Clustering

Page 21: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Bi-CriteriaK-Clustering

Proof of Lemma 1

During iteration t –f*i

the i’th center in F*Qi

|Pt

|/8k

points closest to fi

Probability that sampled point is not in Qi

= (1 –

1/8k

)=> Probability that none of the sampled points are in Qi

= (1-1/8k

)8klog(n)·c

e–c·log(n)

= 1/nc

Probability that at least one of Qi

has no sampled points in at least one of the log(n)

iterations is, using union bound ≤

k·log(n)/nc ≤

1/nc-2

=> Probability that all Qi

have sampled points in all iterations ≥

1-

1/nc-2

For a large enough n, this expression is arbitrarily large

sampled point

Qi (|Pt |/8k closest to f*)

f*

Page 22: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Bi-CriteriaK-Clustering

Proof of Lemma 1

q ≡

point outside the |Pt

|/8k

closest of any f*if* ≡

q’s closest point in F*p ≡

a sampled point closest to f* (within the |Pt

|/8k

closest)

dist(q,Ft

) ≤

dist(q,p) ≤

dist(q,f*) + dist(f*,p) ≤

2·dist(q,f*) = 2·dist(q,F*)=> p is a “good“

point!

=> at most |Pt|/8 “bad”

points!

sampled point Qi (|Pt |/8k closest to f*)

q

f*p

Page 23: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Careful

Can amplify the probability more by making attempts and choosing

F

with lowest CostP

(F)..

Need to be careful with probabilities and streaming reduction..

Bi-CriteriaK-Clustering

Generalizes to j-flats in d dimensions instead of points (open problem to construct coresets for these)

Page 24: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Center Clustering Coreset

Goal

Obtain a (k,ε)

center-clustering coreset S

(1-ε)CostP

(C) ≤

CostS

(C) ≤

(1+ε)CostP

(C)

CostA

(B) ≡

MaxpєA

{ dist(p,B) }

|S| = O(k(√d/ε

)d)

Coreset

Can show its impossible to not be exponential in d..Just for k-clustering its possible to get a coreset of size O(kε-2)

Page 25: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Center Clustering Coreset

Coreset

ConstructionUse KLog(n)-Clustering on P

to create C|C| = O(klog2(n))CostP

(C) ≤

8·CostP

(OPT)C

Page 26: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Center Clustering Coreset

Coreset

ConstructionBuild 2·CostP

(C) x 2·CostP

(C) x

… d-dimensional grid around points in C.Divide to “squares”

εCostP

(C)/(8√d) x εCostP

(C)/(8√d) x …2CostP

(C)

2CostP

(C)

ε/(8√d) ·CostP

(C)

ε/(8√d) ·CostP

(C)

Page 27: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Center Clustering Coreset

Coreset

ConstructionCostP

(C) = maxpϵP { dist(p,C) }=> all points contained in the gridsRunning time is O(n)

2CostP

(C)

2CostP

(C)

ε/(8√d) ·CostP

(C)

ε/(8√d) ·CostP

(C)

Page 28: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Center Clustering Coreset

Coreset

ConstructionCoreset = pick one representative from every square|Coreset| = O[ (√d/ε)d·k·log2(n) ]

2CostP

(C)

Coreset

ε/(8√d) ·CostP

(C)

Page 29: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Center Clustering Coreset

Coreset

CorrectnessS

coreset, p

point maximizing dist(p,A), q ≡

p’s representative

CostP

(A) = dist(p,A) ≤

dist(q, A) + (√d)(ε/8√d)CostP

(C)≤

CostS

(A) + εCostP

(OPT) ≤

CostS

(A) + εCostP

(A)

=> (1-ε)CostP

(A) ≤

CostS

(A)

2CostP

(C)

ε/(8√d) ·CostP

(C)

Page 30: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Center Clustering Coreset

Coreset

CorrectnessCostP

(A) = dist(p,A) ≥

dist(q, A) -

(√d)(ε/8√d)CostP

(C)≥

CostS

(A) -

εCostP

(OPT) ≥

CostS

(A) -

εCostP

(A)

=> (1+ε)CostP

(A) ≥

CostS

(A)

2CostP

(C)

ε/(8√d) ·CostP

(C)

Page 31: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Center Clustering Coreset

Coreset

|Coreset| = O[ (√d/ε)d·k·log2(n) ]

How to get rid of the log2(n)?It was inherited from number of cluster centers in the bi-criteria clustering approximation…

Find a k-center clustering (1+ε) approximation using the current coreset, then reassign the clustering (this time with exactly k centers) into the coreset creation…

Page 32: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Means/MedianClustering Coreset

Can also create a (k,ε) median/means-clustering coreset S

(1-ε)CostP

(C) ≤

CostS

(C) ≤

(1+ε)CostP

(C)

CostA

(B) ≡ ∑pєA

dist(p,B)orCostA

(B) ≡ ∑pєA

[dist(p,B)]2

|S| = O(k2ε-d)

Was improved to |S| = O(poly(d,k,1/ε)) !! Again, just for k-clustering its possible to get

a coreset of size independent of n or d

Page 33: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

K-Means/MedianClustering Coreset

ConstructionIn K-Means/Median the coresets have to be weighted…

Similarly to k-centers, start with a bi-criteria clustering and create squares around each center. The squares are growing by factor 2.

Sample points from every square, but give a weight according to the number of points in that square.

Page 34: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Questions ??

Coreset

Page 35: K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

Exercises1.

Consider the case of points in R2. Suppose you have an O(log(n))

size coreset. Figure out how to find its optimal clustering using exhaustive search in O(log4k(n))

time.

2.

Consider Bi-Criteria K-Clustering for the case of lines in R2. i.e. we want to find a set C

of O(kclogc(n))

lines s.t. -CostP

(C) ≤

c’

·

CostP

(OPT)where distances are now from points to their nearest line.

Hint: The algorithm stays almost the same.When sampling points think how you should transform them into lines…