YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Data reduction for weighted and outlier-resistant

clustering

Leonard J. SchulmanCaltech

joint with

Dan FeldmanMIT

Page 2: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Talk outline

• Clustering-type problems:– k-median– weighted k-median– k-median with m outliers (small m)– k-median with penalty (clustering with many outliers)– k-line median

• Unifying framework: tame loss functions

• Core-sets, a.k.a. -approximations

• Common existence proof and algorithm

Page 3: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.
Page 4: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.
Page 5: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Voronoi regions have spherical boundaries

Page 6: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.
Page 7: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

k-Median with penalty

Page 8: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

k-Median with penalty: good for outliers

2-median clustering of a data set:

Same data set plus an outlier:

Now cluster with h-robust loss function:

Page 9: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.
Page 10: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Related work and our results

Page 11: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Why are all these problems in the same paper?

In each case the objective function is a suitably tame “loss function”.

The loss in representing a point p by a center c is:

k-median: D(p) = dist(p,c)

Weighted k-median: D(p) = w · dist(p,c)

Robust k-median: D(p) = min{h, dist(p,c)}

What qualifies as a “tame” loss function?

Page 12: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Log-Log Lipschitz (LgLgLp) condition on the loss function

Page 13: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Many examples of LgLgLp loss functions:

Robust M-estimators in Statistics

figure: Z. Zhang

Page 14: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Classic Data Reduction

Page 15: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Same notion for LgLgLp loss functions

Page 16: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

k-clustering core-set for loss D

Page 17: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Weighted-k-clustering core-set for loss D

Handling arbitrary-weight centers is the “hard part”

Page 18: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Our main technical result

1. For every LgLgLp loss fcn D on a metric space, for every set P of n points, there is a weighted-(D,k)-core-set S of size

|S| = O(log2 n)

(In more detail: |S|=(dkO(k)/2) log2 n in Rd. For finite metrics, d=log n.)

2. S can be computed in time O(n)

Page 19: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Sensitivity [Langberg and S, SODA’11]

The sensitivity of a point p P determines how important it is to include P in a core-set:

Why this works:If s(p) is small, then p has many “surrogates” in the data,

we can take any one of them for the core-set.If s(p) is large, then there is some C for which p alone

contributes a significant fraction of the loss, so we need to include p in any core-set.

DW(p,C)

qP DW(q,C)s(p) = maxC

Page 20: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Total sensitivity

The total sensitivity T(P) is the sum of the sensitivities of all the points:

The total sensitivity of the problem is the maximum of T(P) over all input sets P.

Total sensitivity ~ n: cannot have small core-sets.Total sensitivity constant or polylog: there may

exist small core-sets.

T(P)=sP s(p)

Page 21: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Small total sensitivity Small coreset

Page 22: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Small total sensitivity Small core-set

Page 23: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

The main thing we need to do in order to produce a small core-set for weighted-k-median:

For each p P compute a good upper bound on s(p) in amortized O(1) time per point.

(Upper bound should be good enough that s(p) is small)

Page 24: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Recursive-Robust-Median(P,k)• Input:

– A set P of n points in a metric space– An integer k 1

• Output:– A subset Q P of (n/kk) points

We prove that any two points in Q can serve as each others’ surrogates w.r.t. any query. Hence each point p Q has sensitivity s(p) O(1/|Q|).

Outer loop: Call Recursive-Robust-Median(P,k), then set P:=P-Q. Repeat until P is empty.

Total sensitivity bd: T # calls to Recursive-Robust-Median kk log n.

Algorithm for computing sensitivities

Page 25: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

The algorithm to find the (n)–size set Q:

Page 26: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Recursive-Robust-Median: illustration

c*

c*

Page 27: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Recursive-Robust-Median: illustration

c*

Page 28: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

A detail

Actually it’s more complicated than described because we can’t afford to look for a (1+)-approximation, or even a 2-approximation, to the best k-median of any b·n points (b constant).

Instead look for a bicriteria approximation: a 2-approximation of the best k-median of any b·n/2 points. Linear time algorithm from [F,Langberg STOC’11].

Page 29: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

High-level intuition for the correctness of Recursive-Robust-Median

Consider any p in the “output” set Q.

If for all queries C, D(p,C) is small, then p has low sensitivity.

If there is a query C for which D(p,C) is large then in that query, all points of Q are assigned to the same center c C, and are closer to each other than to c; so they are surrogates.

Page 30: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Thankyou

Page 31: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

appendices

Page 32: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Many examples of LgLgLp loss functions:

Robust M-estimators in Statistics

Page 33: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

М-оценка

Huber

"fair"

Cauchy

Geman-

McClure

Welsch

Tukey

Andrews


Related Documents