Top Banner
Data reduction for weighted and outlier- resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT
33

Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Jan 05, 2016

Download

Documents

Cynthia Chase
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Data reduction for weighted and outlier-resistant

clustering

Leonard J. SchulmanCaltech

joint with

Dan FeldmanMIT

Page 2: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Talk outline

• Clustering-type problems:– k-median– weighted k-median– k-median with m outliers (small m)– k-median with penalty (clustering with many outliers)– k-line median

• Unifying framework: tame loss functions

• Core-sets, a.k.a. -approximations

• Common existence proof and algorithm

Page 3: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.
Page 4: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.
Page 5: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Voronoi regions have spherical boundaries

Page 6: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.
Page 7: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

k-Median with penalty

Page 8: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

k-Median with penalty: good for outliers

2-median clustering of a data set:

Same data set plus an outlier:

Now cluster with h-robust loss function:

Page 9: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.
Page 10: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Related work and our results

Page 11: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Why are all these problems in the same paper?

In each case the objective function is a suitably tame “loss function”.

The loss in representing a point p by a center c is:

k-median: D(p) = dist(p,c)

Weighted k-median: D(p) = w · dist(p,c)

Robust k-median: D(p) = min{h, dist(p,c)}

What qualifies as a “tame” loss function?

Page 12: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Log-Log Lipschitz (LgLgLp) condition on the loss function

Page 13: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Many examples of LgLgLp loss functions:

Robust M-estimators in Statistics

figure: Z. Zhang

Page 14: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Classic Data Reduction

Page 15: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Same notion for LgLgLp loss functions

Page 16: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

k-clustering core-set for loss D

Page 17: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Weighted-k-clustering core-set for loss D

Handling arbitrary-weight centers is the “hard part”

Page 18: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Our main technical result

1. For every LgLgLp loss fcn D on a metric space, for every set P of n points, there is a weighted-(D,k)-core-set S of size

|S| = O(log2 n)

(In more detail: |S|=(dkO(k)/2) log2 n in Rd. For finite metrics, d=log n.)

2. S can be computed in time O(n)

Page 19: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Sensitivity [Langberg and S, SODA’11]

The sensitivity of a point p P determines how important it is to include P in a core-set:

Why this works:If s(p) is small, then p has many “surrogates” in the data,

we can take any one of them for the core-set.If s(p) is large, then there is some C for which p alone

contributes a significant fraction of the loss, so we need to include p in any core-set.

DW(p,C)

qP DW(q,C)s(p) = maxC

Page 20: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Total sensitivity

The total sensitivity T(P) is the sum of the sensitivities of all the points:

The total sensitivity of the problem is the maximum of T(P) over all input sets P.

Total sensitivity ~ n: cannot have small core-sets.Total sensitivity constant or polylog: there may

exist small core-sets.

T(P)=sP s(p)

Page 21: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Small total sensitivity Small coreset

Page 22: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Small total sensitivity Small core-set

Page 23: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

The main thing we need to do in order to produce a small core-set for weighted-k-median:

For each p P compute a good upper bound on s(p) in amortized O(1) time per point.

(Upper bound should be good enough that s(p) is small)

Page 24: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Recursive-Robust-Median(P,k)• Input:

– A set P of n points in a metric space– An integer k 1

• Output:– A subset Q P of (n/kk) points

We prove that any two points in Q can serve as each others’ surrogates w.r.t. any query. Hence each point p Q has sensitivity s(p) O(1/|Q|).

Outer loop: Call Recursive-Robust-Median(P,k), then set P:=P-Q. Repeat until P is empty.

Total sensitivity bd: T # calls to Recursive-Robust-Median kk log n.

Algorithm for computing sensitivities

Page 25: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

The algorithm to find the (n)–size set Q:

Page 26: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Recursive-Robust-Median: illustration

c*

c*

Page 27: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Recursive-Robust-Median: illustration

c*

Page 28: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

A detail

Actually it’s more complicated than described because we can’t afford to look for a (1+)-approximation, or even a 2-approximation, to the best k-median of any b·n points (b constant).

Instead look for a bicriteria approximation: a 2-approximation of the best k-median of any b·n/2 points. Linear time algorithm from [F,Langberg STOC’11].

Page 29: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

High-level intuition for the correctness of Recursive-Robust-Median

Consider any p in the “output” set Q.

If for all queries C, D(p,C) is small, then p has low sensitivity.

If there is a query C for which D(p,C) is large then in that query, all points of Q are assigned to the same center c C, and are closer to each other than to c; so they are surrogates.

Page 30: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Thankyou

Page 31: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

appendices

Page 32: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

Many examples of LgLgLp loss functions:

Robust M-estimators in Statistics

Page 33: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT.

М-оценка

Huber

"fair"

Cauchy

Geman-

McClure

Welsch

Tukey

Andrews