Sublinear Algorithms via Precision Sampling

Sublinear Algorithms via Precision Sampling

Alexandr Andoni (Microsoft Research)

joint work with:

Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

Goal

Compute the number of Dacians in the empire

Estimate S=a1+a2+…an where ai[0,1]

sublinearly…

Sampling Send accountants to a subset J of provinces

Estimator: S =∑jJ aj * n/J

Chebyshev bound: with 90% success probability0.5*S – O(n/m) < S < 2*S + O(n/m)

For constant additive error, need m~n

Send accountants to each province, but require only approximate counts Estimate ai, up to some pre-selected precision ui: |ai – ai|

< ui

Challenge: achieve good trade-off between quality of approximation to S total cost of estimating each a i to precision ui

Precision Sampling Framework

Formalization

Sum Estimator Adversary

1. fix a1,a2,…an1. fix precisions ui

2. fix a1,a2,…an s.t. |ai – ai| < ui

3. given a1,a2,…an, output S s.t.|∑ai – S| < 1.

What is cost? Here, average cost = 1/n * ∑ 1/ui to achieve precision ui, use 1/ui “resources”: e.g., if ai is itself a sum

ai=∑jaij computed by subsampling, then one needs Θ(1/ui) samples For example, can choose all ui=1/n

Average cost ≈ n This is best possible, if estimator S = ∑a i

Precision Sampling Lemma Goal: estimate ∑ai from ai satisfying |ai-ai|<ui. Precision Sampling Lemma: can get, with 90%

success: O(1) additive error and 1.5 multiplicative error:

S – O(1) < SL < 1.5*S + O(1) with average cost equal to O(log n)

Example: distinguish Σai=5 vs Σai=0 Consider two extreme cases:

if five ai=1: sample all, but need only crude approx (ui=1/10)

if all ai=5/n: only few with good approx ui=1/n, and the rest with ui=1

ε 1+εS – ε < S < (1+ ε)S + ε

O(ε-3 log n)

Precision Sampling Algorithm Precision Sampling Lemma: can get, with 90%

success: O(1) additive error and 1.5 multiplicative error:

S – O(1) < SL < 1.5*S + O(1) with average cost equal to O(log n)

Algorithm: Choose each ui[0,1] i.i.d. Estimator: S = count number of i‘s s.t. ai / ui > 6

(modulo a normalization constant) Proof of correctness:

we use only ai which are (1+ε)-approximation to ai

E[S] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6. E[1/u] = O(log n) w.h.p.

function of [ai /ui - 4/ε]+ and ui’sconcrete distrib. = minimum of O(ε-3) u.r.v.

O(ε-3 log n)

ε 1+εS – ε < S < (1+ ε)S + ε

Why? Save time:

Problem: computing edit distance between two strings new algorithm that obtains (log n)1/ε approximation in

n1+O(ε) time via efficient property-testing algorithm that uses Precision

Sampling More details: see the talk by Robi on Friday!

Save space: Problem: compute norms/frequency moments in

streams gives a simple and unified approach to compute all lp, Fk

moments, and other goodies More details: now

Streaming frequencies Setup:

1+ε estimate frequencies in small space Let xi = frequency of ethnicity i kth moment: Σxi

k

k[0,2]: space O(1/ε2)

[AMS’96,I’00, GC07, Li08, NW10, KNW10, KNPW11]

k>2: space O(n1-2/k)[AMS’96,SS’02,BYJKS’02,CKS’03,IW’05,BGKS’06,BO10]

Sometimes frequencies xi are negative: If measuring traffic difference (delay, etc) We want linear “dim reduction” L:RnRm

m<<n

Ethnicity Frequency

Dacians 358

Galois 12

Barbarians 2988

Norm Estimation via Precision Sampling Idea:

Use PSL to compute the sum ||x||kk=∑ |xi|k

General approach 1. Pick ui’s according to PSL and let yi=xi/ui

1/k

2. Compute all yik up to additive approximation O(1)

Can be done by computing the heavy hitters of the vector y

3. Use PSL to compute the sum ||x||kk=∑ |xi|k

Space bound is controlled by the norm ||y||2

Since heavy hitters under l2 is the best we can do Note that ||y||2≤||x||2 * E[1/ui]

Streaming Fk moments Theorem: linear sketch for Fk with O(1)

approximation, O(1) update, and O(n1-2/k log n) space (in words).

Sketch: Pick random ui [0,1], si±1, and let yi = si * xi / ui

1/k

throw into one hash table H, size m=O(n1-2/k log n) cells

Update: on (i, a) H[h(i)] += si*a/ui

1/k

Estimator: Maxj[m] |H[j]|k

Randomness: O(1) independence suffices

x1 x2 x3 x4 x5 x6

y1

+y3

y4 y2

+y5+y6

x=

H=

More Streaming Algorithms Other streaming algorithms:

Algorithm for all k-moments, including k≤2 For k>2, improves existing space bounds [AMS96, IW05,

BGKS06, BO10] For k≤2, worse space bounds [AMS96, I00, GC07, Li08, NW10,

KNW10, KNPW11]

Improved algorithm for mixed norms (lp of lk) [CM05, GBD08, JW09] space bounded by (Rademacher) p-type constant

Algorithm for lp-sampling problem [MW’10] This work extended to give tight bounds by [JST’11]

Connections: Inspired by the streaming algorithm of [IW05], but

simpler Turns out to be distant relative of Priority Sampling

[DLT’07]

Finale Other applications for Precision Sampling

framework ? Better algorithms for precision sampling ?

Best bound for average cost (for 1+ε approximation) Upper bound: O(1/ ε3 * log n) (tight for our algorithm) Lower bound: Ω(1/ ε2 * log n)

Bounds for other cost models? E.g., for 1/square root of precision, the bound is O(1 / ε3/2)

Other forms of “access” to ai’s ?

Sublinear Algorithms via Precision Sampling

Documents

s o1 s

s onm s

output s s

ai ai uichallenge

ai ai ui3

sum ai

log n1 s s

estimator s