Top Banner
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden
18

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Mar 26, 2015

Download

Documents

Colin Bolton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms

T.S. JayramDavid Woodruff

IBM Almaden

Page 2: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Data streams Algorithms access data in a sequential fashion

One pass / small space Need to be randomized and approximate [FM, MP, AMS]

Algorithm MainMemory

2 3 4 16 0 100 5 4 501 200 401 2 3 6 0

Page 3: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Frequency Moments and Norms Stream defines updates to a set of

items 1,2,…,d. fi = weight of item i positive-only vs. turnstile model

k-th Frequency Moment Fk = i |fi|k

p-th Norm: Lp = kfkp = (i |fi|p)1/p

Maximum frequency: p=1 Distinct Elements: p=0 Heavy hitters Assume length of stream and

magnitude of updates is · poly(d)

Page 4: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Classical Results

Approximating Lp and Fp is the same problem

For 0 · p · 2, Fp is approximable in O~(1) space (AMS, FM, Indyk, …)

For p > 2, Fp is approximable in

O~(d1-2/p) space (IW) this is best-possible (BJKS, CKS)

Page 5: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Cascaded Aggregates

Stream defines updates to pairs of items in {1,2,…n} x {1,2,…,d} fij = weight of item (i,j)

Two aggregates P and Q

0

BBB@

f 11 f 12 : : : f 1d

f 21 f 22 : : : f 2d...

......

...f n1 f n2 : : : f nd

1

CCCA

Q PP ± Q

P ± Q = cascaded aggregate

0

BBB@

Q(Row1)Q(Row2)

...Q(Rown)

1

CCCA

Page 6: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Motivation

Multigraph streams for analyzing IP traffic [Cormode-Muthukrishnan]

Corresponds to P ± F0 for different P’s F0 returns #destinations accessed by

each source Also introduced the more general

problem of estimating P ± Q Computing complex join estimates Product metrics [Andoni-Indyk-Krauthgamer]

Stock volatility, computational geometry, operator norms

Page 7: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

k

n

n1-2/k d

1

k=p

0 1 2 1

0

1

2

p

n1-2/k d1-2/p

n1-1/k

£(1)

?

£(1)

d1-2/p d

n1-1/k

The Picture

Estimating Lk ± Lp

We give a 1-pass O~(n1-2/kd1-2/p) space algorithm when k ¸ p

We also provide a matching lower bound based on multiparty disjointness

We give a 1-pass O~(n1-2/kd1-2/p) space algorithm when k ¸ p

We also provide a matching lower bound based on multiparty disjointness

We give the (n1-1/k) bound for Lk ± L0 and Lk ± L1

Õ(n1/2) for L2 ± L0 without deletions [CM]Õ(n1-1/k) for Lk ± Lp for any p in {0,1} in turnstile [MW]

We give the (n1-1/k) bound for Lk ± L0 and Lk ± L1

Õ(n1/2) for L2 ± L0 without deletions [CM]Õ(n1-1/k) for Lk ± Lp for any p in {0,1} in turnstile [MW][Ganguly] (without

deletions)[Ganguly] (without deletions)

Follows from techniques of[ADIW]

Follows from techniques of[ADIW] Our upper

bound Our upper bound

Page 8: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Our Problem: Fk ± Fp

Fk ± Fp (M) = i (j |fij|p)k

= i Fp(Row i)k

0

BBB@

f 11 f 12 : : : f 1d

f 21 f 22 : : : f 2d...

......

...f n1 f n2 : : : f nd

1

CCCA

M =

Page 9: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

High Level Ideas: Fk ± Fp

1. We want the Fk-value of the vector (Fp(Row 1), …, Fp(Row n))

2. We try to sample a row i with probability / Fp(Row i)

3. Spend an extra pass to compute Fp(Row i)

4. Could then output Fp(M) ¢ Fp(Row i)k-1

(can be seen as a generalization of [AMS])

How do we do the sampling efficiently??

How do we do the sampling efficiently??

Page 10: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Review – Estimating Fp [IW]

Level sets:

Level t is good if |St|(1+ε)2t ¸ F2/B

Items from such level sets are also good

St = f i j (1+ ²)t · jf i j · (1+ ²)t+1g

Page 11: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

²-Histogram [IW]

Finds approximate sizes s’t of level sets For all St, s’t · (1+ε)|St|

For good St, s’t ¸ (1- ε)|St|

Also provides O~(1) random samples from each good St

Space: O~(B)

Page 12: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Sampling Rows According to Fp value Treat n x d matrix M as a vector:

Run ε-Histogram on M for certain B Obtain (1§ε)-approximation st’ to |St| for good t

Fk ± Fp(M’) ¸ (1-ε) Fk ± Fp(M), where M’ is M restricted to good items (Holder’s inequality)

To sample, Choose a good t with probability

st’(1+ε)pt/Fp’(M),

where Fp’(M) = sumgood t st’ (1+ε)pt

Choose random sample (i, j) from St

Let row i be the current sample

Pr[row i] = t [st’(1+ε)pt/Fp’(M)]¢[|St Å row i|/|St|]

¼ Fp(row i)/Fp(M)

Pr[row i] = t [st’(1+ε)pt/Fp’(M)]¢[|St Å row i|/|St|]

¼ Fp(row i)/Fp(M) Problems1. High level algorithm requires

many samples (up to n1-1/k) from the St, but [IW] just gives O~(1).

Can’t afford to repeat in low space

2. Algorithm may misclassify a pair (i,j) into St when it is in St-1

Problems1. High level algorithm requires

many samples (up to n1-1/k) from the St, but [IW] just gives O~(1).

Can’t afford to repeat in low space

2. Algorithm may misclassify a pair (i,j) into St when it is in St-1

Page 13: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

High Level Ideas: Fk ± Fp

1. We want the Fk-value of the vector (Fp(Row 1), …, Fp(Row n))

2. We try to sample a row i with probability / Fp(Row i)

3. Spend an extra pass to compute Fp(Row i)

4. Could then output Fp(M) ¢ Fp(Row i)k-1

(can be seen as a generalization of [AMS])

How do we avoid an extra pass??

How do we avoid an extra pass??

Page 14: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Avoiding an Extra Pass Now we can sample a Row i / Fp(Row i)

We design a new Fk-algorithm to run on(Fp(Row 1), …, Fp(Row n))

which only receives IDs i with probability / Fp(Row i)

For each j 2 [log n], algorithm does:1. Choose a random subset of n/2j rows2. Sample a row i from this set with Pr[Row i] / Fp(Row i)

We show that O~(n1-1/k) oracle samples is enough to estimate Fk up to 1§ε

Page 15: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

New Lower Bounds

Alice Bob

n x d matrix A n x d matrix B

NO instance: for all rows i, ¢(Ai, Bi) · 1

YES instance: there is a unique row j for which¢(Aj, Bj) = d, and for all i j, ¢(Ai, Bi) · 1

We show distinguishing these cases requires (n/d) randomized communication CC

Implies estimating Lk(L0) or Lk(L1) needs (n1-1/k) space

Page 16: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Information Complexity Paradigm [CSWY, BJKS]: the information cost IC is the

amount of information the transcript reveals about the inputs

For any function f, CC(f) ¸ IC(f)

Using their direct sum theorem, it suffices to show an (1/d) information cost of a protocol for deciding if ¢(x,y) = d or ¢(x,y) · 1

Caveat: distribution is only on instances where ¢(x,y) · 1

Page 17: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Working with Hellinger Distance Given the prob. distribution vector ¼(x,y) over transcripts of an

input (x,y) Let Ã(x,y)¿ = ¼(x,y)¿

1/2 for all ¿

Information cost can be lower bounded by ¢(u,v) = 1 kÃ(u,u) - Ã(u,v)k2

Unlike previous work, we exploit the geometry of the squared Euclidean norm (useful in later work [AJP])

Short diagonals property:¢(u,v) = 1 kÃ(u,u) - Ã(u,v)k2 ¸ (1/d) ¢(u,v) = d kÃ(u,u) - Ã(u,v)k2

a

b

c

d

ef

a2 + b2 + c2 + d2 ¸ e2 + f2

Page 18: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Open Problems

Lk ± Lp estimation for k < p

Other cascaded aggregates, e.g. entropy

Cascaded aggregates with 3 or more stages