Top Banner
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003
29

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Dec 14, 2015

Download

Documents

Renee Feemster
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

An Improved Data Stream Summary:

The Count-Min Sketch and its Applications

Graham Cormode, S. Muthukrishnan2003

Page 2: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

We consider the vector

initially iai 0)0(

))(),(,),(()( 1 tatatata ni

The t th update ),( tt ci

tii ctatatt

)1()(

)1()( tata ii tii

Data Stream Model

Page 3: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Count-Min Sketch

A Count-Min (CM) Sketch with parameters is represented by

a two-dimensional array counts with width and depth .

Given parameters , set and .

Each entry of the array is initially zero.

hash functions are chosen uniformly at random from a pairwise

independent family

),(

w ],[]1,1[: wdcountcountd

),(

e

w

1

lnd

d

}1{}1{:,,1 wnhh d

Page 4: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Update procedure :

When arrives, set),( tt ci dj 1

ttjtj cihjcountihjcount )](,[)](,[

t

t

t

t

c

c

c

c

ti1h

dh

Page 5: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

point query

range queries

inner product queries

),( rlQ

approx.

ia)(iQ

approx.

r

liia

),( baQ approx.

n

iiibaba

1

Approximate Query Answering Using CM Sketches

Page 6: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Point Query

)(iQ )](,[minˆ ihjcounta jj

i

Non-negative case ( )

Theorem 1 ii aa ˆ

]ˆ[1

aaaP ii

0)( tati

Page 7: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

PROOF : We introduce indicator variables

kjiI ,, ))()(()( khihki jj 1 if

0 otherwise

ewkhihIE jjkji

1)]()(Pr[)( ,,

Define the variable

n

kkkjiji aIX

1,,,

By construction, jiij Xaihjcount ,)](,[ ij aihjcount )](,[min

Page 8: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

For the other direction, observe that

])](,[.Pr[]ˆPr[11

aaihjcountjaaa ijii

].Pr[1, aaXaj ijii

djiji eXeEXj )](.Pr[ ,,

Markov inequality

0)(

]Pr[ tt

XEtX

n

kkjik

n

kkkjiji a

eIEaaIEXE

11,,

1,,, )()(

Page 9: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Time to produce the estimate )1

(ln

O

Time for updates )1

(ln

O

Space used )1

ln1

(

O

Remark : The constant is used here to minimize the space used.e

Page 10: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

General case

)(iQ )](,[ˆ ihjcountmediana jj

i

Theorem 2 4/1

111]3ˆ3Pr[ aaaaa iii

PROOF :1, )())](,[(| a

eXEaihjcountE jiij

8

1

33

)()3)](,[Pr(|

2

1

,

1

eae

XEaeaihjcount ji

ij

4/1

1)3ˆPr(| aeaa ii

Chernoff bounds

Page 11: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Time to produce the estimate )1

(ln

O

Time for updates )1

(ln

O

Space used )1

ln1

(

O

Page 12: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Inner Product Query

Set

w

kbaj kjcountkjcountba

1

],[],[)(

),( baQ

jjbaba

)(min)(

Page 13: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Theorem 3

)()( baba

])(Pr[11

bababa

PROOF:

)()(,1

)(qhphqp

qp

n

iiij

jj

bababa

)()( baba

e

ba

e

babaqhphbabaE

qp

qp

qpqpjj

11)]()(Pr[)(

]Pr[11

bababaMarkov inequality

0)(

]Pr[ tt

XEtX

Page 14: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Time to produce the estimate

Time for updates

Space used

)1

log1

(

O

)1

log1

(

O

)1

(log

O

Page 15: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

The application of inner-product computation to Join size estimation (where the vectors generated have non-negative entries)

Join size of 2 database relations on a particular attribute :

= the number of items in the cartesian product of the 2 relations which agree the value of that attribute

a

b

}1{ ni

: the nr of tuples which have value iii ba ,

ba

Page 16: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Collorary 1 The Join size of two relations on a particular attribute can

be approximated up to with probability by

keeping space .

11ba 1

1

log1

O

Page 17: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Range Query

Dyadic range: ]2)1(12[ yy xx for parameters yx,

range query dyadic range queriesn2log2 single point query(at most)

For each set of dyadic ranges of length a sketch is kept

n2log CM Sketches

1log0,2 2 nyy

Page 18: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

),( rlQ

Compute the dyadic ranges

(at most ) which

canonically cover the range

Pose that many point queries

to the sketches

Sum of queries

],[ˆ rla

n2log2

Page 19: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Theorem 4 ],[ˆ],[ rlarla

]log2],[],[ˆPr[1

anrlarla

Proof : Theorem 1ii aa ˆ

],[ˆ],[ rlarla

E(Σ error for each estimator) nlog2 E(error for each estimator)

1log2 a

en

deanrlarla ]log2],[],[ˆPr[1

Page 20: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Time to produce the estimate

Time for updates

Space used

1

log)log(nO

1

log)log(nO

1

log)log(n

O

Remark : the guarantee will be more useful when stated without terms ofIn the approximation bound.

nlog

Page 21: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Applications of Count-Min Sketches

Quantiles Heavy Hitters

Page 22: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Quantiles in the Turnstile Model

Do binary searches for ranges whose range sumr1

1],1[ akra

11

1

k

Quantiles Items with rank

(approx. rank and rank )

1ak

1

,,0 k

1)( ak

,)(1

ak

Page 23: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Theorem 5 approximate quantiles can be found with probability at least by keeping a data structure with space

The time for insert or delete operation is , and the time

to find each quantile on demand is .

1

n

nOlog

log)(log1 2

n

nOlog

log)log(

n

nOlog

log)log(

Page 24: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Heavy Hitters (cash register case)

),( tt ci )( tiQ1

)(ˆ taati

tiaddedto a heap

Heavy Hitters Items whose multiplicity exceeds the fraction

(approx. )

1aai

,)(1

aai

Page 25: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Theorem 6 The heavy hitters can be found from an inserts only sequence of

length by using CM sketches with space , and time

per item. Every item which occurs with count more than

time is output, and with probability , no item whose count is less than

is output.

1a

1log

1 aO

1log

aO

1a

1

1)( a

Page 26: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Sketching techniques

tug-of-war Alon, Matias and Szegedy (1996)

Count sketch Alon, Matias and Szegedy (2002)

Random subset sums Gilbert, Kotidis, Muthukrishnan and Strauss (2002)

Count-min sketch Cormode and Muthukrishnan (2003)

Page 27: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

- Linear projections of the vector with appropriately chosen random vectors

Computation :

Sketch Array dw

}1{}1{:,,1 wnhh d pairwise independent hash functions

dgg ,,1 hash function whose range and randomness varies

The th entry of the sketch : ),( kj

jihi

ki

k

iga)(

)(

a

Page 28: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

tug-of-war

is with 4-wise independence

Count sketch

Random subset sums

Count-min sketch

)(,1

log1

,12

igOdw

}1,1{

)(,1

log,1

2igOdOw

is with 2-wise independence}1,1{

)(,1

log24

,22

igdw

is 1

1)(,1

ln,

igd

ew

Page 29: An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Method Query Space UpdateTime

QueryTime

RandomnessNeeded

Tug-of-war Inner-product 4-wise

Tug-of-war Point

Range

4-wise

4-wise

Rundom subset-sums Range Pairwise

Count sketches Point 1 1 Pairwise

Count-Min sketches Point

Inner-product

Range

Pairwise

Pairwise

Pairwise

2/1 2/1 2/1

2/)log( n 2/)log( n 2/)log( n

2/)log( n 2/)log( n 2/)log( n

22 /)(log n 22 /)(log n 22 /)(log n

2/1

/1

/1 /1

/)log(n /)log(n)log(n

1 1

1