Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)

Algorithms for massive data setsAlgorithms for massive data sets

Lecture 2 (Mar 14, 2004)

Yossi Matias & Ely Porat

(partially based on various presentations & notes)

CS 361A 2

Negative Result for Sampling Negative Result for Sampling [Charikar, Chaudhuri, Motwani, Narasayya 2000]

Theorem: Let E be estimator for D(X) examining r<n values in X, possibly in an adaptive and randomized order. Then, for any , E must have relative error

with probability at least .

• Example – Say, r = n/5

– Error 20% with probability 1/2

1

ln2r

rn

re

CS 361A 3

Scenario AnalysisScenario Analysis

Scenario A: – all values in X are identical (say V)

– D(X) = 1

Scenario B: – distinct values in X are {V, W1, …, Wk},

– V appears n-k times

– each Wi appears once

– Wi’s are randomly distributed

– D(X) = k+1

CS 361A 4

ProofProof• Little Birdie – one of Scenarios A or B only

• Suppose– E examines elements X(1), X(2), …, X(r) in that order

– choice of X(i) could be randomized and depend arbitrarily on values of X(1), …, X(i-1)

• Lemma P[ X(i)=V | X(1)=X(2)=…=X(i-1)=V ]

• Why? – No information on whether Scenario A or B

– Wi values are randomly distributed

1

1

in

kin

CS 361A 5

Proof (continued)Proof (continued)

• Define EV – event {X(1)=X(2)=…=X(r)=V}

• Last inequality because

rn

krr

rn

k

r

rn

krnr

iin

kin

ViXXXViXr

i

2exp1

11

1

)1(...)2()1(|)(

1

PΕP V

2/10for ),2exp(1 ZZZ

CS 361A 6

Proof (conclusion)Proof (conclusion)

• Choose to obtain

• Thus:– Scenario A – Scenario B

• Suppose– E returns estimate Z when EV happens

– Scenario A D(X)=1

– Scenario B D(X)=k+1

– Z must have worst-case error >

1

ln2r

rnk

EVP

1P EV EVP

k

A bit vector BV will represent the setLet b be smallest integer s.t. 2^b > u. Let F = GF(2^b). Let r,s be random from F.

For a in A, let h(a) = r ·a + s = 101****10….0Set k’th bit.

Estimate is 2^{max bit set}.

k

Randomized Approximation Randomized Approximation (based on [Flajolet-Martin 1983, Alon-Matias-Szegedy 1996])(based on [Flajolet-Martin 1983, Alon-Matias-Szegedy 1996])

Theorem: For every c > 2 there exists an algorithm that, given a sequence A of n members of U={1,2,…,u}, computes a number d’ using O(log u) memory bits, such that the probability that max(d’/d,d/d’) > c is at most 2/c.

0 1 k u-1

Pr(h(a)=k)

Bit vector : 0000101010001001111

b

k

Randomized Approximation Randomized Approximation (2)(2)(based on [Indyk-Motwani 1998])(based on [Indyk-Motwani 1998])

• Algorithm SM – For fixed t, is D(X) >> t?– Choose hash function h: U[1..t]

– Initialize answer to NO

– For each , if h( ) = t, set answer to YES

• Theorem:– If D(X) < t, P[SM outputs NO] > 0.25

– If D(X) > 2t, P[SM outputs NO] < 0.136 = 1/e^2

ix ix

AnalysisAnalysis

• Let – Y be set of distinct elements of X

• SM(X) = NO no element of Y hashes to t

• P[element hashes to t] = 1/t

• Thus – P[SM(X) = NO] =

• Since |Y| = D(X),– If D(X) < t, P[SM(X) = NO] > > 0.25

– If D(X) > 2t, P[SM(X) = NO] < < 1/e^2

• Observe – need 1 bit memory only!

||)/11( Yt

tt)/11( tt 2)/11(

Boosting AccuracyBoosting Accuracy• With 1 bit

can probabilistically distinguish D(X) < t from D(X) > 2t

• Running O(log 1/δ) instances in parallel reduces error probability to any δ>0

• Running O(log n) in parallel for t = 1, 2, 4, 8 …, n can estimate D(X) within factor 2

• Choice of factor 2 is arbitrary can use factor (1+ε) to reduce error to ε

• EXERCISE – Verify that we can estimate D(X) within factor (1±ε) with probability (1-δ) using space )( 1

loglog 2

nO

CS 361A 11

Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often well-represents all the data

– For a fast approx answer, apply the query to S & “scale” the result

– E.g., R.a is {0,1}, S is a 20% sample

select count(*) from R where R.a = 0

select 5 * count(*) from S where S.a = 0

1 1 0 1 1 1 1 1 0 0 0

0 1 1 1 1 1 0 11 1 0 1 0 1 1

0 1 1 0

Red = in S

R.aR.a

Est. count = 5*2 = 10, Exact count = 10

• Leverage extensive literature on confidence intervals for sampling

Actual answer is within the interval [a,b] with a given probability

E.g., 54,000 ± 600 with prob 90%

Sampling versus CountingSampling versus Counting• Observe

– Count merely abstraction – need subsequent analytics

– Data tuples – X merely one of many attributes

– Databases – selection predicate, join results, …

– Networking – need to combine distributed streams

• Single-pass Approaches– Good accuracy

– But gives only a count -- cannot handle extensions

• Sampling-based Approaches– Keeps actual data – can address extensions

– Strong negative result

Distinct Sampling for StreamsDistinct Sampling for Streams[Gibbons 2001]

• Best of both worlds– Good accuracy

– Maintains “distinct sample” over stream

– Handles distributed setting

• Basic idea– Hash – random “priority” for domain values

– Tracks highest priority values seen

– Random sample of tuples for each such value

– Relative error with probability

12 log O

1

Hash FunctionHash Function

• Domain U = [0..m-1]

• Hashing– Random A, B from U, with A>0

– g(x) = Ax + B (mod m)

– h(x) – # leading 0s in binary representation of g(x)

• Clearly –

• Fact

mxh log)(0

)1(2P llh(x)

Overall IdeaOverall Idea

• Hash random “level” for each domain value

• Compute level for stream elements

• Invariant– Current Level – cur_lev

– Sample S – all distinct values scanned so far of level at least cur_lev

• Observe– Random hash random sample of distinct values

– For each value can keep sample of their tuples

Algorithm DS (Distinct Sample)Algorithm DS (Distinct Sample)

• Parameters – memory size

• Initialize – cur_lev0; Sempty

• For each input x– L h(x)

– If L>cur_lev then add x to S

– If |S| > M• delete from S all values of level cur_lev• cur_lev cur_lev +1

• Return ||2 _ Slevcur

12 logM O

AnalysisAnalysis

• Invariant – S contains all values x such that

• By construction

• Thus

• EXERCISE – verify deviation bound

levcurxh _)(

)(2|S|E _ XDlevcur

levcurlevcurxh _2_)(P

CS 361A 18

Hot list queriesHot list queries

• Why is it interesting:– Top ten – best seller list

– Load balancing

– Caching policies

CS 361A 19


• Let use sampling

edoejddkaklsadkjdkdkpryekfvcuszldfoasd

djkkdkvza

k3d2jvza

CS 361A 20


• The question is:– How to sample if we don’t know our sample size?

CS 361A 21

Gibbons & Matias’ algorithmGibbons & Matias’ algorithm

0 0 0 0

a

a

1Hotlist:

b

b

1

a

2

Produced values: c a a b d b a d d

5 3 1 3

c d

p = 1.0

CS 361A 22


0 0 0 0

a

a

1Hotlist:

b

b

1

a

2


5 3 1 3

c d

p = 1.0

e

Need to replaceone value

CS 361A 23


0 0 0 0

a

a

1Hotlist:

b

b

1

a

2


5 3 1 3

c d

p = 0.75

e

Multiply p with someamount f

(f = 0.75)

Throw biasedcoins with

probability f

4 3 0 2

Replace countsby number

of seen heads

CS 361A 24


0 0 0 0

a

a

1Hotlist:

b

b

1

a

2


5 3 1 3

e d

p = 0.75

e

4 3 1 2

Replace a value whichhas zero count

Count/p is an estimate of number oftimes a value has been seen. E.g., the

value ‘a’ has been seen 4/p = 5.33 times

CS 361A 25

CountersCounters

• How many bits need to count?– Prefix code

– Approximated counters

CS 361A 26

RarityRarity• Paul goes fishing.

• There are many different fish species U={1,..,u}

• Paul catch one fish at a time atU

• Ct[j]=|{ai| ai=j,i≤t}| number of time catches the species j

• Species j is rare at time t if it appears only once

[t]=|{j| Ct[j]=1}|/u

CS 361A 27

RarityRarity

• Why is it interesting?

CS 361A 28

Again lets use samplingAgain lets use sampling

U={1,2,3,4,5,6,7,8,9,10,11,12…u}

U’={4,9,13,18,24}

Xt[i]=|{t|aj=U’[i],j≤t}|

CS 361A 29

Again lets use samplingAgain lets use sampling

Xi[t]=|{t|aj=Xi,j≤t}|

[t]=|{Ct[i]| Ct[i]=1}|/u

’[t]=|{Xt[i]| Xt[i]=1}|/k

:תזכורת

CS 361A 30

RarityRarity

• But [t] need to be at least 1/k to get a good estimator.

|}0][|{|

|}1][|{|][

jCj

jcjt

t

t

CS 361A 31

Min-wise independent hash functionsMin-wise independent hash functions

• Family of hash functions H[n]->[n]call Min-wise independent

• If for any X [n] and xX

||

1)](min)([Pr

XXhxhHh

Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)

Documents