Top Banner
1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara
31

1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

Mar 28, 2015

Download

Documents

Jasmine Rose
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

1

Efficient Computation of Frequent and Top-k Elements in Data Streams

Ahmed Metwally

Divyakant Agrawal

Amr El AbbadiDepartment of Computer Science

University of California, Santa Barbara

Page 2: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

3

Motivation

Motivated by Internet advertising commissioners Before rendering an advertisement for user, query clicks

stream for advertisements to display. If the user's profile is not a frequent “clicker”, then s/he

will probably not click any displayed advertisement.– Show Pay-Per-Impression advertisements.

If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement.– Show Pay-Per-Click advertisements.

– Retrieve top advertisements to choose what to display.

Page 3: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

4

Problem Definition

Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN

Top-k elements are the k elements with highest frequency

Both problems:– Very related, though, no integrated solution has been

proposed– Exact solution is O(min(N,A)) space

approximate variations

Page 4: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

5

Practical Frequent Elements

-Deficient Frequent Elements [Manku ‘02]:– All frequent elements output should have

F > (φ - )N, where is the user-defined error.

φ N

(φ - ) N

Page 5: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

6

Practical Top-k

FindApproxTop(S, k, ) [Charikar ‘02]:– Retrieve a list of k elements such that every

element, Ei, in the list has Fi > (1 - ) Fk, where Ek

is the kth ranked element.

F4

(1 - ) F4

Page 6: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

7

Related Work

Algorithms Classification– Counter-Based techniques

• Keep an individual counter for each element• If the observed ID is monitored, its counter is updated• If the observed ID is not monitored, algorithm dependent

action

– Sketch-Based techniques• Estimate frequency for all elements using bit-maps of

counters• Each element is hashed into the counters’ space using a

family of hash functions.• Hashed-to counters are queried for the frequencies

Page 7: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

8

Recent Work (Comparison)Algorithm Nature Space Bound Handles

CountSketch [Charikar ‘02]

Sketch O(k/2 log N/δ), δ is the failure probability

FindApproxTop(S, k, )

GroupTest [Cormode ’03]

Sketch O(φ-1 log(φ-1) log(|A|)) Hot Items

Frequent [Demaine ’02]

Counter O(1/), proved by [Bose ‘03]

FE

Probabilistic-Inplace [Demaine ’02]

Counter O(m), m is the available memory

FindCandidateTop(S, k, m/2)

Lossy Counting [Manku ’02]

Counter (1/) log(N) -Deficient FE

Sticky Sampling [Manku ’02]

Counter (2/) log(φ-1δ-1) -Deficient FE

Page 8: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

9

Outline

Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

Page 9: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

10

The Space-Saving Algorithm

Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate

for significant elements Keep track of max. possible errors

Page 10: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

11

Space-Saving By ExampleElement

Count

error (max possible)

A B B A C A B B D D

Element A B C

Count 2 2 1

error (max possible) 0 0 0

Element A B C

Count 3 2 1

error (max possible) 0 0 0

Element B A C

Count 4 3 1

error (max possible) 0 0 0

Element B A D

Count 4 3 2

error (max possible) 0 0 1

Element B A D

Count 5 3 3

error (max possible) 0 0 1E

Element B E A

Count 5 4 3

error (max possible) 0 3 0

Space-Saving Algorithm– For every element in the stream S

– If a monitored element is observed• Increment its Count

– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error

Space-Saving Algorithm– For every element in the stream S

– If a monitored element is observed• Increment its Count

– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error

Space-Saving Algorithm– For every element in the stream S

– If a monitored element is observed• Increment its Count

– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error

Space-Saving Algorithm– For every element in the stream S

– If a monitored element is observed• Increment its Count

– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error

Space-Saving Algorithm– For every element in the stream S

– If a monitored element is observed• Increment its Count

– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error

C

Element B E C

Count 5 4 4

error (max possible) 0 3 3B

Page 11: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

12

Space-Saving Observations

Observations:– The summation of the Counts is N

Element B E C

Count 5 4 4

error (max possible) 0 3 3

S = ABBACABBDDBEC N = 13

– Minimum number of hits, min ≤ N/m– In this example, min = 4

Element B E C

Count 5 4 4

error (max possible) 0 3 3

– The minimum number of hits, min, is an upper bound on the error of any element

Element B E C

Count 5 4 4

error (max possible) 0 3 3

Page 12: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

13

Space-Saving Proved Properties

1. If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4.

S = ABBACABBDDBEC N = 13

Element B E C

Count 5 4 4

error (max possible) 0 3 3

2. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4.

Element B E C

Count 5 4 4

error (max possible) 0 3 3

S = ABBACABBDDBEC N = 13

Page 13: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

16

Space-Saving Data Structure

We need a data structure that– Increments counters in constant time– Keeps elements sorted by their counters

We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02]

Page 14: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

18

Frequent Elements Queries

Traverse Stream-Summary, and report all elements that satisfy the user support

Any element whose

guaranteed hits = (Count – error) > φN

is guaranteed to be a frequent element

Page 15: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

19

Frequent Elements Example

For N = 73, m = 8, φ = 0.15:– Frequent Elements should have support of 11 hits.– Candidate Frequent Elements are B, D, and G.

Element B D G A Q F C E

Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2

Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1

– Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11.

Element B D G A Q F C E

Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2

Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1

Page 16: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

20

Frequent Elements Space Bounds

Space Bounds General Distribution Zipf(α)

Space-Saving O(1/) (1/)(1/α)

GroupTest O(φ-1 log(φ-1) log(|A|))

Frequent O(1/) proved by[Bose’03]

Lossy Counting (1/) log(N)

Sticky Sampling (2/) log(φ-1δ-1)

Page 17: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

26

Top-k Elements Queries

Traverse the Stream-Summary, and report top-k elements.

From Property 2, we assert:– Guaranteed top-k elements:

• Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k.

– Guaranteed top-k’ (where k’≈k):• The top-k’ elements reported are guaranteed to be the

correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1.

Page 18: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

27

Top-k Elements Example

For k = 3, m = 8:– B, D, and G are the top-3 candidates.

Element B D G A Q F C E

Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2

Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1

– B, and D are guaranteed to be in the top-3.

Element B D G A Q F C E

Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2

Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1

– B , D, G and A are guaranteed to be the top-4. Here k’ = 4.

Element B D G A Q F C E

Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2

Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1

– B , and D are guaranteed to be the top-2. Another k’ = 2.

Element B D G A Q F C E

Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2

Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1

Page 19: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

28

Top-k Elements Space Bounds

Space Bounds

General Distribution

Zipf(α)

Space-Saving

FindApproxTop(S, k, ):O(k/ * log(N))

Exact Top-k Problem:

α = 1: O(k2 log(A) )

α > 1: O((k/ α)(1/α) k )

CountSketch FindApproxTop(S, k, ):O(k/2 * log(N / δ))

FindApproxTop(S, k, ):α ≥ 1: O(k * log(N / δ))

Page 20: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

32

Outline

Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

Page 21: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

33

Experimental Results - Setup

Synthetic data:– Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0– N = 107 hits.

Real Data (ValueClick, Inc.): Similar results Precision:

– number of correct elements found / entire output Recall:

– number of correct elements found / number of actual correct Run time:

– Processing Stream + Query Time Space used:

– Including hash table

Page 22: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

34

Frequent Elements Results

Query: φ = 10-2, = 10-4, and δ = 10-2

We compared with– GroupTest and Frequent

All algorithms had a recall of 1.– That is, they all output the correct elements

among their output. Space-Saving was able to guarantee all

its output to be correct

Page 23: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

35

Frequent Elements Precision

Precision for Frequent Elements (>100,000 Hits) on Synthetic Data

0 0

1111111 11111 1

0.833333

0.08890.05260.0707

0.2157

0.1053

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Pre

cis

ion

Space-Saving GroupTest Frequent

Page 24: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

36

Frequent Elements Run Time

Run Time for Frequent Elements (>100,000 Hits) on Synthetic Data

4793745172 43844 43734 43141

27250272182590626125280152650024281

5003149578

6704759375167453103751228111906

0

10000

20000

30000

40000

50000

60000

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Ru

n T

ime (

ms)

Space-Saving GroupTest Frequent

Page 25: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

37

Frequent Elements Space Used

Space Used for Frequent Elements (>100,000 Hits) on Synthetic Data

2796

58460

78460

38240

67756

165885636

168260168260 168260 168260 168260 168260168260

13760 13760 1376013760 13760

13760 13760

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Sp

ace U

sed

(B

yte

s)

Space-Saving GroupTest Frequent

Page 26: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

38

Top-k Elements Results

Query: k = 100, = 10-4, and δ = 10-2

We compared with– CountSketch: CountSketch was re-run several

times. The hidden constant was estimated to be 16, in order to have output of competitive quality.

– Probabilistic-InPlace: was allowed the same number of counters as Space-Saving

Space-Saving was able to guarantee all its output to be correct

Page 27: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

39

Top-k Elements Precision

Precision for Top-100 on Synthetic Data

1111111 11

0.1

0.920.98 0.99 0.99 11

0.020.020.0182

0.358423

0.133333

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Pre

cis

ion

Space-Saving CountSketch Probabilistic InPlace

Page 28: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

40

Top-k Elements Recall

Recall for Top-100 on Synthetic Data

1 1 1 1

0.1

0.98 0.99 0.99 1 1

0.91

1 1 11 110.92

1 1 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Re

ca

ll

Space-Saving CountSketch Probabilistic InPlace

Page 29: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

41

Top-k Elements Run Time

Run Time for Top-100 on Synthetic Data

1860453

848141768547 757922 754813

23531 26391 27984 26125 25703 25422 25390

1339343

1931797

32250297972898530078320783037527609

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Ru

n T

ime

(m

s)

Space-Saving CountSketch Probabilistic InPlace

Page 30: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

42

Top-k Elements Space Used

Space Used for Top-100 on Synthetic Data

406330 407070 407070 407070 407010 406570 403930

67756

16588 6916 3436

5846078460

3824010874 3254

653439418 62674

1547020338

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Sp

ac

e U

se

d (

By

tes

)

Space-Saving CountSketch Probabilistic InPlace

Page 31: 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

44

Conclusion

Contributions:– An integrated approach to solve an interesting

family of problems– Strict error bounds using little space– Guarantees on results– Special attention was given to Zipfian data– Experimental validation

Future Work:– Incremental frequent and top-k elements reporting