Top Banner
Aggregation Computation Over Distributed Data Streams (partial content) Yueshen Xu [email protected] Middleware, CCNT Zhejiang Univ Middleware, CCNT, ZJU 06/26/22
14

Aggregation computation over distributed data streams(the final version)

Jan 27, 2015

Download

Education

Yueshen Xu

I have fixed a few mistakes in the original version and appended some new contents. Likewise, I hope it is of help for you.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Aggregation computation over distributed data streams(the final version)

Aggregation Computation Over Distributed Data Streams

(partial content)

Yueshen [email protected]

Middleware, CCNT

Zhejiang Univ

Middleware, CCNT, ZJU04/10/23

Page 2: Aggregation computation over distributed data streams(the final version)

Paper reference

What's Different: Distributed, Continuous Monitoring of Duplicate-Duplicate-Resilient Aggregates on Data Streams

Published in ICDE, 2006 Cited by 61 times By Graham Cormode, S. Muthukrishnan etc.

04/10/23 Middleware, CCNT, ZJU

I think it’s a good reading suitable for freshmen on distributed data streams

Bell Lab

Bell Lab

Expert/27

Rutgers

Rutgers

Expert/45

!!

Page 3: Aggregation computation over distributed data streams(the final version)

Background

Distributed Data Streams Where and why?

Large scale monitoring applications Many sensors distributed over a wide area

04/10/23 Middleware, CCNT, ZJU

Just one example

Distributed Streaming Model

What do we research? Query paradigm Centralized Decentralized

VSVS

Page 4: Aggregation computation over distributed data streams(the final version)

Constraints and Features

Constraints Space

Embedded equipments don’t have enough memory

Processing power The same reason

Communication capability Unreliable, spotty and sporadic

04/10/23 Middleware, CCNT, ZJU

All resources are restricted

Features Different from ad hoc queries in DBMS, but continuous Continuous is one of core characteristics in queries over data

streams What’s different?

Page 5: Aggregation computation over distributed data streams(the final version)

Trouble

Duplication Why? Wide scale monitoring invariably encounters the same events at

different points

04/10/23 Middleware, CCNT, ZJU

Instances The same flow will be observed in different routers The same individual will be observed by several mobile sensors

Requirement Duplicate-resilient aggregate

Two vital questions What is the amount of duplication in the network? What are the versions of classical aggregates in the presence of

duplicates?

 root of all evil

Page 6: Aggregation computation over distributed data streams(the final version)

What is the aggregation?Summarization, namely a statistical variable computed from the

original data sets Examples

min, max, quantile, heavy hitter distinct counts, average, sum …

Topic

What kind of topics are researchers interested in ? Aggregation computation Routing algorithms …

04/10/23 Middleware, CCNT, ZJU

Not strange contacting with data streams

Why only aggregation? transaction

Mirror the topic in data base

Page 7: Aggregation computation over distributed data streams(the final version)

Problems and Concerns

Distinct countTo obtain the number of distinct data (item, record, etc) in multi-sets,

namely the cardinality

Distinct sampleImportant, but I’m sorry that I haven’t finished this part

04/10/23 Middleware, CCNT, ZJU

What does this paper concern about? Priority: correctness, communication cost Computational cost, space cost

!!

Features attached to those algorithms applied to distributed environments

What do we concern about dealing with aggregation computation?

Space complexity may be more attention-getting than time complexity

Page 8: Aggregation computation over distributed data streams(the final version)

Distinct Counting: Flajolet-Martin Sketch

Flajolet-Martin Sketch P. Flajolet, G. Martin. Probabilistic Counting Algorithms for Data

Base Applications. Journal of Computer and System Sciences, 1985(Cited by 628)

Goal: To estimate the cardinalities of multi-sets of data using relative small space by one pass scan

The sketch is a kind of data structure, which is the way to obtain the aggregation results. aggregation

I think this method can be regarded as the classical application of probability without complexity.

04/10/23 Middleware, CCNT, ZJU

Give a question: How about you dealing with this problem? The computing paradigm of sketching Space complexity

Be appropriate for using in data streams inherently

Just think about one scene in TaoBao

Page 9: Aggregation computation over distributed data streams(the final version)

Flajolet-Martin Sketch(Cont.)

Preliminary what do we need? the Multi-set M, containing all items/records, and |M| = n the upper bound on the number of distinct items/records U, which

is more than n one bitmap, consisting of L elements, and 2L = U the hash function h(x: item/record), transforming each items into a

binary string distributed uniformly over the range of [1…2L], just like b1b2…bL, in which b1 is the lowest digit, and bL is the highest

the p(x), attaining the left most position of ‘1’

04/10/23 Middleware, CCNT, ZJU

counting not computing

1 1 …0 0

PPT VS

Whiteboard ?

x

record

h(x)1 L

Page 10: Aggregation computation over distributed data streams(the final version)

Flajolet-Martin Sketch(Cont.)

The algorithm itself the core task: remarking the position of which the leftmost ‘1’ of

the hash value recorded by p(x) in bitmap B

04/10/23 Middleware, CCNT, ZJU

for i:=1 to L do bitmap[i] :=0

for all x in M do

begin

index := p(hash(x));

if bitmap[index] = the bitmap[index] :=1;

end

Why? 1 0 …1 0 Bitmap

1 L

Page 11: Aggregation computation over distributed data streams(the final version)

Flajolet-Martin Sketch(Cont.)

The explanation The fact: bitmap[k] equals to 1 iff after execution a pattern of the

form 0k-11 has appeared amongst hashed values of records in M The probability: the occurrence probability of the pattern 0k-11 is

1/2k

Occurrence times: so if |M| = n, then bitmap[1] is accessed approximately n/2 times, bitmap[2] approximately n/4 times

Extension: bitmap[k] will almost certainly be zero if k >> log2(n) and one if k << log2(n) wit a fringe of 0 and 1 for k ≈ log2(n)

Selection: the leftmost 0, the rightmost 1 or something else

04/10/23 Middleware, CCNT, ZJU

U

The most practical part is over, and the left is very complicated taken for proving and error analysis, namely all about mathematic

for i:=1 to L do bitmap[i] :=0

for all x in M do

begin

index := p(hash(x));

if bitmap[index] = the bitmap[index] :=1;

end

Page 12: Aggregation computation over distributed data streams(the final version)

Questions How to make the value of U? What’s the relationship of U and n? How to make the analysis to the error? log(αn) …

Flajolet-Martin Sketch(Cont.)

Conclusion Analysis

Bit-based, reducing the space complexity by constant level Space complexity O(log(n)) O(log(log(n))) Duplicate-insensitive duplicate-resilient and flexible Order-insensitive stable and robust Additivity The ability to merge two FM sketches together, and the

merger is simply the bitwise-or of each pair of corresponding bitmaps

04/10/23 Middleware, CCNT, ZJU

nice qualities for distributed aggregation

Page 13: Aggregation computation over distributed data streams(the final version)

Question

Is the CFV in my last report a kind of sketch? Yes, I think so

What’s the relationship between sketch and skyline? Are they the same? No, just trust me I hold the opposite opinion

Does the aggregation computation belong to the research fields of data mining?

No, I suppose it doesn’t and I don’t care

04/10/23 Middleware, CCNT, ZJU

What’s skyline?

Page 14: Aggregation computation over distributed data streams(the final version)

Q&AQ&A

04/10/23 Middleware, CCNT, ZJU