Adaptivity in continuous query systems

Adaptivity in continuous query systems

Luis A. Sotomayor & Zhiguo Xu

Professor Carlo ZanioloCS240B - Spring 2003

Sotomayor - Xu 2

Outline Introduction Adapting to the “burstiness” of data

streams by using a smart operator scheduling strategy

Adapting to high volumes of data streamed by multiple data sources through the use of “adaptive filters”

Conclusion

Sotomayor - Xu 3

Introduction Two distinguishing characteristics of

data streams: Volume of data is extremely high Decisions are made in close to real time

Traditional solutions are impractical Data cannot be stored in static databases

for offline querying Importance of data streams is due to

variety of applications

Sotomayor - Xu 4

Applications of data streams Network monitoring Intrusion detection systems Fraud detection Financial monitoring E-commerce Sensor networks

Sotomayor - Xu 5

Research efforts Large number of applications has led to

many efforts seeking to construct full-fledged DSMS

Efforts have concentrated on issues of System architectures Query languages Algorithm efficiency

Issues such as efficient resource allocation, and communication overhead have received less attention

Sotomayor - Xu 6

Importance of adaptivity DSMS deal with multiple long-running continuous

queries Data streams do not usually arrive at a regular rate

Considerable “burstiness” and variation over time Environment conditions in which queries are

executed are frequently different from the conditions for which the query plans were generated

DSMS may face an increasing number of data sources and therefore an increased volume of traffic

The “Chain” operator scheduling strategy

Sotomayor - Xu 8

The classic solution Buffer the backlog of unprocessed

tuples Work through them during periods of

light load Problem:

Heavy load could exceed physical memory (causing page switches)

The memory used for these backlogs has to be minimized

Sotomayor - Xu 9

Finding a better solution Claim: the operator scheduling

strategy can have a significant impact on run-time resource consumption

Use an operator scheduling strategy that will minimize the amount of memory used during query execution I.e. reduce the size of the backlogs

Sotomayor - Xu 10

Chain scheduling A near optimal operator scheduling

strategy Outperforms competing operator

scheduling strategies Strategy concentrates on

Single stream queries involving Selection Projection Foreign-key joins with stored relations

Sliding window queries over multiple streams

Sotomayor - Xu 11

The model Query execution is conceptualized as a data

flow diagram (a directed acyclic graph) Nodes correspond to pipelined operators Edges represent compositions of operators

An edge from A to B indicates the output of operator A is the input to operator B

Another interpretation: an edge represents an input queue that buffers the output from A before it is input to B

Sotomayor - Xu 12

An example Suppose the query is

SELECT Name FROM EmployeeStream WHERE ID = ‘12345’;

Operators are Projection (SELECT …) Selection (WHERE …)

Input stream

Select ProjectOutput stream

Operator path

Sotomayor - Xu 13

Main ideas Operators are thought of as filters

Operate on a set of tuples Produce s tuples in return

s selectivity of an operator If s = 0.2 we can interpret the value in

two ways Out of every 10 tuples, the operator outputs

2 tuples If the input requires 1 unit of memory, the

output will require 0.2 units of memory

Sotomayor - Xu 14

Example Consider an operator path with two

operators O1 and O2 Assume that O1 takes one unit of time

to process a tuple and that its selectivity is 0.2

Assume that O2 takes one unit of time to process 0.2 tuples and that its selectivity is 0

I.e. O2 outputs tuples out of the system

Sotomayor - Xu 15

Example (cont) Now consider two strategies

FIFO A tuple is passed through both operators in

two consecutive time units No other tuples are processed during that

time Greedy strategy

If there is a tuple buffered before O1 then it is operated on using one time unit

Otherwise if there are tuples buffered before O2, 0.2 tuples are processed using 1 time unit

Sotomayor - Xu 16

Example (cont)

Time Greedy scheduling FIFO scheduling

0 1 11 1.2 1.22 1.4 2.03 1.6 2.24 1.8 3.05 2.0 3.26 2.2 4.0

Memory usage

Need to consider the growth or reduction of data as it travels along the operator path

Sotomayor - Xu 17

Progress charts Behavior of data

is captured by progress charts Points represent

an operator The ith operator

takes (ti – ti-1) units of time to process a tuple of size si-1

Result is a tuple of size si

Sotomayor - Xu 18

Progress charts (cont) We can define

selectivity as the drop in tuple size from operator i to operator i+1. In other words

selectivity is equal to si/si-1

selectivity

Sotomayor - Xu 19

The lower envelope Consider some point (s,

t) on the progress chart Imagine there is a line

from this point to every operator point (ti, si) to its right

The operator that corresponds to the line with the steepest slope is called the “steepest descent operator point”

Sotomayor - Xu 20

The lower envelope (cont) By starting at the first

point (t0, s0) and repeatedly calculating the steepest descent operator point we find the lower envelope P’ for a progress chart P

Notice that the slopes of the segments are non-increasing

Sotomayor - Xu 21

The lower envelope (cont) So what is it?

A way to find which segments of the operator path yield the biggest drops in tuple size

It allows us to consider changes in selectivity across groups of operators We call these groups “chains”

Sotomayor - Xu 22

“Chain” scheduling Chain assigns priorities to operators

equaling the slope of the lower envelope segment to which the operator belongs

At any time Out of all the operators with tuples in their

input queues the one with the highest priority is chosen

When there are “ties,” the operator with the oldest tuples is chosen (based on arrival time)

Sotomayor - Xu 23

The Chain strategy along the progress chart Tuples don’t actually

move along lower envelope

They instead move along the operator path

When the Chain strategy moves along the actual progress chart P, the memory requirements are not that much greater than before

Sotomayor - Xu 24

Multiple stream queries Queries that have at least one

tuple-based sliding window join between two streams

Sotomayor - Xu 25

Multiple stream query execution

Query is first broken up into parallel operator paths

R

S

R

S

Shared

Sotomayor - Xu 26

Experimental results Compared the performance of

Chain, FIFO, Greedy, and Round-Robin

2 data sets (network data) Synthetic data set Real data set

Queries used IP addresses and packet sizes in selection and projection predicates

Sotomayor - Xu 27

Experiment: single stream queries (4 operators)

Query: 4 operators Third operator is

very selective In between two

less selective operators

Sotomayor - Xu 28

Experiment results

Sotomayor - Xu 29

Multiple stream experiment Three simultaneous queries

A sliding window join Two single stream queries with

selectivities less than one Results show Chain outperforms other

strategies by a large margin

Sotomayor - Xu 30

Multiple stream experiment results

Sotomayor - Xu 31

Summary Proved that the choice of operator

scheduling strategy has a significant impact on resource consumption

Proved that the Chain scheduling strategy outperforms competing strategies

Future work Latency and starvation issues Consider query plans that change over time Consider the sharing of computation and

memory in query plans

Sotomayor - Xu 32

“Adaptive filters” for continuous queries over distributed data streams

Sotomayor - Xu 33

What’s the problem? Distributed data sources continuously stream

updates to a centralized processor where continuous queries are evaluated

Because of the high volume of data updates, the communication overhead jeopardizes system performance E.g. path latency computed by monitoring

queuing latency at routers: the volume of monitoring traffic from routers may exceed that of normal traffic

Can we reduce the communication overhead to make continuous queries based on multiple data streams feasible and efficient?

Sotomayor - Xu 34

Important observations Exact precision for continuous queries is not

always needed E.g. path latency application: <= 5 ms of accuracy

Approximate answers of sufficient precision can usually be computed from a small fraction of the input stream. E.g. average network traffic volume received by all

hosts within the organization The precision constraint for queries may

change over time. E.g. more precise traffic volume needed in face of

attack

Sotomayor - Xu 35

Overview of Approach Reduce communication overhead at

the cost of query precision. Quantitative precision constraints specified

with the continuous queries Bounded approximate answer [L, H] Precision constraint δ. 0 ≤ H – L ≤ δ

Filters installed at the remote data sources by the stream processor

Filter at data object O’s source: [Lo, Ho] of width Wo centered around most recent numeric update V.

Sotomayor - Xu 36

Naive filtering policy Uniform allocation

E.g a single CQ: AVG(O1, O2, …, On) Precision constraint δ Filters with a bound of width δ

The wider a bound, the more restrictive a filter and consequently the more imprecise the query answers.

Cons Multiple CQs are issued on one object. If the

smallest bound width is chosen for the filter, the higher update stream rate may be wasted on a few CQs.

Data updates rate and magnitudes not counted.

Sotomayor - Xu 37

System structure Data source Filters Stream coordinator Precision manager Bound cache CQ evaluator

Sotomayor - Xu 38

System structure

Sotomayor - Xu 39

Adaptive filter setting algorithm Goal: set bound widths for steam filters adaptively to

reduce communication costs while guaranteeing the precision constraints of CQs AVG queries analyzed only

Q1, Q2, …, Qm with sets S1, S2, …, Sm. Sj is a subset of a set of n data objects O1, O2, …, On

Query result Qj :

Precision constraint: Basic idea:

Implicit bound width shrinking Explicit bound width growing

ji SOnii

j

VS ,1

1

jjSOnii SWji

,1

Sotomayor - Xu 40

Bound shrinking Filtering bound width Wi for object

Oi Maintained both at the central stream

coordinator and at the source filter Wi Wi · (1 – S) for every Γ time units

Γ: adjustment period S: shrink percentage

Sotomayor - Xu 41

Bound growing

Burden score: the degree to which an object is contributing to the overall communication cost due to streamed updates

where Ci is communication cost for Oi, Wi is the current bound width, and

ii

ii WP

CB

ii NP

Burden target: the lowest overall burden required of the objects in the query in order to meet the precision constraint at all times.

Where Ni is the number of updates of Oi received by the stream coordinator in the last Γ time units

Sotomayor - Xu 42

Bound growing (Cont) Burden deviation: the

degree to which an object is “over-burdened” with respect to the burden targets of the queries that access it.

Queried objects are considered in order of decreasing deviation, and it is assigned the maximum possible bound growth when it is considered.

ji SOmjjii TBD

,1

0,max

jkji SOnk

kjjSOmj

i WSW,1

,1min

Sotomayor - Xu 43

Bound growing (Summary) Each object is assigned a burden score Each query is assigned a burden target by either

averaging burden scores or invoking an iterative linear solver

Each object is assigned a deviation value based on the difference between its burden score and the burden targets of the queries that access it

The objects are considered in order of decreasing deviation, and each object is assigned the maximum possible bound growth when it is considered

Sotomayor - Xu 44

Burden Target Computation Single AVG query Qk over every object O1, …, On.

B1 = B2 = … = Bn = Tk

Or

Intuitive explanation behind this formula Objects having higher than average burden scores will

be given a higher priority for bound width growth to lower their burden scores;

Objects having lower than average burden scores will shrink by default, thereby raising their burden scores.

ki SOnii

kk B

ST

,1

1

Sotomayor - Xu 45

Burden Target Computation (Cont) Multiple queries over different set of objects

θi,j : the portion of object Oi’s burden score corresponding to query Qj and

Goal for adjusting burden scores in presence of overlapping queries is to have the burden score Bi of each object Oi equal the sum of the burden targets of the queries over Oi.

Burden target:

iSOmi ji Bji

,1 ,

ji kiSOni SOjkmkki

j

j TBS

T,1 ,,1

1

Sotomayor - Xu 46

Validation against optimized strategy The adaptive bound width setting algorithm converges on

bounds that are on par with those selected by an optimizer.

Sotomayor - Xu 47

Implementation and experimental validation Single query

Sotomayor - Xu 48

Implementation and experimental validation Multiple queries

Sotomayor - Xu 49

Summary Trade the precision of query results for

lower communication costs. The specification of precision for continuous

queries Adaptive filters

Future work How imprecision propagates through more

complex query plans Develop appropriate optimization

techniques for adapting remote filter predicates in more complex environments

Sotomayor - Xu 50

Conclusion The problem

DSMS must consider the high volume as well as the “burstiness” of data streams

Effectiveness of systems depends on being able to gracefully adapt to environmental conditions (I.e. resource availability)

Two different approaches for adaptivity Minimizing the amount of memory at all

times Controlling the amount of data sent from

multiple data sources

Sotomayor - Xu 51

Conclusion (cont) Chain operator scheduling minimizes the

amount of memory used during execution making the system more adaptable to variation in arrival rates

Adaptive filters reduce the volume of data so that a system can perform efficiently while providing a certain level of precision

Overall, the need for adaptivity in DSMS is necessary due to the unpredictability of data streams

Sotomayor - Xu 52

References J. M. Hellerstein et al. Adaptive Query

Processing: Technology in Evolution. IEEE 2000 B. Babcock, S. Babu, M. Datar, R. Motwani, and

J. Widom. Models and Issues in Data Stream Systems. ACM SIGMOD/PODS 2002 Conference.

B. Babcock, S. Babu, M. Datar, R. Motwani. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. SIGMOD 2003

Chris Olston, Jing Jiang, Jennifer Widom. Adaptive Filters for Continuous Queries Over Distributed Data Streams. SIGMOD 2003.

Adaptivity in continuous query systems

Documents