DAX: Dynamically Adaptive Distributed System for Processing CompleX Continuous Queries Bin Liu, Yali Zhu, Mariana Jbantova, Brad Momberger, and Elke A.

DAX: Dynamically Adaptive Distributed System for Processing CompleX

Continuous QueriesBin Liu, Yali Zhu, Mariana Jbantova, Brad Momberger,

and Elke A. RundensteinerDepartment of Computer Science, Worcester Polytechnic Institute

100 Institute Road, Worcester, MA 01609

Tel: 1-508-831-5857, Fax: 1-508-831-5776

{binliu, yaliz, jbantova, bmombe, rundenst}@cs.wpi.edu

VLDB’05 Demonstration

http://davis.wpi.edu/dsrg/CAPE/index.html

Uncertainties in Stream Query ProcessingRegister

Continuous Queries

Distributed Stream Query Engine

Distributed Stream Query Engine

Streaming DataStreaming Result

Real-time and accurate responses

required

May have time-varying rates and

high-volumesAvailable resources for

executing each operator may vary over time.

Distribution and Adaptations are required.

High workload of queries

ReceiveAnswers

Memory- and CPU resource limitations

Adaptation in Distributed Stream Processing

• Adaptation Techniques:– Spilling data to disk– Relocating work to other machines– Reoptimizing and migrating query plan

• Granularity of Adaptation:– Operator-level distribution and adaptation– Partition-level distribution and adaptation

• Integrated Methodologies:– Consider trade-offs between spill vs redistribute– Consider trade-offs between migrate vs redistribute

System Overview [LZ+05, TLJ+05]

Local Statistics Gatherer

DataDistributor

CAPE-Continuous Query Processing Engine

DataReceiver

Query Processor

Local Adaptation Controller

Distribution Manager

StreamingDataNetwor

k

Network

End User

Global AdaptationController

RuntimeMonitor

Query PlanManager

Repository

ConnectionManager

Repository

Application Server Stream Generator

Global Plan Migrator

Local Plan Migrator

Motivating Example

Real Time Data Integration Server

......

Decision Support System

...

Decision-Make Applications

Stock Price, Volumes,...

Reviews, External Reports, News, ...

• Scalable Real-Time Data Processing Systems

– To Produce As Many Results As Possible at Run-Time • (i.e., 9:00am-4:00pm)• Main memory based processing

– To Require Complete Query Results • (i.e., for offline analysis after 4:00pm or whenever possible)• Load shedding not acceptable, must temporarily spill to disk

Complex queries such as multi-joins are common!

Analyze relationship among stock price, reports, and news?

A equi-Join of stock price, reports, and news on stock symbols

M1 M2 M3Legend:

M1

M2

M3

Random Distribution Balanced Network Aware Distribution

Goal: To minimize network connectivity.

Algorithm: Takes each query plan and creates sub-plans where neighbouring operators are grouped together.

Goal: To equalize workload per machine.

Algorithm: Iteratively takes each query operator and places it on the query processor with the least number of operators.

Initial Distribution Policies


Distribution Table

M 2Operator 8

M 2Operator 7

M 1Operator 6

M 1Operator 5

M 2Operator 4

M 2Operator 3

M 1Operator 2

M 1Operator 1

MachineOperator

Initial Distribution Process

Stream Source

Application

4321

876

5

1 2 3 4

5

67 8

M1

M2

Step 1 Step 2

Step 1: Create distribution table using initial distribution algorithm.

Step 2: Send distribution information to processing machines (nodes).

Operator-level Adaptation - Redistribution

4100 tuplesM 2

2000 tuplesM 1

M 2Operator 8

M 2Operator 7

M 1Operator 6

M 1Operator 5

M 2Operator 4

M 2Operator 3

M 1Operator 2

M 1Operator 1

Machine

(M)

Operator (OP)

Op 3: .3

Op 4: .2

Op 7: .3

Op 8: .2

.91M 2

Op 1: .25

Op 2: .25

Op 5: .25

Op 6: .25

.44M 1

Operator Cost

CostMachine

Statistics Table

M Capacity: 4500 tuples

Distribution Table

Op 3: .4

Op 4: .3

Op 8: .3

.64M 2

Op 1: .15

Op 2: .15

Op 5: .15

Op 6: .15

.71M 1

Operator Cost

CostMachineBalance

Cost Table (current) Cost Table (desired)

Cape’s cost models: number of tuples in memory and network output rate. Operators redistributed based on redistribution policy. Redistribution policies of Cape: Balance and Degradation.

Op 7: .4

Cost per machine is determined as percentage of memory filled with tuples.

Redistribution Protocol: Moving Operators Across Machines

Experimental Results of Distribution and Redistribution Algorithms

Query Plan Performance with Query Plan of 40 Operators.

Observations: Initial distribution is important for query plan performance. Redistribution improves at run-time query plan performance.

0

100000

200000

300000

400000

500000

600000

700000

800000

Time (m)

Th

rou

gh

pu

t

Random Distribution

Balanced Network Aware Distribution

RD + Redist

BNA + Redist

Operator-level Adaptation: Dynamic Plan Migration• The last step of plan re-optimization: After optimizer generates a new query

plan, how to replace currently running plan by the new plan on the fly?• A new challenge in streaming system because of stateful operators.• A unique feature of the DAX system.• But can we just take out the old plan and plug in the new plan?

Key Observation: Purge of tuples in states relies on processing of new tuples.

• Steps(1) Pause execution of old plan(2) Drain out all tuples inside old plan(3) Replace old plan by new plan(4) Resume execution of new plan

AB

BC

A B C

(2)All tuples

drained

(4)Processing

Resumed

(3) Old Replaced

By new

Deadlock Waiting Problem:

Migration Strategy - Moving State

• Basic idea - Share common states between two migration boxes

• Key Steps– Drain Tuples in Old Box – State Matching:

State in old box has unique ID. During rewriting, new ID given to newly generated state in new box. When rewriting done, match states based on IDs.

– State Moving between matched states

• What’s left?– Unmatched states in new box– Unmatched states in old box

CDSABC SD

BCSAB SC

ABSA SB

ABSA SBCD

CDSBC

SD

BCSB SC

QA QB QC QD QA QB QC QD

QABCD QABCD

Old Box New Box

Migration Requirements : No missing results and no duplicatesTwo migration boxes: One contains old sub-plan, one contains new sub-plan.

Two sub-plans semantically equivalent, with same input and output queues

Migration is abstracted as replacing old box by new box.

BC

AB

QA QB QC

b1b2

SC

SA SB

SAB

a2

b1b2

a1a2

c1

b3

c2

c3

A

BC

t

a1a2

b1b2b3

c1c2c3

W = 2

BC

AB

QA QBQC

b1b2

SC

SA SB

SAB

a1a2

b1b2

a2a2

c1c2

b3 c3

ABSA SBCD

CDSBC SD

BC

SB SC

QA QB QCQD

A B C

Old Old Old

Old Old New

Old New Old

New Old Old

Old New New

New Old New

New New Old

New Old New

New New New

Moving State:Unmatched States

Unmatched New States (Recomputation)Recursively recompute unmatched states from bottom up.

Unmatched Old States (Execution Synchronization)First clean accumulated tuples in box input queues, it is then safe to discard these unmatched states.

Distributed DynamicMigration Protocols (I)

...

(2) Local Synctime

(1) Request SyncTime

Distribution Managerop2

op3 op4

op1

op2

op3 op4

op13 4

1 2 3 5

4 2

Distribution TableOP1 M1

OP 2 M2

OP 3 M1

OP 4 M2

op1

op2

op3 op4

3 4

1 2M1

M2

op1

op2

op3 op4

3 4

1 2

(1) Request SyncTime

(3) Global SyncTime (3) Global SyncTime

(4) Execution Synced

...

op1

op2

op3 op4


OP 2 M2

OP 3 M1

OP 4 M2

op1

op2

op3 op4

3 4

1 2M1

M2

op1

op2

op3 op4

3 4

1 2


Migration Start

Migration Stage:Execution

Synchronization

Distributed Dynamic Migration Protocols (II)

...

(6) PlanChanged

(5) Send New SubQueryPlan


op2

op3 op4

op1

op2

op3 op4

op13 4

1 2 3 5

4 2


OP 2 M2

OP 3 M1

OP 4 M2

M1

M2

Migration Stage:Change Plan Shape

op2

op1

3 5

4 2

(5) Send New SubQueryPlan

op2

op1

3 5

4 2op2

op3 op4

op1

3 5

4 2

op2

op3 op4

op1

3 5

4 2

(8) States Filled

(7) Fill States [2, 4]

Distribution Managerop2

op3 op4

op1

op2

op3 op4

op13 4

1 2 3 5

4 2


OP 2 M2

OP 3 M1

OP 4 M2

M1

M2

Migration Stage:Fill States and

Reactivate Operators

op2

op3 op4

op1

3 5

4 2

op2

op3 op4

op1

3 5

4 2

(7) Fill States [3, 5]

(7.1) Request state [4]

(7.2) Move state [4]

(7.3) Request state [2]

(7.4) Move state [2]

(9) Reconnet operators

(11) Active [op 1]

(9) Reconnect Operators

(11) Activate [op2]

(10) Operator Reconnected

Distributed Dynamic Migration Protocols (III)

From Operator-level to Partition-level• Problem of operator-level adaptation:

– Operators have large states. – Moving them across machines can be expensive.

• Solution as partition-level adaptation: – Partition state-intensive operators [Gra90,SH03,LR05]– Distribute Partitioned Plan into Multiple Machines

A B C

SplitA

m1 m2

SplitB SplitC

A B C

A B C

m1 Union

Join

SplitA SplitB SplitC

m2 Union

Join


m3 Union

Join


m4 Union

Join


Partitioned Symmetric M-way Join

...3

...4

..1

...2

...3

...1

A2A1

...1

...2

..2

...3

...3

...4

B2B1

...3

...2

..1

...4

...1

...1

C2C1

• Example Query: A.A1 = B.B1 = C.C1

– Join is Processed in Two Machines


m1 m23-Way Join

3-Way Join

A B C A B C = PA1 PB1 PC1 PA2 PB2 PC2

A1%2=0 ->m1

A1%2=1 ->m2

B1%2=0 ->m1

B1%2=1 ->m2

C1%2=0 ->m1

C1%2=1 ->m2

...4

...2

A2A1

PA1

...2

...2

...4

B2B1

PB1

...4

...2

C2C1

PC1

Partitions of m1

...1

...3

...3

...1

A2A1

PA2

...1

...3

...3

B2B1

PB2

...1

...1

...3

...1

C2C1

PC2

Partitions of m2

Partition-level Adaptations• 1: State Relocation : Uneven workload among machines!

A B C

SplitA

m1 m2

SplitB SplitC

• States relocated are active in another machine

• Overheads in monitoring and moving states across machines

Push Operator States Temporarily into Disks -Spilled operator states are temporarily inactive

A B C

A B C

Secondary Storage

New incoming tuples probe only against partial states

• 2: State Spill: Memory overflow problem still exists!

Approaches: Lazy- vs. Active-Disk• Lazy-Disk Approach Distribution Manager

...

Memory Usage

Query Processor (1)

Disk

Local Adapt.

Controller

Query Processor (n-1)

Disk

Local Adapt.

Controller

Query Processor (n)

Disk

Local Adapt.

Controller

State Spill

State Relocation

– Independent Spill and Relocation Decisions

• Distribution Manager: Trigger state relocation if Mr < r and t > r

• Query Processor: Start state spill if Memu / Memall > s

• Active-Disk Approach– Partitions on Different

Machines May Have Different Productivity

• i.e., Most productive partitions in machine 1 may be less productive than least productive ones other machines

– Proposed Technique: Perform State Spill Globally


...

Memory Usage/Average Productivity

Query Processor (1)

Disk

State Spill

Local Adapt.

Controller

Query Processor (n-1)

Disk

Local Adapt.

Controller

Query Processor (n)

Disk

Local Adapt.

Controller

State Relocation

Force State Spill

Performance Results of Lazy-Disk & Active-Disk Approaches

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58

Minutes

Th

rou

gh

pu

t

No-Relocation

Lazy-Disk

• Lazy-Disk vs. No-Relocation in Memory Constraint Env.

• Lazy-Disk vs. Active Disk

Three machines, M1(50%), M2(25%), M3(25%)Input Rate: 30ms; Tuple Range:30KInc. Join Ratio: 2State spill memory threshold: 100MState relocation: > 30M, Mem thres. 80%, Minspan 45s

Three machines, Input Rate: 30ms; Tuple Range:15K,45KState spill memory thres.: 80MAvg. Inc, Join Ratio: M1(4), M2(1), M3(1) Maximal Force-Disk memory: 100M, Ratio>2State relocation: >30M, Mem thres.: 80%, Minspan: 45s

0

500000

1000000

1500000

2000000

2500000

3000000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58

MinutesT

hro

ug

hp

ut

Lazy-Disk

Active-Disk

Plan-Wide State Spill: Local Methods• Local Output

A B C

D

EJoin2

Join1

Join3

Poutput, Psize…

…

t1

t2

t3

Poutput, Psize

t

– Direct Extension of Single-Operator Solution:

– Update Operator Productivity Values Individually

– Spill partitions with smaller Poutput/Psize values among all operators

• Bottom Up Pushing

A B C

D

EJoin2

Join1

Join3

– Push States from Bottom Operators First

– Randomly or using local productivity value for partition selection

– Less intermediate results (states) stored -> reduce number of state spills

Plan-Wide State Spill: Global Outputs• Poutput: Contribution to Final Query Output

– Update Poutput values of partitions in Join3

– Apply Split2 to each tuple and find corresponding partitions from Join2, and update its Poutput value

A B CD

EJoin2

Join1

Join3

SplitE


Split2

SplitDSplit1

k

– And so on …

A lineage tracing algorithm to update Poutput statistics

...

2

2

OP1

...p1

1

1

OP2

...p2

j12

20

......

OP1

...p1

11

p12

OP2

p21

p2j

OP3

p31

p3j

2

OP4

...p4

1

p4j

2

2 33

3

4

44

4

2+3+4 3+4 4

• Consider Intermediate Result Size

P11: Psize = 10,

Poutput=20P1

2: Psize = 10, Poutput=20

Intermediate Result Factor Pinter

• Poutput/(Psize + Pinter)

• Apply Same Lineage Tracing Algorithm for Intermediate Results

p12

p2i

Experiment Results for Plan-Wide Spill

0

5000

10000

15000

20000

25000

30000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Minutes

Thro

ughp

ut

Global Output with PenaltyGlobal OutputLocal OutputBottom-Up

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Minutes

Thro

ughp

ut

Global Output with PenaltyGlobal OutputLocal OutputBottom-Up

Query with Average Join Rate:Join1: 1, Join2: 3, Join3: 3

Query with Average Join Rate:Join1: 3, Join2: 2, Join3: 3

300 partitions Memory Threshold: 60MB Push 30% of states in each state spill Average tuple inter-arrival time 50ms from each input

Backup Slides

Plan Shape Restructuring and Distributed Stream Processing

• New slides for yali’s migration + distribution ideas

Pros: Migrate in a gradual fashion. Still output even during migration.Cons: Still rely on executing of old box to process tuples during migration stage.

CD

SABC SD

BC

SAB SC

AB

SA SB

AB

SASBCD

CD

SBC SD

BCSB SC

QA QB QC QD

QA QB QC QD

QABCD QABCD

Migration Strategies – Parallel TrackBasic Idea : Execute both plans in parallel until old box is “expired”, after whichthe old box is disconnected and the migration is over.Potential Duplicate: Both boxes generate all-new tuples.

At root op in old box:

If both to-be-joined tuples have all-new sub-tuples,

don’t join.

Other op in old box:

Proceed as normal

Cost Estimations For MS:

TPT ≈ 2W given enough system resources

1st W

2nd W

TM-start

TM-end

T

New New

OldOld

New New

Old OldCD

BC

AB

QA QB QC QD

SABC

SC

SA SB

SD

SAB

Old Box

W

TMS = Tmatch + Tmove + Trecompute

≈ Trecompute(SBC) + Trecompute(SBCD)

= λBλCW2(Tj + TsσBC) + 2λBλCλDW3(TjσBC + TsσBCσBCD)

AB

CD

BC

QB QC QDQA

......

SD

SB SC

SBCD

SBC

...

Cost Estimations For PT:

New Box

Experimental Results for Plan Migration

0

2000

4000

6000

8000

10000

12000

14000

0 2000 4000 6000 8000Global Window Size W (ms)

Mig

rati

on

Du

rati

on

(m

s)

Measured T_PT Estimated T_PT

0200400600800

100012001400160018002000

0 2000 4000 6000 8000Global Window Size W (ms)

Mig

rati

on

Du

rati

on

(m

s)

Measured T_MS Poly. (Measured T_MS)

0

2000

4000

6000

8000

10000

12000

14000

0 1000 2000 3000 4000 5000Window Size (ms)

Mig

rati

on

Du

rati

on

T_MS T_PT Observations:

• Confirm with prior cost analysis.

• Duration of moving state affected by window size and arrival rates.

• Duration of parallel track is 2W given enough system resources, otherwise affected by system parameters, such as window size and arrival rates.

Related Work on Distributed Continuous Query Processing

[1] Medusa: M. Balazinska, H. Balakrishnan, and M. Stonebraker. Contract-based load management in federated distributed systems. In Ist of NSDI, March 2004

[2] Aurora*: M. Cherniack, H. Balakrishnan, M. Balazinska, and etl. Scalable distributed stream processing. In CIDR, 2003.

[3] Borealis: T. B. Team. The design of the Borealis Stream Processing Engine. Technical Report, Brown University, CS Department, August 2004

[4] Flux: M. Shah, J. Hellerstein, S. Chandrasekaran, and M. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In ICDE, pages 25-36, 2003

[5] Distributed Eddies: F. Tian, and D. DeWitt. Tuple routing strategies for distributed Eddies. In VLDB Proceedings, Berlin, Germany, 2003

Related Work on Partitioned Processing

• Non state-intensive queries [BB+02,AC+03,GT03]– State-Intensive operators (run-time memory shortage)

• Operator-level adaptation [CB+03,SLJ+05,XZH05] – Fine grained state level adaptation (adapt partial states)

• Load shedding [TUZC03]– Require complete query result (no load shedding)– Drop input tuples to handle resource shortage

• XJoin [UF00] and Hash-Merge Join [MLA04]– Integrate both spill and relocation in distributed environments– Investigate dependency problem for multiple operators

• Flux [SH03]– Multi-Input operators– Integrate both state spill and state relocation– Adapt states of one single input operator across machines

• Hash-Merge Join [MLA04], XJoin [UF00]– Only spill states for one single operator in central environments

CAPE Publications and Reports[RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Constraint-

Aware Adaptive Stream Processing Engine”. Invited Book Chapter. http://www.cs.uno.edu/~nauman/streamBook/. July 2004.

[ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages 431-442.

[DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages 587-604.

[DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear.

[DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS 2003.

[RDSZBM04] E A. Rundensteiner, L Ding, T Sutherland, Y Zhu, B Pielech \ And N Mehta. CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity.

Demonstration Paper. VLDB 2004[SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan

Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, 2004.[SPR04] T. Sutherland, B. Pielech, Yali Zhu, Luping Ding, and E. A. Rundensteiner, "Adaptive

Multi-Objective Scheduling Selection Framework for Continuous Query Processing “. IDEAS 2005.

[SLJR05] T Sutherland, B Liu, M Jbantova, and E A. Rundensteiner, D-CAPE: Distributed and Self-Tuned Continuous Query Processing, CIKM, Bremen, Germany, Nov. 2005.

[LR05] Bin Liu and E.A. Rundensteiner, Revisiting Pipelined Parallelism in Multi-Join Query Processing, VLDB 2005.

[B05] Bin Liu and E.A. Rundensteiner, Partition-based Adaptation Strategies Integrating Spill and Relocation, Tech Report, WPI-CS-TR-05, 2005. (in submission)

CAPE Project: http://davis.wpi.edu/dsrg/CAPE/index.html

http://www.cs.uno.edu/~nauman/streamBook/



CAPE Engine

Constraint-aware

Adaptive

Continuous Query

Processing

Engine

Exploit semantic constraints such as sliding windows and punctuations to reduce resource usage and improve response time.

Incorporate heterogeneous-grained adaptivity at all query processing levels.

- Adaptive query operator execution- Adaptive query plan re-optimization- Adaptive operator scheduling- Adaptive query plan distribution

Process queries in a real-time manner by employing well-coordinated heterogeneous-grained adaptations.

Analyzing Adaptation Performance• Questions Addressed:

– Partitioned Parallel Processing• Resolves memory shortage• Should we partition non-memory intensive queries?• How effective is partitioning memory intensive queries?

– State Spill• Known Problem: Slows down run-time throughput• How many states to push?• Which states to push?• How to combine memory/disk states to produce complete results?

– State Relocation• Known Asset: Low overhead• When (how often) to trigger state relocation?• Is state relocation an expensive process?• How to coordinate state moving without losing data & states?

• Analyzing State Adaptation Performance & Policies– Given sufficient main memory, state relocation helps run-time throughput– With insufficient main memory, Active-Disk improves run-time throughput

• Adapting Multi-Operator Plan– Dependency among operators– Global throughput-oriented spill solutions improve throughput

Percentage Spilled per Adaptation

0

50

100

150

200

250

300

350

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Every 60 Seconds

Me

mo

ry U

sa

ge

(M

B)

All-Mem

10%-Push

30%-Push

50%-Push

100%-Push

0

200000

400000

600000

800000

1000000

1200000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Minutes

Th

rou

gh

pu

t

All-Mem

10%-Push

30%-Push

50%-Push

100%-Push

• Amount of State Pushed Each Adaptation– Percentage: # of Tuples Pushed/Total # of Tuples

(Input Rate: 30ms/Input, Tuple Range:30K, Join Ratio:3, Adaptation threshold: 200MB)

Run-Time Query Throughput Run-Time Main Memory Usage

DAX: Dynamically Adaptive Distributed System for Processing CompleX Continuous Queries Bin Liu, Yali Zhu, Mariana Jbantova, Brad Momberger, and Elke A.

Documents

query operator

operatorlevel distribution

stream query processing

distribution information

initial distribution

initial distribution

complete query results

disk complex queries