Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

Adaptive Processing in Data Adaptive Processing in Data Stream SystemsStream Systems

Shivnath BabuShivnath Babu

stanfordstreamdatamanager

Stanford University

stanfordstreamdatamanager 2

Data StreamsData Streams• New applications -- data as continuous, rapid,

time-varying data streamsdata streams– Sensor networks, RFID tags– Network monitoring and traffic engineering– Financial applications– Telecom call records– Web logs and click-streams– Manufacturing processes

• Traditional databases -- data stored in finite, persistent data sets data sets

Using Traditional DatabaseUsing Traditional Database

User/ApplicationUser/Application

LoaderLoader

QueryQuery ResultResult

ResultResult……

QueryQuery……

Table R

Table S

New Approach for Data StreamsNew Approach for Data Streams

User/ApplicationUser/Application

Register Register Continuous Continuous

QueryQuery

Stream QueryProcessor

ResultResult

Input streams

Example Continuous QueriesExample Continuous Queries

• Web

– Amazon’s best sellers over last hour

• Network Intrusion Detection– Track HTTP packets with destination address

matching a prefix in given table and content matching “*\.ida”

• Finance – Monitor NASDAQ stocks between $20 and $200 that

have moved down more than 2% in the last 20 minutes

Data StreamManagement

System (DSMS)

Data Stream Management System (DSMS)Data Stream Management System (DSMS)

Input Streams

RegisterContinuous

StreamedResult

StoredResult

ArchiveStoredTables

Primer on Database Query ProcessingPrimer on Database Query Processing

Preprocessing

Query Optimization

Query Execution

Best queryexecution plan

Canonical form

DeclarativeQuery

Results

DatabaseSystem

DataData

Traditional Query OptimizationTraditional Query Optimization

Executor:Runs chosen plan to

completion

Chosen query plan

Optimizer: Finds “best” query plan to

process this query

Statistics Manager: Periodically collects statistics,

e.g., data sizes, histograms

Which statisticsare required

Estimatedstatistics

Data, auxiliary

structures,statistics

Optimizing Continuous Queries is Optimizing Continuous Queries is ChallengingChallenging

• Continuous queries are long-running

• Stream properties can change while query runs– Data properties: value distributions

– Arrival properties: bursts, delays

• System conditions can change

• Performance of a fixed plan can change significantly over time

Adaptive processing: use plan that is best for current conditions

RoadmapRoadmap

• StreaMon: Our adaptive query processing engine

• Adaptive ordering of commutative filters

• Adaptive caching for multiway joins

• Current and future work– Similar techniques apply to conventional databases

Traditional Optimization Traditional Optimization StreaMon StreaMon

process this query

completion

Chosen query plan

Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms

Estimatedstatistics

Re-optimizer:Ensures that plan is efficient

for current characteristics

Profiler: Monitors current stream and

system characteristics

Executor: Executescurrent plan on

incoming stream tuples

Decisions toadapt

Combined in part for efficiency

Pipelined FiltersPipelined Filters

• Commutative filters over a stream

• Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida”

• Simple to complex filters

– Boolean predicates

– Table lookups

– Pattern matching

– User-defined functions

Filter1

PacketsPacketsPacketsPackets

Bad packetsBad packetsBad packetsBad packets

Filter2

Filter3

Pipelined Filters: Problem DefinitionPipelined Filters: Problem Definition

• Continuous Query: F1 Æ F2 … Æ … Fn

• Plan: Tuples F(1) F(2) … … F(n)

• Goal: Minimize expected cost to process a tuple

Pipelined Filters: ExamplePipelined Filters: Example

1 12 23

F1 F2 F3 F4

Input tuples Output tuples

Informal Goal: If tuple will be dropped, then drop it as cheaply as possible

Why is Our Problem Hard?Why is Our Problem Hard?

• Filter drop-rates and costs can change over time

• Filters can be correlated• E.g., Protocol = HTTP and DestPort = 80

Metrics for an Adaptive AlgorithmMetrics for an Adaptive Algorithm

• Speed of adaptivity– Detecting changes and finding

new plan

• Run-time overhead– Re-optimization, collecting

statistics, plan switching

• Convergence properties– Plan properties under stable

statistics

ProfilerProfilerProfilerProfiler Re-optimizerRe-optimizerRe-optimizerRe-optimizer

ExecutorExecutorExecutorExecutor

StreaMonStreaMonStreaMonStreaMon

Pipelined Filters: Stable StatisticsPipelined Filters: Stable Statistics

• Assume statistics are not changing– Order filters by decreasing drop-rate/cost

[MS79,IK84,KBZ86,H94]

– Correlations NP-Hard

• Greedy algorithm: Use conditional statistics

– F(1) has maximum drop-rate/cost

– F(2) has maximum drop-rate/cost ratio for tuples not

dropped by F(1)

– And so on

Adaptive Version of GreedyAdaptive Version of Greedy• Greedy gives strong guarantees

– 4-approximation, best poly-time approx. possible assuming P NP [MBM+05]

– For arbitrary (correlated) characteristics

– Usually optimal in experiments

• Challenge:– Online algorithm

– Fast adaptivity to Greedy ordering

– Low run-time overhead

A-Greedy: Adaptive Greedy

A-GreedyA-Greedy

Profiler: Maintains conditionalfilter drop-rates and costs

over recent tuples

Executor:Processes tuples with

current Greedy ordering

Re-optimizer: Ensures thatfilter ordering is Greedy for

current statistics

statisticsEstimated

are requiredWhich statistics

Combined in part for

efficiency

Changes infilter ordering

A-Greedy’s ProfilerA-Greedy’s Profiler

• Responsible for maintaining current statistics– Filter costs

– Conditional filter drop-rates: exponential!

• Profile Window: Sampled statistics from which required conditional drop-rates can be estimated

Profile WindowProfile Window

1 12 23

0 1 1 0

0 0 1 11 0 0 1

1 0 0 1 ProfileWindow

F1 F2 F3 F4

Greedy Ordering Using Profile WindowGreedy Ordering Using Profile Window

1 0 1 0

0 0 0 1

1 0 1 0

0 1 0 0

0 0 1 1

F1 F2 F3 F4

2 2 3 2

F1 F2 F3 F4

3 2 2 2

F3 F1 F2 F4

3 2 2 2

F3 F2 F4 F1

1 0Matrix View Greedy Ordering

A-Greedy’s Re-optimizerA-Greedy’s Re-optimizer

• Maintains Matrix View over Profile Window – Easy to incorporate filter costs

– Efficient incremental update

– Fast detection/correction of changes in Greedy order

Details in [BMM+04]: “Adaptive Processing of Pipelined Stream Filters”, SIGMOD 2004

NextNext

• Tradeoffs and variations of A-Greedy

• Experimental results for A-Greedy

TradeoffsTradeoffs

• Suppose:

– Changes are infrequent

– Slower adaptivity is okay

– Want best plans at very low run-time overhead

• Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties

• Spectrum of A-Greedy variants

Variants of A-GreedyVariants of A-GreedyAlgorithm Convergence

PropertiesRun-time Overhead

A-Greedy 4-approx. High (relative to others)

Matrix View

1 0 1 0

0 0 0 1

1 0 1 0

0 1 0 0

0 0 1 1

3 2 2 2

Profile Window Matrix View

Variants of A-GreedyVariants of A-GreedyAlgorithm Convergence

PropertiesRun-time Overhead

A-Greedy 4-approx. High (relative to others)

Matrix View

Sweep 4-approx. Less work per sampling step

Local-Swaps May get caught in

local optima

Less work per sampling step

Independent Misses correlations

Lower sampling rate

Experimental SetupExperimental Setup

• Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon

• Studied convergence properties, run-time overhead, and adaptivity

• Synthetic testbed– Can control stream data and arrival properties

• DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory

Converged Processing RateConverged Processing Rate

Optimal-Fixed

A-Greedy

Independent

Local-Swaps

3 4 6 8 10

Number of filters

Optimal

A-Greedy

Local-Swaps

Independent

Effect of Filter Drop-RateEffect of Filter Drop-Rate

Optimal-Fixed

A-Greedy

Independent

Local-Swaps

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Drop-rate for each of 3 filters

Optimal

A-Greedy

Independent

Effect of CorrelationEffect of Correlation

Optimal-Fixed

A-Greedy

Independent

Local-Swaps

1 2 3 4

Correlation factor

Optimal

A-Greedy

Independent

Run-time OverheadRun-time Overhead

Optimal A-Greedy Sweep Local-Swaps Independent

Tuple processing Profiling + Reopt. Overhead

AdaptivityAdaptivity

30003200340036003800400042004400460048005000

Progress of time

Local-Swaps

A-Greedy

Independent

Permute selectivitieshere

Progress of time (x1000 tuples processed)

RoadmapRoadmap

• StreaMon: Our adaptive processing engine

• Adaptive ordering of commutative filters

• Adaptive caching for multiway joins

• Current and future work

Stream JoinsStream Joins

Sensor RSensor RSensor RSensor R Sensor SSensor SSensor SSensor S Sensor TSensor TSensor TSensor T

observationsin the

last minute

join results

MJoins (VNB04)MJoins (VNB04)

Window on R Window on S Window on T

⋈T ⋈S

Excessive Recomputation in MJoinsExcessive Recomputation in MJoins

Materializing Join SubexpressionsMaterializing Join Subexpressions

Fully-materialized

joinsubexpression

Tree Joins: Trees of Binary JoinsTree Joins: Trees of Binary Joins

Fully-materializedjoin subexpression

Window on R

Window on T

Window on S

Hard State Hinders AdaptivityHard State Hinders Adaptivity

WR WT⋈

SSSS TTTT

WS WT⋈

Plan switch

Can we get best of both worlds?Can we get best of both worlds?

MJoin Tree Join

Θ Recomputation Θ Less adaptive

Θ Higher memory use

WR WT⋈⋈S

MJoins + CachesMJoins + Caches

WR WT⋈ S tuple Cache

Bypasspipelinesegment

MJoins + Caches (contd.)MJoins + Caches (contd.)

• Caches are soft state– Adaptive

– Flexible with respect to memory usage

• Captures whole spectrum from MJoins to Tree Joins and plans in between

• Challenge: adaptive algorithm to choose join operator orders and caches in pipelines

Adaptive Caching (A-Caching)Adaptive Caching (A-Caching)

• Adaptive join ordering with A-Greedy or variant– Join operator orders candidate caches

• Adaptive selection from candidate caches

• Adaptive memory allocation to chosen caches

A-Caching (caching part only)A-Caching (caching part only)

Profiler:Estimates costs and benefits

of candidate caches

Executor:MJoins with caches

Re-optimizer: Ensures that maximum-benefit subset

of candidate caches is used

List of candidate caches

Estimatedstatistics

Combined in partfor efficiency

Add/removecaches

Performance of Stream-Join Plans (1)Performance of Stream-Join Plans (1)

Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW+05]

100000

150000

200000

250000

300000

350000

400000

450000

MJoin TreeJoin A-Caching

Performance of Stream-Join Plans (2)Performance of Stream-Join Plans (2)

Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW+05]

100000

150000

200000

250000

MJoin TreeJoin A-Caching

rocessin

A-Caching: Results at a glanceA-Caching: Results at a glance

• Capture whole spectrum from Fully-pipelined MJoins to Tree-based joins adaptively

• Approximation algorithms scalable

• Different types of caches

• Up to 7x improvement with respect to MJoin and 2x improvement with respect to TreeJoin

• Details in [BMW+05]: “Adaptive Caching for Continuous Queries”, ICDE 2005 (To appear)

Current and Future WorkCurrent and Future Work

• Broadening StreaMon’s scope, e.g.,– Shared computation among multiple queries

– Parallelism

• Rio: Adaptive query processing in conventional database systems

• Plan logging: A new overall approach to address certain “meta issues” in adaptive processing

Related WorkRelated Work• Adaptive processing of continuous queries

– E.g., Eddies [AH00], NiagaraCQ [CDT+00]

• Adaptive processing in conventional databases

– Inter-query adaptivity, e.g., Leo [SLM+01], [BC03]

– Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS+04]

• New approaches to query optimization

– E.g., parametric [GW89,INS+92,HS03], expected-cost based [CHS99,CHG02], error-aware [VN03]

SummarySummary

• New trends demand adaptive query processing– New applications, e.g., continuous queries, data streams

– Increasing data size and query complexity

• CS-wide push towards autonomic computing

• Our goal: Adaptive Data Management System– StreaMon: Adaptive Data Stream Engine

– Rio: Adaptive Processing in Conventional DBMS

• Google keywords: shivnath, stanford stream

Performance of Stream-Join PlansPerformance of Stream-Join Plans

100000

150000

200000

250000

300000

350000

400000

450000

D1 D2 D3 D4 D5 D6 D7 D8

Sample points from spectrum of input properties

TreeJoin

A-Caching

Adaptivity to Memory AvailabilityAdaptivity to Memory Availability

0 10 20 32 40 50 60 70

Memory available for storing join subresults (KB)

TreeJoin

A-Caching

Plan LoggingPlan Logging• Log the profiling and re-optimization history

– Query is long-running

– Example view over log for R S T

Rate(R) ….. R,S) Plan Cost

1024 ….. 0.75 P112762

5642 ….. 0.72 P272332

934 ….. 0.76 P112003

⋈ ⋈

Plans lying in a Plans lying in a high- high- dimensional space of statisticsdimensional space of statistics Plans lying in a Plans lying in a high- high- dimensional space of statisticsdimensional space of statistics

Rate(R)

Uses of Plan LoggingUses of Plan Logging

• Reducing re-optimization overhead– Create a cache of plans

• Reducing profiling overhead– Track how changes in a statistic contribute to

changes in best plan

Rate(R)

Uses of Plan Logging (contd.)Uses of Plan Logging (contd.)

• Tracking “Return of Investment” on adaptive processing– Track cost versus benefit of adaptivity

– Is there a single plan that would have good overall performance?

• Avoiding thrashing– Which statistics have transient changes?

Adaptive Processing in Traditional DBMSAdaptive Processing in Traditional DBMS

completion

Chosen query plan

process this query

Estimatedstatistics

Errors

Proactive Re-optimization with RioProactive Re-optimization with Rio

QueryWhich statisticsare required

Estimates

“Robust“

plansCombined

for efficiency

Stats. Mgr. + Profiler: Collects statistics atrun-time based onrandom samples

(Re-)optimizer:Considers pairs of

(estimate, uncertainty)during optimization+

uncertainty

process this query

Executor:Executes current plan

completion

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

query query statistics

best query plan

query executor

query plan optimizer

data properties

estimated statistics

data sizes

current plan

Documents

CPS216: Data-Intensive Computing Systems Data Access from...

Learning Application Models for Utility Resource Planning...

CPS216: Advanced Database Systems Notes 09:Query...

1 CPS216: Advanced Database Systems Notes 10: Failure...

Proactive Re-Optimization Shivnath Babu, Pedo Bizarro, David...

CPS 216: Advanced Database Systems - Duke Computer...

CPS 216: Advanced Database Systems Shivnath Babu Fall 2006.

Content-Based Routing: Different Plans for Different Data...

Xplus: A SQL-Tuning-Aware Query...

1 CS216 Advanced Database Systems Shivnath Babu Notes 12:...

CPS 216: Data-intensive Computing Systems Shivnath Babu.

Data Engineering How MapReduce Works Shivnath Babu.

Experiment-driven System Management Shivnath Babu Duke...

CQL: A Language for Continuous Queries over Streams and...

CPS216: Data-intensive Computing Systems · Failure...

CPS216: Advanced Database Systems Notes 02:Query Processing....