Data Stream load shedding by Sampling (CS2650) Taecheol Oh
Jan 24, 2016
Data Stream load sheddingby Sampling (CS2650)
Taecheol Oh
Introduction
Many data stream sources are prone to dramatic spikes in volume
An overloaded system will be unable to process all of its input data
So, discarding some fraction of the unprocessed data, becomes necessary in order for the system to continue to provide up-to-date query response
Sampling
Degrade gracefully by providing approximate answers during load spikes
With a basic statistics on the distribution of values, guarantee on the accuracy of queries for a given sampling rate
Semantic of sample
SAMPLE(R,f): produce a uniform random sample of R that contains a f fraction of the tuples in R
Sampling with Replacement (WR) Sample fn tuples, uniformly and independently
from R Specific tuples could be sampled multiple times
Sampling without Replacement (WoR) Sample fn distinct tuples from R
Independent Coin Flips (CF) For each tuple in R, choose it for the sample with
probability of f, independent of other tuples, B(n,f)
Density Preserving Sampling Suppose that we have N values x1, x2,
…, xN Partitioned into groups that have sizes
n1,n2,…,ng The expected sum of the weights of the
sampled points for each group is proportional to the group’s size
Experiments
STREAM ( Stanford stREam datA Manager ) A general purpose data stream management
system Traditional DBMS is for running one time queries
over finite stored data sets In applications, data takes the form of continuous
data streams rather than finite data sets In the STREAM project, consider data
management and query processing in the presence of multiple continuous, rapid, time-varying data streams
Abstract Semantics
The abstract semantics is based on two data types Steam and Relations
Stream: an unbounded bag of pairs <s,t> s: a tuple, t: time stamp, the logical arrival
time Relation: a bag of tuples at time t. an
instantaneous relationStreams Relations
Stream-to-Rlation
Relation-to-RelationRlation-to-Stream
Query Execution
When a continuous query is registered with the system, generate a query execution plan
Plans composed of three main components: OperatorsOperators QueuesQueues (input and inter-operator) State State (windows, history)
Global schedulerscheduler for plan execution
Simple Query Plan
Q1 Q2
State4⋈State3
Stream1 Stream2
Stream3
State1 State2⋈
SchedulerScheduler
Overview of Approach
Unweighted sampling vs Weighted sampling Unweighted sampling
Each element is sampled uniformly at random Algorithm
i 0While tuples are streaming by and M > 0 do
get tuple tigenerate random variable X from B(x,
1/n-i)M M – Xi i + 1
Overview of Approach
Weighted sampling Each element is sampled with a probability
proportional to its weight Algorithm
i 0, W Sum of weights, D 0While tuples are streaming by and M > 0 do
get tuple ti with weightgenerate random variable X from B(x,
weight/W-D)M M – XD D + weight of the tuplei i + 1
Overview of Approach
Weighted sampling considering the density
operator
queue
Overview of Approach
Weighted sampling considering the density
operator
Density measure
- - - - - - W, Z, Z, X, Y, X
Mapping function
Bit map / counter
queue
Weighted samplingcontroller
+1
Thanks