Stream Data Introduction
Jan 02, 2016
Stream Data Introduction
Outline Streaming Data description
Uses/Applications Problems/Challenges
Main Concepts variance & k-means aging & sliding windows algorithms
References
Sizing the challenge
WalMart Records 20 Million Transactions
Google Handles 100 Million Searches AT&T produces 275 million call records Earth sensing satellite produces GBs of
data
This just in a day!
Characteristics/Description
Stream data sets are… Continuous Massive Unbounded Possibly infinite
Fast changing and requires fast, real-time response
Example: Network Management Application
Network Management involves monitoring and configuring network hardware and software to ensure smooth operation Monitor link bandwidth usage, estimate traffic
demands Quickly detect faults, congestion and isolate root
cause Load balancing, improve utilization of network
resources
AT&T collects 100 GBs of NetFlow data each day!AT&T collects 100 GBs of NetFlow data each day!
Network Management Application (cont.)
Network Operations Center
Network
MeasurementsAlarms
Uses/Applications Banking/Stocks/Financials
credit card fraud detection stock trends monitoring
Sensors power grid balancing engine controls collision avoidance driver sleep monitor
Problems/Challenges
‘Zillions of data Continuous/Unbounded Examples arrive faster than they can be
mined Application may require fast, real-time
response Examples:
life threatening: collision avoidancelost revenue/transactions: hung-up networks
Problems/Challenges
Time/Space constrained Not enough memory Can’t afford storing/revisiting the data
Single pass computation External memory algorithms for
handling data sets larger than main memory cannot be used.
Do not support continuous queries Too slow real-time response
Problems/Challenges
In summary…
Can’t stop to smell the roses…
Only one chance/single pass/look at the data
Problems/Challenges
Other Considerations Classical algorithms (i.e. CART, C4.5) do not
scale up to data stream [DH00] Most need entire data set for analysis Random access (or multiple passes) to the data
Difficult to compute answers accurately with limited memory
With probability at least 1 - , algorithms compute an approximate answer within a factor of the actual answer
Noise (bad sensors, outliers) Aging/Old/Stale data
Computation Model
Stream ProcessingEngine
(Approximate) Answer
Data Streams
Synopsis in Memory
Decision Making
Model Components Synopsis
Summary of the data Samples, Histograms
Processing Engine Implementation/Management System
STREAM (Stanford): general-purpose Aurora (Brown/MIT): sensor monitoring, dataflow Telegraph (Berkeley): adaptive engine for sensors
Decision Making Apply Data Mining techniques
Decision Trees, Clusters, Association Rules
Synopsis: Dealing with Time/Space Constraints
Since data can’t be contained, or revisited, the best alternative is to summarize what has been seen.
Basic stream synopsis computation Random Sampling: Generate statistics using a
representative sample of the data Histograms: Distribution/Grouping data
representation Wavelets: Mathematical tool for hierarchical
decomposition of functions/signals
15
Reservoir Sampling
Sample first m items Choose to sample the i’th item (i>m) with probability m/i If sampled, randomly replace a previously sampled item
Optimization: when i gets large, compute which item will be sampled next, skip over intervening items
16
Reservoir Sampling - Analysis
Analyze simple case: sample size m = 1 Probability i’th item is the sample from stream
length n: Prob. i is sampled on arrival prob. i survives to end
1 i i+1 n-2 n-1 i i+1 i+2 n-1 n
…
= 1/n
Case for m > 1 is similar, easy to show uniform probability
Drawbacks of reservoir sampling: hard to parallelize
17
Min-wise Sampling For each item, pick a random fraction between 0 and 1 Store item(s) with the smallest random tag [Nath et al.’04]
0.391 0.908 0.291 0.555 0.619 0.273
Each item has same chance of least tag, so uniform
Can run on multiple streams separately, then merge
18
Histograms Histograms approximate the frequency
distribution of element values in a stream
A histogram (typically) consists of A partitioning of element domain values into
buckets A count per bucket B (of the number of
elements in B) Long history of use for selectivity
estimation within a query optimizer
BC
Histogram Sampling Equi-Depth
Element counts per bucket are kept constant V-Optimal
Minimize frequency variance within buckets Exponential Histograms (EH)
Bucket sizes are non-decreasing powers of 2 Size: Total number of 1’s in the bucket. For every bucket other than the last bucket, there are
at least k/2 and at most k/2+1 buckets of that size Example: k=4: (1,1,2,2,2,4,4,4,8,8,..)
Essential component of “sliding windows” technique addressing “aging” data.
Equi-Depth
V-Optimal
Exponential Histograms
Exponential Histogram
Assume k/2 = 2
32,16,8,8,4,4,2,1,1
Exponential Histogram
Assume k/2 = 2
32,16,8,8,4,4,2,1,1 1
Exponential Histogram
Assume k/2 = 2
32,16,8,8,4,4,2,1,1,1 32,16,8,8,4,4,2,2,1
Merge! Merged!
Exponential Histogram
Assume k/2 = 2
32,16,8,8,4,4,2,1,132,16,8,8,4,4,2,2,132,16,8,8,4,4,2,2,1,132,16,16,8,4,2,1
Exponential Histogram
Assume k/2 = 2
32,16,8,8,4,4,2,1,132,16,8,8,4,4,2,2,132,16,8,8,4,4,2,2,1,132,16,16,8,4,2,1
26
Answering Queries using Histograms
(Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation
Example: select count(*) from R where 4<=R.e<=15
For equi-depth histograms, maximum error:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Count spreadevenly amongbucket values
4 R.e 15
answer: 3.5 * BC
BC*2
Sliding Windows Technique
Background: Some applications rely on ALL historical
data But for most applications, OLD data is
considered less relevant and could skew results from NEW trends or conditions
new processes/procedures new hardware/sensors new fashion trends
Sliding Windows (cont.)
Common approaches addressing Old data: Aging Model
elements are associated with “weights” that decrease over time
may use some exponential decay formulas Sliding Windows Model
Only last “N” elements are considered Incorporate examples as they arrive The record “expires” at time t+N (N is the
window length) Count only the “1’s” in bit-stream data
Sliding Window (SW) Model
….1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1…
Time Increases
Current Time
Window Size N = 7
Sliding Windows Plus Exponential Histograms
Sliding Windows Approach (pseudo-pseudo code) Consider only the last N elements. Define k=1/ε, and approximate k/2 to nearest integer. Time Stamp each “1” that arrives in the stream and
insert into a first bucket, shifting any initial ones. First bucket value is “1” since there is only one “1”
If the number of buckets with same value exceeds k/2 +1, merge the oldest buckets, but keeping at least k/2 buckets of the same value
Merging creates a new bucket with size equal to the sum Eliminate last bucket if its last 1 time stamp exceeds N
Benefits of Sliding Windows
Incorporates new elements as they appear.
Easy to calculate statistics over data streams with respect to the last N elements based on the histogram.
Can estimate the number of 1’s within a factor of (1 + ε) using only θ((1/ε)(log2
N)) bits of memory.
Expansion of Sliding Windows
The original Sliding Window Method was not fully applicable to two important statistics during the “merging” of the buckets: k-median and variance
A solution was devised by Babcock, Datar, Motwani and O’Callaghan
Their work derived a methodology for Variance, that was also applied for k-medians.
Variance and k-Medians
Variance: Σ(xi – μ)2, μ = Σ xi/N k-median clustering:
Given: N points (x1… xN) in a metric space Find k points C = {c1, c2, …, ck} that minimize
Σ d(xi, C) (the assignment distance)
Clustering to be covered in detail future presentation
Notation
Vi = Variance of the ith bucketni = number of elements in ith bucketμi = mean of the ith bucket
B1 Bm B2………………
Current window, size = N
Bm-1
Variance – composition
Bi,j = concatenation of buckets i and j
ji
jjiiji, n + n
μn + μn = μ
jiji, n n n
2ji
ji
jijiji, )μ - (μ
n + n
nn + V + V = V
Decision Making
The problem of addressing time changing data had also significant influence on decision algorithms.
Pedro Domingos, who had originally developed a successful decision table algorithm (VFDT), also conceptualized the need to work with recent data, resulting in a new algorithm known as CVFDT. VFDT - Very Fast Decision Tree CVFDT - Concept Drift Very Fast Decision Tree
Implemented a window approach
Decision Making
Both VFDT and CVFDT make use of a statistical result known as Hoeffding* bound Used to estimate the minimum number of
necessary examples needed to make a decision for a node in a decision tree.
This is the key concept for these algorithms to work.
* W.Hoefding, Probability Inequalities sums bounded Variables, Journal American Statistics Association, 1963
Hoeffding Bound
random variable a whose range is R n independent observations of a;
Mean: ā Hoeffding bound states:
With probability 1- , the true mean of a is at least ā - , where n
R
2
)/1ln(2
Hoeffding Bound
Significance… This estimate/bound is incorporated into
an ID3 type decision tree, hence VFDT/CVFDT
The information gain is evaluated against
n
R
2
)/1ln(2
VFDT Algorithm
n
R
2
)/1ln(2
VFDT Algorithm Results
CVFDT vs. VFDT
CVFDT is an extension to VFDT that incorporated “windowing”
CFVDT concept: Generate tree as regular but using a window of
“w ” elements. Monitor changes in gain for attributes. If changes, generate alternate subtree with
new “best” attribute, but keep on background. Replace if new subtree becomes more
accurate.
References
[BDMO03] B. Babcock, M. Datar, R. Motwani, and J. L. O’Callaghan. “Maintaining Variance and k-Medians over Data Stream Windows”. ACM PODS, 2003.http://citeseer.nj.nec.com/591910.htmlhttp://www.stanford.edu/~babcock/papers/pods03.ppt
[DH00] P. Domingos and G. Hulten. “Mining High-Speed Data Streams”. ACM KDD, 2000.
http://citeseer.nj.nec.com/domingos00mining.html [HSD01] G. Hulten, L. Spencer and P. Domingos. “Mining Time-Changing Data
Streams”. ACM KDD, 2001.http://citeseer.nj.nec.com/hulten01mining.html
[DGIM02] Mayur Datar, Aristides Gionis, Piotr Indyk and Rajeev Motwani. “Maintaining Stream Statistics over Sliding Windows” ACM-SIAM SODA 2002.http://www.stanford.edu/~babcock/papers/pods03.ppt
[GGR02] Minos Garofalakis, Johannes Gehrke and Rajeev Rastogi. “Querying and Mining Data Streams: You Only Get One Look”. SIGMOD 2002 (tutorial).http://www.bell-labs.com/user/minos/Talks/streams-tutorial02.ppt