Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Stream Data Introduction

Outline Streaming Data description

Uses/Applications Problems/Challenges

Main Concepts variance & k-means aging & sliding windows algorithms

References

Sizing the challenge

WalMart Records 20 Million Transactions

Google Handles 100 Million Searches AT&T produces 275 million call records Earth sensing satellite produces GBs of

data

This just in a day!

Characteristics/Description

Stream data sets are… Continuous Massive Unbounded Possibly infinite

Fast changing and requires fast, real-time response

Example: Network Management Application

Network Management involves monitoring and configuring network hardware and software to ensure smooth operation Monitor link bandwidth usage, estimate traffic

demands Quickly detect faults, congestion and isolate root

cause Load balancing, improve utilization of network

resources

AT&T collects 100 GBs of NetFlow data each day!AT&T collects 100 GBs of NetFlow data each day!

Network Management Application (cont.)

Network Operations Center

Network

MeasurementsAlarms

Uses/Applications Banking/Stocks/Financials

credit card fraud detection stock trends monitoring

Sensors power grid balancing engine controls collision avoidance driver sleep monitor

Problems/Challenges

‘Zillions of data Continuous/Unbounded Examples arrive faster than they can be

mined Application may require fast, real-time

response Examples:

life threatening: collision avoidancelost revenue/transactions: hung-up networks

Problems/Challenges

Time/Space constrained Not enough memory Can’t afford storing/revisiting the data

Single pass computation External memory algorithms for

handling data sets larger than main memory cannot be used.

Do not support continuous queries Too slow real-time response

Problems/Challenges

In summary…

Can’t stop to smell the roses…

Only one chance/single pass/look at the data

Problems/Challenges

Other Considerations Classical algorithms (i.e. CART, C4.5) do not

scale up to data stream [DH00] Most need entire data set for analysis Random access (or multiple passes) to the data

Difficult to compute answers accurately with limited memory

With probability at least 1 - , algorithms compute an approximate answer within a factor of the actual answer

Noise (bad sensors, outliers) Aging/Old/Stale data

Computation Model

Stream ProcessingEngine

(Approximate) Answer

Data Streams

Synopsis in Memory

Decision Making

Model Components Synopsis

Summary of the data Samples, Histograms

Processing Engine Implementation/Management System

STREAM (Stanford): general-purpose Aurora (Brown/MIT): sensor monitoring, dataflow Telegraph (Berkeley): adaptive engine for sensors

Decision Making Apply Data Mining techniques

Decision Trees, Clusters, Association Rules

Synopsis: Dealing with Time/Space Constraints

Since data can’t be contained, or revisited, the best alternative is to summarize what has been seen.

Basic stream synopsis computation Random Sampling: Generate statistics using a

representative sample of the data Histograms: Distribution/Grouping data

representation Wavelets: Mathematical tool for hierarchical

decomposition of functions/signals

15

Reservoir Sampling

Sample first m items Choose to sample the i’th item (i>m) with probability m/i If sampled, randomly replace a previously sampled item

Optimization: when i gets large, compute which item will be sampled next, skip over intervening items

16

Reservoir Sampling - Analysis

Analyze simple case: sample size m = 1 Probability i’th item is the sample from stream

length n: Prob. i is sampled on arrival prob. i survives to end

1 i i+1 n-2 n-1 i i+1 i+2 n-1 n

…

= 1/n

Case for m > 1 is similar, easy to show uniform probability

Drawbacks of reservoir sampling: hard to parallelize

17

Min-wise Sampling For each item, pick a random fraction between 0 and 1 Store item(s) with the smallest random tag [Nath et al.’04]

0.391 0.908 0.291 0.555 0.619 0.273

Each item has same chance of least tag, so uniform

Can run on multiple streams separately, then merge

18

Histograms Histograms approximate the frequency

distribution of element values in a stream

A histogram (typically) consists of A partitioning of element domain values into

buckets A count per bucket B (of the number of

elements in B) Long history of use for selectivity

estimation within a query optimizer

BC

Histogram Sampling Equi-Depth

Element counts per bucket are kept constant V-Optimal

Minimize frequency variance within buckets Exponential Histograms (EH)

Bucket sizes are non-decreasing powers of 2 Size: Total number of 1’s in the bucket. For every bucket other than the last bucket, there are

at least k/2 and at most k/2+1 buckets of that size Example: k=4: (1,1,2,2,2,4,4,4,8,8,..)

Essential component of “sliding windows” technique addressing “aging” data.

Equi-Depth

V-Optimal

Exponential Histograms

Exponential Histogram

Assume k/2 = 2

32,16,8,8,4,4,2,1,1


Assume k/2 = 2

32,16,8,8,4,4,2,1,1 1


Assume k/2 = 2

32,16,8,8,4,4,2,1,1,1 32,16,8,8,4,4,2,2,1

Merge! Merged!


Assume k/2 = 2

32,16,8,8,4,4,2,1,132,16,8,8,4,4,2,2,132,16,8,8,4,4,2,2,1,132,16,16,8,4,2,1


Assume k/2 = 2

32,16,8,8,4,4,2,1,132,16,8,8,4,4,2,2,132,16,8,8,4,4,2,2,1,132,16,16,8,4,2,1

26

Answering Queries using Histograms

(Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation

Example: select count(*) from R where 4<=R.e<=15

For equi-depth histograms, maximum error:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Count spreadevenly amongbucket values

4 R.e 15

answer: 3.5 * BC

BC*2

Sliding Windows Technique

Background: Some applications rely on ALL historical

data But for most applications, OLD data is

considered less relevant and could skew results from NEW trends or conditions

new processes/procedures new hardware/sensors new fashion trends

Sliding Windows (cont.)

Common approaches addressing Old data: Aging Model

elements are associated with “weights” that decrease over time

may use some exponential decay formulas Sliding Windows Model

Only last “N” elements are considered Incorporate examples as they arrive The record “expires” at time t+N (N is the

window length) Count only the “1’s” in bit-stream data

Sliding Window (SW) Model

….1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1…

Time Increases

Current Time

Window Size N = 7

Sliding Windows Plus Exponential Histograms

Sliding Windows Approach (pseudo-pseudo code) Consider only the last N elements. Define k=1/ε, and approximate k/2 to nearest integer. Time Stamp each “1” that arrives in the stream and

insert into a first bucket, shifting any initial ones. First bucket value is “1” since there is only one “1”

If the number of buckets with same value exceeds k/2 +1, merge the oldest buckets, but keeping at least k/2 buckets of the same value

Merging creates a new bucket with size equal to the sum Eliminate last bucket if its last 1 time stamp exceeds N

Benefits of Sliding Windows

Incorporates new elements as they appear.

Easy to calculate statistics over data streams with respect to the last N elements based on the histogram.

Can estimate the number of 1’s within a factor of (1 + ε) using only θ((1/ε)(log2

N)) bits of memory.

Expansion of Sliding Windows

The original Sliding Window Method was not fully applicable to two important statistics during the “merging” of the buckets: k-median and variance

A solution was devised by Babcock, Datar, Motwani and O’Callaghan

Their work derived a methodology for Variance, that was also applied for k-medians.

Variance and k-Medians

Variance: Σ(xi – μ)2, μ = Σ xi/N k-median clustering:

Given: N points (x1… xN) in a metric space Find k points C = {c1, c2, …, ck} that minimize

Σ d(xi, C) (the assignment distance)

Clustering to be covered in detail future presentation

Notation

Vi = Variance of the ith bucketni = number of elements in ith bucketμi = mean of the ith bucket

B1 Bm B2………………

Current window, size = N

Bm-1

Variance – composition

Bi,j = concatenation of buckets i and j

ji

jjiiji, n + n

μn + μn = μ

jiji, n n n

2ji

ji

jijiji, )μ - (μ

n + n

nn + V + V = V

Decision Making

The problem of addressing time changing data had also significant influence on decision algorithms.

Pedro Domingos, who had originally developed a successful decision table algorithm (VFDT), also conceptualized the need to work with recent data, resulting in a new algorithm known as CVFDT. VFDT - Very Fast Decision Tree CVFDT - Concept Drift Very Fast Decision Tree

Implemented a window approach

Decision Making

Both VFDT and CVFDT make use of a statistical result known as Hoeffding* bound Used to estimate the minimum number of

necessary examples needed to make a decision for a node in a decision tree.

This is the key concept for these algorithms to work.

* W.Hoefding, Probability Inequalities sums bounded Variables, Journal American Statistics Association, 1963

Hoeffding Bound

random variable a whose range is R n independent observations of a;

Mean: ā Hoeffding bound states:

With probability 1- , the true mean of a is at least ā - , where n

R

2

)/1ln(2

Hoeffding Bound

Significance… This estimate/bound is incorporated into

an ID3 type decision tree, hence VFDT/CVFDT

The information gain is evaluated against

n

R

2

)/1ln(2

VFDT Algorithm

n

R

2

)/1ln(2

VFDT Algorithm Results

CVFDT vs. VFDT

CVFDT is an extension to VFDT that incorporated “windowing”

CFVDT concept: Generate tree as regular but using a window of

“w ” elements. Monitor changes in gain for attributes. If changes, generate alternate subtree with

new “best” attribute, but keep on background. Replace if new subtree becomes more

accurate.

References

[BDMO03] B. Babcock, M. Datar, R. Motwani, and J. L. O’Callaghan. “Maintaining Variance and k-Medians over Data Stream Windows”. ACM PODS, 2003.http://citeseer.nj.nec.com/591910.htmlhttp://www.stanford.edu/~babcock/papers/pods03.ppt

[DH00] P. Domingos and G. Hulten. “Mining High-Speed Data Streams”. ACM KDD, 2000.

http://citeseer.nj.nec.com/domingos00mining.html [HSD01] G. Hulten, L. Spencer and P. Domingos. “Mining Time-Changing Data

Streams”. ACM KDD, 2001.http://citeseer.nj.nec.com/hulten01mining.html

[DGIM02] Mayur Datar, Aristides Gionis, Piotr Indyk and Rajeev Motwani. “Maintaining Stream Statistics over Sliding Windows” ACM-SIAM SODA 2002.http://www.stanford.edu/~babcock/papers/pods03.ppt

[GGR02] Minos Garofalakis, Johannes Gehrke and Rajeev Rastogi. “Querying and Mining Data Streams: You Only Get One Look”. SIGMOD 2002 (tutorial).http://www.bell-labs.com/user/minos/Talks/streams-tutorial02.ppt

Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Documents

handling data

data stream dh00most

gbs of netflow data

entire data set

network hardware

probability ith item

sample size

representative sample