Top Banner
Stream Data Introduction
43

Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Jan 02, 2016

Download

Documents

Stanley Watson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Stream Data Introduction

Page 2: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Outline Streaming Data description

Uses/Applications Problems/Challenges

Main Concepts variance & k-means aging & sliding windows algorithms

References

Page 3: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Sizing the challenge

WalMart Records 20 Million Transactions

Google Handles 100 Million Searches AT&T produces 275 million call records Earth sensing satellite produces GBs of

data

This just in a day!

Page 4: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Characteristics/Description

Stream data sets are… Continuous Massive Unbounded Possibly infinite

Fast changing and requires fast, real-time response

Page 5: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Example: Network Management Application

Network Management involves monitoring and configuring network hardware and software to ensure smooth operation Monitor link bandwidth usage, estimate traffic

demands Quickly detect faults, congestion and isolate root

cause Load balancing, improve utilization of network

resources

AT&T collects 100 GBs of NetFlow data each day!AT&T collects 100 GBs of NetFlow data each day!

Page 6: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Network Management Application (cont.)

Network Operations Center

Network

MeasurementsAlarms

Page 7: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Uses/Applications Banking/Stocks/Financials

credit card fraud detection stock trends monitoring

Sensors power grid balancing engine controls collision avoidance driver sleep monitor

Page 8: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Problems/Challenges

‘Zillions of data Continuous/Unbounded Examples arrive faster than they can be

mined Application may require fast, real-time

response Examples:

life threatening: collision avoidancelost revenue/transactions: hung-up networks

Page 9: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Problems/Challenges

Time/Space constrained Not enough memory Can’t afford storing/revisiting the data

Single pass computation External memory algorithms for

handling data sets larger than main memory cannot be used.

Do not support continuous queries Too slow real-time response

Page 10: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Problems/Challenges

In summary…

Can’t stop to smell the roses…

Only one chance/single pass/look at the data

Page 11: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Problems/Challenges

Other Considerations Classical algorithms (i.e. CART, C4.5) do not

scale up to data stream [DH00] Most need entire data set for analysis Random access (or multiple passes) to the data

Difficult to compute answers accurately with limited memory

With probability at least 1 - , algorithms compute an approximate answer within a factor of the actual answer

Noise (bad sensors, outliers) Aging/Old/Stale data

Page 12: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Computation Model

Stream ProcessingEngine

(Approximate) Answer

Data Streams

Synopsis in Memory

Decision Making

Page 13: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Model Components Synopsis

Summary of the data Samples, Histograms

Processing Engine Implementation/Management System

STREAM (Stanford): general-purpose Aurora (Brown/MIT): sensor monitoring, dataflow Telegraph (Berkeley): adaptive engine for sensors

Decision Making Apply Data Mining techniques

Decision Trees, Clusters, Association Rules

Page 14: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Synopsis: Dealing with Time/Space Constraints

Since data can’t be contained, or revisited, the best alternative is to summarize what has been seen.

Basic stream synopsis computation Random Sampling: Generate statistics using a

representative sample of the data Histograms: Distribution/Grouping data

representation Wavelets: Mathematical tool for hierarchical

decomposition of functions/signals

Page 15: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

15

Reservoir Sampling

Sample first m items Choose to sample the i’th item (i>m) with probability m/i If sampled, randomly replace a previously sampled item

Optimization: when i gets large, compute which item will be sampled next, skip over intervening items

Page 16: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

16

Reservoir Sampling - Analysis

Analyze simple case: sample size m = 1 Probability i’th item is the sample from stream

length n: Prob. i is sampled on arrival prob. i survives to end

1 i i+1 n-2 n-1 i i+1 i+2 n-1 n

= 1/n

Case for m > 1 is similar, easy to show uniform probability

Drawbacks of reservoir sampling: hard to parallelize

Page 17: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

17

Min-wise Sampling For each item, pick a random fraction between 0 and 1 Store item(s) with the smallest random tag [Nath et al.’04]

0.391 0.908 0.291 0.555 0.619 0.273

Each item has same chance of least tag, so uniform

Can run on multiple streams separately, then merge

Page 18: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

18

Histograms Histograms approximate the frequency

distribution of element values in a stream

A histogram (typically) consists of A partitioning of element domain values into

buckets A count per bucket B (of the number of

elements in B) Long history of use for selectivity

estimation within a query optimizer

BC

Page 19: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Histogram Sampling Equi-Depth

Element counts per bucket are kept constant V-Optimal

Minimize frequency variance within buckets Exponential Histograms (EH)

Bucket sizes are non-decreasing powers of 2 Size: Total number of 1’s in the bucket. For every bucket other than the last bucket, there are

at least k/2 and at most k/2+1 buckets of that size Example: k=4: (1,1,2,2,2,4,4,4,8,8,..)

Essential component of “sliding windows” technique addressing “aging” data.

Page 20: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Equi-Depth

V-Optimal

Exponential Histograms

Page 21: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Exponential Histogram

Assume k/2 = 2

32,16,8,8,4,4,2,1,1

Page 22: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Exponential Histogram

Assume k/2 = 2

32,16,8,8,4,4,2,1,1 1

Page 23: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Exponential Histogram

Assume k/2 = 2

32,16,8,8,4,4,2,1,1,1 32,16,8,8,4,4,2,2,1

Merge! Merged!

Page 24: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Exponential Histogram

Assume k/2 = 2

32,16,8,8,4,4,2,1,132,16,8,8,4,4,2,2,132,16,8,8,4,4,2,2,1,132,16,16,8,4,2,1

Page 25: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Exponential Histogram

Assume k/2 = 2

32,16,8,8,4,4,2,1,132,16,8,8,4,4,2,2,132,16,8,8,4,4,2,2,1,132,16,16,8,4,2,1

Page 26: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

26

Answering Queries using Histograms

(Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation

Example: select count(*) from R where 4<=R.e<=15

For equi-depth histograms, maximum error:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Count spreadevenly amongbucket values

4 R.e 15

answer: 3.5 * BC

BC*2

Page 27: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Sliding Windows Technique

Background: Some applications rely on ALL historical

data But for most applications, OLD data is

considered less relevant and could skew results from NEW trends or conditions

new processes/procedures new hardware/sensors new fashion trends

Page 28: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Sliding Windows (cont.)

Common approaches addressing Old data: Aging Model

elements are associated with “weights” that decrease over time

may use some exponential decay formulas Sliding Windows Model

Only last “N” elements are considered Incorporate examples as they arrive The record “expires” at time t+N (N is the

window length) Count only the “1’s” in bit-stream data

Page 29: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Sliding Window (SW) Model

….1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1…

Time Increases

Current Time

Window Size N = 7

Page 30: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Sliding Windows Plus Exponential Histograms

Sliding Windows Approach (pseudo-pseudo code) Consider only the last N elements. Define k=1/ε, and approximate k/2 to nearest integer. Time Stamp each “1” that arrives in the stream and

insert into a first bucket, shifting any initial ones. First bucket value is “1” since there is only one “1”

If the number of buckets with same value exceeds k/2 +1, merge the oldest buckets, but keeping at least k/2 buckets of the same value

Merging creates a new bucket with size equal to the sum Eliminate last bucket if its last 1 time stamp exceeds N

Page 31: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Benefits of Sliding Windows

Incorporates new elements as they appear.

Easy to calculate statistics over data streams with respect to the last N elements based on the histogram.

Can estimate the number of 1’s within a factor of (1 + ε) using only θ((1/ε)(log2

N)) bits of memory.

Page 32: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Expansion of Sliding Windows

The original Sliding Window Method was not fully applicable to two important statistics during the “merging” of the buckets: k-median and variance

A solution was devised by Babcock, Datar, Motwani and O’Callaghan

Their work derived a methodology for Variance, that was also applied for k-medians.

Page 33: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Variance and k-Medians

Variance: Σ(xi – μ)2, μ = Σ xi/N k-median clustering:

Given: N points (x1… xN) in a metric space Find k points C = {c1, c2, …, ck} that minimize

Σ d(xi, C) (the assignment distance)

Clustering to be covered in detail future presentation

Page 34: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Notation

Vi = Variance of the ith bucketni = number of elements in ith bucketμi = mean of the ith bucket

B1 Bm B2………………

Current window, size = N

Bm-1

Page 35: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Variance – composition

Bi,j = concatenation of buckets i and j

ji

jjiiji, n + n

μn + μn = μ

jiji, n n n

2ji

ji

jijiji, )μ - (μ

n + n

nn + V + V = V

Page 36: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Decision Making

The problem of addressing time changing data had also significant influence on decision algorithms.

Pedro Domingos, who had originally developed a successful decision table algorithm (VFDT), also conceptualized the need to work with recent data, resulting in a new algorithm known as CVFDT. VFDT - Very Fast Decision Tree CVFDT - Concept Drift Very Fast Decision Tree

Implemented a window approach

Page 37: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Decision Making

Both VFDT and CVFDT make use of a statistical result known as Hoeffding* bound Used to estimate the minimum number of

necessary examples needed to make a decision for a node in a decision tree.

This is the key concept for these algorithms to work.

* W.Hoefding, Probability Inequalities sums bounded Variables, Journal American Statistics Association, 1963

Page 38: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Hoeffding Bound

random variable a whose range is R n independent observations of a;

Mean: ā Hoeffding bound states:

With probability 1- , the true mean of a is at least ā - , where n

R

2

)/1ln(2

Page 39: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

Hoeffding Bound

Significance… This estimate/bound is incorporated into

an ID3 type decision tree, hence VFDT/CVFDT

The information gain is evaluated against

n

R

2

)/1ln(2

Page 40: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

VFDT Algorithm

n

R

2

)/1ln(2

Page 41: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

VFDT Algorithm Results

Page 42: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

CVFDT vs. VFDT

CVFDT is an extension to VFDT that incorporated “windowing”

CFVDT concept: Generate tree as regular but using a window of

“w ” elements. Monitor changes in gain for attributes. If changes, generate alternate subtree with

new “best” attribute, but keep on background. Replace if new subtree becomes more

accurate.

Page 43: Stream Data Introduction. Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows.

References

[BDMO03] B. Babcock, M. Datar, R. Motwani, and J. L. O’Callaghan. “Maintaining Variance and k-Medians over Data Stream Windows”. ACM PODS, 2003.http://citeseer.nj.nec.com/591910.htmlhttp://www.stanford.edu/~babcock/papers/pods03.ppt

[DH00] P. Domingos and G. Hulten. “Mining High-Speed Data Streams”. ACM KDD, 2000.

http://citeseer.nj.nec.com/domingos00mining.html [HSD01] G. Hulten, L. Spencer and P. Domingos. “Mining Time-Changing Data

Streams”. ACM KDD, 2001.http://citeseer.nj.nec.com/hulten01mining.html

[DGIM02] Mayur Datar, Aristides Gionis, Piotr Indyk and Rajeev Motwani. “Maintaining Stream Statistics over Sliding Windows” ACM-SIAM SODA 2002.http://www.stanford.edu/~babcock/papers/pods03.ppt

[GGR02] Minos Garofalakis, Johannes Gehrke and Rajeev Rastogi. “Querying and Mining Data Streams: You Only Get One Look”. SIGMOD 2002 (tutorial).http://www.bell-labs.com/user/minos/Talks/streams-tutorial02.ppt