Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.

Adaptive Load Shedding for Mining Frequent Patterns from

Data Streams

Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong

(DaWaK 2006)

2008/3/19 1Yi-Chun Chen

Outline

• Motivation

• Objective

• Definition

• Adaptive Load Shedding in Data Stream

• Performace Results

• Conclusion

2008/3/19 2Yi-Chun Chen

Motivation• Finding frequent itemsets plays an important role

in analyzing data streams

• Only assuming that the machinery itself is fast enough to handle all incoming transactions without incurring any unwanted latencies

2008/3/19 Yi-Chun Chen 3

(Cont.)

• The arrival rate of data streams usually exceeds the system capacity

• Algorithms mining from data streams must cope with system overload situations


Objective• Given a processing capacity C of a mining

system and a data stream DS with high arrival rates

• Load(DS) : the workload of the system

• If , a load shedding is invoked

• Guarantee

• Discover a set of patterns closely approximates to the set of actual frequent itemsets


( )Load DS C

( )Load DS C

(Cont.)

• How to determine overload situations?

• How much load to shed?

• How to approximate frequent patterns under the introduction of load shedding?


Definition

•

• • : the occurrence count of X in DS up to the

transaction

MFIs: maximal frequent itemset


1 2, ,..., mI a a a

1 2, ,..., ,...NDS t t t

( )freq X thN

( )sup( )

freq XX

N

Adaptive Load Shedding in Data Streams

• Overload Detection

• Load Shedding by Sampling Transactions


Overload Detection

• To quickly estimate the system workload, we propose an approximate method on MFIs– MFIs also contains all frequent itemsets

– The # of MFIs is smaller than the # of frequent itemsets

– The support of MFIs is always closest to


(Cont.)

• load coefficient:– k be the # of MFIs in a transaction

– be a MFI, where

• Suppose we measure the above statistics for n transactions over one time unit

– r be the current rate of the data stream


1 , 1

2 2 i ji

k kX XX

i i j

L

iX 1 i k

1

n

iiL

r Cn

Load Shedding by Sampling Transactions

• In order to estimate how much load to shed

– P be a parameter expressing the fraction of transactions that should be discarded

– Suppose P < 1 , then we use Hoeffding bound to discard transactions and to approximate frequent patterns


1

n

iiL

P r Cn

(Cont.)• Hoeffding bound:

– , – r be the number of times that occurs in these

transactions– sup(X) = p : the true support of X

– : the estimated support of X

– We want to satisfy the inequality, so the required number of sampling transactions is at least


0 0Pr r n p n 2

022 ne 0 1 1iX

0n

0sup ( ) /E X r n

0 2

1 2ln

2n

(Cont.)

• Sample batch: each incoming transaction is chosen with probability P until we sample enough transactions

• Local patterns: all freq. itemsets in this sample batch are found only within part of the stream

• Global freq. itemsets in the entire stream


0n

(Cont.)

• Due to the non-uniform distribution of the stream

– False global patterns

– Significant support : the max. support error of each pattern

• : frequent

• : sub-frequent

• : infrequent


0 ( )

sup( )X

0 sup( )X

0sup( )X

Significant patterns

(Cont.)

• The required number of sampling transactions is at least

• If and ,then is too huge• we assume that each itemset appearing more than 0.01% ,then if

, then every itemset will be chosen

• ,


0 2

1 2ln

2n

0.001 0.01 0 2600000n

0 10000n

0 20

1 2; ln2

n Max

1

Performance Results

• Accuracy Measurements

• Adaptability• Recall: 找到的 true freq. patterns / 實際上是 true freq. patterns

• Precision: 找到 true freq. patterns / 找到的 total freq. patterns

• Synthetic: T5I3D1000K, T8I4D1000K with 10000 unique items

• Real-life: “BMS-POS” T6.5 D515597 with 1657 distinct items

• Fix , select


00.01, 0.01 25n K 04, 0.1

0 ;250.1

n K



Conclusion

• To address the problem of finding frequent patterns from data streams where the mining system may not keep up with the arrival reat of the stream


Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.

Documents

yichun chen slide

yichun chen7 slide

yichun chen10 slide

yichun chen3 slide

yichun chen4 slide

yichun chen8 slide

yichun chen9 slide

yichun chen11 slide