Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee- Keong Ng, and Kok- Leong Ong (DaWaK 2006) 2008/3/19 1 Yi-Chun Chen
Dec 18, 2015
Adaptive Load Shedding for Mining Frequent Patterns from
Data Streams
Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong
(DaWaK 2006)
2008/3/19 1Yi-Chun Chen
Outline
• Motivation
• Objective
• Definition
• Adaptive Load Shedding in Data Stream
• Performace Results
• Conclusion
2008/3/19 2Yi-Chun Chen
Motivation• Finding frequent itemsets plays an important role
in analyzing data streams
• Only assuming that the machinery itself is fast enough to handle all incoming transactions without incurring any unwanted latencies
2008/3/19 Yi-Chun Chen 3
(Cont.)
• The arrival rate of data streams usually exceeds the system capacity
• Algorithms mining from data streams must cope with system overload situations
2008/3/19 Yi-Chun Chen 4
Objective• Given a processing capacity C of a mining
system and a data stream DS with high arrival rates
• Load(DS) : the workload of the system
• If , a load shedding is invoked
• Guarantee
• Discover a set of patterns closely approximates to the set of actual frequent itemsets
2008/3/19 Yi-Chun Chen 5
( )Load DS C
( )Load DS C
(Cont.)
• How to determine overload situations?
• How much load to shed?
• How to approximate frequent patterns under the introduction of load shedding?
2008/3/19 Yi-Chun Chen 6
Definition
•
• • : the occurrence count of X in DS up to the
transaction
MFIs: maximal frequent itemset
2008/3/19 Yi-Chun Chen 7
1 2, ,..., mI a a a
1 2, ,..., ,...NDS t t t
( )freq X thN
( )sup( )
freq XX
N
Adaptive Load Shedding in Data Streams
• Overload Detection
• Load Shedding by Sampling Transactions
2008/3/19 Yi-Chun Chen 8
Overload Detection
• To quickly estimate the system workload, we propose an approximate method on MFIs– MFIs also contains all frequent itemsets
– The # of MFIs is smaller than the # of frequent itemsets
– The support of MFIs is always closest to
2008/3/19 Yi-Chun Chen 9
(Cont.)
• load coefficient:– k be the # of MFIs in a transaction
– be a MFI, where
• Suppose we measure the above statistics for n transactions over one time unit
– r be the current rate of the data stream
2008/3/19 Yi-Chun Chen 10
1 , 1
2 2 i ji
k kX XX
i i j
L
iX 1 i k
1
n
iiL
r Cn
Load Shedding by Sampling Transactions
• In order to estimate how much load to shed
– P be a parameter expressing the fraction of transactions that should be discarded
– Suppose P < 1 , then we use Hoeffding bound to discard transactions and to approximate frequent patterns
2008/3/19 Yi-Chun Chen 11
1
n
iiL
P r Cn
(Cont.)• Hoeffding bound:
– , – r be the number of times that occurs in these
transactions– sup(X) = p : the true support of X
– : the estimated support of X
– We want to satisfy the inequality, so the required number of sampling transactions is at least
2008/3/19 Yi-Chun Chen 12
0 0Pr r n p n 2
022 ne 0 1 1iX
0n
0sup ( ) /E X r n
0 2
1 2ln
2n
(Cont.)
• Sample batch: each incoming transaction is chosen with probability P until we sample enough transactions
• Local patterns: all freq. itemsets in this sample batch are found only within part of the stream
• Global freq. itemsets in the entire stream
2008/3/19 Yi-Chun Chen 13
0n
(Cont.)
• Due to the non-uniform distribution of the stream
– False global patterns
– Significant support : the max. support error of each pattern
• : frequent
• : sub-frequent
• : infrequent
2008/3/19 Yi-Chun Chen 14
0 ( )
sup( )X
0 sup( )X
0sup( )X
Significant patterns
(Cont.)
• The required number of sampling transactions is at least
• If and ,then is too huge• we assume that each itemset appearing more than 0.01% ,then if
, then every itemset will be chosen
• ,
2008/3/19 Yi-Chun Chen 15
0 2
1 2ln
2n
0.001 0.01 0 2600000n
0 10000n
0 20
1 2; ln2
n Max
1
Performance Results
• Accuracy Measurements
• Adaptability• Recall: 找到的 true freq. patterns / 實際上是 true freq. patterns
• Precision: 找到 true freq. patterns / 找到的 total freq. patterns
• Synthetic: T5I3D1000K, T8I4D1000K with 10000 unique items
• Real-life: “BMS-POS” T6.5 D515597 with 1657 distinct items
• Fix , select
2008/3/19 Yi-Chun Chen 16
00.01, 0.01 25n K 04, 0.1
0 ;250.1
n K
2008/3/19 Yi-Chun Chen 17
2008/3/19 Yi-Chun Chen 18
Conclusion
• To address the problem of finding frequent patterns from data streams where the mining system may not keep up with the arrival reat of the stream
2008/3/19 Yi-Chun Chen 19