Mining Frequent Patterns from Data Streams Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Charu Aggarwal And Jiawei Han, Jure Leskovec
Mining Frequent Patterns
from Data Streams
Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Charu Aggarwal
And Jiawei Han, Jure Leskovec
OUTLINE
l Data Streams
l Characteristics of Data Streamsl Key Challenges in Stream Data
l Frequent Pattern Mining over Data Streams
l Counting Itemsetsl Lossy Countingl Extensions
Data Stream
l What is the data stream?l A data stream is an ordered sequence of instances that in many
applications of data stream mining can be read only once or a small number of items using limited computing and storage capabilities.
4
Data Streams
l Traditional DBMSl Data stored in finite, persistent data sets
l Data Streamsl Continuous, ordered, changing, fast, huge amountl Managed by Data Stream Management System (DSMS)
5
DBMS versus DSMS
l Persistent relationsl One-time queriesl Random accessl “Unbounded” disk storel Only current state mattersl No real-time servicesl Relatively low update ratel Data at any granularityl Assume precise datal Access plan determined by
query processor, physical DB design
l Transient streams l Continuous queriesl Sequential accessl Bounded main memoryl Historical data is importantl Real-time requirementsl Possibly multi-GB arrival ratel Data at fine granularityl Data stale/imprecisel Unpredictable/variable data
arrival and characteristics
6
Data Streams
l Data Streamsl Data streams - continuous, ordered, changing, fast, huge amount
l Traditional DBMS - data stored in finite, persistent data sets
l Characteristics of Data Streamsl Fast changing and requires fast, real-time response
7
Characteristics of Data Streams
l Data Streamsl Data streams—continuous, ordered, changing, fast, huge amount
l Traditional DBMS—data stored in finite, persistent data sets
l Characteristicsl Fast changing and requires fast, real-time responsel Huge volumes of continuous data, possibly infinitel Data stream captures nicely our data processing needs of today
APPLICATIONS
Example: Freeboard.io - Dashboards For the Internet Of Things - https://freeboard.io/
Sensors
Weather
Stock Exchange
Self Driving Cars
Trends
Tweets
Logs
Articles/News...
Wikipedia Edits
ATM Transactions
Chats
Television
Seisms
Music similarities
CO2 Level
Car Tracking...
9
Stream Data Applications
l Telecommunication calling recordsl Business: credit card transaction flowsl Network monitoring and traffic engineeringl Financial market: stock exchangel Engineering & industrial processes: power supply &
manufacturingl Sensor, monitoring & surveillance: video streams, RFIDsl Security monitoringl Web logs and Web page click streamsl Massive data sets (even saved but random access is too
expensive)
10
Characteristics of Data Streams
l Data Streamsl Data streams—continuous, ordered, changing, fast, huge amount
l Traditional DBMS—data stored in finite, persistent data sets
l Characteristicsl Fast changing and requires fast, real-time responsel Huge volumes of continuous data, possibly infinitel Data stream captures nicely our data processing needs of todayl Random access is expensive—single scan algorithm (can only have
one look)l Store only the summary of the data seen thus farl Most stream data are at pretty low-level or multi-dimensional in
nature, needs multi-level and multi-dimensional processing
11
Key Challenges in Stream Data
l Mining precise freq. patterns in stream data: unrealistic
l Infinite length
l Concept-drift
l Concept-evolution
l Feature evolution
Key Challenges: Infinite Length
l Infinite length
l In many data mining situations, we do not know the entire data set in advance. Stream management is important when the input rate is controlled externally
l Examples: Google queries, Twitter or Facebook status updates
l Infinite length: Impractical to store and use all historical data
l Requires infinite storage
l And running time
Key Challenges: Infinite Length
0 1 10
11
11
0
0 0
0
Key Challenges: Concept-Drift
Negative instancePositive instance
A data chunk
Current hyperplane
Previous hyperplane
Instances victim of concept-drift
Key Challenges: Concept-Evolution
l Concept-evolution occurs when a new class arrives in the stream.l In this example, we again see a data chunk having two dimensional
data points.l There are two classes here, + and -. Suppose we train a rule-based
classifier using this chunkl Suppose a new class x arrives in the stream in the next chunk.l If we use the same classification rules, all novel class instances will
be misclassified as either + or -.
XXXXX
XXXXXXXXXXXX
XXXXXXXXXXXX
XXXXXXXXXXXX
XXXXXX
Novel classy
x1
y1
y2
x
++++++
++++++
+++++++
++++++++
++++++++++++
++++++++++
- - - - -
- - - - -
- - - - -
++++++++
++++++++
- - - - - - -
- - - - - - - - - - - - - - - -
-- - - - - - - - - - - - - - -
-- - - - - - - - - - - - - - -
- - - - - - - -- - - - -
Classification rules:
R1. if (x > x1 and y < y2) or (x < x1and y < y1) then class = +
R2. if (x > x1 and y > y2) or (x < x1and y > y1) then class = -
A
CD
B
y
x1
y1
y2
x
++++++
++++++
+++++++
++++++++
++++++++++++
++++++++++
- - - - -
- - - - -
- - - - -
++++++++
++++++++
- - - - - - -
- - - - - - - - - - - - - - - -
-- - - - - - - - - - - - - - -
-- - - - - - - - - - - - - - -
- - - - - - - -- - - - -
A
CD
B
Key Challenges: Dynamic Features
l Why new features evolvingl Infinite data stream
l Normally, global feature set is unknownl New features may appear
l Concept driftl As concept drifting, new features may appear
l Concept evolutionl New type of class normally holds new set of features
Feature Extraction & Selection
i + 1st chunkith chunk
Existing classification models need complete fixed features and apply to all thechunks. Global features are difficult to predict. One solution is using all Englishwords and generate vector. Dimension of the vector will be too high.
Current model
TrainingNew Model
Feature SpaceConversion
Classification &Novel Class Detection
runway, climb runway, clear, ramp
runway, ground, ramp
Feature Set
ith chunk and i + 1st
chunk and models have different feature sets
Key Challenges: Dynamic Features
Frequent Pattern Mining
over Data Stream
l Items Counting
l Lossy Countingl Extensions
Items Counting
20
Counting Bits – (1)
l Problem:givenastreamof0’sand1’s,be
preparedtoanswerqueriesoftheform“how
many1’sinthelastk bits?”wherek≤ N.l Obvioussolution:storethemostrecentN bits.
l Whennewbitcomesin,discardtheN+1st bit.
21
Counting Bits – (2)
l Youcan’tgetanexactanswerwithoutstoring
theentirewindow.
l RealProblem:whatifwecannotaffordto
storeN bits?
l E.g.,we’reprocessing1billionstreamsandN =1billion
l Butwe’rehappywithanapproximateanswer.
22
DGIM* Method
l StoreO(log2N)bitsperstream.
l Givesapproximateanswer,neveroffbymore
than50%.
l Errorfactorcanbereducedtoanyfraction>0,with
morecomplicatedalgorithmandproportionallymore
storedbits.
*Datar, Gionis, Indyk, and Motwani
23
Something That Doesn’t
(Quite) Work
l Summarizeexponentiallyincreasingregionsof
thestream,lookingbackward.
l Dropsmallregionsiftheybeginatthesame
pointasalargerregion.
24
Key Idea
l Summarizeblocksofstreamwithspecific
numbersof1’s.
l Blocksizes (numberof1’s)increase
exponentiallyaswegobackintime
25
Example: Bucketized Stream
1001010110001011010101010101011010101010101110101010111010100010110010
N
1 ofsize 2
2 ofsize 4
2 ofsize 8
At least 1 ofsize 16. Partiallybeyond window.
2 ofsize 1
26
Timestamps
l Eachbitinthestreamhasatimestamp,starting1,2,…
l RecordtimestampsmoduloN (thewindowsize),
sowecanrepresentanyrelevant timestampin
O(log2N)bits.
27
Buckets
l Abucket intheDGIMmethodisarecord
consistingof:
1. Thetimestampofitsend[O(logN)bits].2. Thenumberof1’sbetweenitsbeginningandend
[O(loglogN)bits].l Constraintonbuckets:numberof1’smustbe
apowerof2.
l ThatexplainstheloglogN in(2).
28
Representing a Stream by Buckets
l Eitheroneortwobucketswiththesame
power-of-2numberof1’s.
l Bucketsdonotoverlapintimestamps.
l Bucketsaresortedbysize.
l Earlierbucketsarenotsmallerthanlaterbuckets.
l Bucketsdisappearwhentheirend-timeis>Ntimeunitsinthepast.
29
Updating Buckets – (1)
l Whenanewbitcomesin,dropthelast(oldest)
bucketifitsend-timeispriortoN timeunits
beforethecurrenttime.
l Ifthecurrentbitis0,nootherchangesare
needed.
30
Updating Buckets – (2)
l Ifthecurrentbitis1:
1. Createanewbucketofsize1,forjustthisbit.
◆ Endtimestamp=currenttime.
2. Iftherearenowthreebucketsofsize1,combinethe
oldesttwointoabucketofsize2.
3. Iftherearenowthreebucketsofsize2,combinethe
oldesttwointoabucketofsize4.
4. Andsoon…
31
Example
1001010110001011010101010101011010101010101110101010111010100010110010
0010101100010110101010101010110101010101011101010101110101000101100101
0010101100010110101010101010110101010101011101010101110101000101100101
0101100010110101010101010110101010101011101010101110101000101100101101
0101100010110101010101010110101010101011101010101110101000101100101101
0101100010110101010101010110101010101011101010101110101000101100101101
32
Querying
l Toestimatethenumberof1’sinthemost
recentN bits:
1. Sumthesizesofallbucketsbutthelast.
2. Addhalfthesizeofthelastbucket.
l Remember:wedon’tknowhowmany1’sof
thelastbucketarestillwithinthewindow.
33
Example: Bucketized Stream
1001010110001011010101010101011010101010101110101010111010100010110010
N
1 ofsize 2
2 ofsize 4
2 ofsize 8
At least 1 ofsize 16. Partiallybeyond window.
2 ofsize 1
34
Error Bound
l Supposethelastbuckethassize2k.
l Thenbyassuming2k-1 ofits1’sarestillwithin
thewindow,wemakeanerrorofatmost2k-1.
l Sincethereisatleastonebucketofeachof
thesizeslessthan2k,thetruesumisatleast1
+2+..+2k-1 =2k-1.
l Thus,erroratmost50%.
Frequent Pattern Mining
over Data Stream
l Items Countingl Lossy Counting
l Extensions
LOSSY COUNTING
37
Mining Approximate Frequent Patterns
l Mining precise freq. patterns in stream data: unrealisticl Even store them in a compressed form, such as FPtree
l Approximate answers are often sufficient (e.g., trend/pattern analysis)l Example: A router is interested in all flows:
l whose frequency is at least 1% (σ) of the entire traffic stream seen so far
l and feels that 1/10 of σ (ε = 0.1%) error is comfortable l How to mine frequent patterns with good approximation?
l Lossy Counting Algorithm (Manku & Motwani, VLDB’02)l Major ideas: not tracing items until it becomes frequentl Adv: guaranteed error boundl Disadv: keep a large set of traces
38
Lossy Counting for Frequent Single Items
Bucket 1 Bucket 2 Bucket 3
Divide stream into ‘buckets’ (bucket size is 1/ ε = 1000)
39
First Bucket of Stream
Empty(summary) +
At bucket boundary, decrease all counters by 1
40
Next Bucket of Stream
+
At bucket boundary, decrease all counters by 1
41
Approximation Guarantee
n Given: (1) support threshold: σ, (2) error threshold: ε, and (3)stream length N
n Output: items with frequency counts exceeding (σ – ε) Nn How much do we undercount?
If stream length seen so far = N and bucket-size = 1/εthen frequency count error £ #buckets
= N/bucket-size = N/(1/ε) = εNn Approximation guarantee
n No false negativesn False positives have true frequency count at least (σ–ε)Nn Frequency count underestimated by at most εN
42
Lossy Counting For Frequent Itemsets
Divide Stream into ‘Buckets’ as for frequent itemsBut fill as many buckets as possible in main memory one time
Bucket 1 Bucket 2 Bucket 3
If we put 3 buckets of data into main memory one time,then decrease each frequency count by 3
43
Update of Summary Data Structure
22
12111
summary data 3 bucket datain memory
44
10220
+
33
9
summary data
Itemset ( ) is deleted.That’s why we choose a large number of buckets – delete more
44
Pruning Itemsets – Apriori Rule
If we find itemset ( ) is not frequent itemset,then we needn’t consider its superset
3 bucket datain memory
1
+
summary data
221
1
45
Summary of Lossy Counting
l Strengthl A simple ideal Can be extended to frequent itemsets
l Weakness:l Space bound is not goodl For frequent itemsets, they do scan each record many timesl The output is based on all previous data. But sometimes, we are
only interested in recent datal A space-saving method for stream frequent item mining
l Metwally, Agrawal, and El Abbadi, ICDT'05
Extensions
47
Extensions
l Lossy Counting Algorithm (Manku & Motwani, VLDB’02)l Mine frequent patterns with Approximate frequent patterns.
l Keep only current frequent patterns. No changes can be detected.
l FP-Stream (C. Giannella, J. Han, X. Yan, P.S. Yu, 2003)l Use tilted time window frame.
l Mining evolution and dramatic changes of frequent patterns.
l Moment (Y. Chi, ICDM ‘04)l Very similar to FP-tree, except that keeps a dynamic set of items.
l Maintain closed frequent itemsets over a Stream Sliding Window
48
Lossy Counting versus FP-Stream
l Lossy Counting (Manku & Motwani VLDB’02)
l Keep only current frequent patterns—No changes can be detected
l FP-Stream: Mining evolution and dramatic changes of frequent
patterns (Giannella, Han, Yan, Yu, 2003)
l Use tilted time window frame
l Use compressed form to store significant (approximate) frequent
patterns and their time-dependent traces
Summary of FP-Stream
l Mining Frequent Itemsets at Multiple Timel Granularities Based in FP-Growthl Maintains
l Pattern Treel Tilted-time window
l Advantagesl Allows to answer time-sensitive queriesl Places greater information to recent data
l Drawbackl Time and memory complexity
Moment
l Regenerate frequent itemsets from the entire window whenever a new transaction comes into or an old transaction leaves the window
l Store every itemset, frequent or not, in a traditional data structure such as the prefix tree, and update its support whenever a new transaction comes into or an old transaction leaves the window
l Drawbackl Mining each window from scratch - too expensive
l Subsequent windows have many freq patterns in commonl Updating frequent patterns every new tuple, also too expensive
Summary of Moment
l Computes closed frequents itemsets in a sliding windowl Uses Closed Enumeration Treel Uses 4 type of Nodes:
l Closed Nodesl Intermediate Nodesl Unpromising Gateway Nodesl Infrequent Gateway Nodes
l Adding transactionsl Closed items remains closed
l Removing transactionsl Infrequent items remains infrequent
References
[Agrawal’ 94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994.[Cheung’ 03] W. Cheung and O. R. Zaiane, “Incremental mining of frequent patterns without candidate generation or support,” in DEAS, 2003. [Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent itemsets over a stream sliding window,” in ICDM, November 2004.[Evfimievski’ 03] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in PODS, 2003.[Han’ 00] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000.[Koh’ 04] J. Koh and S. Shieh, “An efficient approach for maintaining association rules based on adjusting fp-tree structures.” in DASFAA, 2004.[Leung’ 05] C.-S. Leung, Q. Khan, and T. Hoque, “Cantree: A tree structure for efficient incremental mining of frequent patterns,” in ICDM, 2005.[Toivonen’ 96] H. Toivonen, “Sampling large databases for association rules,” in VLDB, 1996, pp. 134–145.Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo: Verifying and Mining Frequent Patterns from Large Windows over Data Streams. ICDE 2008: 179-188Hetal Thakkar, Barzan Mozafari, Carlo Zaniolo. Continuous Post-Mining of Association Rules in a Data Stream Management System. Chapter VII in Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, Yanchang Zhao; Chengqi Zhang; and Longbing Cao (eds.), ISBN: 978-1-60566-404-0.