Mining Sequential Mining Sequential Patterns Patterns Rakesh Agrawal Rakesh Agrawal Ramakrishnan Srikant Ramakrishnan Srikant Proc. of the Int’l Conference Proc. of the Int’l Conference on Data Engineering (ICDE) on Data Engineering (ICDE) March 1995 March 1995 Presenter: Phil Schlosser Presenter: Phil Schlosser
24
Embed
Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int’l Conference on Data Engineering (ICDE) March 1995 Presenter: Phil Schlosser.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Ramakrishnan SrikantRamakrishnan SrikantProc. of the Int’l Conference on Proc. of the Int’l Conference on Data Engineering (ICDE) March Data Engineering (ICDE) March
19951995
Presenter: Phil SchlosserPresenter: Phil Schlosser
TopicsTopics
► IntroductionIntroduction►Problem StatementProblem Statement►Finding Sequential Patterns (The Finding Sequential Patterns (The
►Bar code data allows us to store Bar code data allows us to store massive amounts of sales data (basket massive amounts of sales data (basket data).data).
►We would like to extract sequential We would like to extract sequential buying patterns.buying patterns.
►E.g. Video rental: Star Wars; Empire E.g. Video rental: Star Wars; Empire Strikes Back; Return of the Jedi.Strikes Back; Return of the Jedi.
►Can include multiple elements.Can include multiple elements.►E.g. Sheet and pillow case; comforter; E.g. Sheet and pillow case; comforter;
drapes.drapes.
Problem StatementProblem Statement
► Given a database of transactions:Given a database of transactions:► Each transaction:Each transaction:
- Customer ID- Customer ID- Transaction Time- Transaction Time- Set of items purchased- Set of items purchased
* No two transactions have same customer ID * No two transactions have same customer ID and transaction timeand transaction time
* Don’t consider quantity of items purchased.* Don’t consider quantity of items purchased. Item was either purchased or not purchasedItem was either purchased or not purchased
Problem Statement (terms)Problem Statement (terms)
► Itemset: non-empty set of items. Each item set Itemset: non-empty set of items. Each item set mapped to an integer.mapped to an integer.
► Sequence: Ordered list of itemsets.Sequence: Ordered list of itemsets.► Customer-Sequence: List of customer transactions Customer-Sequence: List of customer transactions
ordered by increasing transaction time.ordered by increasing transaction time.► (A customer supports a sequence if the sequence is (A customer supports a sequence if the sequence is
contained in the customer-sequence.)contained in the customer-sequence.)► Support for a sequence: Fraction of total customers Support for a sequence: Fraction of total customers
that support a sequence.that support a sequence.► Maximal Sequence: A sequence that is not Maximal Sequence: A sequence that is not
contained in any other sequence.contained in any other sequence.► Large Sequence: Sequence that meets minisup.Large Sequence: Sequence that meets minisup.
► Replace each transaction with all litemsets contained in the Replace each transaction with all litemsets contained in the transaction.transaction.
► Transactions with no litemsets are dropped. (Still considered Transactions with no litemsets are dropped. (Still considered for support counts)for support counts)
Note: (10 20) dropped because of lack of support.(40 60 70) replaced with set of litemsets {(40),(70),(40 70)} (60 does not have minisup)
Sequence PhaseSequence Phase
►Finds Large Sequences:Finds Large Sequences:►AprioriAllAprioriAll►AprioriSomeAprioriSome►DynamicSomeDynamicSome►Stay Tuned….Stay Tuned….
Maximal PhaseMaximal Phase
►Find maximal sequences among large Find maximal sequences among large sequences.sequences.
►k-sequence: sequence of length kk-sequence: sequence of length k►S set of all large sequencesS set of all large sequences► for (k=n; k>1; k--) dofor (k=n; k>1; k--) do
foreach k-sequence sforeach k-sequence skk do do
Delete from S all subsequences of sDelete from S all subsequences of skk
Authors claim data-structures and an algorithm Authors claim data-structures and an algorithm exist to do this efficiently. (hash trees)exist to do this efficiently. (hash trees)
Sequence PhaseSequence Phase
►Sequence phase finds the large Sequence phase finds the large sequences.sequences.
►Two families of algorithms:Two families of algorithms:
*****CountSome**********CountSome*****
*****CountAll**********CountAll*****
Final Exam Question #2Final Exam Question #2
► There were two types of algorithms presented There were two types of algorithms presented to find sequential patterns, CountSome and to find sequential patterns, CountSome and CountAll. What was the main difference CountAll. What was the main difference between the two algorithms?between the two algorithms?
► CountAllCountAll ( (AprioriAllAprioriAll) is careful with respect to ) is careful with respect to minimum support, careless with respect to minimum support, careless with respect to maximality.maximality.
CountSomeCountSome ( (AprioriSomeAprioriSome) is careful with ) is careful with respect to maximality, careless with respect respect to maximality, careless with respect to minimum support.to minimum support.
AprioriAllAprioriAllL1 = {large 1-sequences}for (k = 2; Lk-1 ≠ {}; k++) do begin Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. endAnswer = Maximal Sequences in UkLk
Notation:Lk: Set of all large k-sequencesCk: Set of candidate k-sequences
last = 1for (k = 2; Ck-1 ≠ {} and Llast ≠ {}; k++) do begin if (Lk-1 known) then Ck = New candidates generated from Lk-1
else Ck = New candidates generated from Ck-1
if (k==next(last)) then begin // (next k to count?) foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. last = k; endend
AprioriSome (Backward AprioriSome (Backward Phase)Phase)for (k--; k>=1; k--) do
if (Lk not found in forward phase) then begin Delete all sequences in Ck contained in some Li i>k; foreach customer-sequence c in DT do Increment the count of all candidates in Ck that are contained in c Lk = Candidates in Ck with minimum support end else // lk already known Delete all sequences in Ck contained in some Li i>k;
Answer = UkLk //(Maximal Phase not Needed)
Notation: DT; Transformed database
DynamicSome (Just the DynamicSome (Just the basics)basics)
►Similar to AprioriSomeSimilar to AprioriSome►AprioriSome generates CAprioriSome generates Ckk from C from Ck-1k-1
►DynamicSome generates CDynamicSome generates Ckk “on the “on the fly”fly”
Final Exam Question #1Final Exam Question #1
►What was the greatest hardware What was the greatest hardware concern regarding the algorithms concern regarding the algorithms contained in the paper?contained in the paper?
►Main memory capacity. When there is Main memory capacity. When there is little main memory, or many little main memory, or many potentially large sequences, the potentially large sequences, the benefits of benefits of AprioriSomeAprioriSome vanish. vanish.
Final Exam Question #3Final Exam Question #3
►How did the two best sequence mining How did the two best sequence mining algorithms (algorithms (AprioriAllAprioriAll and and AprioriSomeAprioriSome) ) perform compared with each other? perform compared with each other? Take into consideration memory, Take into consideration memory, speed, and usefulness of the data.speed, and usefulness of the data.
Final Exam Question #3Final Exam Question #3
►Memory:Memory:In terms of main memory usage, In terms of main memory usage, AprioriAllAprioriAll is better. is better.In terms of secondary storage access, In terms of secondary storage access, AprioriSomeAprioriSome is better. is better.
Final Exam Question #3Final Exam Question #3
►Speed:Speed:With sufficient memory, as minimum With sufficient memory, as minimum support decreases the difference support decreases the difference between between AprioriAllAprioriAll and and AprioriSomeAprioriSome increases. (increases. (AprioriSomeAprioriSome is better.) is better.)
More large sequences not maximal are More large sequences not maximal are generated.generated.
Final Exam Question #3Final Exam Question #3
► Usefulness of the data:Usefulness of the data:For the problem of finding maximal large For the problem of finding maximal large sequences, the answer is “Precisely the sequences, the answer is “Precisely the same.”.same.”.However, However, AprioriAllAprioriAll finds all large sequences, finds all large sequences, while while AprioriSomeAprioriSome discards some large discards some large sequences that aren’t maximal. sequences that aren’t maximal. AprioriAllAprioriAll, , then, generates more “useful” data.then, generates more “useful” data.“The user may want to know the ratio of the “The user may want to know the ratio of the number of people who bought the first number of people who bought the first kk + 1 + 1 items in a sequence to the number of people items in a sequence to the number of people who bought the first who bought the first kk items.” items.”