Database Management Systems Association Rules Fundamentals Elena Baralis, Silvia Chiusano Politecnico di Torino 1 Data Base and Data Mining Group of Politecnico di Torino D B M G Association Rules Fundamentals Elena Baralis, Silvia Chiusano Politecnico di Torino 2 D B M G Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule diapers ⇒ beer 2% of transactions contains both items 30% of transactions containing diapers also contains beer TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diapers, Milk 4 Beer, Bread, Diapers, Milk 5 Coke, Diapers, Milk … … 3 D B M G A collection of transactions is given a transaction is a set of items items in a transaction are not ordered Association rule A, B ⇒ C A, B = items in the rule body C = item in the rule head The ⇒ means co-occurrence not causality Examples cereals, cookies ⇒ milk age < 40, life-insurance = yes ⇒ children = yes customer, relationship ⇒ data, mining Association rule mining 4 D B M G Definitions Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k items Support count (#) is the frequency of occurrence of an itemset Example: #{Beer,Diapers} = 2 Support is the fraction of transactions that contain an itemset Example: sup({Beer, Diapers}) = 2/5 Frequent itemset is an itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diapers, Milk 4 Beer, Bread, Diapers, Milk 5 Coke, Diapers, Milk … … 5 D B M G Given the association rule A ⇒ B A, B are itemsets Support is the fraction of transactions containing both A and B #{A,B} |T| |T| is the cardinality of the transactional database a priori probability of itemset AB rule frequency in the database Confidence is the frequency of B in transactions containing A sup(A,B) sup(A) conditional probability of finding B having found A “strength” of the “⇒” Rule quality metrics 6 D B M G Rule quality metrics: example From itemset {Milk, Diapers} the following rules may be derived Rule: Milk ⇒ Diapers support sup=#{Milk,Diapers}/#trans. =3/5=60% confidence conf=#{Milk,Diapers}/#{Milk}=3/4=75% Rule: Diapers ⇒ Milk same support s=40% confidence conf=#{Milk,Diapers}/#{Diapers}=3/3 =100% TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diapers, Milk 4 Beer, Bread, Diapers, Milk 5 Coke, Diapers, Milk … …
14
Embed
Association rules - polito.it · DBMG 9 Association rule extraction (1) Extraction of frequent itemsets many different techniques level-wise approaches (Apriori, ...) approaches without
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Database Management Systems Association RulesFundamentals
Elena Baralis, Silvia ChiusanoPolitecnico di Torino 1
Data Base and Data Mining Group of Politecnico di Torino
DBMG
Association RulesFundamentals
Elena Baralis, Silvia ChiusanoPolitecnico di Torino
2DBMG
Association rulesObjective
extraction of frequent correlations or pattern from a transactional database
Tickets at a supermarket counter Association rule
diapers ⇒ beer2% of transactions containsboth items30% of transactionscontaining diapers alsocontains beer
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diapers, Milk
4 Beer, Bread, Diapers, Milk
5 Coke, Diapers, Milk
… …
3DBMG
A collection of transactions is givena transaction is a set of itemsitems in a transaction are not ordered
Association ruleA, B ⇒ C
A, B = items in the rule bodyC = item in the rule head
The ⇒ means co-occurrencenot causality
Examplescereals, cookies ⇒ milkage < 40, life-insurance = yes ⇒ children = yescustomer, relationship ⇒ data, mining
Association rule mining
4DBMG
DefinitionsItemset is a set including one or more items
Example: {Beer, Diapers}k-itemset is an itemset that contains k itemsSupport count (#) is the frequency of occurrence of an itemset
Example: #{Beer,Diapers} = 2 Support is the fraction of transactions that contain an itemset
Example: sup({Beer, Diapers}) = 2/5Frequent itemset is an itemset whose support is greater than or equal to a minsup threshold
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diapers, Milk
4 Beer, Bread, Diapers, Milk
5 Coke, Diapers, Milk
… …
5DBMG
Given the association ruleA ⇒ B
A, B are itemsetsSupport is the fraction of transactions containing both A and B
#{A,B}|T|
|T| is the cardinality of the transactional databasea priori probability of itemset ABrule frequency in the database
Confidence is the frequency of B in transactionscontaining A
sup(A,B)sup(A)
conditional probability of finding B having found A“strength” of the “⇒”
Rule quality metrics
6DBMG
Rule quality metrics: example
From itemset {Milk, Diapers} the following rules may be derived
Database Management Systems Association RulesFundamentals
Elena Baralis, Silvia ChiusanoPolitecnico di Torino 2
7DBMG
Association rule extraction
Given a set of transactions T, association rule mining is the extraction of the rules satisfying the constraints
support ≥ minsup thresholdconfidence ≥ minconf threshold
The result is complete (all rules satisfying both constraints)correct (only the rules satisfying both constraints)
May add other more complex constraints
8DBMG
Association rule extractionBrute-force approach
enumerate all possible permutations (i.e., association rules)compute support and confidence for each ruleprune the rules that do not satisfy the minsup and minconfconstraints
Computationally unfeasibleGiven an itemset, the extraction process may be split
first generate frequent itemsetsnext generate rules from each frequent itemset
Counting Support of CandidatesScan transaction database to count support of each itemset
total number of candidates may be largeone transaction may contain many candidates
Approach [Agr94]candidate itemsets are stored in a hash-tree
leaf node of hash-tree contains a list of itemsets and countsinterior node contains a hash table
subset function finds all candidates contained in a transaction
match transaction subsets to candidates in hash tree
34DBMG
Performance Issues in Apriori
Candidate generationCandidate sets may be huge
2-itemset candidate generation is the most critical stepextracting long frequent intemsets requires generating all frequent subsets
Multiple database scansn +1 scans when longest frequent pattern length is n
35DBMG
Factors Affecting PerformanceMinimum support threshold
lower support threshold increases number of frequent itemsetslarger number of candidates larger (max) length of frequent itemsets
Dimensionality (number of items) of the data setmore space is needed to store support count of each itemif number of frequent items also increases, both computation andI/O costs may also increase
Size of databasesince Apriori makes multiple passes, run time of algorithm may increase with number of transactions
Average transaction widthtransaction width increases in dense data setsmay increase max length of frequent itemsets and traversals of hash tree
number of subsets in a transaction increases with its width
36DBMG
Improving Apriori Efficiency
Hash-based itemset counting [Yu95]
A k-itemset whose corresponding hashing bucket count is
below the threshold cannot be frequent
Transaction reduction [Yu95]
A transaction that does not contain any frequent k-itemset is
useless in subsequent scans
Partitioning [Sav96]
Any itemset that is potentially frequent in DB must be frequent
in at least one of the partitions of DB
Database Management Systems Association RulesFundamentals
Elena Baralis, Silvia ChiusanoPolitecnico di Torino 7
37DBMG
Improving Apriori Efficiency
Sampling [Toi96]
mining on a subset of given data, lower support threshold + a
method to determine the completeness
Dynamic Itemset Counting [Motw98]
add new candidate itemsets only when all of their subsets are
estimated to be frequent
38DBMG
FP-growth Algorithm [Han00]Exploits a main memory compressed rappresentation of the database, the FP-tree
high compression for dense data distributionsless so for sparse data distributions
complete representation for frequent pattern miningenforces support constraint
Frequent pattern mining by means of FP-growthrecursive visit of FP-treeapplies divide-and-conquer approach
decomposes mining task into smaller subtasks
Only two database scanscount item supports + build FP-tree
(1) Count item support and prune items below minsup threshold(2) Build Header Table by sorting items in decreasing support order(3) Create FP-treeFor each transaction t in DB
order transaction t items in decreasing support order
same order as Header Tableinsert transaction t in FP-tree
use existing path for common prefixcreate new branch when path
becomes differentminsup>1
Example DB
41DBMG
FP-tree construction
B:1
A:1
TransactionTID Items1 {A,B}
TID Items1 {B,A}
Sorted transaction
Header TableItem sup{B}{A}{C}{D}{E}
87753
FP-tree{ }
42DBMG
FP-tree construction
B:1
A:1
{ }
C:1
D:1
TransactionTID Items2 {B,C,D}
B:2
TID Items2 {B,C,D}
Sorted transaction
Header TableItem sup{B}{A}{C}{D}{E}
87753
FP-tree
Database Management Systems Association RulesFundamentals
Elena Baralis, Silvia ChiusanoPolitecnico di Torino 8
Item pointers are used to assist frequent itemset generation 52DB
MG
FP-growth AlgorithmScan Header Table from lowest support item upFor each item i in Header Table extract frequent itemsets including item i and items preceding it in Header Table
(1) build Conditional Pattern Base for item i (i-CPB) Select prefix-paths of item i from FP-tree
(2) recursive invocation of FP-growth on i-CPB
53DBMG
ExampleConsider item D and extract frequent itemsets including
D and supported combinations of items A, B, C
A:5
{ }
A:2
C:1
D:1
E:1
C:3
D:1
D:1
E:1C:3
D:1
D:1 E:1
Header TableItem sup{B}{A}{C}{D}{E}
87753
B:8
FP-tree
{D} 5{D} 5
54DBMG
Conditional Pattern Base of D(1) Build D-CPB
Select prefix-paths of item D from FP-tree
A:5
{ }
A:2
C:1
D:1
E:1
C:3
D:1
D:1
E:1C:3
D:1
D:1 E:1
Header TableItem sup{B}{A}{C}{D}{E}
87753
B:8
FP-tree
{D} 5{D} 5
Frequent itemset:
D, sup(D) = 5
Database Management Systems Association RulesFundamentals
Elena Baralis, Silvia ChiusanoPolitecnico di Torino 10
55DBMG
Conditional Pattern Base of D
A:5
{ }
A:2
C:1
D:1
E:1
C:3
D:1
D:1
E:1C:3
D:1
D:1 E:1
Header TableItem sup{B}{A}{C}{D}{E}
87753
B:8
FP-tree
D-CPBItems sup
{B,A,C} 1
{D} 5
{B,A,C} 1
56DBMG
Conditional Pattern Base of D{ }
A:2
C:1
D:1
E:1
C:3
D:1
D:1
E:1C:3
D:1
D:1 E:1
Header TableItem sup{B}{A}{C}{D}{E}
87753
B:8
FP-tree
D-CPBItems sup
{B,A,C} 1
{D} 5
{B,A} 1
A:5
57DBMG
Conditional Pattern Base of D{ }
A:2
C:1
D:1
E:1
C:3
D:1
D:1
E:1C:3
D:1
D:1 E:1
Header TableItem sup{B}{A}{C}{D}{E}
87753
B:8
FP-tree
D-CPBItems sup
{B,A,C} 1
{D} 5
{B,A} 1
A:5
{B,C} 1
58DBMG
Conditional Pattern Base of D{ }
A:2
C:1
D:1
E:1
C:3
D:1
D:1
E:1C:3
D:1
D:1 E:1
Header TableItem sup{B}{A}{C}{D}{E}
87753
B:8
FP-tree
{D} 5A:5
Items sup{B,A,C} 1{B,A} 1{B,C} 1{A,C} 1
D-CPB
59DBMG
Conditional Pattern Base of D{ }
A:2
C:1
D:1
E:1
C:3
D:1
D:1
E:1C:3
D:1
D:1 E:1
Header TableItem sup{B}{A}{C}{D}{E}
87753
B:8
FP-tree
{D} 5A:5
Items sup{B,A,C} 1{B,A} 1{B,C} 1{A,C} 1
D-CPB
{A} 1
60DBMG
Conditional Pattern Base of D
D-CPBItems sup
{B,A,C}{B,A}{B,C}{A,C}{A}
11111
D-conditional Header Table
Item sup{A}{B}{C}
433
(1) Build D-CPBSelect prefix-paths of item D from FP-tree
(2) Recursive invocation of FP-growth on D-CPB
A:4
{ }
C:1C:1
C:1
B:1
B:2
D-conditional FP-tree
Database Management Systems Association RulesFundamentals
Elena Baralis, Silvia ChiusanoPolitecnico di Torino 11
61DBMG
Conditional Pattern Base of DC(1) Build DC-CPB
Select prefix-paths of item C from D-conditional FP-tree
Frequent itemset: DC, sup(DC) = 3
DC-CPBItems sup{A,B}{A}{B}
111
C:1
D-CPBItems sup
{B,A,C}{B,A}{B,C}{A,C}{A}
11111
D-conditional Header Table
Item sup{A}{B}{C}
433
A:4
{ }
C:1C:1
B:1
B:2
D-conditional FP-tree
{C} 3
62DBMG
Conditional Pattern Base of DC
DC-CPBItems sup{A,B}{A}{B}
111
A:2
{ }
B:1
B:1
DC-conditional FP-tree
DC-conditional Header TableItem sup{A}{B}
22
(1) Build DC-CPBSelect prefix-paths of item C from D-conditional FP-tree
(2) Recursive invocation of FP-growth on DC-CPB
63DBMG
Conditional Pattern Base of DCB
Frequent itemset: DCB, sup(DCB) = 2
(1) Build DCB-CPBSelect prefix-paths of item B from DC-conditional FP-tree
DCB-CPBItems sup{A} 1
DC-CPBItems sup{A,B}{A}{B}
111
A:2
{ }
B:1
B:1
DC-conditional FP-tree
DC-conditional Header TableItem sup{A}{B}
22{C}{B} 2
64DBMG
(1) Build DCB-CPBSelect prefix-paths of item B from DC-conditional FP-tree
DCB-CPBItems sup{A} 1
Item A is infrequent in DCB-CPBA is removed from DCB-CBPDCB-CDB is empty
Conditional Pattern Base of DCB
(2) The search backtracks to DC-CBP
65DBMG
Conditional Pattern Base of DCA
Frequent itemset: DCA, sup(DCA) = 2
(1) Build DCA-CPBSelect prefix-paths of item A from DC-conditional FP-tree
DCA-CPB is empty (no transactions)
(2) The search backtracks to D-CBP
DC-CPBItems sup{A,B}{A}{B}
111
A:2
{ }
B:1
B:1
DC-conditional FP-tree
DC-conditional Header TableItem sup{A}{B}
22
{C}{A} 2
66DBMG
Conditional Pattern Base of DB(1) Build DB-CPB
Select prefix-paths of item B from D-conditional FP-tree
Frequent itemset: DB, sup(DB) = 3
{A} 2{A} 2
DB-CPBItems sup
D-conditional FP-tree
C:1
D-CPBItems sup
{B,A,C}{B,A}{B,C}{A,C}{A}
11111
D-conditional Header Table
Item sup{A}{B}{C}
433
A:4
{ }
C:1C:1
B:1
B:2{C}{B} 3
Database Management Systems Association RulesFundamentals
Elena Baralis, Silvia ChiusanoPolitecnico di Torino 12
67DBMG
Conditional Pattern Base of DB(1) Build DB-CPB
Select prefix-paths of item B from D-conditional FP-tree
(2) Recursive invocation of FP-growth on DB-CPB
{A} 2{A} 2
DB-CPBItems sup
A:2
{ }DB-conditional
FP-tree
DB-conditional Header Table
{A} 2{A} 2Items sup
68DBMG
Conditional Pattern Base of DBA
Frequent itemset: DBA, sup(DBA) = 2
(1) Build DBA-CPBSelect prefix-paths of item A from DB-conditional FP-tree
DBA-CPB is empty (no transactions)
(2) The search backtracks to D-CBP
A:2
{ }DB-conditional
FP-tree
DB-conditional Header Table
{A} 2{A} 2
DB-CPBItems sup
{A} 2{A} 2Items sup{C}{A} 2
69DBMG
Conditional Pattern Base of DA(1) Build DA-CPB
Select prefix-paths of item A from D-conditional FP-tree
Frequent itemset: DA, sup(DA) = 4
The search ends
DA-CPB is empty (no transactions)
D-conditional FP-tree
C:1
D-CPBItems sup
{B,A,C}{B,A}{B,C}{A,C}{A}
11111
D-conditional Header Table
Item sup{A}{B}{C}
433
A:4
{ }
C:1C:1
B:1
B:2{C}{A} 4
70DBMG
Frequent itemsets with prefix DFrequent itemsets including D and supported combinations of items B,A,C