Page 1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
Page 2
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
Definition: Frequent Itemset
Itemset– A collection of one or more items
Example: {Milk, Bread, Diaper}
– k-itemset An itemset that contains k items
Support count ()– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
Support– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset– An itemset whose support is greater
than or equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Page 3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3
Definition: Association Rule
Example:Beer}Diaper,Milk{
4.052
|T|)BeerDiaper,,Milk(
s
67.032
)Diaper,Milk()BeerDiaper,Milk,(
c
Association Rule– An implication expression of the form
X Y, where X and Y are itemsets
– Example: {Milk, Diaper} {Beer}
Rule Evaluation Metrics– Support (s)
Fraction of transactions that contain both X and Y
– Confidence (c) Measures how often items in Y
appear in transactions thatcontain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Page 4
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold
– confidence ≥ minconf threshold
Brute-force approach:– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Page 5
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5
Mining Association Rules
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different confidence
• Thus, we may decouple the support and confidence requirements
Page 6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
Mining Association Rules
Two-step approach: 1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally expensive
Page 7
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7
Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets
Page 8
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Frequent Itemset Generation
Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
N
Transactions List ofCandidates
M
w
Page 9
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)– Complete search: M=2d
– Use pruning techniques to reduce M
Reduce the number of transactions (N)– Reduce size of N as the size of itemset increases– Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)– Use efficient data structures to store the candidates or
transactions– No need to match every candidate against every
transaction
Page 10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
Reducing Number of Candidates
Apriori principle:– If an itemset is frequent, then all of its subsets must also
be frequent
Apriori principle holds due to the following property of the support measure:
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support
)()()(:, YsXsYXYX
Page 11
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Apriori Principle
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
Page 12
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Illustrating Apriori Principle
Item CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1
Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3
Itemset Count {Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Triplets (3-itemsets)Minimum Support = 3
If every subset is considered, 6C1 + 6C2 + 6C3 = 41
With support-based pruning,6 + 6 + 1 = 13
Page 13
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13
Frequent Itemset Mining
2 strategies:
– Breadth-first: AprioriExploit monotonicity to the maximum
– Depth-first strategy: EclatPrune the databaseDo not fully exploit monotonicity
Page 14
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
Apriori
A CB D
{}
minsup=2
0 0 0 0
Candidates
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Page 15
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
Apriori
A CB D
{}
0 1 1 0
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Candidates
Page 16
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
Apriori
A CB D
{}
0 2 2 0
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Candidates
Page 17
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17
Apriori
A CB D
{}
1 2 3 1
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Candidates
Page 18
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
Apriori
A CB D
{}
2 3 4 2
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Candidates
Page 19
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19
Apriori
A CB D
{}
2 4 4 3
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Candidates
Page 20
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
Apriori
AB BCAC AD CDBD
A CB D
{}
2 4 4 3
Candidates
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Page 21
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21
Apriori
AB BCAC AD CDBD
A CB D
{}
2 4 4 3
1 2 2 3 2 2
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Page 22
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Apriori
ACD BCD
AB BCAC AD CDBD
A CB D
{}
1 2 2 3 2 2
Candidates
2 4 4 3
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Page 23
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23
Apriori
ACD BCD
AB BCAC AD CDBD
A CB D
{}
1 2 2 3 2 2
2 1
2 4 4 3
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Page 24
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Apriori Algorithm
Method:
– Let k=1– Generate frequent itemsets of length 1– Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k frequent itemsets
Prune candidate itemsets containing subsets of length k that are infrequent
Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those
that are frequent
Page 25
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25
Frequent Itemset Mining
2 strategies:
– Breadth-first: AprioriExploit monotonicity to the maximum
– Depth-first strategy: EclatPrune the databaseDo not fully exploit monotonicity
Page 26
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26
Depth-First Algorithms
Find all frequent itemsets
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Page 27
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27
Depth-First Algorithms
Find all frequent itemsets
with D
without D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Find all frequent itemsets
Find all frequent itemsets
Page 28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28
Depth-First Algorithms
Find all frequent itemsets
with D
without D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Find all frequent itemsets
Find all frequent itemsets
A, B, C, AC
A, B, C, AC, BC
Page 29
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29
Depth-First Algorithms
Find all frequent itemsets
with D
without D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Find all frequent itemsets
Find all frequent itemsets
A, B, C, AC
A, B, C, AC, BC
AD, BD, CD, ACD A, B, C, AC, BCadd D again
Page 30
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30
Depth-First Algorithms
Find all frequent itemsets
with D
without D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Find all frequent itemsets
Find all frequent itemsets
A, B, C, AC
A, B, C, AC, BC
AD, BD, CD, ACD + A, B, C, AC, BC
A, B, C, AC, BC, AD, BD, CD, ACD
Page 31
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
Page 32
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 4C: 2
DBDB[D]
AC: 2
Page 33
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 4C: 2
3 A, 4 A, B
A: 2
DBDB[D]
DB[CD]
Page 34
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 4C: 2
3 A, 4 A, B
A: 2
DBDB[D]
DB[CD]
AC: 2
Page 35
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 4C: 2
DBDB[D]
AC: 2
Page 36
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 4C: 2
DBDB[D]
AC: 2
4 A
DB[BD]
A:1
Page 37
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 4C: 2
DBDB[D]
AC: 2
Page 38
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 4C: 2
DBDB[D]
AC: 2
AD: 2BD: 4CD: 2ACD: 2
Page 39
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 4CD: 2ACD: 2
Page 40
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 4CD: 2ACD: 2
1 B2 B3 A4 A, B
DB[C]
A: 2B: 3
Page 41
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 4CD: 2ACD: 2
1 B2 B3 A4 A, B
DB[C]
A: 2B: 3
1
2
4 AA: 1
DB[BC]
Page 42
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 4CD: 2ACD: 2
1 B2 B3 A4 A, B
DB[C]
A: 2B: 3
Page 43
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 4CD: 2ACD: 2AC: 2BC: 3
1 B2 B3 A4 A, B
DB[C]
A: 2B: 3
Page 44
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 4CD: 2ACD: 2AC: 2BC: 3
Page 45
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 4CD: 2ACD: 2AC: 2BC: 3
1 24 A 5
DB[B]
A:1
Page 46
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 4CD: 2ACD: 2AC: 2BC: 3
Page 47
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 4CD: 2ACD: 2AC: 2BC: 3
Final set of frequent itemsets
Page 48
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48
ECLAT
For each item, store a list of transaction ids (tids)
TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D
10 B
HorizontalData Layout
A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109
Vertical Data Layout
TID-list
Page 49
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49
ECLAT
Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets.
Depth-first traversal of the search lattice Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large
for memory
A1456789
B1257810
AB1578
Page 50
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50
Rule Generation
Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,
If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)
Page 51
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51
Rule Generation
How to efficiently generate rules from frequent itemsets?– In general, confidence does not have an anti-monotone
propertyc(ABC D) can be larger or smaller than c(AB
D)
– But confidence of rules generated from the same itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
Page 52
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52
Rule Generation for Apriori Algorithm
ABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Lattice of rulesABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Pruned Rules
Low Confidence Rule