1 Mining Association Rules Mining Association Rules 2 Mining Association Rules Mining Association Rules What is Association rule mining Apriori Algorithm FP-tree Algorithm Additional Measures of rule interestingness Advanced Techniques 3 What Is Association Rule Mining? What Is Association Rule Mining? Association rule mining Finding frequent patterns, associations, correlations, or causal structures among sets of items in transaction databases Understand customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket” Applications Basket data analysis, cross-marketing, catalog design, loss-leader analysis, web log analysis, fraud detection (supervisor->examiner) 4 What Is Association Rule Mining? What Is Association Rule Mining? Rule form Antecedent → Consequent [support, confidence] (support and confidence are user defined measures of interestingness) Examples buys(x, “computer”) → buys(x, “financial management software”) [0.5%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “car”) [1%,75%]
17
Embed
Apriori Algorithm Mining Association Rules - paginas.fe.up.ptpaginas.fe.up.pt/~ec/files_0405/slides/04 AssociationRules.pdf · 1 Mining Association Rules 2 Mining Association Rules
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Mining Association Rules Mining Association Rules
2
Mining Association RulesMining Association Rules
What is Association rule mining
Apriori Algorithm
FP-tree Algorithm
Additional Measures of rule interestingness
Advanced Techniques
3
What Is Association Rule Mining?What Is Association Rule Mining?
Association rule miningFinding frequent patterns, associations, correlations, or causalstructures among sets of items in transaction databases
Understand customer buying habits by finding associations and correlations between the different items that customers place intheir “shopping basket”
ApplicationsBasket data analysis, cross-marketing, catalog design, loss-leader analysis, web log analysis, fraud detection (supervisor->examiner)
4
What Is Association Rule Mining?What Is Association Rule Mining?
Rule form
Antecedent → Consequent [support, confidence]
(support and confidence are user defined measures of interestingness)
Dynamic itemset counting: partitions the DB into several blocks each marked by a start point.
At each start point, DIC estimates the support of all itemsets that are currently counted and adds new itemsets to the set of candidate itemsets if all its subsets are estimated to be frequent.
If DIC adds all frequent itemsets to the set of candidate itemsets during the first scan, it will have counted each itemset’s exact support at some point during the second scan;
Is Apriori Fast Enough? Is Apriori Fast Enough? The bottleneck of Apriori: candidate generation
Huge candidate sets:104 frequent 1-itemset will generate 107 candidate 2-itemsets
Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern
Can we design a method that mines the complete set of frequent itemsets without candidate generation?
33
Traditional methods such as database queries: support hypothesis verification about a relationship such as the co-occurrence of diapers & beer.
Data Mining methods automatically discover significant associations rules from data.
Find whatever patterns exist in the database, without the user having to specify in advance what to look for (data driven).
Therefore allow finding unexpected correlations
34
Mining Association RulesMining Association Rules
What is Association rule mining
Apriori Algorithm
FP-tree Algorithm
Additional Measures of rule interestingness
Advanced Techniques
35
Mining Frequent Patterns Without Mining Frequent Patterns Without Candidate GenerationCandidate Generation
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure
highly condensed, but complete for frequent pattern mining
avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern mining method
A divide-and-conquer methodology: decompose mining tasks into smaller ones
Avoid candidate generation: sub-database test only!
36
The FPThe FP--tree constructiontree construction
min_support = 0.5
Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
note that d,g,i,l,h,j,o,k,s,e,n are not frequent items
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
37
The FPThe FP--tree constructiontree construction
3. Scan DB again, construct FP-tree
3p3m3b3a4c4f
head CountItem
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
TID (ordered) frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}
38
Benefits of the FPBenefits of the FP--tree Structuretree Structure
Completeness: never breaks a long pattern of any transaction
preserves complete information for frequent pattern mining
(do not need to scan DB again)
Compactnessreduce irrelevant information - infrequent items are gone
frequency descending ordering: more frequent items are more likely to be shared
never be larger than the original database (if not count node-links and counts)
39
Major Steps to Mine FPMajor Steps to Mine FP--treetree
1. Construct conditional pattern base for each node in the FP-tree (the sub-pattern base under a items existence)
2. Construct conditional FP-tree from each conditional pattern-base
3. Recursively mine conditional FP-trees and grow frequent patterns obtained so far
40
1. Construct the Conditional Pattern Base1. Construct the Conditional Pattern Base
Starting at the frequent header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item
Accumulate all of transformed prefix paths of that item to form a conditional pattern base
Rules that hold 100% of the time may not have the highest possible lift. For example, if 5% of people are Vietnam veterans and 90% of the people are more than 5 years old, we get a lift of 0.05/(0.05*0.9)=1.11 which is only slightly above 1 for the rule
Vietnam veterans -> more than 5 years old.
And, lift is symmetric:
not eat cereal ⇒ play basketball [20%, 80%]
33.1
50003000
50001250
50001000
LIFT =×
=
52
Conviction of a RuleConviction of a Rule
Note that A -> B can be rewritten as ¬(A,¬B)
Conviction is a measure of the implication and has value 1 if items are unrelated.
is the proportion of additional elements covered by both the premise and consequence above the expected if independent.
P ( ) sup( , ) sup( ) sup( )S A B A B A B→ = − ⋅
54
Coverage of a RuleCoverage of a Rule
coverage( ) sup( )A B A→ =
55
Association Rules 3D Association Rules 3D VisulalisationVisulalisation
56
Size of ball equates to total support
Height equates to confidence
Association Rules 3D Association Rules 3D VisulalisationVisulalisation
57
In this graph, the support values for the Body and Head portions indicated by the sizes and colors of each. The thickness of each line indicates the confidence for the rule;The sizes and colors of the circles in the center, above the Implies label, indicate the joint support of the Body and Head components of a association rules.
58
Association Rules Visualization Association Rules Visualization -- Ball graphBall graph
59
The Ball graph ExplainedThe Ball graph Explained
A ball graph consists of a set of nodes and arrows. All the nodes are yellow, green or blue. The blue nodes are active nodes representing the items in the rule in which the user is interested. The yellow nodes are passive representing items related to the active nodes in some way. The green nodes merely assist in visualizing two or more items in either the head or the body of the rule. The conventions of a ball graph in DBMiner are as follows.
A circular node represents a frequent (large) data item. The volume of the ball represents the support of the item. Only those items which occur sufficiently frequently are shown
An arrow between two nodes represents the rule implication between the two items. An arrow will be drawn only when the support of a rule is no less than the minimum support
60
Mining Association RulesMining Association Rules
What is Association rule mining
Apriori Algorithm
FP-tree Algorithm
Additional Measures of rule interestingness
Advanced Techniques
61
MultipleMultiple--Level Association RulesLevel Association Rules
Fresh ⇒ Bakery [20%, 60%]
Dairy ⇒ Bread [6%, 50%]
Fruit ⇒ Bread [1%, 50%] is not valid
FoodStuff
Frozen Refrigerated Fresh Bakery Etc...
Vegetable Fruit Dairy Etc....
Banana Apple Orange Etc...
62
MultiMulti--Dimensional Association RulesDimensional Association Rules
10% of customers bought “Foundation” and “Ringworld” in one transaction, followed by “Ringworld Engineers” in another transaction.
64
GivenA database of customer transactions ordered by increasingtransaction timeEach transaction is a set of itemsA sequence is an ordered list of itemsets
Example:10% of customers bought “Foundation“ and “Ringworld" in one transaction, followed by “Ringworld Engineers" in another transaction.10% is called the support of the pattern(a transaction may contain more books than those in the pattern)
ProblemFind all sequential patterns supported by more than a user-specified percentage of data sequences