Mining Association Rules 1 Mining Association Rules Mining Association Rules What is Association rule mining Apriori Algorithm Measures of rule interestingness Advanced Techniques 2 h i i l ii ? What Is Association Rule Mining? l Association rule mining Finding frequent patterns, associations, correlations, or causal structures among sets of items in transaction databases Understand customer buying habits by finding associations and l ti bt th diff t it th t t l i th i correlations between the different items that customers place in their “shopping basket” Applications Basket data analysis, cross-marketing, catalog design, loss-leader analysis, web log analysis, fraud detection (supervisor->examiner) 3 Wh tI A i ti Rl Mi i ? What Is Association Rule Mining? Rule form Antecedent → Consequent [support, confidence] Antecedent → Consequent [support, confidence] (support and confidence are user defined measures of interestingness) Examples buys(x, “computer”) → buys(x, “financial management software”) [0.5%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “car”) [1%,75%] 4
19
Embed
Mining Association Rules Apriori Algorithmec/files_1011/week 04 - Association Rules.pdf · Apriori AlgorithmApriori Algorithm Measures of rule interestingness Advanced Techniques
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Mining Association Rulesg
1
Mining Association RulesMining Association Rules
What is Association rule mining
Apriori Algorithm
Measures of rule interestingness
Advanced Techniques
2
h i i l i i ?What Is Association Rule Mining?
l Association rule mining
Finding frequent patterns, associations, correlations, or causal structures among sets of items in transaction databases
Understand customer buying habits by finding associations and l ti b t th diff t it th t t l i th icorrelations between the different items that customers place in their
“shopping basket”
Applications
Basket data analysis, cross-marketing, catalog design, loss-leader analysis, web log analysis, fraud detection (supervisor->examiner)
3
Wh t I A i ti R l Mi i ?What Is Association Rule Mining?
Probably mom was calling dad at work to buy diapers on way home and hehome and he decided to buy a six-pack as well.
The retailer could move diapers and beers to separate places and position high-profit items of g pinterest to young fathers along the path
5
path.
How can Association Rules be used?
Let the rule discovered be
How can Association Rules be used?
{Bagels,...} → {Potato Chips}
Potato chips as consequent => Can be used to determine what should be done to boost its sales
Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagelsbe affected if the store discontinues selling bagels
Bagels in antecedent and Potato chips in the consequent => Can be Bagels in antecedent and Potato chips in the consequent > Can be used to see what products should be sold with Bagels to promote sale of Potato Chips
6
Basic Concepts
Given Given:
(1) database of transactions,
(2) each transaction is a list of items purchased by a customer in a visit
Find: Find:
all rules that correlate the presence of one set of items (itemset) with that of another set of items(itemset) with that of another set of items
7
E.g., 35% of people who buys salmon also buys cheese
TX1 Shoes Socks Tie BeltTX1 Shoes, Socks, Tie, Belt
TX2 Shoes, Socks, Tie, Belt, Shirt, Hat
TX3 Shoes, Tie
TX4 Shoes, Socks, Belt
Transaction Shoes Socks Tie Belt Shirt Scarf HatTransaction Shoes Socks Tie Belt Shirt Scarf Hat
1 1 1 1 0 0 0
2 1 1 1 1 1 0 1
3 1 0 1 0 0 0 0
4 1 1 0 1 0 0 0
...
Support is 50% (2/4)
TiS k8
Confidence is 66.67% (2/3)TieSocks
Rule Basic Measures
A B [ ]
Rule Basic Measures
A B [ s, c ]
Support: denotes the frequency of the rule within transactions. A high value means that the rule involves a great part of database.
support(A B [ s, c ]) = p(A ∪ B)
Confidence: denotes the percentage of transactions containing A which also contain B It is an estimation of conditioned probabilitywhich also contain B. It is an estimation of conditioned probability .
confidence(A B [ s, c ]) = p(B|A) = sup(A,B)/sup(A).
9
ctio
nstr
ansa
Rule C => D support
f d /10
confidence /
ApplicationsApplications
“Baskets” = documents;
“items” = words in those documents.
Lets us find words that appear together unusually frequently, i.e., linked concepts.
Word 1 Word 2 Word 3 Word 4
Doc 1 1 0 1 1
Doc 2 0 0 1 1
Doc 3 1 1 1 0
Word 4 => Word 3
11
When word 4 occurs in a document there a big probability of word 3 occurring
ApplicationsApplications
“Baskets” = sentences, Baskets sentences,
“items” = documents containing those sentences.
Items that appear together too often could represent plagiarism.p g
Doc 1 Doc 2 Doc 3 Doc 4
Sent 1 1 0 1 1
Sent 2 0 0 1 1
Sent 3 1 1 1 0
Doc 4 => Doc 3
12
When a sentence occurs in document 4 there is a big probability of occurring in document 3
ApplicationsApplications
“Baskets” = Web pages; p g ;
“items” = linked pages.
Pairs of pages with many common references may be about the same topic.
“Baskets” = Web pages pi ;
“items” = pages that link to pi
Pages with many of the same links may be mirrors or about Pages with many of the same links may be mirrors or about the same topic. wp a wp b wp c wp d
wp1
13
wp1
wp2
Example Definitions:pItemset:
A B B E F
Trans. Id Purchased Items
1 A D
Definitions:
A,B or B,E,F
Support of an itemset:
1 A,D
2 A,C
3 A B C Sup(A,B)=1
Sup(A,C)=2
3 A,B,C
4 B,E,F
Frequent pattern:
Given min. sup=2, {A,C} is a f t tt
For minimum support = 50% and minimum confidence = 50%, we have the
frequent pattern
following rules
A => C with 50% support and 66% confidence
14
C => A with 50% support and 100% confidence
15Randall Matignon 2007, Data Mining Using SAS Enterprise Miner, Wiley (book).
The support probability of each rule is identified by the color of the symbols and the confidence probability of each rule is identified by the shape of the symbols. The items that have the highest confidence are
16
Heineken beer, crackers, chicken, and peppers.
Randall Matignon 2007, Data Mining Using SAS Enterprise Miner, Wiley (book).
Mining Association RulesMining Association Rules
What is Association rule mining
Apriori Algorithm Apriori Algorithm
Measures of rule interestingness
Advanced Techniques
17
Boolean association rulesBoolean association rules
Each transaction is converted to a B l tBoolean vector
18
An ExampleAn Example
Transaction ID Items Bought Min. support 50%Transaction ID Items Bought2000 A,B,C1000 A,C F t It t S t
For any itemset c contained in transaction t, the first item of c must be in t.At root, by hashing on every item in t, we ensure that we only ignore itemsets that start with an item not in t.
Generating AR from frequent intemsetsGenerating AR from frequent intemsets
misleading because the overall percentage of students eating cereal is 75% which is higher than 66 7%is higher than 66.7%.
play basketball not eat cereal [20%, 33.3%]
47
is more accurate, although with lower support and confidence
Lift of a Rule
Lift (Correlation, Interest)
)|()( ABpBAsup)(
)|()()(
),()(Bp
ABpBA
BABALIFT ==→
supsupsup
A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlatedotherwise A and B positively correlated
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0
rule Support LiftX Y 25% 2.00XZ 37 50% 0 86
48
Z 0 1 1 1 1 1 1 1XZ 37.50% 0.86YZ 12.50% 0.57
Lift of a RuleLift of a Rule
Example 1 (cont)p
play basketball eat cereal [40%, 66.7%] 89037503000
50002000
.==LIFTplay basketball eat cereal [40%, 66.7%]
50003750
50003000 ×
play basketball not eat cereal [20%, 33.3%] 33112503000
50001000
.==LIFTplay basketball not eat cereal [20%, 33.3%]
basketball not basketball sum(row)
50001250
50003000 ×
cereal 2000 1750 3750
not cereal 1000 250 1250
49
sum(col.) 3000 2000 5000
Problems With LiftProblems With Lift
R les that hold 100% of the time ma not ha e the highest Rules that hold 100% of the time may not have the highest
possible lift. For example, if 5% of people are Vietnam veterans
d f h l h ld l f fand 90% of the people are more than 5 years old, we get a lift of
0.05/(0.05*0.9)=1.11 which is only slightly above 1 for the rule
Vietnam veterans -> more than 5 years old.
A d lif i i And, lift is symmetric:
not eat cereal play basketball [20%, 80%] p y
33150001000
LIFT
50
331
50003000
50001250
5000 .=×
=LIFT
Conviction of a RuleConviction of a Rule
1sup( ) sup( ) ( ) ( ) ( )( ( ))( ) A B P A P B P A P BC A B ⋅ ⋅ −1sup( ) sup( ) ( ) ( ) ( )( ( ))( )( ) ( , )sup( , ) ( , )
A B P A P B P A P BConv A BP A P A BA B P A B
→ = = =−
Conviction is a measure of the implication and has value 1 if items are unrelated.
play basketball eat cereal [40%, 66.7%] eat cereal play basketball conv:0.85
750
50002000
50003000
50003750
150003000
.=−
−
=Conv
play basketball not eat cereal [20%, 33.3%]not eat cereal pla basketball con 1 43
12511000300050001250
150003000
.=
−
=Conv
51
not eat cereal play basketball conv:1.43
50001000
50003000 −
ConvictionConviction
conviction of X=>Y can be interpreted as the conviction of X=>Y can be interpreted as the
ratio of the expected frequency that X occurs without Y (that is to p q y (say, the frequency that the rule makes an incorrect prediction) if X and Y were independent
divided by the observed frequency of incorrect predictions.
A conviction value of 1.2 shows that the rule would be incorrect 20% more often (1.2 times as often) if the association between X20% more often (1.2 times as often) if the association between X and Y was purely random chance.
52
Leverage of a RuleLeverage of a Rule
Leverage or Piatetsky Shapiro Leverage or Piatetsky-Shapiro
P ( ) sup( , ) sup( ) sup( )S A B A B A B→ = − ⋅
PS ( L ) PS (or Leverage):
is the proportion of additional elements covered by both theis the proportion of additional elements covered by both the
premise and consequence above the expected if independent.
53
Coverage of a RuleCoverage of a Rule
( ) s p( )A B Acoverage( ) sup( )A B A→ =
54
CommentComment
Traditional methods such as database queries: Traditional methods such as database queries:
support hypothesis verification about a relationship such as the co-occurrence of diapers & beer.
Data Mining methods automatically discover significant l f dassociations rules from data.
Find whatever patterns exist in the database, without the p ,user having to specify in advance what to look for (data driven).
55
Therefore allow finding unexpected correlations
APRIORI EXTENSIONS
56
Challenges of Frequent Pattern MiningChallenges of Frequent Pattern Mining
Challenges Challenges
Multiple scans of transaction database
Huge number of candidates Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce number of transaction database scans Reduce number of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates Facilitate support counting of candidates
57
Improving Apriori’s EfficiencyImproving Apriori s Efficiency
Problem with Apriori: every pass goes over whole data. Problem with Apriori: every pass goes over whole data.
AprioriTID: Generates candidates as apriori but DB is used for
l h fcounting support only on the first pass.
Needs much more memory than Apriori
Builds a storage set C^k that stores in memory the frequent sets per
transaction
AprioriHybrid: Use Apriori in initial passes; Estimate the size of
C^k; Switch to AprioriTid when C^k is expected to fit in memoryk; p k p y
The switch takes time, but it is still better in most cases
58
ItemsTID Set-of-itemsetsTID SupportItemset
C^1 L1ItemsTID
1 3 4100
2 3 5200
Set of itemsetsTID
{ {1},{3},{4} }100
{ {2},{3},{5} }200
SupportItemset
2{1}
3{2}
1 2 3 5300
2 5400{ {1},{2},{3},{5} }300
{ {2},{5} }400
3{3}
3{5}
itemsetSI
L2Set-of-itemsetsTID
C2 C^2
{1 2}
{1 3}
{1 5}
SupportItemset
2{1 3}
3{2 3}
{ {1 3} }100
{ {2 3},{2 5} {3 5} }200
{ {1 2} {1 3} {1 5}300{1 5}
{2 3}
{2 5}
3{2 3}
3{2 5}
2{3 5}
{ {1 2},{1 3},{1 5},
{2 3}, {2 5}, {3 5} }
300
{ {2 5} }400{3 5}
SupportItemset
{ { } }
C^3 L3C3 Set-of-itemsetsTID
59
itemset
{2 3 5}
SupportItemset
2{2 3 5}{ {2 3 5} }200
{ {2 3 5} }300
Improving Apriori’s EfficiencyImproving Apriori s Efficiency
Transaction reduction: A transaction that does not Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scansscans
Sampling: mining on a subset of given data.
The sample should fit in memory
Use lower support threshold to reduce the probability of pp p ymissing some itemsets.
The rest of the DB is used to determine the actual itemset
60
The rest of the DB is used to determine the actual itemset count.
Improving Apriori’s Efficiency
Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB (2 DB scans)
(support in a partition is lowered to be proportional to the number of elements in (support in a partition is lowered to be proportional to the number of elements in the partition)
Phase I
Divide D Fi d f t Combine Fi d l b l
Phase IPhase II
Trans.in D
Divide D into nNon-
overlappin
Find frequent itemsets local
to each partition
Combine results to
form a global set of
did t
Find global frequent itemsets among
Freq. itemsets
in Dgpartitions
partition(parallel alg.) candidate
itemsets
among candidates
in D
61
Improving Apriori’s Efficiencyp g p y
Dynamic itemset counting: partitions the DB into several blocks each marked by a start point.
At each start point, DIC estimates the support of all itemsets that are currently
counted and adds new itemsets to the set of candidate itemsets if all its subsets
are estimated to be frequent.
If DIC adds all frequent itemsets to the set of candidate itemsets during the first
ll h d h ’ d hscan, it will have counted each itemset’s exact support at some point during the
second scan;
h C l i thus DIC can complete in two scans.
62
Association Rules VisualizationAssociation Rules Visualization
The coloured column indicates the association rule B→C. Diff t i l d t h diff t t d t l f
63
Different icon colours are used to show different metadata values of the association rule.
Association Rules VisualizationAssociation Rules Visualization
64
Association Rules VisualizationAssociation Rules VisualizationAssociation Rules VisualizationAssociation Rules Visualization
Size of ball equates to total support
HeightHeight equates to confidence
65
Association Rules Visualization Association Rules Visualization -- Ball graphBall graph
66
The Ball graph ExplainedThe Ball graph Explained
A ball graph consists of a set of nodes and arrows. All the nodes are yellow, green or blue. The g p y , gblue nodes are active nodes representing the items in the rule in which the user is interested. The yellow nodes are passive representing items related to the active nodes in some way. The green nodes merely assist in visualizing two or more items in either the head or the body of the rule.
A circular node represents a frequent (large) data item. The volume of the ball represents the support of the item. Only those items which occur sufficiently frequently are shown
A b t t d t th l i li ti b t th t it A An arrow between two nodes represents the rule implication between the two items. An arrow will be drawn only when the support of a rule is no less than the minimum support
67
Association Rules VisualizationAssociation Rules Visualization
68
Mining Association RulesMining Association Rules
What is Association rule mining
Apriori Algorithm Apriori Algorithm
FP-tree Algorithm
Additional Measures of rule interestingness
Advanced Techniques
69
Multiple-Level Association RulespFoodStuff
Frozen Refrigerated Fresh Bakery Etc...
Vegetable Fruit Dairy Etc....
Banana Apple Orange Etc...
Fresh Bakery [20%, 60%]
Dairy Bread [6%, 50%]
Fruit Bread [1%, 50%] is not valid
Items often form hierarchy.
70
Flexible support settings: Items at the lower level are expected to have lower support.Transaction database can be encoded based on dimensions and levels explore shared multi-level mining
Multi-Dimensional Association RulesMulti Dimensional Association Rules
Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: ≥ 2 dimensions or predicates Multi dimensional rules: ≥ 2 dimensions or predicates
Inter-dimension association rules (no repeated predicates)
(X ”19 25”) ti (X “ t d t”) b (X “ k ”) age(X,”19-25”) ∧ occupation(X,“student”) buys(X,“coke”)
hybrid-dimension association rules (repeated predicates)
Wal Mart knows that customers who buy Barbie dolls (it Wal-Mart knows that customers who buy Barbie dolls (it sells one every 20 seconds) have a 60% likelihood of b i f h f d b Wh dbuying one of three types of candy bars. What does Wal-Mart do with information like that?
'I don't have a clue,' says Wal-Mart's chief of merchandising Lee Scottmerchandising, Lee Scott.
See - KDnuggets 98:01 for many ideas kd t / /98/ 01 ht l
73
www.kdnuggets.com/news/98/n01.html
Some SuggestionsSome Suggestions
By increasing the price of Barbie doll and giving the type of candy bar free, wal-mart By increasing the price of Barbie doll and giving the type of candy bar free, wal mart can reinforce the buying habits of that particular types of buyer
Highest margin candy to be placed near dolls.
Special promotions for Barbie dolls with candy at a slightly higher margin.
Take a poorly selling product X and incorporate an offer on this which is based on buying Barbie and Candy. If the customer is likely to buy these two products anyway then why not try to increase sales on X?
Probably they can not only bundle candy of type A with Barbie dolls but can also Probably they can not only bundle candy of type A with Barbie dolls, but can also introduce new candy of Type N in this bundle while offering discount on whole bundle. As bundle is going to sell because of Barbie dolls & candy of type A, candy of type N can get free ride to customers houses And with the fact that you like something if you seeget free ride to customers houses. And with the fact that you like something, if you see it often, Candy of type N can become popular.
74
ReferencesReferences
Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, 2000
Vipin Kumar and Mahesh Joshi, “Tutorial on High Performance Data Mining ”, 1999g
Rakesh Agrawal, Ramakrishnan Srikan, “Fast Algorithms f Mi i A i i R l ” P VLDB 1994for Mining Association Rules”, Proc VLDB, 1994(http://www.cs.tau.ac.il/~fiat/dmsem03/Fast%20Algorithms%20for%20Mining%20Association%20Rules.ppt)
Alípio Jorge, “selecção de regras: medidas de interesse e meta queries”,