Association Rule Mining & Apriori Algorithm Binu Jasim Data Mining (Monsoon Sem-2014) NIT Calicut
Association Rule Mining &
Apriori Algorithm
Binu Jasim Data Mining (Monsoon Sem-2014)
NIT Calicut
Association Analysis
To find out Associations b/w Items/Objects
Market Basket Analysis
One Application of Association Analysis
Other: Bioinformatics, medical diagnosis etc.
Two key Challenges
Efficiently mine enormous amounts of data for association patterns
Associations occurring due to chance should be avoided
What is in a Super Market ?
Items/Products Lot of them !
- Denoted as I
Transactions (T )- People Buying different Items
A single transaction/a bill Contains a list of Items customer i bought
-Denoted as tr i
Transaction Data
Items (I): The Set of all items
I = { i1, i2, , i }
Transactions
T = { tr1, tr2, , tr }
Transactions
tr I
i1 i2 i3 i4 i5
tr1 0 0 1 0 0
tr2 1 1 0 0 0
tr3 0 0 0 1 0
tr4 0 0 0 1 1
tr5 1 1 1 1 0
Item Sets
Items set is a collection of zero or more Items
K-Item set contains K Items
trj contains an item set X
if x trj
Document Collection as Transaction Data
Treat words as Items & Each document as a transaction
Word1 Word2 Wordr D1 0 1 0 0 0 1
D2 1 1 0 0 0 0
D2 0 0 0 1 0 0
: : :
Images as Transaction Data
Each Pixel as an Item and an Image as a Transaction
So A may be represented as 0111101010100111 -for 4x4 image resolution
Associations
{A} -> {B} indicates Item B is also bought if Item A is bought
{A,B} -> {C} indicates C is bought if Items A & B are bought together
So We have X -> Y as associations
where X Y =
Total # Associations
If we have d items
Then total # Associations = 3 2+1 + 1
Eg: - Items set {A,B}
Possible Associations: {A} -> {B}
& {B} -> {A}
d = 2, so 32 23 + 1 = 2
Total # Associations
I = {A, B, C} d = 3
Total # Associations = 33 24 + 1 = 12
{A} -> {B, C} {B, C} -> {A}
{B} -> {A, C} {A, C} -> {B}
{C} -> {A, B} {A, B}-> {C}
{A}->{B} {B}-> {A}
{A}->{C} {C}-> {A}
{B}->{C} {C}->{B}
Proof !
Each item can go into either of the 3 boxes
Antecedent and Consequent cant be empty
Which gives 3 2+1 + 1
Support Count ()
Support count of an Item set X is given as
() = | *trj: x trj+|
Eg:-
, = 2, , = 0
A B C D
tr1 1 1 0 1
tr2 1 1 1 0
Support & Confidence
We have the association X-> Y s.t. X Y =
Support(X->Y) =
N
Confidence(X->Y) =
Support & Confidence Thresholds
R: {C,D} -> {E} (Association Rule)
Support(R) = 3/6 = 0.5 > minsup
Confidence(R) = = 0.75 > minconf
Why large support?
Items people seldom buy can still have large confidence. Eg:- {1GB usb} -> {headset}
- may give large confidence as support count of {1 GB usb} is small
- But this transaction {1GB usb, headset} together is rare, so small support
Why large confidence ?
Confidence is a better measure than support to indicate how often items are bought together.
Confidence(X->Y) also gives an estimate of conditional probability of Y given X
Caution!: Correlation doesnt imply Causation
Just because a rule X->Y has large support and large confidence,
X need not be the cause of Y.
It only implies correlation
Association Rule Mining Problem
Given a set of transactions T, find all the rules having
support > minsup & confidence > minconf
Eg;- minsup = 20%, minconf = 50%
Obvious: minconf > minsup
Association Rule Mining
Brute Force: List all the rules and compute minsup & minconf
- Computationally expensive: O(3)
Better Strategy: Check minsup & minconf
for all the subsets: Still exponential: O(2)
The Idea
If {Milk, Bread, Butter} is frequent
then all of its subsets are also frequent
i.e. support =
({Milk,Bread,Butter})/N > minsup
then ( Milk, Bread /N > minsup
Decomposed into 2 Sub Tasks
Frequent Item Set Generation: Find all the item sets which satisfy minsup threshold
Rule Generation: Extract all the high confidence rules out of the Frequent Item set found in step 1