Top Banner
 Association Rule Mining & Apriori Algorithm Binu Jasim Data Mining (Monsoon Sem-2014) NIT Calicut
24

9 Association Rule Mining

Oct 08, 2015

Download

Documents

Eric Gardner

DATA MINING
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Association Rule Mining &

    Apriori Algorithm

    Binu Jasim Data Mining (Monsoon Sem-2014)

    NIT Calicut

  • Association Analysis

    To find out Associations b/w Items/Objects

  • Market Basket Analysis

    One Application of Association Analysis

    Other: Bioinformatics, medical diagnosis etc.

  • Two key Challenges

    Efficiently mine enormous amounts of data for association patterns

    Associations occurring due to chance should be avoided

  • What is in a Super Market ?

    Items/Products Lot of them !

    - Denoted as I

    Transactions (T )- People Buying different Items

    A single transaction/a bill Contains a list of Items customer i bought

    -Denoted as tr i

  • Transaction Data

    Items (I): The Set of all items

    I = { i1, i2, , i }

    Transactions

    T = { tr1, tr2, , tr }

  • Transactions

    tr I

    i1 i2 i3 i4 i5

    tr1 0 0 1 0 0

    tr2 1 1 0 0 0

    tr3 0 0 0 1 0

    tr4 0 0 0 1 1

    tr5 1 1 1 1 0

  • Item Sets

    Items set is a collection of zero or more Items

    K-Item set contains K Items

    trj contains an item set X

    if x trj

  • Document Collection as Transaction Data

    Treat words as Items & Each document as a transaction

    Word1 Word2 Wordr D1 0 1 0 0 0 1

    D2 1 1 0 0 0 0

    D2 0 0 0 1 0 0

    : : :

  • Images as Transaction Data

    Each Pixel as an Item and an Image as a Transaction

    So A may be represented as 0111101010100111 -for 4x4 image resolution

  • Associations

    {A} -> {B} indicates Item B is also bought if Item A is bought

    {A,B} -> {C} indicates C is bought if Items A & B are bought together

    So We have X -> Y as associations

    where X Y =

  • Total # Associations

    If we have d items

    Then total # Associations = 3 2+1 + 1

    Eg: - Items set {A,B}

    Possible Associations: {A} -> {B}

    & {B} -> {A}

    d = 2, so 32 23 + 1 = 2

  • Total # Associations

    I = {A, B, C} d = 3

    Total # Associations = 33 24 + 1 = 12

    {A} -> {B, C} {B, C} -> {A}

    {B} -> {A, C} {A, C} -> {B}

    {C} -> {A, B} {A, B}-> {C}

    {A}->{B} {B}-> {A}

    {A}->{C} {C}-> {A}

    {B}->{C} {C}->{B}

  • Proof !

    Each item can go into either of the 3 boxes

    Antecedent and Consequent cant be empty

    Which gives 3 2+1 + 1

  • Support Count ()

    Support count of an Item set X is given as

    () = | *trj: x trj+|

    Eg:-

    , = 2, , = 0

    A B C D

    tr1 1 1 0 1

    tr2 1 1 1 0

  • Support & Confidence

    We have the association X-> Y s.t. X Y =

    Support(X->Y) =

    N

    Confidence(X->Y) =

  • Support & Confidence Thresholds

    R: {C,D} -> {E} (Association Rule)

    Support(R) = 3/6 = 0.5 > minsup

    Confidence(R) = = 0.75 > minconf

  • Why large support?

    Items people seldom buy can still have large confidence. Eg:- {1GB usb} -> {headset}

    - may give large confidence as support count of {1 GB usb} is small

    - But this transaction {1GB usb, headset} together is rare, so small support

  • Why large confidence ?

    Confidence is a better measure than support to indicate how often items are bought together.

    Confidence(X->Y) also gives an estimate of conditional probability of Y given X

  • Caution!: Correlation doesnt imply Causation

    Just because a rule X->Y has large support and large confidence,

    X need not be the cause of Y.

    It only implies correlation

  • Association Rule Mining Problem

    Given a set of transactions T, find all the rules having

    support > minsup & confidence > minconf

    Eg;- minsup = 20%, minconf = 50%

    Obvious: minconf > minsup

  • Association Rule Mining

    Brute Force: List all the rules and compute minsup & minconf

    - Computationally expensive: O(3)

    Better Strategy: Check minsup & minconf

    for all the subsets: Still exponential: O(2)

  • The Idea

    If {Milk, Bread, Butter} is frequent

    then all of its subsets are also frequent

    i.e. support =

    ({Milk,Bread,Butter})/N > minsup

    then ( Milk, Bread /N > minsup

  • Decomposed into 2 Sub Tasks

    Frequent Item Set Generation: Find all the item sets which satisfy minsup threshold

    Rule Generation: Extract all the high confidence rules out of the Frequent Item set found in step 1