Top Banner
Data & Text Mining 1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University
38

Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Jan 02, 2016

Download

Documents

Marion Black
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 1

Introduction to Association Analysis

Zhangxi Lin

ISQS 3358

Texas Tech University

Page 2: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 2

Outline

Basic concepts Itemset generation - Apriori principle Association rule discovery and generation Evaluation of association patterns Sequential pattern analysis

Page 3: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 3

Basic Concepts

Page 4: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 4

Association Rule Mining Given a set of transactions, find rules that will predict

the occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Implication means co-occurrence, not causality!

Page 5: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 5

Definition: Frequent Itemset Itemset

A collection of one or more items Example: {Milk, Bread, Diaper}

k-itemset An itemset that contains k items

Support count () Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2

Support Fraction of transactions that contain an

itemset E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset An itemset whose support is greater

than or equal to a minsup threshold

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 6: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 6

Definition: Association Rule

Example:Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Association Rule An implication expression of the

form X Y, where X and Y are itemsets

Example: {Milk, Diaper} {Beer}

Rule Evaluation Metrics Support (s)

Fraction of transactions that contain both X and Y

Confidence (c) Measures how often items in Y

appear in transactions thatcontain X

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 7: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 7

Count (CKG, SVG) = 1 Support = 1 / 5 = 20%

Count (CKG) = 3 Confidence = 1 / 3 = 0.33

Count (~CKG) = 2 Count (~CKG, SVG) = 2 Confidence (~CKG, SVG) = 2 / 2 = 100%

Page 8: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 8

Formal Definitions

Support s(X Y) =

Confidence, c(X Y) =

N

YX )(

)(

)(

X

YX

Page 9: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 9

Itemset generation - Apriori principle

Page 10: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 10

Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold

Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds

Computationally prohibitive!

Page 11: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 11

Mining Association RulesExample of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements

Page 12: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 12

Mining Association Rules

Two-step approach: 1. Frequent Itemset Generation

– Generate all itemsets whose support minsup

2. Rule Generation– Generate high confidence rules from each frequent itemset,

where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive

Page 13: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 13

Frequent Itemset Generationnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets

Page 14: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 14

Frequent Itemset Generation Brute-force approach:

Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database

Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!!

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List ofCandidates

M

w

Page 15: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 15

Frequent Itemset Generation Strategies Reduce the number of candidates (M)

Complete search: M=2d

Use pruning techniques to reduce M

Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or

transactions No need to match every candidate against every transaction

Page 16: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 16

Reducing Number of Candidates Apriori principle:

If an itemset is frequent, then all of its subsets must also be frequent

Apriori principle holds due to the following property of the support measure:

Support of an itemset never exceeds the support of its subsets

This is known as the anti-monotone property of support

)()()(:, YsXsYXYX

Page 17: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 17

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Apriori PrincipleFound to be Frequent

Page 18: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 18

Illustrating Apriori Principlenull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

Page 19: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 19

Apriori Algorithm

Method:

Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified

Generate length (k+1) candidate itemsets from length k frequent itemsets

Prune candidate itemsets containing subsets of length k that are infrequent

Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those

that are frequent

Page 20: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 20

Association rule discovery and generation

Page 21: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 21

Reducing Number of Comparisons Candidate counting:

Scan the database of transactions to determine the support of each candidate itemset

To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate,

match it against candidates contained in the hashed buckets

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions Hash Structure

k

Buckets

Page 22: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 22

Factors Affecting Complexity Choice of minimum support threshold

lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of

frequent itemsets Dimensionality (number of items) of the data set

more space is needed to store support count of each item if number of frequent items also increases, both computation and

I/O costs may also increase Size of database

since Apriori makes multiple passes, run time of algorithm may increase with number of transactions

Average transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of

hash tree (number of subsets in a transaction increases with its width)

Page 23: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 23

Rule Generation

Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,

If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)

Page 24: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 24

Rule Generation How to efficiently generate rules from frequent

itemsets? In general, confidence does not have an anti-monotone

propertyc(ABC D) can be larger or smaller than c(AB D)

But confidence of rules generated from the same itemset has an anti-monotone property

e.g., L = {A,B,C,D}:

c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the

RHS of the rule

Page 25: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 25

Rule Generation for Apriori Algorithm

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Lattice of rulesABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned Rules

Low Confidence Rule

Page 26: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 26

Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share

the same prefixin the rule consequent

join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC

Prune rule D=>ABC if itssubset AD=>BC does not havehigh confidence

BD=>ACCD=>AB

D=>ABC

Page 27: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 27

Demonstration

A bank wants to examine its customer base and understand which of its products individual customers own in combination with one another. It has chosen to conduct a market-basket analysis of a sample of its customer base. The bank has a data set that lists the banking products/services used by 7,991 customers.

Data set: BANK Variables

ACCT: ID, Nominal, Account Number SERVICE: Target, Nominal, Type of Service VISIT: Sequence, Ordinal, Order of Product Purchase

Page 28: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 28

Evaluation of association patterns

Page 29: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 29

Contingency TableChecking Account

500 3,500

1,000 5,000

No

Yes

No Yes

SavingAccount

4,000

6,000

10,000Support(SVG CK) = 50%

Confidence(SVG CK) = 83%

Lift(SVG CK) = 0.83/0.85 < 1

Page 30: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 30

Statistical Independence

Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B)

P(SB) = 420/1000 = 0.42 P(S) P(B) = 0.6 0.7 = 0.42

P(SB) = P(S) P(B) => Statistical independence P(SB) > P(S) P(B) => Positively correlated P(SB) < P(S) P(B) => Negatively correlated

Page 31: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 31

Statistical-based Measures

Measures that take into account statistical dependence

)](1)[()](1)[(

)()(),(

)()(),(

)()(

),(

)(

)|(

YPYPXPXP

YPXPYXPtcoefficien

YPXPYXPPS

YPXP

YXPInterest

YP

XYPLift

Page 32: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 32

Example: Lift/Interest Contingency Table

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

Page 33: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 33

Compared to Confusion Matrix

Computed Yes

Computed No

Total

Actual Yes 15 5 20

Actual No 75 5 80

Total 90 10 100

In classification, we are interested in P(Actual Yes|Computed Yes)i.e. P(Row | Column)

In associate analysis, we are interested in P(Column|Row)

Page 34: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 34

Sequential pattern analysis

Page 35: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 35

Examples of Sequence DataSequence Database

Sequence Element (Transaction)

Event(Item)

Customer Purchase history of a given customer

A set of items bought by a customer at time t

Books, diary products, CDs, etc

Web Data Browsing activity of a particular Web visitor

A collection of files viewed by a Web visitor after a single mouse click

Home page, index page, contact info, etc

Event data History of events generated by a given sensor

Events triggered by a sensor at time t

Types of alarms generated by sensors

Genome sequences

DNA sequence of a particular species

An element of the DNA sequence

Bases A,T,G,C

Sequence

E1E2

E1E3

E2E3E4E2

Element (Transaction

)

Event (Item)

Page 36: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 36

Examples of Sequence Web sequence:

< {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} >

Sequence of initiating events causing the nuclear accident at 3-mile Island:(http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm)

< {clogged resin} {outlet valve closure} {loss of feedwater} {condenser polisher outlet valve shut} {booster pumps trip} {main waterpump trips} {main turbine trips} {reactor pressure increases}>

Sequence of books checked out at a library:<{Fellowship of the Ring} {The Two Towers} {Return of the King}>

Page 37: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 37

Sequential Pattern Mining: Definition Given:

a database of sequences a user-specified minimum support threshold,

minsup

Task: Find all subsequences with support ≥ minsup

Page 38: Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

Data & Text Mining 38

Sequential Pattern Mining: Example

Minsup = 50%

Examples of Frequent Subsequences:

< {1,2} > s=60%< {2,3} > s=60%< {2,4}> s=80%< {3} {5}> s=80%< {1} {2} > s=80%< {2} {2} > s=60%< {1} {2,3} > s=60%< {2} {2,3} > s=60%< {1,2} {2,3} > s=60%

Object Timestamp EventsA 1 1,2,4A 2 2,3A 3 5B 1 1,2B 2 2,3,4C 1 1, 2C 2 2,3,4C 3 2,4,5D 1 2D 2 3, 4D 3 4, 5E 1 1, 3E 2 2, 4, 5