Market Basket Analysis

Table of ContentsMarket basket analysis in a multiple

store environment

• Introduction• Need for a new Algorithm• Problem Definition• Algorithm Description

Graph Based Structure of Market Basket Analysis

• Market Basket• Defining the problem

• Apriori Algorithm• Limitation

• Similis Algorithm• Problem Transformation• Searching for the Maximum-

weighted Clique

• Comparison• Example

• Conclusion

Introduction

• Market basket is a method of discovering customer purchasing patterns

• Discovering such purchasing patterns can help managers in designing store layout, web sites, product mix and bundling, and other marketing strategies

• Company with multiple stores, discovery of purchasing patterns that may vary over time and exist in all, or in subsets of, stores can be useful in forming marketing, sales, service, and operation strategies at the company’s, local, and store levels

Need for a new AlgorithmTwo main problems in using the existing methods in a multi-store environment:1.Temporal Association Rules :

1. The static association rules either find patterns at a point of time or implicitly assume the patterns stay the same over time and across stores

2. Temporal Association selling periods are considered in computing the support value

2.Spatial Association Rules1. Possibility that some products may not be sold in some

stores, for example, because of geographical, environmental, or political reasons

2. Problem is to find common association patterns in subsets of stores with location

Cont..

• Solution: Apriori-like algorithm – Covers rules that are applicable to the entire chain

without time restriction or to a subset of stores in specific time intervals

– The format of the rules is similar to that of the traditional rules

– Rules also contain information on store (location) and time

Cont..

• Examples:– In the second week of August, customers

purchase computers, printers, Internet and wireless phone services jointly in electronics stores near campus

– In January, customers purchase cold medicine, humidifiers, coffee, and sunglasses together in supermarkets near skiing resorts

Problem Definition

Cont..

– Let {T1, T2,. . ., Tm} be the set of mutually disjoint time intervals (periods) and form a complete partition of T

– Let P={P1, P2,. . ., Pq} be the set of stores, where Pj (1 ≤ j ≤ q) denotes the jth store in the store chain

– Each transaction s in D is attached with a timestamp, ‘t’ and store identifier, ‘p’ to indicate the store and time that the transaction occurs

– Let Sk subset P and Rk subset T be the sets of the stores and times that item Ik is sold, respectively

Cont..

Cont..

Cont..

Cont..

“Apriori-like” Algorithm

Cont..

Cont..

• Some essential Points:– RFk denote the set of all relative-frequent k-

itemsets; Fk the set of all frequent k-itemsets; Ck the set of candidate k-itemsets

– k-item candidate itemset are generated by combining k-1 frequent itemsets following the anti-monotone property

Cont..

• Algorithm in brief:– First step of the algorithm is to build the PT table, for

each item in I – Different Phases of Algorithm:

• In the first phase, we scan the database for the first time and build a two-dimensional table, called the TS table and find frequent 1-itemset

• In the kth phase of the algorithm, Ck is derived, and Fk is generated by evaluating their supports

• Since an RFk itemset must be a frequent itemset, we generate RFk from Fk by evaluating the relative supports of the itemsets X in Fk

Cont..

• PT table: Associates context (stores, time intervals) with each item in I– PT tables for individual items can be used to

determine the PT table for a given itemset X

Cont..

Cont..

• PT table: The method to compute the jth row of PT table for itemset X

Cont..

Cont..

• Candidate itemsets: we generate the candidate itemsets from the frequent itemsets, from the last phase

• Relative-frequent itemset: – Because an RF itemset must be a frequent itemset,

we can generate RFk from Fk by computing the relative supports of those itemsets X in Fk

– |DVX| can be obtained from the TS and PT tables of

X

Cont..

Conclusion

• Store-chain association rules, is proposed specifically for a multi-store environment, where stores may have different product-mix strategies that can be adjusted over time.

• These rules have a distinct advantage over the traditional ones because they contain store (location) and time information so that they can be used not only for general or local marketing strategies (depending on the results), but also for product procurement, inventory, and distribution strategies for the entire store chain

References1. R. Agrawal, R. Srikant, Fast algorithms for mining association rules,

Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994, pp. 478–499

2. J.M. Ale, G.H. Rossi, An approach to discovering temporal association rules, Proceedings of the 2000 ACM Symposium on Applied Computing (Vol. 1), Villa Olmo, Como, Italy, 2000, pp. 294– 300

3. S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, Proceedings of the 1997 ACM-SIGMOD Conference on Management of Data, Tucson, Arizona, USA, May 1997, pp. 255–264.

4. E. Clementini, P.D. Felice, K. Koperski, Mining multiplelevel spatial association rules for objects with a broad boundary, Data and Knowledge Engineering 34 (3) (2000) 251– 270.

Table of ContentsMarket basket analysis in a multiple

store environment

• Introduction• Need for a new Algorithm• Problem Definition• Algorithm Description

Graph Based Structure of Market Basket Analysis

• Market Basket• Defining the problem

• Apriori Algorithm• Limitation

• Similis Algorithm• Problem Transformation• Searching for the Maximum-

weighted Clique

• Comparison• Example

• Conclusion

Market Basket

• Market Basket is a powerful tool for implementing the cross selling strategies.

• Problem Definition:– The file with multiple transactions can be shown in a

relational database table T(customer, item). – The customer= {1,2,3,……,n} and the item = a,b,c,….,z. – The table T(customer, item) can be seen as a set of all

customer transactions Trans = {t1,t2,......,tk} where each transaction contains the subset of items tk = {ia, ib,......,iz}

– The relational table thus formed can be seen as the relationship between item and customer called item – clientele.

Apriori Algorithm

• Apriori Algorithm has two important characteristics.

• Level – wise algorithm i.e. it traverses the item lattice one level at a time, from frequent 1 – item set to maximum size of frequent item sets.

• Generate and Test strategy for finding the frequent item sets. The support for each candidate is then counted and tested against the minsup threshold.

• Limitation• Apriori algorithm has an exponential time complexity, and

several passes over the input table are needed. To overcome these handicaps Similis algorithm is proposed.

Similis Algorithm• Similis is a Latin word which means “Similar”.• Similis consists of two steps :

• Problem Transformation: Generation of graph structure• Search: Finding the maximum weight clique

• Algorithm Description: STEP 1 - Data Transformation input: table T(customer, item) 2 Generate graph G(V,E) using the similarities between items output: weighted graph G(V,E)STEP 2 - Finding the maximum-weighted cliques input: weighted graph G(V,E) and size k 2 Find in G(V,E) the clique S with k vertexes with the maximum weight, using the Primal- Tabu Meta-heuristic. output: weighted clique S of size k that correspond to the most frequent market basket with

k items.

Problem Transformation

• As stated earlier of transforming the table T(customer, item) into condense data by using graph structure.

• A graph is a pair of G = (V,E), where V is the vertices and E is the edge to the graph.

• In market basket case each vertex corresponds to an item and each arc has a weight which represents the distance between the adjacent vertices.

• The distance between two items is given by the frequency that the two items are bought together.

Cont…

• To find the values for the weighted graph G(V,E) some similarity measures can be used.

• The similarity value of the two items will be high if they are both included in frequent transactions.

• This means that if two items are frequently bought in the same transactions, then they belong to a frequent market basket.

• In order to create sets of items, one association measure must be found, similarity or distance measures can be created.

Cont…

• For each pair of items (A,B) a similarity measure SIM(A,B) can be found, if the items are bought together many times they have a strong similarity, but they have a weak similarity if they are not usually bought together.

• For all items, an item similarity matrix is generated, which can be represented by the adjacent matrix of the weighted graph G(V,E).

Similarity Measures• The authors describe the following similarity measures. • These measures use binary matrices and return normalized values

between 0 and 1.• The Dice (sim1), Jaccard (sim2) and Cosine (sim3) coefficient are widely

used given their simplicity.

Weight Calculation• A multiplicative model will de used to express the weight of an edge (A,B). The

weight of the edge (A,B) takes into account the similarity and frequency of items, such as:

weight(A,B) = sim(A,B) . frequency(A,B)• Where the similarity value of two items will be high if they are both included in the

same transactions.• The frequency of the item must be considered to guarantee a correspondence

between high-weighted edges and items that appear in many transactions.• There are several ways to define item frequency. In this work author opt for the

average of the relative frequency of the two items, given by:

Example

In this table T customer and item relationship is given, there are 5 customer and 6 items so we have to find the item – clientele relationship means:Bread = (1,2,4,5) Milk = (1,3,4,5) Diaper = (2,3,4,5) Beer = (2,3,4)Eggs = (2) Coke = (3,5)

Matrix of GraphG(V,E) Bread Milk Diaper Beer Eggs Coke

Bread 0.48 0.60 0.28 0.125 0.12

Milk 0.48 0.28 0 0.30

Diaper 0.525 0.125 0.30

Beer 0.133 0.125

Eggs 0

Coke

Adjacent Matrix of weighted graph G =(V,E)

Searching for the Maximum-weighted Clique

• A clique can represent a common interest group.• Given a graph representing the communication among a

group of individuals in an organization, each vertex represents an individual, while edge (i, j) shows that individual i regularly communicates with individual j.

• Our aim is to find the maximum weighted clique in graph.• If a graph with weights in the edges is used, the most

weighted clique corresponds to the common-interest group whose elements communicate the most among themselves. This structure allows the representation of sets of elements strongly connected.

Maximum Clique Problem• The Maximum Clique Problem is an important problem in combinatorial

optimization.• In market basket it is used to find the interesting combination patterns of

the item with another one’s.• Here to solve this problem, the Primal-Tabu algorithm is used for finding

the maximum weighted clique.• Conceptually primal Tabu works on finding the related neighbourhood

structures are N+, N-, and N0 for addition, removal and swap of a vertex of the graph.

• At each step one new solution S' is chosen from the neighbourhood N(S) of the current solution S.

• At each iteration the best solution found S* are updated whenever the clique value is increased.

Comparison• Apriori with support 3 give us the frequent item choice (Bread, Milk, Diaper) .• Here I am trying to find whether the graph give me the same option by using the

maximum weight clique method.• But before this since the support is 3 and our values are in 1 and 0 (binary matrices

) so we should need to normalized the 3 to the range of 1 and 0 .

Result

Milk

Bread

Diaper

Beer

Coke

Eggs

0.48

0.48

0.525

0.60

Only those edges are consider whose weights are more than 0.40.

Conclusion• The main disadvantage of the Apriori algorithm is the

exponential time complexity, since it performs many passes over the data.

• Using few items or sparse data the algorithm is efficient , while when using correlated data the performance degrades significantly.

• The Similis algorithm because of its lower computational complexity, thus allowing the resolution of a greater number of real problems.

• In this innovative approach, the condensed data is obtained by transforming the market basket problem in a maximum-weighted clique problem.

References

1. E. Balas, W. Niehaus, Optimized Crossover-Based Genetic Algorithms will be the Maximum Cardinality and Maximum Weight Clique Problems, Journal of Heuristics, Kluwer Academic Publishers, 4, 1998, pp. 107-122.

2. M. Berry and G. Lino, Data Mining Techniques for Marketing, Sales and Customer Support, John Wiley and Sons, 1997.

3. I.M. Bomze, M. Budinich, P.M. Pardalos and M. Pelillo, Maximum Clique Problem, in Handbook of Combinatorial Optimization, D.-Z. Du and P.M. Pardalos Eds, 1999, pp.1-74.

4. J. Han, M. Kamber, Data Mining, Morgan Kaufmann, San Francisco, 2001.

Market Basket Analysis

Documents

store pj

time information

store chain

store location

ts table

jth store

entire store

store layout