MARKET BASKET ANALYSIS USING R TOOL Gaurav Mittal DOMS-NITT
Nov 18, 2014
MARKET BASKET ANALYSIS USING
R TOOL
Gaurav MittalDOMS-NITT
What is Market Basket Analysis?
Understanding behavior of shoppers What items are bought together
What’s in each shopping cart/basket?
Basket data consist of collection of transaction date and items bought in a transaction Itemset
Retail organizations interested in generating qualified decisions and strategy based on analysis of transaction data what to put on sale, how to place merchandise on shelves for
maximizing profit, customer segmentation based on buying pattern
Market Basket Analysis
MBA uses this information to: Identify who customers are (not by name) Understand why they make certain purchases Gain insight about its merchandise (products):
Fast and slow movers Products which are purchased together Products which might benefit from promotion
Take action: Store layouts Which products to put on specials, promote, coupons…
Combining all of this with a customer loyalty card it becomes even more valuable
Examples
Rule form: LHS RHS IF a customer buys diapers, THEN they also buy beer
diapers beer
“Transactions that purchase bread and butter also purchase milk”
bread butter milk
Customers who purchase maintenance agreements are very likely to purchase large appliances
When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners
Def: Market Basket Analysis (Association Analysis) is a mathematical modeling technique based upon the theory that if you buy a certain group of items, you are likely to buy another group of items.
It is used to analyze the customer purchasing behavior and helps in increasing the sales and maintain inventory by focusing on the point of sale transaction data.
Definitions and Terminology
Transaction is a set of items (Itemset). Confidence : It is the measure of uncertainty or trust
worthiness associated with each discovered pattern. Support : It is the measure of how often the collection of items
in an association occur together as percentage of all transactions
Frequent itemset : If an itemset satisfies minimum support,then it is a frequent itemset.
Strong Association rules: Rules that satisfy both a minimum support threshold and a minimum confidence threshold
In Association rule mining, we first find all frequent itemsets and then generate strong association rules from the frequent itemsets
Market Basket Analysis General Concept: methods
_____________________________
Method:
Transaction 1: Frozen pizza, cola, milk Transaction 2: Milk, potato chips Transaction 3: Cola, frozen pizza Transaction 4: Milk, pretzels Transaction 5: Cola, pretzels
Frozen
Pizza Milk ColaPotato
ChipsPretzel
s
Frozen Pizza 2 1 2 0 0
Milk 1 3 1 1 1
Cola 2 1 3 0 1
Potato Chips 0 1 0 1 0
Pretzels 0 1 1 0 2
Results:
we could derive the association rules: If a customer purchases Frozen Pizza, then they will probably purchase Cola. If a customer purchases Cola, then they will probably purchase Frozen Pizza.
Market Basket Analysis General Concept: Measures Support : measure of how often the collection of items
in an association occur together as a percentage of all the transactions support = (containing the item combination) /( total number of record.) Let the rule Is "If a customer purchases Cola, then they will purchase Frozen
Pizza“ The support for this
= 2 (number of transaction that include both Cola and Frozen Pizza is) / 5(total records )
= 40%.
Confidence : confidence of rule “B given A” is a measure of how much more likely it is that B occurs when A has occurred 100% meaning that B always occurs if A has occurred Confidence of a rule = the support for the combination / the support for the
condition. For the rule "If a customer purchases Milk, then they will purchase
Potato Chips" confidence = support for the combination (Potato Chips + Milk) is 20%/
support for the condition (Milk) is 60%, =33%
Association Rules Apply Elsewhere
Retail – supermarkets, etc… Purchases made using credit/debit cards. Optional Telco Service purchases. Banking services. Unusual combinations of insurance claims can be
a warning of fraud. Medical patient histories. Restaurants and Fast-food Centre.
Preparing Data for MBA
Determining scope of dataset (one or many stores, what period, etc)
Converting transaction data to itemsets Generalizing items to appropriate level
Depends on objective of modelRolling up rare items to get adequate support
INTRODUCTION TO R
R is a programming language and software environment for statistical computing and graphics.
R is part of the GNU project. Its source code is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems.
R uses a command line interface, though several graphical user interfaces are available.
Comprehensive R Archive Network (CRAN) makes it easy to benefit from others’ work and to share your work and get feedback on potential improvements
For computationally-intensive tasks, C, C++, and Fortran code can be linked and called at run time.
R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others) and graphical techniques.
Another of R's strengths is its graphical facilities, which produce publication-quality graphs which can include mathematical symbols.
Although R is mostly used by statisticians and other practitioners requiring an environment for statistical computation and software development, it can also be used as a general matrix calculation toolbox with comparable benchmark results to GNU Octave and its proprietary counterpart, MATLAB
THE R ENVIRONMENT
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data
analysis, and graphical facilities for data analysis and display either on-screen or on
hardcopy.
Packages The capabilities of R are extended through user-submitted packages,
which allow specialized statistical techniques, graphical devices, as well as and import/export capabilities to many external data formats.
A statistical package is a suite of computer programs that are specialised for statistical analysis. It enables people to obtain the results of standard statistical procedures and statistical significance tests, without requiring low-level numerical programming.
Process Methodology The data is obtained from the excel sheet
provided by the customer.
Each row contains- BUS_DT - Bussiness Date REST_NO – Restaurant Number RTL_TRAN_NO – Transaction Numbrer MENU_ITEM_KEY – Product Key Number MENU_ITEM_PLU – Menu Product Number MENU_ITEM_NAME – Product Name RCPT_DT_TMSTP – Date Of Transaction HALF_HOUR_KEY – The half hour in which the transaction occurred. COMBO_IND – Is the product offered with something else SERVICE_MODE_CODE – Eating / Taken CGY – Category RGLR_PRC – Regular Price DRV_PRC – Derived Price ITEM_QTY – Number of Products Ordered
Products offered at the store
WHOPPER TENDERCRISP Chicken Sandwich Crown-shaped CHICKEN TENDERS French Fries Hamburger Cheeseburger DOUBLE CROISSAN'WICH BK BURGER SHOTS KRAFT Macaroni and Cheese Drinks
Changing the given data in a new format that contains all items purchased in a single transaction.
Done by using VLOOKUP function in excel. The data obtained is re structured to remove the
multiple line of the same transaction using if…then method in excel.
The data is ready to be fed for statistical application.
Working in R
Downloading Rcmndr, which is a GUI, and Apriori or Association rules package from the CRAN.
A GUI is run named as Rcmndr, to load the data in the software, or the data can be directly loaded using the command functions.
<-Dataset <- read.table("C:/Users/mittal/Documents/mittal.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
loading package Arules library("arules")
To inspect the transactions. <-inspect(Dataset)
Next, we call the function apriori() to find all rules (the default association type for apriori()) with a minimum support of 1% and a confidence of 0.6.
> rules <- arules(Adult, parameter = list(support = 0.01,
+ confidence = 0.6))
Asking for the rules > rules
Getting the Summary of the rules > summary(rules_whopper) > rules_whopper <- subset(rules, subset = rhs %in%
"income=small" &
+ lift > 1.2) > rules_hamburger <- subset(rules, subset = rhs %in%
"income=large" &
+ lift > 1.2)
The recommendations Whopper can be bundled with coke, minute
maid orange juice, French toast stick. Cheeseburger can be bundled with the
French fries, onion rings. French fries with HERSHEY®'S Fat Free Milk. Dutch Apple Pie with Bacon, Egg & Cheese
Biscuit Sandwich.
Challenges…!!!
Cannot load data more than 799 rows. R software is usable only for learning
purpose but difficult for industrial purpose where large amount of data to be analyzed.
Limited knowledge available for guiding analysis development in R.
New codes has to be developed for extending the database.
Thank You