Page 1
2009:097
M A S T E R ' S T H E S I S
Mining Changes in CustomerPurchasing Behavior
- a Data Mining Approach
Samira Madani
Luleå University of Technology
Master Thesis, Continuation Courses Marketing and e-commerce
Department of Business Administration and Social SciencesDivision of Industrial marketing and e-commerce
2009:097 - ISSN: 1653-0187 - ISRN: LTU-PB-EX--09/097--SE
Page 2
Abstract: The world around us is changing all the time. For businesses, knowing what is
changing and how it has changed is also crucial. One of the most important aspects of
surviving in a dynamic market is to know and adapt to changes happening in customer
behavior. In Fast Moving Consumer Goods (FMCG) Distribution Company, this issue
has more importance. Because of the variety of FMCGs products, distribution
companies and their different strategies, the purchasing behavior of customers may
change many times during a period and the competition become tougher. The purpose
of this study is to help Kalleh Company as a manufacturer and distributor of food
products in Iran market to mine changes happening in their customer behavior.
Mining changes has several steps includes data collection, data preprocessing, customer
segmentation, mining customer behavior patterns and change mining. For customer
segmentation, we use Customer Value Matrix. For mining pattern of behavior, we use
Apriori algorithm and maximal frequent itemsets. We have different kinds of changes
based on the literature, added/perished rules, emerging pattern and unexpected changes.
Also, there are two measures of similarity and unexpectedness to measure the change.
In this study, one time we calculate changes based on these measures from the
literature. Then, we modified these measures to calculate the difference between ordinal
attribute to bring their information in the calculation of changes. Our contribution is
modifying these change measure to bring more information and higher accuracy in
change mining. The result presented in the chapter4. Marketing managers can apply
these detected changes to be responsive accurately and timely to the changes in the
market. In addition, they can use it to evaluate different marketing campaigns to build
stronger relationship with their customer and knowing the market better. There are
many implications for mining changes in macro in micro aspects of businesses and also
in marketing campaigns and manufacturing.
Page 3
2
Abstract: ................................................................................................................... 1
Chapter1: Introduction .............................................................................................. 8
1.1Background of the study: .................................................................................. 9
1.2Problem definition: ......................................................................................... 10
1.3Purpose of this study: ...................................................................................... 10
1.4Research question: .......................................................................................... 12
1.5Research motivation: ...................................................................................... 13
1.6Demarcation: .................................................................................................. 13
1.7Research outline: ............................................................................................ 13
Chapter2: Literature Review.................................................................................... 14
2. 1Mining Customer Behavior: ........................................................................... 15
2. 2Review of Data Mining .................................................................................. 15
2. 2.1 Data mining: in brief .............................................................................. 15
2. 2.2 Data mining Functions: ........................................................................... 17
2. 2.3 Classification in brief: ............................................................................ 18
2. 2.4 Clustering in brief: .................................................................................. 18
2. 2.5 Association Rules in Brief: ..................................................................... 19
2. 3Association Rule Mining Review: .................................................................. 19
2.3.1 Association Rule mining problem: ........................................................... 19
2.3.2 Apriori Algorithm.................................................................................... 20
2.3.3 Association Rule Mining Approaches: Apriori Approach ........................ 26
2. 4Mining Changes Literature Review: ............................................................... 30
2. 5Customer segmentation review: ..................................................................... 37
2. 5.1 Clustering Analysis ................................................................................ 38
2. 5.2 Customer Segmentation Model ............................................................... 38
2. 5.3 RFM Model ............................................................................................ 38
2. 5.4 RFM Scoring .......................................................................................... 39
2.5.5 Customer Value Matrix Model ................................................................ 43
Page 4
3
Chapter3: Research Methodology ............................................................................ 45
3.1Research Methodology: .................................................................................. 46
3.2Research Design: ............................................................................................ 46
3.3Research Purpose: ........................................................................................... 46
3.4Research Approaches: ..................................................................................... 48
3.5Research Strategy: .......................................................................................... 48
3.6Research process:............................................................................................ 49
3.7Data Collection and Description: .................................................................... 50
3.8 Data Pre-Processing: ...................................................................................... 53
3.9Customer Segmentation: ................................................................................. 56
3.9.1 Customer Value Matrix ........................................................................... 57
3.9.2 An effective analytical tool ...................................................................... 57
3.9.3 Customer Value Matrix Methodology ...................................................... 58
3.10 Mining Customer Behavior: ......................................................................... 60
3.10.1 Association Rule Mining: ...................................................................... 60
3.10.2 Apriori algorithm: .................................................................................. 61
3.11 Change Mining: ........................................................................................... 63
3.11.1 Change Mining: ..................................................................................... 63
Chapter4: Results & Analysis .................................................................................. 70
4.1 Data preprocessing result: .............................................................................. 71
4.1.1 Data Cleaning .......................................................................................... 71
4.1.2 Data Transformation result: ..................................................................... 71
4.2Customer segmentation (in sql server 2000 ..................................................... 72
4.2.1 Customer Value Matrix Result: ............................................................... 72
4.3Customer Behavior Mining: ............................................................................ 75
4.3.1 Discretization Result: .............................................................................. 75
4.3.2 Association Rule Mining Results: ............................................................ 78
4.4Change Mining: .............................................................................................. 78
Page 5
4
4.4.1 Some examples of change pattern: ........................................................... 79
4.4.2 Association rules and changes based (Chen et al, 2005): .......................... 80
4.4.3 Rules with discrete variables in RHS: ...................................................... 97
4.4.4 Change mining with Manhattan distance ............................................... 103
Chapter5: Conclusion, further research .................................................................. 123
5.1Conclusion: ................................................................................................... 124
5.2Our contribution: .......................................................................................... 126
5.3Limitation: .................................................................................................... 126
5.4Managerial Implication: ................................................................................ 126
5.5Future works: ................................................................................................ 127
References: ........................................................................................................... 127
List of tables
Table 2.1: Factors for classification of ARM…………………………………………..25
Table 2.2: Mining in a changing environment timetable………………………………37
Table3.1: Data collected from Kalleh Company………………………………………52
Table3.2: calculating variables for customer value matrix……………………………58
Table 4.1: RFM table fields…………………………………………………………….72
Table 4.2: calculating variables for customer value matrix…………………………...73
Table 4.3: calculating variables for customer value matrix…………………………...73
Table 4.4: segment information in for period 1………………………………………..74
Table 4.5: segment information in for period 2………………………………………..75
Table4.6: R quantile……………………………………………………………………76
Table4.7: M quantile…………………………………………………………………...76
Table4.8: F quantile……………………………………………………………………77
Page 6
5
Table4.9: Area quantile………………………………………………………………..78
Table 4.10: Generated rule summary………………………………………………….78
Table 4.11: Generated Rules for period 1 Cluster 1…………………………………...80
Table4.12: Generated Rules for period 2 Cluster 1……………………………………81
Table4.13: Generated Rules for period 1 Cluster 2……………………………………82
Table4.14: Generated Rules for period 2 Cluster 2……………………………………84
Table4.15: Generated Rules for period 1 Cluster 3……………………………………87
Table4.16: Generated Rules for period 2 Cluster 3……………………………………88
Table4.17: Generated Rules for period 1 Cluster 4……………………………………89
Table4.18: Generated Rules for period 2 Cluster 4……………………………………95
Table4.19:Cat1 quantile….…………………………………………………………….98
Table4.20:Cat2 quantile…..……………………………………………………………99
Table4.21:Cat3 quantile …..………………………………………………….………100
Table4.22:Cat5 quantile ………………………………………………………………101
Table4.23:Cat11 quantile ….………………………………………………………….102
Table4.24:Cat13 quantile ……………………………………………………………..103
Table4.25: Generated Rules for period 1 Cluster 1, Change mining by (Chen et al, 2005)
measures & Manhattan distance………………………………………………………103
Table4.26: Generated Rules for period 2 Cluster 1, Change mining by (Chen et al, 2005)
measures & Manhattan distance………………………………………………………104
Table4.27: Generated Rules for period 1 Cluster 2, Change mining by (Chen et al, 2005)
measures & Manhattan distance……………………………….……………………...105
Table4.28: Generated Rules for period 2 Cluster 2, Change mining by (Chen et al, 2005)
measures & Manhattan distance………………………………………………….….107
Table4.29: Generated Rules for period 1 Cluster 3, Change mining by (Chen et al, 2005)
measures & Manhattan distance……………………………………………………..109
Page 7
6
Table4.30: Generated Rules for period 2 Cluster 3, Change mining by (Chen et al, 2005)
measures & Manhattan distance…………………………………………………...….110
Table4.31: Generated Rules for period 1 Cluster 4, Change mining by (Chen et al, 2005)
measures & Manhattan distance……………………………………………………....112
Table4.32: Generated Rules for period 2 Cluster 4, Change mining by (Chen et al, 2005)
measures & Manhattan distance……………………………………………………....117
List of figures:
Figure 2.1: Knowledge Discovery in Database Processes………………………...…...16
Figure 2.2 the major steps in data mining process…………………………………......17
Figure 2.3: Classification of DM techniques…………………………………………...17
Figure 2.4: Classic Problem of association rule mining ……………………………….20
Figure 2.5: Mining in a changing environment review………………………………...36
Figure 2.6: Customer Value Matrix…………………………………………………….44
Figure 3.1: Research design of this study………………………………………………46
Figure 3.2: Change mining process perspective………………………………………..49
Figure 3.3: Change mining process…………………………………………………….50
Figure 3.4: Change mining process in detail………………………………………..….50
Figure 3.5: Product categories of Kalleh company…………………………………….52
Figure 3.6: customer value matrix………………………………………………….….59
Figure 4.1: generalized product category……………………………………………....71
Figure 4.2: The Customer Value Matrix…………………………………………….....74
Figure4.3: R histogram……………………………………………………………..…..76
Figure4.4: M histogram………………………………………………………………..76
Figure4.5: F histogram…………………………………………………………………77
Figure4.6: Area histogram ……………………………………………………………..78
Figure4.7: Cat1 histogram…………………………………………………………..….98
Page 8
7
Figure4.8: Cat2 histogram………………………………………………………….….99
Figure4.9: Cat3 histogram……………………………………………………………100
Figure4.10: Cat5 histogram…………………………………………………………...101
Figure4.11: Cat11 histogram …………………………………………………………102
Figure4.12: Cat13 histogram……………………………………………………….....103
Page 9
8
Chapter1: Introduction
Background of the study
Problem definition
Purpose of this study
Research question
Research motivation
Research demarcation
Research outline
Page 10
9
1.1Background of the study: The world around us changes continuously. Knowing and adapting to changes
is an important aspect of our lives. For businesses, knowing what is changing and
how it has changed is also essential (Liu et al, 2000). One of the most important
aspects of surviving in a dynamic market is to know and adapt to changes happening
in customer behavior. Moreover, in recent years, there has been the explosive
growth in the amount of information (Min, S., H., Han, I., 2005). In general, Fast
moving consumer goods (FMCG) distribution companies collected huge amount of
data from their customers and their purchasing transactions. In this gathered data,
we can find interesting hidden information about the customers and their behaviors.
The traditional approach for marketing decision making for marketing
promotions, campaigns and market research in FMCG distribution companies is to
focus more on their internal expert opinions. These experts include the marketing
managers and also sales managers who are in constant touch with their salespeople
and merchandisers who bring them market information.
However, this kind of decision making process ignores the customer data and
their behaviors. Furthermore, in today’s world where the market is highly
competitive and products are overwhelming, customers face with various products
and various providers with different marketing strategies (Hossein Javaheri, S.,
2008). In such a situation, customer behavior changes all the time due to such a
dynamic market (Chen et al, 2005). When the marketing manager became aware of
some changes in the market by sales team; he/she does not have any idea about how
and where to start understanding these changes and their reasons. It results to design
a wide time-consuming and costly market research which its result maybe did not
reach on time to the marketing department to react to these changes. Also in such a
market, there are many promotion campaigns by company itself and competitors
that it is difficult to analyze the effectiveness of them in the market. So, in the
competitive environment, there is a need to mine customer data and their
transactions to find changes in customer purchasing behavior which is an effective
and efficient way to respond to their needs timely and accurately. As a result, many
FMCG distribution companies in Iran are trying to move away from traditional way
Page 11
10
for planning their marketing campaigns, promotions and market research by
understanding changes happening in their customers’ purchasing behavior. Change
mining helps managers to make better marketing strategies.
1.2Problem definition: Kalleh Company is a private manufacturer and distributor of food product in
Iran. It produces different categories of food product from dairy products to ice
cream and meats and sauces. It has more than 10 different categories and about 800
products. Now, the company is faced with the challenge of increasing competition.
There are some reasons behind it. First, according to the high variation of products,
it should compete in different food market like dairy, ice cream and meat. It results
to compete with many competitors with different product categories and different
marketing strategies. Also there are some powerful governmental companies that
make competition tougher for Kalleh. So in such a market, the customer behavior
may change by the trend of companies’ strategies in the market and also by
changing their need by themselves.
Kalleh Company in order to answer to the changes in customer purchasing
behavior timely and not being behind the customer needs and the competition need
to mine changes in the customer purchasing behavior. The goal of Kalleh Company
is to mine changes in purchasing behavior of the customers in different segments to
respond to these changes timely and accurately to increase its return on investment
(ROI).
1.3Purpose of this study: The purpose of this study is to mine changes in customer purchasing behavior.
In order to reach this goal we need to building customer purchasing patterns of
customers based on the customer, product and transaction data collected in
databases.
Data mining techniques can help us to reach this goal. According to (Song et
al, 2001), data mining is the process of exploration and analysis of large quantities
of data in order to discover meaningful pattern and rules. Many of data mining
Studies has focused on developing techniques to build precise models to predict
customer’s behavior, and to set up marketing strategies and customization.
According to (Nemati & Barko, 2001; cited by Nemati, H.R., Barko, C. D., 2003),
most of data mining applications (72%) are centered on predicting customer
Page 12
11
behavior. Comparatively little attention has been paid to discover changes in
databases collected eventually (Liu et al., 2000). From literature review, what is
obvious is too much time spent on worrying about “absolute” numbers, like
Lifetime Value. However, what they should really be observing is “relative”
numbers – change over time. Highest potential ROI customers from a marketing
viewpoint are Customers who are in the process of changing their behavior either
accelerating their relationship with you, or ending their relationship with a company
(Novo, j., 2008). In many applications, mining changes can be more crucial than
producing precise prediction models, which are in the center of existing data mining
researches. Regardless of how the model is accurate, it is inactive by itself because it
can only predict based on patterns mined in the old data. Acting based on the built
model should not guide to actions that may change the environment because
otherwise the model will stop to be correct (Liu et al., 2000). Prediction model
building is more appropriate in areas where the environment is comparatively
steady. However, in many business conditions, constant human interference to the
environment is a fact. Businesses simply cannot let nature take its course. They
constantly need to do actions in order to provide better services and products by
finding the attractive changes and steady patterns in customer behaviors. Still in a
comparatively steady environment, changes are also unavoidable due to internal and
external issues (Liu et al., 2000).
From these viewpoints the question: ‘Which patterns exist?’ as it is responded
by state-of-the art data mining technology, is replaced by the question: ‘How do
patterns change?’ (Böttcher, M., et al, 2006). Actually, discovery of interesting and
earlier unidentified changes in customer, product and transaction data, not only let
the user monitor the influence of past business decisions but also to get ready
today’s business for tomorrow’s needs (Böttcher, M., et al, 2006).
Major changes often need instant concentration and actions to modify the
existing practices and/or to change the domain condition (Liu et al, 2000). By using
change mining methodology, Kalleh Company can detect different kinds of changes
happening in the customer purchasing behavior to build stronger relationship with
the customers. Also, understanding changes in customer behavior can assist
managers to set up effective and efficient promotion campaigns.
(Liu et al, 2000) mentioned that there are two main goals for mining changes
in a business environment:
Page 13
12
"To follow the trends": The main feature of this kind of applications is the
word "follow". Companies like to know where the trend is going not to be left
behind. They need to investigate customers' changing behaviors so as to provide
products and services that suit the changing needs of the customers.
"To stop or to delay undesirable changes": In this kind of applications, the
keyword is "stop". Companies like to know undesirable changes as soon as possible
and to plan corrective measures to stop or to delay the pace of such changes.
The overall procedure consists of several steps. In the literature, there are some
methods for change mining in the dynamic situation. According to (Song et al,
2001), the majority of data mining techniques like association rules and neural
networks cannot be used alone because they cannot manage dynamic situation well.
(Song et al, 2001) and (Chen et al, 2005) developed a methodology for mining
changes. They used association rule to detect interesting association relationships
among a large set of data items which introduced by (Agrawal et al., 1993). The
methodology detects all kinds of changes. According to (Chen et al, 2005), Change
mining has several steps including data preprocessing, customer segmentation,
mining association rule and change mining. In the first customers are segmented
based on their behavioral variables, recency, frequency and monetary (RFM). Then
by building association rule with customer behavioral variable (RFM), customer
data and transaction data, we describe the customer purchasing behavior in two
different time snapshots, and in the end we compare generated rules for each
segment to mine changes in the customer purchasing behavior. To mine changes,
various algorithms and techniques should be used. In order to implement these
algorithms and techniques, an extensive programming is needed. Finally, we
combined all of the algorithms to build a change mining package.
1.4Research question: Based on the problem discussion that we have above, the purpose of this study
is to mine changes in customer purchasing behavior. In order to reach this purpose,
the research question will be as followed: How businesses can be responsive to the
changes of customer behavior in dynamic market. In addition, how businesses can
detect and access to the changes happened in the customer behavior pattern to be
responsive accurately and timely.
Page 14
13
1.5Research motivation: Recently, we have watched an explosion of data produced and collected by
individuals and organizations. This fast growth in data and databases made the
problem of data overload (Li, X. B., 2005). More recently, increased computing
power has led to greater elasticity in the models one can use and the amount of data
that can be stored and processed (Bolton, R. J., 2004) and as a result, data mining
techniques have came out and flourished in the past several years to encounter this
demand (Li, X. B., 2005). Organizations are starting to understand the importance of
data mining in their marketing strategies.
In this situation, businesses currently face the challenge of a constantly
evolving market where customer needs are changing all the time (Chen et al, 2005).
In such an environment, knowing the changes and responding rapidly and correctly
to them, has a high importance. While customer needs change over time, if
businesses could not meet their need, they would lose their customers who are their
ROI resources. Some works have been done in change mining in retailing. One of
the businesses that change mining can help it to improve, is FMCG distribution
business that face a dynamic markets by huge variation of products and competitors
in the market. The purpose of the change mining is following the trends that are
happening in the customer purchasing pattern, detecting the changes and respond to
them timely to satisfy customers more and meet their needs.
1.6Demarcation: This study focus on mining changes in customer
purchasing behavior based on the customer and purchasing transaction stored in a
database. Change mining has been done by data gathered from a database of FMCG
Distributor Company in Iran. Most of the literature reviewed is about mining
changes in customer purchasing behavior. Our work focus on building customer
behavior patterns by association rule mining and the comparison of these built rules.
These patterns just based on their previous transactions.
1.7Research outline: This thesis consists of five chapters. The first chapter
is introduction that gives a brief background about subject followed by research
question, objectives, and motivation. Chapter 2 is a literature review, consists
literature review on data mining, association rule, change mining and customer
segmentation. Chapter3 is about our research methodology including data
preprocessing, market segmentation, mining customer behavior and change mining.
Chapter4 is about the results and analysis. Chapter 5 is the last chapter that contains
conclusion, limitation, and further research.
Page 15
14
Chapter2: Literature Review
Review of Mining Customer Behavior
Review of Data Mining
Review of Association Rule Mining
Review of Change Mining
Review of Customer Segmentation
Page 16
15
2. 1Mining Customer Behavior: Different methods to describe customer
behavior exist in the literature. Among them, there are various types of conjunctive
rules to build customer behavior pattern including association rules and
classification rules (Agrawal R. et al, 1996 & Breiman L., et al., 1984 cited on
Adomavicius, G., Tuzhilin, A., 2001)
Using rules to describe customer behavior has certain advantages. Besides
being descriptive way to portray behaviors, a conjunctive rule is a well-studied
concept and it is used widely in data mining, expert systems, and many other areas.
In addition, researchers have proposed many rule discovery algorithms in the
literature, especially for association rules (Adomavicius, G., Tuzhilin, A., 2001). To
discover rules that describe the behavior of customers, we can use various data
mining algorithms, like Apriori for association rule mining.
Association rules were initially applied for market basket analysis to find the
relationships between product items purchased by customers at retail stores
(Agrawal, Imielinski, & Swami, 1993; Srikant, Vu, & Agrawal, 1997 cited by Chen
et al, 2005). In a research of customer behavior, we can apply association rule to
find the correlations between customer demographic variables, purchased product
and product databases (Song et al, 2001).
In this chapter, we will have a review of data mining, then association rules.
Then the next topic will be the change mining of customer behavior in the literature.
And following by that finally we will have a brief review of customer segmentation.
2. 2Review of Data Mining
2. 2.1 Data mining: in brief
Today, size of databases can be very large. Within this data you can find
hidden strategic information. But when you have a huge amount of data, inducing
meaningful conclusions is not easy. The novel answer is data mining being used
both to increase revenues and to reduce costs. Many people use data mining as a
synonym for another popular word, Knowledge Discovery in Database. In rotation
other people define Data Mining as the core process of KDD.
The KDD processes are shown in Figure 2.1 (Han, J., & Kamber, M., 2006).
Usually KDD has three processes. First one is preprocessing executed before data
mining techniques applied to the right data. The preprocessing includes data
cleaning, integration, selection and transformation. The main process of KDD is the
data mining process. In this process different algorithm are applied to produce
Page 17
16
hidden knowledge. The last process is post-processing comes evaluating the mining
result according to users’ requirements and domain knowledge.
Regarding the evaluation results, if the result is satisfactory the knowledge can
be presented; else we have to run some or all of those processes again till we get the
satisfactory result (Han, J., & Kamber, M., 2006).
Figure 2.1: Knowledge Discovery in Database Processes
(Song et al, 2001) defines data mining as a process of exploration and analysis
of large quantity of data to discover meaningful patterns and rules. (Feelders et al,
2000) define the process of data mining as follows:
Page 18
17
Source: (Feedlers et al, 2000)
Figure 2.2 the major steps in data mining Process
The data mining returns potential is immense. Innovative organizations
worldwide are already using data mining to attract higher-value customers, to
configure their product offerings differently to increase sales, and to minimize losses
due to mistakes or fraud.
2. 2.2 Data mining Functions: (Dunham, 2002) categorizes data mining to two
categories, one is descriptive and the other one is predictive (Figure 2.3).
Source: (Dunham, 2002)
Figure 2.3: Classification of DM techniques
The first and simplest analytical step in data mining is to describe the data-
Page 19
18
summarize its statistical attributes such as means and visual review like charts and
graphs, and correlations among variables. The most important step is right data
selection, data gathering and data exploration. Sometimes data description alone
cannot provide an action plan. You must build a predictive model based on patterns
determined from known results, and then examine that model with a new sample
data. A good model should never be the same as reality, but it can be a useful guide
to know your business. And after all we should empirically verify the model
(Twocrows.com, 2005). In the next section, we explain briefly three important data
mining techniques.
2. 2.3 Classification in brief:
Based on (Han and Kamber, 2006), Classification is automatic model building
that can classify a class of objects so as to predict the classification or missing
attribute value of future objects whose class may not be known. The process has 2
steps. In the first step, a model is built to describe the characteristics of a set of data
classes or concepts based on the collection of training data set. Because data classes
or concepts are predefined, this step is also known as supervised learning. In the
second step, the model is used to predict the classes of future data or objects. There
are several techniques for classification (Han and Kamber, 2006). In Classification
by decision tree many researches are done and plenty of algorithms have been
designed, Murthy did a extensive survey on decision tree induction (Murthy, 1998;
cited by Han, J., & Kamber, M., 2006). Bayesian classification is another technique
that can be found in (Duda and Hart, 1973 cited by Han, J., & Kamber, M., 2006).
Nearest neighbor methods are also talked about in many statistical texts on
classification, such as (Duda and Hart, 1973, cited by Han, J., & Kamber, M., 2006)
and (James, 1985, cited by Han, J., & Kamber, M., 2006). Besides, there are many
other machine learning and neural network techniques used to help building the
classification models.
2. 2.4 Clustering in brief: As we mentioned before, classification can be
taken as supervised learning process, clustering is another mining technique similar
to classification. However clustering is an unsupervised learning process.
"Clustering is the process of grouping a set of physical or abstract objects into
classes of similar objects" (Han, J., & Kamber, M., 2006), so that objects within the
same cluster must be similar to some extend, also they should be dissimilar to those
objects in other clusters. In classification each record belongs to a predefined class,
while in clustering there is no predefined class. In clustering, objects are grouped
together based on their similarities. (Han, J., & Kamber, M., 2006)Similarities
Page 20
19
between objects are explained by some similarity functions; usually similarities are
quantitatively defined as distance or other measures by corresponding domain
experts. (Han, J., & Kamber, M., 2006) Most clustering applications are used in
market segmentation. When they cluster their customers into different groups,
business organizations can provide different personalized services to different group
of markets. (Han, J., & Kamber, M., 2006) An extensive survey of current clustering
techniques and algorithms is available in (Berkhin, 2002; cited by Han, J., &
Kamber, M., 2006).
2. 2.5 Association Rules in Brief:
Association rule mining is one of the most important techniques of data
mining. (Agrawal et al, 1993) introduced this method first time. The goal of this
technique is extracting interesting correlations, frequent patterns, and associations
among sets of items in the transaction databases or other data reservoirs (Agrawal et
al, 1993). Association rules are used extensively in various areas. In this study we
will use association rule to mine customer behavior pattern to find behavioral
changes. In the next section, we will have a review of association rule mining.
2. 3Association Rule Mining Review: 2.3.1 Association Rule mining problem: In this section, we will
introduce association rule mining problem in detail. A typical association Rule has
an implication of the form A B where A is an itemset and B is an itemset that
contains only a single atomic condition (Berry & Linoff, 2004). There are two
definitions to evaluate each association rule. The support of an association rule is the
percentage of records containing both A and B and the confidence of a rule is the
percentage of records containing itemset A that also contain itemset B. The support
shows the usefulness of a discovered rule and the confidence shows certainty of
found association Rules (Berry & Linoff, 2004). We can calculate another variable
called Lift. It Measures the difference between confidence and expected value of
confidence for a rule. (Berry & Linoff, 2004) define Lift (also called improvement),
as a measure telling us how much better a rule is at forecasting the result than just
assuming the result in the first place. “Lift is the ratio of the density of the target
after application of the left-hand side to the density of the target in the population”
(Berry & Linoff, 2004). “Another way of saying this is that lift is the ratio of the
records that support the entire rule to the number that would be expected, assuming
that there is no relationship between the products” (the exact formula is givenlater in
the chapter) (Berry & Linoff, 2004).
Page 21
20
2.3.2 Apriori Algorithm Association rule mining is discovering association
rules that satisfy the pre-defined minimum support and confidence from a database
(Agrawal, R., & Srikant, R., 1994). According to (Agrawal, R., & Srikant, R.,
1994), this problem is usually decomposed into two sub problems:
One is to find those itemsets whose occurrences surpass a predefined threshold
in the database; those itemsets are called frequent or large itemsets. This problem
can be later divided into 2 sub problems: candidate large itemsets generation and
frequent itemsets generation process. Large or frequent itemsets are those itemsets
whose supports surpass the support threshold as and candidate itemsets are those
itemsets that are expected or have the hope to be large or frequent.
The second problem is producing association rules from those large itemsets
with the limits of minimal confidence. You can see the whole process of standard
problem of mining association rules in figure 2.3.
Source: (Agrawal et al, 1993)
Figure 2.4: Classic Problem of association rule mining
The whole performance of mining association rules is determined mainly by
the first step (Agrawal, R., & Srikant, R.). After the large itemsets are found, the
corresponding association rules can be derived in a straightforward manner. the
focus of most mining algorithms is counting of large itemsets Efficiently, and many
efficient solutions have been designed to target previous criteria (Kantardzic .M,
2003).
Page 22
21
Different kinds of produced AR:
One attraction of association rules is the clarity and utility of the results, which
are in the form of rules about groups of products. There is a spontaneous attraction
to an association rule because it shows how tangible products and services group
together (Berry & Linoff, 2004). While association rules are easily understandable,
they are not always useful (Berry & Linoff, 2004). There are 3 types of generated
association rules: Actionable rules, trivial rules and inexplicable rules.
Actionable rules are the useful rule holds high-quality, actionable information.
Once the pattern is found, it is not often hard to justify, and thinking about rule in
the real environment can lead to insights and actions. Because the rule is easily
understood, it recommends plausible causes and possible interventions (Berry &
Linoff, 2004).
Another type of association rule is trivial rules. . Many people in business
know trivial results. Although it is valid and well supported in the data, it is still not
practical. A simple example is customers purchasing hamburgers buy hamburger
buns. A subtler problem drops within the same category. An apparently interesting
result may be the result of past marketing programs and product bundles. Although
other data mining techniques have this problem but market basket analysis is
vulnerable to reproducing the success of prior marketing campaigns because of its
dependence on un-summarized point-of-sale data, exactly the same data that defines
the success of the campaign. Trivial rules have one advantage and that is when a
rule should appear 100 percent of the time, the few cases where it does not hold
supply a lot of information about data quality. An area where business operations,
data collection, and processing may need to be more refined indicates the exceptions
to trivial rules (Berry & Linoff, 2004).
Inexplicable results seem to have no interpretation and do not recommend a
course of action. There is a caution and that is when applying market basket
analysis, many of the results are often either trivial or inexplicable; trivial rules
reproduce common knowledge about the business, which waste the effort used to
apply complex analysis techniques and Inexplicable rules are flukes in the data and
are not actionable (Berry & Linoff, 2004).
ARM Approaches Classification:
Association rule mining is a well studied research area; in this section, we will
only review some basic and classic approaches for association rule mining. As
Page 23
22
mentioned before, the second sub-problem of ARM is straightforward; most of those
approaches focus on the first sub-problem. As mentioned, the first sub-problem can
be further divided into two sub-problems: candidate large itemsets generation
process and frequent itemsets generation process. Most of the algorithms of mining
association rules that surveyed are quite similar, the difference is the extent to which
specific improvements have been made. According to (Zhao, Q., Bhowmick, S.S.,
2003), there are 3 milestones in ARM classic problem; Apriori approach, tree
structure approaches and special issues in ARM. Besides these approaches, there is
another approach from (Zaki et al, 1999); class-based algorithms approach. There
some features that exists in literature to classify ARM algorithms by different
aspect. In the following subsection we will see some of them.
Here there are some features, which can be used to classify the algorithms. We
can categorize the algorithms based on several basic features that try to best
differentiate the various algorithms. These are different features that we have found
in literature (summarized in Table 2.1):
Target: Basic association rule algorithms actually find all rules with the
acceptable support and confidence thresholds. However, there are some more
efficient algorithms could be used. One approach which has been done to do this is
adding constraints on the rules which have been produced. Algorithms can be
categorized as complete (All association rules satisfying the support and confidence
are found), constrained (Some subset of all the rules are found, based on a technique
limiting them), and qualitative (A subset of the rules are produced based on
additional measures, beyond support and confidence, need to be satisfied) (Dunham
M.H., et al, 2001).
Type: Here we show the type of association rules which are produced (for
example regular (Boolean), spatial, temporal, generalized, qualitative, etc.)
(Dunham M.H., et al, 2001).
Data type: Besides data stored in a database, the type of data also is important.
Association rules of a plain text might be very important information to find out. For
example, “data”, “mining”, and “decision” may be highly dependent in a paper of
knowledge discovery (Dunham M.H., et al, 2001)..
Data source: In addition to market basket data, association rules of data absent
in the database might play important role for decision purposes of a company
(Dunham M.H., et al, 2001).
Page 24
23
Technique: All approaches to date are based on first finding the large itemsets.
There could, of course, be other techniques not requiring that large itemsets first be
found. Although to date we are not aware of any techniques not generating large
itemsets, certainly this possibility does exist with the potential of improved
performance. However, (Agrawal et al, 1998) cited in (Dunham M.H., et al, 2001)
proposed “strongly collective itemsets” to evaluate and find itemsets. The term
“support” and “confidence” are completely different from large itemset approach.
An itemset I is said to be “strongly collective” at level K if the collective strength C
(K) of I as well as any subset of I is at least K (Dunham M.H., et al, 2001).
Itemset Strategy: Different algorithms consider the generation of items
differently. This feature shows how the algorithm considers transactions as well as
when the itemsets are produced. One technique, Complete, could produce and count
all potential itemsets. The most common approach is that introduced by Apriori.
With this strategy, a set of itemsets to count is produced prior to scanning the
transactions. This set remains constant during the process. A dynamic strategy
produces the itemsets during the scanning of the database itself. A hybrid technique
generates some itemsets prior to the database scan, but also adds new itemsets to
this counting set during the scan (Dunham M.H., et al, 2001).
Transaction Strategy: Different algorithms consider the set of transactions in a
different manner. This feature shows how the algorithm scans the set of transaction.
The complete strategy checks all transactions in the database. With the sample
approach, some subset of the database (sample) is checked prior to processing the
complete database. The partition techniques divide the database into partitions. The
scanning of the database requires that the partitions be checking individually and in
order (Dunham M.H., et al, 2001).
Itemset Data Structure: As itemsets are produced, different data structures can
be applied to keep track of them. The most usual approach seems to be a hash tree.
Alternatively, a trie or lattice may be applied. At least one technique suggests a
virtual trie structure where only a portion of the complete trie is actually
materialized (Dunham M.H., et al, 2001).
Transaction Data Structure: "Each algorithm assumes that the transactions are
stored in some basic structure, usually a flat file or a TID list" (Dunham M.H., et al,
2001).
Optimization: Many algorithms have been introduced improving on earlier
Page 25
24
algorithms by applying an optimization strategy. Various strategies have considered
optimization based on available main memory, whether or not the data is skewed,
and pruning of the itemsets to be counted (Dunham M.H., et al, 2001).
Architecture: As indicated, the goal of some algorithms is working like
sequential function in centralized single processor architecture. Alternatively,
algorithms have been designed to work in a parallel manner suitable for a
multiprocessor or distributed architecture (Dunham M.H., et al, 2001).
Parallelism Strategy: Parallel algorithms can be more described as task or data
parallelism (Dunham M.H., et al, 2001).
In the literature there some other features that based on them also we can
categorize the association rule mining methods; in the following we can consider
them:
Counting Strategy: This refers to the methods used in counting the candidate
itemsets occurrences. There horizontal counting and vertical intersection are two
main approaches. The horizontal counting decides about the support value of a
candidate itemset by scanning transaction singly, and increasing the counter of the
itemset if it is a subset of the transaction. This approach operates well for a rarely
occurred candidate because only those transactions containing that itemset need to
be checked. The candidate look up operation, however, is very expensive for
candidates of large size (Su, J. H., Lin, W. Y., 2004). On the other hand, vertical
intersection is applied when the database is in a vertical format such that every
record is associated with an item to store the identifiers of the transactions
containing that item, called Tidlist. Despite the vertical intersection scheme omits
the I/O cost for database scan, it has the following shortage: when a candidate
itemset has a support count completely less than the number of transactions, a large
amount of unnecessary intersections happens there (Su, J. H., Lin, W. Y., 2004).
Search direction: according to (Su, J. H., Lin, W. Y., 2004), there are two main
methods for search direction, Bottom-up traversal and Top-down traversal. Today,
most Apriori-like approaches apply bottom-up traversal of the search space, which
starts from all frequent 1-itemsets upward to the longest frequent itemsets. The most
important advantage of this model is that it can effectively prune the search space by
exploiting downward closure property: when it recognized one itemset as
infrequent, all of its superset is also infrequent. However, this benefit fades when
most of the maximal frequent itemsets locating near the largest itemset of the search
Page 26
25
lattice, due to a comparatively small support threshold. In this situation, there are
very few itemsets to be pruned (Su, J. H., Lin, W. Y., 2004). Another itemset
traversal method is Top-down traversal which applied in the opposite direction, i.e.
starting from the longest itemsets downward to the frequent 1-itemsets, or top-down
for short (Su, J. H., Lin, W. Y., 2004). This strategy is traditionally applied for
discovering maximal frequent itemsets (Tseng, M.C. & Lin, W.Y., 2001; cited by
Su, J. H., Lin, W. Y., 2004) But we should consider that though all of the frequent
itemsets can be derived from their maximal ones, more counting strategies are
needed to gain their exact supports for computing the confidences of association
rules. At the same time, if there are huge numbers of items and/or the support
threshold is very low; many infrequent itemsets have to be visited before the
maximal frequent itemsets are identified. This is why most work on frequent
itemsets mining accepts and applies the bottom-up paradigm instead. (Su, J. H., Lin,
W. Y., 2004).
Search strategy: While the search direction directs the way that the search
space is exploited, the search strategy identifies the order in which itemsets are
visited (Su, J. H., Lin, W. Y., 2004). One of these strategies is BFS. Most Apriori-
like algorithms apply breadth-first search (BFS) because it can facilitate the pruning
of candidates with downward closure. This strategy, however, needs more memory
to keep the frequent subsets of the pruned candidates (Su, J. H., Lin, W. Y., 2004).
Another strategy is DFS; recursively visiting the descendants of an itemset. In the
literature, this strategy is usually combined with the counting strategy of vertical
intersection because it is enough to keep in memory the tidlists corresponding to the
itemsets on the path from the root down to the presently inspected one. (Su, J. H.,
Lin, W. Y., 2004)
Table 2.1: Factors for classification of ARM
DIMENSION VALUES
Target Complete, Constrained, Qualitative
Type Regular (Boolean), Generalized, Quantitative, etc.
Data type Database Data, Text
Data source Market Basket, Beyond Basket
Page 27
26
Technique Large Itemset, Strongly Collective Itemset
Itemset Strategy Complete, Apriori, Dynamic, Hybrid
Transaction Strategy Complete, Sample, Partitioned
Itemset Data Structure Hash Tree, Trie, Virtual Trie, Lattice
Transaction Data
Structure
Flat File, TID
Optimization Memory, Skewed, Pruning
Architecture Sequential, Parallel
Parallel Strategy None, Data, Task
Pattern Kind Sequential Pattern, Frequent Itemset, Structured
Pattern
Rule Kind Association Rule, Strong gradient relationship,
correlation(Han book)
Counting Strategy Horizontal, Vertical
Search Strategy Bottom-Up Traversal, Top-Down Traversal ,
Hybrid
Search Direction BFS, DFS
Candidate generation Complete, Heuristic
2.3.3 Association Rule Mining Approaches: Apriori Approach
AIS Algorithm:
The AIS (Agrawal, Imielinski, Swami) algorithm was the first algorithm
suggested for mining association rule in (Agrawal et al, 1993). It concentrates on
improving the quality of databases simultaneously with necessary functionality to
process decision support queries. According to (Zhao, Q., Bhowmick, S.S., 2003), in
this algorithm only one item consequent association rules are produced. It means
that the consequent of those rules only contain one item, for example we only
Page 28
27
produce rules like ABC→D but not rules like AB→CD.
Disadvantage: The main disadvantage of the AIS algorithm is too many candidate
itemsets that at last turned out to be small are produced, needing more space and
wastes much effort that turned out to be useless. At the same time this algorithm
needs too many passes over the whole database (Zhao, Q., Bhowmick, S.S., 2003).
3.3.3.3 Apriori Algorithm:
Apriori is a great improvement in the history of association rule mining,
Apriori algorithm was first introduced by Agrawal in (Agrawal, R., & Srikant, R.,
1994). The AIS is just a straightforward approach that needs many passes over the
database, which produces many candidate itemsets and saving counters of each
candidate while most of them turn out to be not frequent. Apriori is more efficient
during the candidate generation process for two reasons; Apriori applies a different
candidate's generation method and a new pruning technique (Zhao, Q., Bhowmick,
S.S., 2003).
a) Problem & limitation of Apriori:
One is the complex candidate generation process that spends most of the time,
space, and memory. Another bottleneck is the several scan of the database. Many
new algorithms were designed with some modifications or improvements based on
Apriori algorithm. Commonly, there were two approaches:
First approach tries to reduce the number of passes over the whole database or
replace the whole database with only part of it based on the current frequent
itemsets. The other approach tries exploring different types of pruning techniques to
make the number of candidate itemsets much lesser. Apriori-TID and Apriori-
Hybrid (Agrawal, R., & Srikant, R., 1994), DHP (Park et al, 1995; cited by Zhao,
Q., Bhowmick, S.S., 2003), SON (Savesere et al, 1995) are modifications of the
Apriori algorithm (Zhao, Q., Bhowmick, S.S., 2003).
3.3.3.4 Optimized Apriori algorithms:
According to problems of Apriori, which have been mentioned in previous
section, some new approaches are introduced. In the following section we will have
them; item pruning and database passes over reduction.
a) Transaction and Item Pruning:
This is one of the main optimization of the Apriori Algorithm. There is no
Page 29
28
need to inspect the whole database each time it is needed to count occurrence of
candidate itemsets. This optimization reduced drastically the needed time to count
the support for the candidate sets and enhances the performance. Transaction
pruning was present in 2 algorithms; AprioriTid, Apriori Hybrid and DHP.
AprioriTID, Apriori Hybrid:
AprioriTID was introduced in the same paper with Apriori. For all that, it does
not state it explicitly, it uses transaction pruning to improve Apriori performance.
The main difference comes from where it does not use the whole database to count
support for candidate sets, and it uses another approach (Ayad, A. M., 2000).The
main disadvantage of this algorithm is the size of the alternative set that shows the
database may go beyond the size of the actual database in early stages thus loosing
its edge on Apriori. Because of this disadvantage another algorithm, Apriori Hybrid
introduced. It uses Apriori at the first stages and then shifts to AprioriTID when
transaction pruning is more effective (Ayad, A. M., 2000).
DHP:
DHP (Dynamic Hashing and Pruning algorithm) is another algorithm that
introduced by (Park et al, 1995). It uses probabilistic counting to decline the number
of candidate itemsets counted during each round of Apriori execution. This decline
is completed by subjecting each candidate k ¬itemset to a hash-based filtering step
in addition to the pruning step (Ayad, A. M., 2000).
Throughout candidate counting in round k -1, the algorithm builds a hash
table. Each entry in the hash table is a counter that retains the sum of the supports of
the k-itemsets that correspond to that exacting entry of the hash table. The algorithm
uses this information in round k to prune the set of candidate k-itemsets. After
subset pruning as in Apriori, the algorithm can remove a candidate itemset if the
count in its hash table entry is smaller than the minimum support threshold.
According to, It has 2 advantage, first this algorithm is also based on the monotone
Apriori property, where a hash table is built for the purpose of reducing the
candidate space by pre-computing the proximate support for the k+1 item set while
counting the k-itemset. DHP has another important advantage, the transaction
trimming, which has been applied by removing the transactions that do not contain
any frequent items. However, this trimming and the pruning properties caused some
problems that made it impractical in many cases (Ayad, A. M., 2000).
Page 30
29
b) Reducing the number of database passes:
As mentioned before, the main disadvantage of the classical Apriori is the
several passes it has to do on the databases the number of which is equal to the
length of the longest frequent itemset (Pattern) present in the database(Zhao, Q.,
Bhowmick, S.S., 2003). Many optimization efforts focused on eliminating the
number of database passes. They differed, however, in how the number of passes
decreased. This is the focus of this section.
Database Partitioning:
(Savasere et al, 1995) developed Partition, an algorithm that requires only two
scans of the transaction database. The database is divided into disjoint partitions,
each small enough to fit in memory. In a first scan, the algorithm reads each
partition and computes locally frequent itemsets on each partition using Apriori. In
the second scan, the algorithm counts the support of all locally frequent itemsets
toward the complete database. If an itemset is frequent with respect to the complete
database, it must be frequent in at least one partition; therefore, the second scan
counts a superset of all potentially frequent items.
The main achievement of Partition is the reduction of database activity. It was
shown that this reduction was not obtained at the expense of more CPU utilization.
It was shown however, that the number of partition greatly affects the performance
of the algorithm by affecting the number of locally frequent itemsets that turn to be
globally infrequent. The algorithm was shown to be vulnerable to data skew (Ayad,
A. M., 2000).
Dynamic itemset counting:
(Brin et al, 1997) proposed the Dynamic Itemset Counting algorithm. DIC
partitions the database into several blocks marked by start points and repeatedly
scans the database. In contrast to Apriori, DIC can add new candidate itemsets at
any start point, instead of just at the beginning of a new database scan. At each start
point, DIC estimates the support of all itemsets that are currently counted and adds
new itemsets to the set of candidate itemsets if all its subsets are estimated to be
frequent (Brin et al, 1997). If DIC adds all frequent itemsets and their negative
border to the set of candidate itemsets during the first scan, it will have counted each
itemset’s exact support at some point during the second scan; thus DIC will
complete in two scans (Ayad, A. M., 2000). The Dynamic Item set Counting (DIC)
Page 31
30
reduces the number of I/O passes by counting the candidates of multiple lengths in
the same pass. DIC performs well in cases of homogenous data, while in other cases
DIC might scan the databases more often than the Apriori algorithm.
c) Both Sampling:
(Toivonen, H., 1996) proposed a sampling based algorithm that typically
requires two scans of the database. The algorithm first takes a sample from the
database and generates a set of candidate itemsets that are highly likely to be
frequent in the complete database. In a subsequent scan over the database, the
algorithm counts these itemsets’ exact supports and the support of their negative
border. If no itemset in the negative border is frequent, then the algorithm has
discovered all frequent itemsets. Otherwise, some superset of an itemset in the
negative border could be frequent, but its support has not yet been counted. The
sampling algorithm generates and counts all such potentially frequent itemsets in a
subsequent database scan (Toivonen, H., 1996). The algorithm was shown to
perform well compared to other level-wise algorithms and to be the Partition
algorithm. The database activity is reduced effectively one pass. The only drawback
of the algorithm, however, is that it has to test many spurious candidates due to the
reduced support threshold and to guarantee a superset of the actual frequent itemsets
(Ayad, A. M., 2000).
Conclusion of Apriori Approach:
Most of the algorithms introduced above are based on the Apriori algorithm
and try to improve the efficiency by making some modifications, such as reducing
the number of passes over the database; reducing the size of the database to be
scanned in every pass; pruning the candidates by different techniques and using
sampling technique (Zhao, Q. & Bhowmick, S.S., 2003). However, there are two
bottlenecks of the Apriori algorithm: first bottleneck is the complex candidate
generation process that uses most of the time, space, and memory and the other
bottleneck is the multiple scan of the database, Apriori is used in many applications
for building patterns in large databases
2. 4Mining Changes Literature Review: As it is mentioned in chapter
of customer behavior analysis, mining changes has very important role in business
strategies and marketing. In this chapter, we are going to briefly review the literature
related to mining changes in customer buying behavior patterns and determine the
position of my work within these researches. In the past, researchers generally
Page 32
31
applied statistical surveys to study customer behavior. Recently, however, data
mining techniques have been adopted to describe and predict customer behavior
(Giudici & Passerone, 2002; Song, Kim, & Kim, 2001 cited by Song et al, 2001).
There are some related works to mining in a dynamic environment; of course, they
are not as much as work has been done in customer behavior modeling or
prediction.
Liu et al. (2000) devised a method of change mining in the context of decision
trees for predicting changes in customer behavior. Since decision tree is a
classification-based approach, it cannot detect complete sets of changes (Song et al,
2001). Association rule extraction was widely used for analyzing the correlation
between product items purchased by customers, and to support sales promotion and
market segmentation (Changchien & Lu, 2001; Changchien, Lee, & Hsu, 2004;
cited by Song et al, 2001). (Song et al, 2001) employed an approach based on
association rules to identify changes in customer behavior. (Chen et al, 2005)
employed another approach to recognize changes in customer behavior by
association rule mining methods.
There are existing works that have been carried out on learning in a changing
environment (Fruend and Mansour, 1997; cited by Song et al, 2001, Helmbold &
Long, 1994; cited by Song et al, 2001; Widmer, 1996; cited by Song et al, 2001).
There are some existing works in mining in a changing environment (Bay and
Pazzami, 1999; Ganti, Gehrk, Ramakrishnan, 1999; Han, Kamber, 2001; Liu et al,
2000; Nakhaeizadeh, Taylor, Lanquillon, 1998; cited by Song et al, 2001). For
example (Fruend and Mansour, 1997 cited in Song et al, 2001) presents a model of
learning in a changing distribution. All the following related works focus on
dynamic aspects or comparison between two different datasets or rules. They are
clustered as six categories in this chapter.
According to (Song et al, 2001) there are six groups of works in the area of
data mining in changing environment. These are as follows:
The first field of study that studies mining in a changing environment is rule
maintenance (Cheung, Han, Ng & Wong, 1996a; Cheung, Ng, Tam, 1996b;
Feldman, Aumann, Amir & manila, 1997; cited by Song et al, 2001, Thomas,
Bodagal, Alsbati & Ranka, 1997) the purpose of these studies is improving accuracy
in a changing environment. For example is the study that has been done by (Thomas
et al, 1997) which proposed an incremental updating technique based on Negative
Page 33
32
Borders, for the maintenance of the association rules when new transaction data is
added to or eliminated from transaction database. An important aspect of this
technique is it requires a full scan of database if the database update causes the
negative border of the set of large itemsets to expand. But these techniques don’t
provide any changes to the user, they just maintain existing knowledge.
The second research trend associated to our work is discovering emerging
pattern (Agrawal, R. & Psaila, G., 1995; Dong, G., & Li, J., 1999; Li et al, 2000).
These researches try to find emerging patterns (EPs) which are described as itemsets
whose supports boosted significantly from one dataset to another. (Agrawal, R. &
Psaila, G.) established Active data mining paradigm which is in this paradigm, data
is continuously mined at a desired frequency. As rules are discovered, they are
added to a rulebase, and if they already exist, the history of the statistical parameter
associated with the rules is updated. When the history begins exhibiting certain
trends, specified as shape queries in the user-specified triggers, the triggers are fired
and appropriate actions are initiated. (Dong, G., & Li, J., 1999) introduced the data-
mining problem of emerging patterns (EPs). (Li et al, 2000) proposed the use of
jumping emerging patterns (JEPs) as the basis for a new classifier called the JEP-
Classifier. Each JEP can capture some crucial difference between a pair of datasets.
Then, aggregating all JEPs of large supports can produce more potent classification
power. They use two algorithms for constructing the JEP-Classifier which are both
scalable and efficient. These algorithms make use of the border representation to
efficiently store and manipulate JEPs. EPs can capture emerging trends in time-
stamped database, or useful contrast between data classes, but they don’t consider
the structural changes in the rules (Song et al, 2001).
Another connected research is subjective interestingness in data mining (Liu &
Hsu, 1996; Liu et al, 1997; Liu, et al, 1999; Padmanabhan & Tuzhilin, 1999,
Silberchatz & Tuzhilin, 1996; Suzuki, 1997 cited by Song et al, 2001). These
researches provide a number of techniques for finding unexpected rules regarding
users existing knowledge. For example, (Liu & Hsu, 1996) tries to link the gap
between the user and the rules created by an induction system. A fuzzy matching
technique is recommended for rule comparison in the context of classification rules.
It permits the user to compare the produced rules with his/her hypotheses or existing
knowledge in order to find out what is right and what is wrong about his/her
knowledge, and to tell what has changed since the last learning. This technique is
also helpful in data mining for solving the interestingness problem. (Liu et al, 1997
cited in Song et al, 2001) studies the problem of analyzing discovered rules next to a
Page 34
33
particular form of existing concepts, namely general impressions (GIs). A
specification scheme for representing GIs is proposed and two matching algorithms
for analyzing discovered rules are presented. This technique is useful for solving the
interestingness problem. (Padmanabhan & Tuzhilin, 1999 cited in Song et al, 2001)
proposed a new definition of unexpectedness of a rule with respect to a belief and
showed an algorithm that finds unexpected association rules from data using this
measure. (Silberchatz & Tuzhilin, 1996 cited in Song et al, 2001) mentioned that
Measures of interestingness of patterns in data mining applications can be
categorized into objective and subjective and they classified subjective measures
into unexpected and actionable and argued, at the intuitive level, that these two
measures of interestingness are independent of each other. Although action ability
emerges to be the major concept, we believe that it is a difficult notion to capture
formally since they consider that most unexpected patterns are actionable and most
actionable patterns are unexpected, in this paper, they proposed to capture action
ability via unexpectedness. Consequently, they studied "unexpectedness" as a
measure of interestingness and described interestingness of a pattern in terms of how
strongly it "shakes" the existing system of beliefs. By this meaning they also make
unexpected patterns more interesting than the expected ones. All of the above work
study subjective measures of interestingness, but these techniques can not be applied
for detecting changes, as its analysis only compares each newly generated rule with
each existing rule to discover degrees of difference, and it doesn’t find which aspect
have changes, what kind of changes have taken place and how much change has
happened.
The forth research stream is mining from time-series data. There is an
interesting interest to find out regularity from time-series data (Das et al, 1997; Das
et al, 1998; Han, Dong & Yin, 1999 cited in Song et al, 2001). (Dos et al. 1998 cited
in Song et al, 2001) believe the problem of finding rules relating patterns in a time-
series to other patterns in that series, or patterns in one series to pattern in another
series, in fact they stress is in the discovery of local patterns in multivariate time
series in contrast to the traditional time series analysis which mainly focuses on
global models, and (Han et al., 1999 cited in Song et al, 2001) present several
algorithms for efficient mining of partial periodic patterns, by exploring some
interesting properties connected to partial periodicity. (Dos et al, 1997 cited in Song
et al, 2001) also presents an intuitive model for measuring similarities between two
time series, this model takes into account outliers, different scaling functions and
variable sampling rates; but these studies are rather different from my research
Page 35
34
which centers on the detection of irregularity rather than regularity form data.
The fifth research field is mining class comparison to differentiate between
different classes (Bay, D.S. & Pazzani, M., J., 1999; Ganti et al, 1999 cited in Song
et al, 2001; Han, J., & Kamber, M., 2006). (Ganti et al, 1999 cited in Song et al,
2001) presents the general framework for measuring changes in two models. They
develop FOCUS Framework for calculating an interpretable, suitable deviation
measure between two datasets to compute the differences between “interesting”
characteristics in each dataset. Fundamentally, the difference between the two
models is quantified as the amount of work needed to change one model into the
other. Their framework work covers a wide variety of models as well as frequent
itemsets, decision tree classifier, and clusters, and captures standard measures of
deviation such as misclassification rate and the chi-square metric as special cases. It
offers deviation measures between the two mining model and focused regions but
cannot be directly used to detect customer behavior changes because it doesn’t
provide which aspects are changed and which kind of changes have occurred. (Bay,
D.S. & Pazzani, M., J., 1999); (Han, J., & Kamber, M., 2006) also provide
techniques for understandings the differences between several contrasting groups,
but these techniques can only identify change about the same structured rule.
Finally, (Liu et al, 2000) presents a technique for change mining by
overlapping two decision trees which are produced from different time snapshots,
but this change mining technique using decision trees cannot identify complete sets
of change. Since decision trees techniques run within a specified objective class,
only changes about designated consequent attributes can be detected. This approach
can be applied only in cases which have a precise research question. Also, this
technique doesn’t offer any information for the degree of change. (Song et al, 2001)
had a research on Understanding and adapting to changes of customer behavior for
an internet-based company. The aim of that research is to develop a methodology
which discovers changes of customer behavior automatically from customer profiles
and sales data at different time snapshots. They defined the 3 types of changes:
emerging pattern, unexpected changes and added/perished rules, then, similarity and
difference measures for rule matching to detect all types of changes. Finally, the
degree of change is assessed to detect significantly changed rules and rank them.
Their proposed methodology can evaluate the degree of changes as well as finds all
kinds of changes automatically from different time snapshot data. Another related
research has been done by (Chen et al, 2005) which integrates customer behavioral
variables, demographic variables, and transaction database to found a method of
Page 36
35
mining changes in customer behavior. The behavioral variables, RFM, coupled with
growth matrix of customer value, are used to estimate the value that individual
customers give to the business. Association rules are used to identify the association
between customer profile and product items purchased. For mining change patterns,
two extended measures of similarity and unexpectedness are designed to study the
degree of similarity between patterns at different time periods. Finally, an online
query system provides marketing managers a tool for fast information search, and
valuable information based on timely feedback. (Cho et al, 2005) have done a
research for finding changes in customer buying behavior for recommendation
systems and it declared that the needs of customer changes over time so we should
take into consideration changes in customer preferences to progress the accuracy of
the recommendations made. They suggest a new methodology for improving the
quality of Collaborative Filtering (CF) recommendation that uses customer purchase
sequences. The proposed methodology is used to a large department store in Korea
and compared to existing CF techniques. Different experiments using real-world
data show that the proposed methodology provides higher quality recommendations
than do classic CF techniques, with better performance, particularly with regard to
heavy users. (Au & Chan, 2005 cited in Song et al) present another technique to
find changes in association rules. They present the meaning of the problem of
mining changes in association rules over time. The proposed approach permits
different fuzzy data-mining techniques to be used for tackling this problem. Given a
set of database partitions, each of which encloses a set of transactions gathered in a
specific time period, a set of association rules is found in each database partition.
They suggest executing data mining in the discovered association rules so as to
expose the regularities governing how the rules change in different time periods.
They proposed to use linguistic variables and linguistic terms to represent the
changes in the discovered association rules. Particularly, fuzzy decision trees are
built to discover the changes in the discovered association rules. The fuzzy decision
trees are then exchanged to a set of fuzzy rules, called fuzzy meta-rules because they
are rules about rules. By doing so, the changes hidden in the data can be exposed
and presented to human users in a comprehensible form. In addition, the discovered
changes can also be used to forecast any change in the future.
Page 37
36
Figure 2.5: Mining in a changing environment review
Learning in a Changing
Environment
-FRUEND and
Mansour, 1997
-Helmbold & Long,
1994 cited by (Song et
al, 2001)
Rule
Maintenance
Emerging
Patterns
Subjective
interestingne
Mining from Time
Series Data
Mining Class
Comparisons
Mining
changes
-Cheung, Han, Ng
& Wong, 1996a;
-Cheung, Ng, Tam,
1996b;
-Cheung et al,
1997
-Agrawal &
Psaila, 1995
-Dong & Li, 1999
-Li, Dong &
Ramamohanarao,
2000
-Liu & Hsu,
1996; -Liu et al,
1997
-Liu, Hsu, Ma *
Chen, 1999
-Padmanabhan &
Tuzhilin, 1999
cited by (Song et
al, 2001)
-Silberchatz &
Tuzhilin, 1996
-Suzuki, 1997
-Das,
Gunopulous &
Mannila, 1997
-Das, Lin,
Manila,
Renganathan &
Smyth, 1998
-Han, Dong &
Yin, 1999
-Bay & Pazzani,
1999;
-Ganti et al,
1999; -Han, J., &
Kamber, M.,
2001
-Song et al, 2001
-Liu et al, 2000
-Chen et al, 2005
Mining in a Changing
Environment
Page 38
37
Table 2.2: Mining in a changing environment timetable
Mining
Changes
• •
•
class
comparison
•
•
Time Series • • •
Subjective
Interestingness
• • • •
Rule
Maintenance
• •
Emerging
Pattern •
• •
Subject/ Year 1995 1996 1997 1998 1999 2000 2001
2002
2003 2004 2005
2. 5Customer segmentation review: The mass marketing approach
cannot satisfy the needs of varied customers today. This variety should be satisfied
using segmentation that splits markets into customer clusters with similar needs
and/or features that are likely to show similar purchasing behaviors (Dibb & Simkin,
1996 cited by Tsai, C., Y., Chiu, C., C., 2004). Segmentation theory suggests that
groups of customers with similar needs and purchasing behaviors are likely to show
a more homogeneous answer to marketing programs that aim specific consumer
groups (Tsai, C., Y., Chiu, C., C., 2004). Market segmentation has accordingly been
regarded as one of the most vital elements in achieving successful modern
marketing and customer relationship management (CRM) (Berson, Smith, &
Thearling, 2000 cited by Tsai, C.,Y., Chiu, C.,C., 2004).
Segmentation variable selection is a critical concern for successful market
segmentation."Segmentation variables can be classified into general variables and
product specific variables" (Wedel & Kamakura, 1997 cited by Tsai, C., Y., Chiu,
C., C., 2004). The general variables consist of the customer demographics and
Page 39
38
lifestyles. The product specific variables entail customer purchasing behaviors and
intentions. Many researches have been done to use general variables to segment
customers because the variables are intuitive and easy to operate (Beane & Ennis,
1987; Hammond et al., 1996 cited by Tsai, C., Y., Chiu, C., C., 2004). Market
segmentation based on general variables is more instinctive and easier to conduct
than product specific variables. But the assumption that customers with alike
demographics and lifestyles will show similar purchasing behavior is unsure (Tsai,
C., Y., Chiu, C., C., 2004). Here, we have briefly, reviewed some of segmentation
methods from literature.
2. 5.1 Clustering Analysis Data mining is a type of analytic method for
summarizing useful knowledge and realizing useful data mode from huge data (Wu,
J., Lin, Z., 2005). In market research field, clustering is an effective and commonly
used method for market segmentation, realizing targeted market and segments of
customers. Clustering can be used as an independent tool to show data distribution,
monitor cluster’s characteristics and make an additional analysis of specific clusters
if required (Wu, J., Lin, Z., 2005).
2. 5.2 Customer Segmentation Model
The customer segmentation concept was built by American marketing expert,
Wendell R. Smith, in the middle of 1950s. "Customer segmentation refers to
classifying customers by their value, demands, preference and other factors in the
circumstances of clear organization strategies, business model and targeted market".
Customers in one group have definite similarities, whereas different segments of
customers have clear characteristics. Customer segmentation model is built by
classifying customers according to assured standards on selected segmentation
variables. There are two types of consumption-based customer segmentation models
(Wu, J., Lin, Z., 2005).
2. 5.3 RFM Model
RFM segmentation model is a model that distinguishes important customers
by three variables; customers consumption interval, frequency and spent money. R
symbolizes recency referring to the interval between the time when the latest
consuming behavior happens and present. How much the interval is shorter, the R is
bigger. F symbolizes frequency referring to the frequency of consuming behavior in
a period of time. M symbolizes monetary referring to consumption money amount
in a period of time. Researches show that the bigger the R and F values are, the
more likely the related customers are to make a new deal with ventures.
Page 40
39
Furthermore, the bigger M is, the more likely the related customers are to react to
ventures’ products and service again (Wu, J., Lin, Z., 2005). RFM method is very
successful for customer segmentation. We can arrange customers by their
consuming date and then we put the most recent customer in front. Thus, customers
can be classified into some groups. Then, F and M are standardized and arranged in
the same way as mentioned before. At this time, each customer is placed in a three-
dimension space, related to a coordinate of (R, F, M). By calculating R*F*M, the
value of RFM for each customer can be achieved (Wu, J., Lin, Z., 2005). With these
RFMs arranged, the groups of customers can be classified consistent with certain
proportion. For example, to a commercial enterprise, customers whose RFM related
values are in the first 20 percent can be considered as their most valuable customers
(Wu, J., Lin, Z., 2005). It is essential to quantify customer behavior so that we can
analyze the short and long term outcome of our segmentation formulae. The purpose
of RFM is to give a simple framework for customer behavior analysis. Once
customers are allocated RFM behavior scores, they can be grouped into segments
and their consequent effectiveness analyzed. This effectiveness analysis then forms
the basis for future customer contact frequency decisions (Miglautsch, J.R., 2001).
There are some methods for RFM scoring in the literature which they are as follows.
2. 5.4 RFM Scoring
The purpose of RFM scoring is to plan future behavior (driving better
segmentation decisions). In order to allow planning, it is critical to interpret the
customer behavior into numbers which can be used through time (Miglautsch, J.R,
2001).
Too often, direct marketers will use static customer selections. When initially
building their segmentation system, they defined some factors with some thresholds.
If these thresholds keep fixed, the results will be poorer and poorer over time. It is
called bracket creep problem (Miglautsch, J.R, 2001). Some common scoring
methods are used to avoid this problem.
a. Customer Quintiles
The most common scoring method is to arrange customers in downward order
(best to worst). Customers are then divided into five equal groups or quintiles. The
best group receives a score of 5, the worst a 1. For Recency, customers are sorted by
days since last purchase, the lower the number of days, the better the score is. For
Frequency, customers are sorted by purchases number, the upper the number of
Page 41
40
purchases, the better the score. And for Monetary, customers are sorted by the
amount of money spent. The upper the amount, the higher the score is. Each time
customers are scored, a new comparative segmentation scheme is built. This has the
benefit of quantifying customer behavior which can be projected into the future
(Miglautsch, J.R, 2001). The comparatively best customers would always fall into
the 5, 5, 5 category. It is essential to recognize where the cutoff points fall, since
they automatically change with each customer scoring. The customer quintile
method has the benefit of yielding equal numbers of customers in each segment.
There are five equal groups for RFM, generating 125 equal size segments in general.
Initial analysis would be to contact all customers, look at the performance of each
individual cell and understand how different segments of the customers carry out
(Miglautsch, J.R, 2001).
The customer quintile method does meet some scoring confronts in the area of
Frequency. In most direct marketing customer files, a high percentage of the
customers have only ordered once. This percentage is frequently as high as 30%-
60%. If more than 20% of the customers have only one purchase, then the lowest
Frequency group will have a purchase amount of 1. Since that group cannot keep all
the customers with only one purchase, some of them will be sorted into the 2 score
group. Their behavior is identical to those in the 1 score, they simply overflowed. If
40% of the customers had only one purchase, then both 1 and 2 score groups would
have equal behaviors. If the percentage ran as high as 60% (which is not that
unusual) then three of the five quintiles would have the equal behavior.
Remembering the reason of RFM, this would be a less than satisfying result. A
second concern with the quintile method is its relative sensitivity. At the high end of
our Frequency model customers average 7.4 purchases. That is significantly more
than the 1.0 purchases at the bottom and approximately twice as great as the 3.4
purchases in the 4 score group. However, the Paretto Principle (commonly called the
80/20 rule) still applies within the 5 score group. This means that there are a small
number of very large customers and a larger number of relatively smaller customers
who make up that 7.4 average (Miglautsch, J.R, 2001).
As long as our segmentation method is primarily built for mailing goals, this
difference is debatable. Certainly the 5 and 4 groups would be mailed. However, if
our RFM model is being used to make possible telemarketing or field sales contact,
extra sub-segments would be vital to identify the super customers. The customer
quintile scoring method produces some unsatisfying results at both the top and
bottom of the scale. It tends to group together customers who have hugely different
Page 42
41
buying behavior (at the top) and subjectively break apart customers who have same
behavior (at the bottom) (Miglautsch, J.R, 2001).
b. Behavior Quintile Scoring
An alternative scoring method has been made by John Wirth. It also sorts
customers by behavior but, instead of building arbitrary cutoffs at an assured percent
of the customers, it produces cutoffs on percentage of behavior. This method
appears to defeat the sensitivity problems mentioned above. Five groups are still
produced, but monetary score would produce equal amounts of sales in each
quintile. Behavior scoring has the benefit of grouping customers by similar
behavior. Since segmentation decisions are based on precedent customer behavior,
this permits better segmentation (Miglautsch, J.R, 2001).
i. Frequency
The Behavior method does suffer from similar problems when beginning
Frequency score. If we initiate at the top of the Frequency sort and deduct each
customer’s frequency from total Frequency, the customers purchased only once may
not equal 20% of entire Frequency. In such a situation, some of the customers who
have bought twice will be included in the 1 score group with this method. It is also
worrying to sort customers from top to bottom in a computer generated scoring
system. A particular sort file must be created and each scoring process must be
accomplished uniquely. The Mean scoring method, an additional improvement of
the John Wirth method has been developed by Ted Miglautsch (Miglautsch, J.R,
2001). When scoring Frequency, the solitary purchasers are given a score of 1. The
system then averages the remaining Frequencies to find out the mean. If a customer
total falls below the mean, he will have the score of 2. This process is duplicated
two more times giving us quintiles of behavior which have sensitivity on both ends
of the scale and let scoring of many variables at the same time (Miglautsch, J.R,
2001).
ii. Recency
Because previous behavior is the best predictor of future behavior, Recency is
normally considered the most influential of the three variables. Recency plays an
important role in direct marketing decision making. Recent customers are
considered viable for a assured length of time. Unlike Frequency and Monetary,
customers reset themselves. At the center of Recency is the fact that most of the
Page 43
42
customers fall into two groups: hot and dead. Although Recency can be scored by
sorting customers by days since last purchase, industry list meeting suggest a more
calendar based method. “Hotline names” normally represent purchasers within three
months or 90 days. Business-to-business direct marketers often lengthen these time
frames since their customers can stay viable even though individuals change
(Miglautsch, J.R, 2001).
c. Weighting
With relational, database-driven marketing databases becoming more ordinary,
most marketers can select RF&M scores separately. Though, others are not as lucky
and need a single field to do the work of all three variables. The benefit of a single
variable is that customers can simply be segmented by a single query on one field
(Miglautsch, J.R, 2001). Donald R. Libey, in his book "Libey on RFM", proposes
that Monetary, Frequency and Recency values can be added jointly (Miglautsch,
J.R, 2001). Scoring is not explicitly argued but he present a formula for creating a
single RFM value. His method contains adding average order and Frequency per
year. To improve this complex formula, marketers can multiply Rx3, Fx2 and Mx1.
This would give the best customers a composite score of 30 (5x3)+(5x2) +(5x1).
This not only gives more influence to the most recent names, it also gives a bit of a
boost to Frequency. The logic behind this, is that if two customers have the same
Recency, spent the equal amount but one purchased several times and the other only
once, the more frequent buyer is much more probable to react. One extra
enhancement is often employed in generating a complex score. Instead of
multiplying by 3, 2&1, alternate 9.9, 6.6 and 3.3. This produces a range of complex
scores between 99 and 19.8. It preserves the approximately 3x weighting of R; it
produces more of a 100 point scale (Miglautsch, J.R, 2001).
d. Life-to-Date
Generally, RFM scoring is stand on life to date totals. It is frequently requested
whether it would progress RFM scoring to shorten up the time frame. The idea is
that if Recency is so influential, maybe we should consider only the recent behavior
of the precedent few years; an excellent proposal but filled with danger. The basic
idea again is quantifying behavior for the point of customer segmentation. It is clear
that high RFM customers are easily recognized. The factual challenge is to
recognize viable customers further than the 12 month window in some areas like
direct marketing. Should any of them be mailed and marketed? Certainly some
Page 44
43
should. To gain this wider viewpoint, it needs that all obtainable customer history be
examined (Miglautsch, J.R, 2001).
2.5.5 Customer Value Matrix Model
The Customer Value Matrix was made from a want to apply RFM to the
small-business retail environment. After some experiments with applying RFM in
small businesses, it became clear that RFM was too difficult and time-consuming
for them. The problem was that, while RFM was comparatively easy conceptually,
the consequential segmentation was often complex to understand and even more
difficult to use them. By three values per RFM variable, RFM analysis makes 27
customer segments. For RFM analysis to be useful, the marketer must know which
groups can be combined for a exacting strategy or tactic (Marcus, C., 1998).
Closer test of the RFM analysis emphasized the co-linearity of the Frequency
Purchase frequency and the total Monetary Value variables. An extra purchase by a
customer results an increase in the total monetary value of that customer. Given this
result, Charles Edmundson recommended using Average of Purchase Amount
instead of the total Monetary Value of a customer. By this, we eliminated the co-
linearity between the two variables. Besides, for the more clarity, the Purchase
Frequency variable was changed to Purchases number. These changes showed
refinements over usual RFM analysis; though, they did not determine the problem of
finishing up with too many segments to interpret and to work with (Marcus, C.,
1998). For solving this issue a simplified, more actionable version of RFM was
needed. In the first step, we centered on the two variables that best expressed the
value of a customer: Purchases Number and Average Purchase Amount. The third
variable, Recency, gives motivating information that can be joined with the two key
variables. We can also use other important variables such as Type of Purchase or
Length of Relationship. Using just Purchase Frequency and Average Purchase
Amount was piece of the answer; moreover it is needed to simplify the segmentation
to a 2 * 2 matrix. Matrices have been effectively used to help in the understanding
of information for decision-making reasons. Maybe the most usually known matrix
is Boston Consulting Group’s (BCG) Growth-Share Matrix centering on allocation
of resources given the market share place and growth potential of a given set of
business opportunities (Henderson, 1967; Porter, 1980, cited byMarcus, C., 1998).
The BCG Growth-Share Matrix can be used to market segments, products or still
countries. BCG’s Growth-Share Matrix segments business opportunities into
obviously defined groups (Cash cows, Stars, Dogs and Question marks). The use of
a comparatively straightforward method and easy-to-understand quadrant identifiers
Page 45
44
has made the BCG Matrix an effective analytical tool. The BCG Matrix adds further
value by involving what managerial strategies and tactics are to be chased with each
business segment. Businesses that have high relative market share in low growth
markets (Cash cows) can be applied to support other developing businesses, as low-
relative-share businesses in low-growth markets are probable to be cash traps
(Dogs). Simplifying the RFM analysis to center on the customer-value-based
variables, Purchases Number and Average Purchase Amount, and applying a 2 * 2
matrix to correspond the resulting segmentation verified to be active in arriving at a
practical yet meaningful approach to customer segmentation (Marcus, C., 1998).
Customer value matrix model is an advanced model that is based on the traditional
RFM model. In this model, customer value matrix includes of the times of
purchasing (shown by F) and the average amount of purchasing (shown by A) (Wu,
J., Lin, Z., 2005). Average amount of purchasing replaces two variables in RFM
model between which there is multi-co linearity, which omits their linear result on
RFM model. In customer value matrix, the foundation value of F and A is their
average value correspondingly. Once the division of the axis is determined,
customers are located in one of the quadrants of the customer value matrix. By the
value of A and F, customers are categorized into four groups in the matrix, for
example customer who likes to consume (shown by I), customer who is important
for ventures (shown by II), customer who frequently consumes (shown by III), and
customer whose behavior is unsure to ventures (shown by IV). The consequence is
accessible in Figure 1(Wu, J., Lin, Z., 2005).
Page 46
45
Chapter3: Research Methodology
Background of the study (Problem definition)
Research question Research objectives
Research motivation Research outline
Page 47
46
3.1Research Methodology:
3.2Research Design: A research design is a roadmap for performing the marketing research project.
It gives details of each step in the marketing research project. Accomplishment of
the research design should result in all the information requested to construction or
solve the management-decision problem (Malhotra, K.N., 1996). Many designs
maybe are suitable for a given marketing research problem. A good research design
ensures that the information gathered will be related and useful to management and
that all of the necessary information will be achieved. A good design should also
assist to ensure that the marketing research project will be performed effectively and
efficiently (Malhotra, K.N., 1996). The research design of this study is illustrated in
figure 3.1. Detailed descriptions are explained below.
Figure 3.1: Research design of this study
3.3Research Purpose: According to (Malhotra, K.N., 1996), basic
Page 48
47
research designs can be categorized in terms of the research objectives. They are of
two wide types of research: exploratory and conclusive. These types are explained
below.
Exploratory research is a research performed to explore the problem situation
to achieve ideas and insight into the problem facing the management or the
researcher (Malhotra, K.N., 1996).
A conclusive Research is designed to help the decision maker in determining,
assessing, and choosing the best manner of action for a given condition (Malhotra,
K.N., 1996). Conclusive research is in two types: Causal and Descriptive. Causal
research is a kind of conclusive research whose major goal is to gain evidence
concerning cause-and-effect (causal) relationships (Malhotra, K.N., 1996).
Descriptive research is a kind of research that has as its main goal the explanation of
something usually market features or functions. Descriptive research supposes that
the researcher has previous knowledge about the problem situation this is one of the
main differences between deceptive and exploratory research (Malhotra, K.N.,
1996). Among the main kinds of descriptive studies are internally or externally
centered sales studies, consumer perception and behavior studies, and market
characteristics studies. Additionally, descriptive research uses different verity of
data collection methods like secondary data analyzed quantitatively or surveys
(Malhotra, K.N., 1996).
The approach of our study is data mining. According to data mining definition
by (Han, J., & Kamber, M., 2006), data mining refers to mining" knowledge from
large amounts of data". When approaching a data-mining problem, a data-mining
analyst may already have some a priori hypotheses that he or she would like to test
concerning the relationships between the variables (Larose D.T., 2005). Though, all
the time, analysts do not have a priori notions of the expected relationships among
the variables. Particularly when faced with large unknown databases, analysts often
prefer to use exploratory data analysis (EDA) or graphical data analysis (Larose
D.T., 2005). Exploratory data analysis (EDA) let the analyst to explore the data set,
check the interrelationships among the attributes, recognize attractive subsets of the
observations, and develop an original idea of possible relations between the
attributes and the target variable, if any.
Data mining approaches are in two kinds: Descriptive and Predictive (Han, J.,
& Kamber, M., 2006). Predictive mining tasks make deduction on the current data
Page 49
48
so as to make predictions (Han, J., & Kamber, M., 2006). Descriptive mining tasks
typify the general properties of the data in the database.
The focus of this study is data mining, which is an approach that combines
exploration and confirmatory analysis. So the purpose of this research is
exploratory. While, we try to understand customer behavior by building pattern by
data mining tools. According to definition of data mining approaches, the focus of
our data-mining task is descriptive.
3.4Research Approaches: There are two kinds of approaches for
research design: quantitative and qualitative (Malhotra, K.N., 1996). "Quantitative
research is an unstructured, exploratory research methodology based on small
samples that provides insights and understanding of the problem setting "(Malhotra,
K.N., 1996).
In contrast, Qualitative research is a methodology that searches to quantify the
data and usually applies some form of statistical analysis. The findings of this kind
of research can be treated as conclusive and applied to suggest a final course of
action (Malhotra, K.N., 1996). Descriptive researches frequently are quantitative
research (Malhotra, K.N., 1996). The concept of data mining allow decision maker
to be supported by qualitative descriptive research. The focus of this study is on
data mining so the research approach of this study is quantitative.
3.5Research Strategy: Research strategy will be a common plan of how
you are going to respond the research questions. It is a particular way to gather data
(Saunders et al, 2000).
A researcher based on the research question should choose among survey,
secondary data, case study, experiment or history (Yin, R.K, 1994). There are two
kinds of data generally used in researches: Primary data and Secondary data.
Primary data is produced by the researcher particularly to address the research
problem (Malhotra, K.N., 1996). Secondary data is data collected for some reason
other than problem at hand (Malhotra, K.N., 1996). It consists of information made
existing by business and government sources and computerized databases.
Secondary data are a reasonable and fast source of background information
(Malhotra, K.N., 1996). Two major categories are defined for secondary data:
Internal secondary data and External secondary data (Malhotra, K.N., 1996).
External secondary data create external to the organization (Malhotra, K.N.,
1996). Internal secondary data is data available within the organization for which
Page 50
49
the research is being performed (Malhotra, K.N., 1996). While, it is possible that
internal secondary data may be accessible in practical form, it is more usual that
considerable processing effort will be required before such data can be applied
(Malhotra, K.N., 1996).
The focus of this study is data mining and the data has been collected from the
database of Kalleh Company, so the suitable strategy for this research is secondary
data which is internal. In the end, as the focus of this study is on data mining, the
purpose of this research is exploratory. The approach of this study is quantitative,
strategy of the research is secondary and internal data and the data mining approach
is descriptive.
3.6Research process: The purpose of this research is to understand changes
happening in the customer buying behavior during time. Figure 3.2 shows the
general overview of change mining flowchart.
Figure 3.2: Change mining process perspective
As it is shown, the input of this flowchart is RFM data which show the
customer purchasing behavior, some demographic variables and product data. This
data induced to the Change Miner. Change mining procedure consists of different
steps implemented by different data mining techniques and algorithms in each. In
chapter2, different studies related to change mining were reviewed. Based on
literature, change mining has several steps, includes describing customer behavior
by mining association rule and mining change pattern. Most of the works have been
done for retail marketing. The focus of this research is using change mining in
behavior of FMCG manufacturer and Distributor Company. It is to analyze the
customer behavior in two time snapshots and the output will be change patterns
happened during time periods.
In this thesis, the research process has been followed. The following process is
based on previous methodology on Change mining (Chen et al, 2005). In this study,
according to pervious works, different steps of change mining were studied and with
some changes integrated to a unique process.
The whole process of change mining is shown in figure 3.4. The process
Change Miner Change
patterns Input Data: RFM,
Demographic, Product
Page 51
consists of several steps such
Segmentation, Mining Customer Behavior,
Each step by itself consists of several
illustrated.
Figure 3.4: Change mining
As can be seen in figure 3.
customer behavior. In order to implementing
In this study, we use SQL server 2000 for data
market segmentation. Also for building customer behavior pattern, we use R Open
source language Programming
the methods have been used, explained in detail.
3.7Data Collection and question and determining the
data to connect to the research question
strategy, empirical data are
Primary data is produced by the researcher
problem (Malhotra, K.N.,
Data CollectionData Pre
Processing
Data Collection
Data Pre-Processing
• Data Cleaning
• Data Transformation
• RFM variables
steps such as Data Collection, Data Pre-Processing, Customer
Mining Customer Behavior, and Change
Figure 3.3: Change mining process
Each step by itself consists of several tasks. In figure 3.4 the detail of each step is
Figure 3.4: Change mining process in detail
As can be seen in figure 3.4, there are various steps in mining changes in
customer behavior. In order to implementing this method, programming required.
In this study, we use SQL server 2000 for data preprocessing like building
market segmentation. Also for building customer behavior pattern, we use R Open
source language Programming (R Software, 2007). In the next section, each step and
the methods have been used, explained in detail.
3.7Data Collection and Description: After defining the research
question and determining the appropriate research strategy, we should determine the
data to connect to the research question (Yin, R.K, 1994). As mentioned in research
strategy, empirical data are usually in two types: primary data and secondary data.
d by the researcher specially to deal with the research
1996).
Data Pre-Processing
Customer Segmentation
Mining Customer Behavior
Processing
Data Cleaning
Transformation
RFM variables
Cstomer Segmentation
• Building Customer Value Matrix
• Segment Customers
Mining Customer Behavior
• Mining Association Rules in each time snapshot
50
Processing, Customer
Mining.
the detail of each step is
, there are various steps in mining changes in
, programming required.
building RFM,
market segmentation. Also for building customer behavior pattern, we use R Open
. In the next section, each step and
defining the research
research strategy, we should determine the
, 1994). As mentioned in research
nd secondary data.
the research
Change Mining
Change Mining
• Rule-Matching: computing similarity & difference measure
• Mining change pattern includes Emerging pattern, added/perished rule and unexpected pattern
Page 52
51
Secondary data is data gathered for some reason other than current problem
(Malhotra, K.N., 1996). It consists of information prepared by business and
government resources and computerized databases. Secondary data can classify to
two types, internal and external data (Malhotra, K.N., 1996).
For data mining purposes secondary data used mostly. Change mining
researches mostly work with secondary data collected in business databases. Hence,
this study is based on secondary data were gathered from Kalleh company which is
a manufacturer and distributor of food products in Iran market. Here we bring a
brief history of Kalleh Company.
This data is purchasing transaction data of Kalleh customers like fast-foods,
restaurants and coffee-shops which buy different categories of product from this
company. We Saved data in SQL 2000 (SQL Server, 2000). According to (Chen et
al, 2005), data for change mining includes 3 categories:
Customer data: for market segmentation and mining customer behavior one
kind of variables needed is demographic data. In this study, based on the collected
data, we have one demographic variable which shows the geographic area of each
customer. Beside, there was another variable, customer type that because of missing
value, it couldn’t provide any value for us, which we remove them.
Product data: showing different products provided for customers. In our data
we have about 800 product and 13 different product categories which are shown in
fig3.5
Purchasing transaction data of customers: usually, some valuable variables are
hidden in large quantity of raw data and can be achieved by data integration and
transformation. Customer behavioral variables like Recency, Frequency and
Monetary are unknown in customer and transaction database. They can be taken out
from these data (Chen et al, 2005). In this study, RFM variables used to analyze
customer purchasing behavior during two time snapshots.
The data gathered from 2 years of purchasing transaction while the number of
kalleh customers in the market of restaurants and fast-foods of Tehran during this
period is about 2457. Table3.1 shows the information data gathered from Kalleh
company database. This information gathered from Kalleh database based on the
literature and expert opinions.
Page 53
Table3.1: Data collected from Kalleh Company
Figure 3.5: Product categories of Kalleh
For this study 3 types of data were needed
data for extracting RFM, customer data and product data. RFM variables which are
the input of change mining process which are extracted from purchasing transaction
data to analyze the customer behavior. Besides, customer geographic data and
product data extracted from Kalleh database.
Customer Data
• Geographic Area of each customer
Product Categories
Pizza Cheese
Cooking Cheese
Processed Cheese
Tehran meat products
Amol Meat products
Freezed meet product
Yogurt-Milk
Other Dairy products
Ice Creams
sauces
Dishes
Other complementory products
Table3.1: Data collected from Kalleh Company
Figure 3.5: Product categories of Kalleh Company
For this study 3 types of data were needed includes, purchasing transaction
data for extracting RFM, customer data and product data. RFM variables which are
the input of change mining process which are extracted from purchasing transaction
the customer behavior. Besides, customer geographic data and
product data extracted from Kalleh database.
Product Data
• Product Code
• Product Categroy
Transaction Data
• Data of purchase
• Purchase amount: price
• Product purchased
• number of orders for each product
Product Categories
Pizza Cheese
Cooking Cheese
Processed Cheese
Tehran meat products -not freezing
Amol Meat products-not freezing
Freezed meet product
Milk-drinking Yogurt
Other Dairy products
Ice Creams
Other complementory products
52
, purchasing transaction
data for extracting RFM, customer data and product data. RFM variables which are
the input of change mining process which are extracted from purchasing transaction
the customer behavior. Besides, customer geographic data and
Purchase amount:
Page 54
53
3.8 Data Pre-Processing: Much of the raw data included in databases is un-preprocessed, imperfect, and
noisy. For instance, the databases may include fields that are obsolete or
unnecessary, missing values Outliers, Data in a form not appropriate for data mining
models, values not steady with policy or common sense. The databases require
undergoing preprocessing, in the form of data cleaning and data transformation to be
practical for data mining reasons (Larose, D.T., 2005).
Data cleaning and Integration:
Prior to analysis, data accuracy and consistency must be guaranteed to gain
correct results (Chen et al, 2005). Real-world data mostly are unfinished, noisy, and
unpredictable. Data cleaning processes effort to fill in missing values reduce noise
while recognizing outliers, and accurate inconsistencies in the data (Han, J., &
Kamber, M., 2006).
Noisy Data:
The data saved in a database may reveal noise, exceptional cases, or imperfect
data objects. When mining data regularities these objects may mystify the process.
Consequently, the correctness of the discovered patterns can be poor. So it should be
regarded as to handle these noises and exceptional cases (Han, J., & Kamber, M.,
2006). In our database, it the customer base and purchasing transactions, we have
some customers that belong to the Kalleh. These are noises that we have in our
database which remove all of them by their IDs from database.
Missing values:
When you have some record that some of these attributes have no value called
missing value. There are several methods to fill in missing values such as ignoring
the record, filling in the missing value manually which is time-consuming and may
not be feasible given a large data set with many missing values, using a global
constant to fill in the missing value and replace all missing attribute values by the
same constant, such as a label like Unknown" or using the attribute mean to fill in
the missing value or some other ways that exist in literature (Han, J., & Kamber, M.,
2006).
It is important to note that, in some cases, a missing value may not involve an
Page 55
54
error in the data. Software routines may also be applied to expose other null values.
Therefore, though we can try our best to clean the data after it is gathered, good
design of databases and data entry accuracy; procedures should help reduce the
number of missing values or errors in the first place (Han, J., & Kamber, M., 2006).
In this study, excepting one variable, the design of the database for sales
transaction are sensitive to null values in data entry moments which help to
minimize the missing values. One variable that has missing value is customer type
that it doesn’t sensitive to null value in data entry moments in database and we faced
a huge number of null values. Hence, this variable could not create any value for us;
we remove it from our work.
Data Transformation:
When the data are transformed or consolidated into forms appropriate for
mining, it is called data transformation. Data transformation can include the
following Tasks (Han, J., & Kamber, M., 2006).
Where summary or aggregation operations are used to the data, it is called
Aggregation. For instance, the daily sales data may be aggregated, to calculate
monthly and yearly total amounts. This step is normally used in building a data cube
for analysis of the data at multiple granularities (Han, J., & Kamber, M., 2006). In
this study, invoices sales data aggregated to compute the average sales per each
period and average of frequency of purchases. In addition, we use aggregation for
calculating the average purchases for each customer and total number of purchases.
Another data transformation task is Data Generalization, where low-level or
primitive (raw) data are substituted by higher-level concepts during the use of
concept hierarchies. In this study, we have about 800 products in 13 categories. We
replace products by categories and then replaced by super-categories according to
the experts opinion for mining purposes. Attribute construction (or feature
construction), where new attributes are built and added from the given set of
attributes to assist the mining process (Han, J., & Kamber, M., 2006).
Generally, some useful variables can be unknown in a large quantity of raw
data, and therefore can be gained through data integration and transformation (Chen
et al, 2005). In this study, we use customer behavioral variables (Recency,
Frequency and monetary) for customer segmentation. These variables are hidden in
customer and transaction databases, and can be extracted from data integration and
Page 56
55
transformation (Chen et al, 2005). (Stone, 1995 cited by Chen et al, 2005)
mentioned that "recency is the interval between the most recent transaction time of
individual customers and evaluation time". In this study, we consider the evaluation
time the next day of the end of each period. Frequency shows the average
expenditure of a customer during a period (Chen et al, 2005). Finally, frequency
shows the number of purchases in each period for each customer. In this study, we
calculate number of purchases for each customer as frequency, average sales for
each customer in period and the interval between the last purchase and the day after
the last date of each period.
Data Discretization:
Since the data needed for analyzing association rules must be discrete,
continuous variables should be altered to discrete type. Discrete values have
significant roles in data mining and knowledge discovery. They are about intervals
of numbers which are more concise to represent and specify, easier to use and
comprehend as they are closer to aknowledge -level representation than continuous
values. Many studies represent induction tasks can profit from discretization
because rules with discrete values are usually shorter and more understandable and
discretization can lead to advanced predictive accuracy. As well, mining in a
reduced data set need fewer input/output operation and is more efficient then a
larger and un-generalized data set. Because of these benefits, discretization
techniques are used previous to data mining as a preprocessing task. Discretization
technique can be classified based on how the discretization is done, such as whether
it uses class information or which direction it proceeds. If the Discretization process
employs class information, then it is said supervised Discretization, or else, it is
unsupervised. If the process begins by first result one or a few points called cut
points to split the whole attribute range, and then duplicates this recursively on the
resulting intervals, it is called top down Discretization or splitting. This difference
with bottom-up Discretization or merging starting by considering all of continuous
values as potential split-points, takes out some by merging neighborhood values to
form intervals, and then recursively applies this process to the resulting intervals
(Han, J., & Kamber, M., 2006).
Discretization can be done recursively on an attribute to give a hierarchical or
multi-resolution partitioning of the attribute values, recognized as concept hierarchy.
Concept hierarchies are helpful for mining at multiple level of abstraction. Though
data is lost by these generalizations with data reduction by collecting and replacing
Page 57
56
low-level concepts by high-level concepts, the generalized data may be more
significant and easier to interpret (Han, J., & Kamber, M., 2006). There are
numerous discretization methods available in the literature based on different
definitions mentioned above.
In this study, the discretization method used, is binning. Binning is the
simplest method discretizing a continuous-valued attribute by producing a particular
number of bins. The bins can be produced by equal-width and equal-frequency (Liu
et al, 2002). In equal-width, the continuous range of a feature is evenly separated
into intervals that have an equal-width and each interval represents a bin. In equal-
frequency, an equal number of continuous values are set in each bin (Liu,
2002).these methods are responsive for a given number of bins. In this study, based
on domain experts' opinions, for each variable, we discretize them by equal
frequency method.
3.9Customer Segmentation: There are many analytics methods which
applied for market segmentation. One of the most traditional approaches of market
segmentation is demographic segmentation. The other methods have also use buyer
attitudes, motivation an attitudes and pattern of usage. Companies that capture
customer and purchase information apply such information to analyze customer
behavior for their marketing efforts (Marcus, C., 1998).
While the availability of customer purchase information has permitted
marketers to develop richer, more complicated customer segmentation schemes,
simplicity has also proven its place. For many years, RFM (recency, frequency and
monetary value) has been applied to segment customers to assist marketers
optimizing their marketing efforts. Many times, RFM has been confronted by
innovative conceptual approaches prepared possible by new technologies such as
neural networks. Yet, many marketing tasks continue to count on RFM variables,
particularly direct marketing because the lift experienced using alternative methods
does not normally guarantee the costs of implementing those methods. There are
costs linked with improved technical complexity, particularly that of taking the
analysis away from marketers and putting it into the hands of programmers and
statisticians. Besides the costs of explanation and communication – as marketers
require to develop actionable strategic and tactical decisions from the research
findings are important. The Customer Value Matrix is a customer segmentation
technique that is simple yet, powerful approach overcoming the above limitations.
Its effectiveness lies not only in that it recognizes key customer segments, but also
Page 58
57
in that, it emphasizes appropriate marketing strategies and tactics in a manner that
can be eagerly communicated and easily executed (Marcus, C., 1998).
3.9.1 Customer Value Matrix The Customer Value Matrix was developed
from a desire to apply RFM to the small-business retail environment, but it became
clear that RFM was too complex and time-consuming for marketers. There were
some problems, which are as follows: while RFM was comparatively simple
conceptually; because of producing too many segments, the consequential
segmentation was often difficult to understand and even more difficult to apply.
Additionally, Closer test of the RFM analysis highlighted the co-linearity of the
Frequency of Purchase and the total Monetary Value variables. (Charles Edmundson
cited by Marcus, C., 1998) recommended using Average Purchase Amount as an
alternative of the total Monetary Value of a customer to eliminate the co-linearity
between the two variables. In addition, for greater precision, the variable Frequency
of Purchase was transformed to Number of Purchases. These changes showed
refinements over usual RFM analysis; however, they did not solve the problem of
ending up with too many segments to understand and to work with (Marcus, C.,
1998).
What was required was a simplified, more practical version of RFM. First,
centered on the two variables that best explained the value of a customer: Number of
Purchases and Average Purchase Amount and the other was simplifying the
segmentation to a 2*2 matrix (Marcus, C., 1998).
3.9.2 An effective analytical tool
Matrices have been effectively applied to help in the information
understanding for decision-making goals. Maybe the most usually known matrix is
Boston Consulting Group’s (BCG) Growth-Share Matrix, which centers on
allocation of resources specified the market share position and growth potential of a
given set of business opportunities (Henderson, 1967; Porter, 1980 cited by Marcus,
C., 1998 ). The BCG Growth-Share Matrix can be used for segmenting markets and
products. BCG’s Growth-Share Matrix segments business opportunities into four
obviously described groups (Cash cows, Stars, Dogs and Question marks). The BCG
Matrix adds additional value by involving what managerial strategies and tactics are
needed with every business segment. The application of a comparatively simple
scheme and easy-to-understand quadrant identifiers has made the BCG Matrix an
effective analytical tool (Marcus, C., 1998).
Page 59
58
Simplifying the RFM analysis to center on the customer-value-based variables,
Number of Purchases and Average Purchase Amount, and using a 2*2 matrix to
correspond the resulting segmentation proved to be active in arriving at a realistic
and meaningful approach to customer segmentation (Marcus, C., 1998). In this
study, according to literature and based on expert opinions, we have chosen
Customer Value Matrix to segment customer behavior.
3.9.3 Customer Value Matrix Methodology Customer Value Matrix building
has some steps. In the first step, we require some basic customer and purchase
information to involve in a relatively simple methodology. In the second step, the
segmentation process executes to allocate each customer in the Customer Value
Matrix. Finally, we should obtain four segments with key differences among the
resulting customer segments (Marcus, C., 1998).
Data:
The data requested to develop the Customer Value Matrix are customer
identification (ID) number, the purchase date and the total purchase amount. The
customer ID number is applied to finding out the purchases of each customer. The
total Number of Purchases is basically a count of the unique dates for a given
customer’s invoices. The total amount of each purchase is applied to calculate the
Average Purchase Amount (Marcus, C., 1998). In this study, the data that we have
to build Customer Value Matrix is customer identification number, the date of each
purchase and the total amount of each purchase.
Segmentation:
The segmentation process using the Customer Value Matrix needs the
computation of the average values for the Number of Purchases and Average
Amount Spent. The average value for the x-axis, or Average Number of Purchases,
are considered by taking the total number of purchases for the customer base and
splitting it by the total number of customers in database. The average value for the
y-axis, or Average Purchase Amount, is obtained by taking the total revenue and
splitting it by the total number of purchases. The axes’ averages then provide to
separate the high and low values on each scale. In this study, according to gathered
data, we could not calculate the revenue, so instead of revenue we put total sales.
Table 3.2 shows these variables and their calculation for this study. You can see
result in chapter4.
Page 60
59
Table3.2: calculating variables for customer value matrix
Then, we compare each customer’s Average Number of Purchases and
Average Purchase Amount to the gained average values for the whole customer
base. So, each customer is allocated exclusively to one of the four segments based
on whether they are above or below the axis averages. The output of this step is a
matrix as illustrated in figure 3.5.
Figure 3.6: customer value matrix
You can see the result of this step in next chapter.
Frequency Avg.
Frequency
Monetary
Av
g.
Mo
net
ary
Spender
Best
Uncertain
Frequency
Page 61
60
Customer Value Matrix centers on the Number of Purchases and the Average
Purchase Amount as the primary variables, as best representation of the customer
value. Using the Customer Value Matrix as the foundation, any number of variables
(like geographic, demographic, the purchase recency or the customer relationship
length) may be overlaid on the segmentation to get more detail according to the
customer data and their transactions.
The methodology for the development of the Customer Value Matrix shows
that relatively simple yet effective customer segmentation is indeed possible. In this
study, according to the literature review and by considering expert opinion, we
segmented our customers based on customer value matrix by (Marcus, C., 1998).
Customer Value Matrix focuses on the Average Purchase Amount and the Number
of Purchases as the primary variables, which are best portray of customer value. By
this method we have four segments which they can be differentiated. The result can
be found in chapter4.
3.10 Mining Customer Behavior: Different methods to describe
customer behavior exist in the literature. Among them, there are various types of
conjunctive rules to build customer behavior pattern including association rules and
classification rules (Agrawal et al. cited by Adomavicius, G. Tuzhilin, A., 2001).
Using rules to describe customer behavior has certain advantages. Besides being an
intuitive and descriptive way to represent behaviors, a conjunctive rule is a well-
studied concept used extensively in data mining, expert systems, logic
programming, and many other areas. In addition, researchers have proposed many
rule discovery algorithms in the literature, especially for association rules
(Adomavicius, G. Tuzhilin, A., 2001). To discover rules that describe the behavior
of customers, we can use various data mining algorithms, like Apriori for
association rule mining.
Association rules were originally used to analyze the relationships of product
items bought by customers at retail stores (Agrawal, Imielinski, & Swami, 1993;
Srikant, Vu, & Agrawal, 1997 cited by Chen et al, 2005). In a customer behavior
research, association rule can be used to find the correlations between customer
profiles shown by demographic variables and purchased product by exploring
customer and product databases (Song et al, 2001). In this research based on the
literature, we mine customer purchasing behavior by association rule.
3.10.1 Association Rule Mining: A classic and normal association rule has
Page 62
61
an implication of the form A->B, where A is an itemset and B is an itemset that
includes only a single atomic condition (Song et al, 2001). A and B are statements
regarding the values of attributes of an example in a database (Song et al, 2001). "A
is termed the left-hand-side (LHS), and is the conditional part of an association rule.
Meanwhile, B called the right-hand-side (RHS), and is the consequent part". A and
B are frequent itemset, If the relative support of an itemset satisfies a pre-specified
minimum support threshold (Chen et al, 2005).
The support of an association rule is the percentage of records containing
itemset A and B at the same time. The confidence of an association rule is the
percentage of records including itemset A that also include itemset B. the support
shows the usefulness of the revealed rule and the confidence signifies certainty of
the found association rule (Song et al, 2001). The most usual use of association rules
is market basket analysis, in which the market basket contains the set of items
(namely itemset) purchased by a customer during a single store visit (Chen et al,
2005). Association rule mining discovers all collections of items in a database
whose confidence and support meet or go above a pre-specified threshold value
(Song et al, 2001).
In this research we use the Apriori algorithm that introduced by (Agrawal et al,
1993) to build profile association rule. In the next section, we explain about Apriori
algorithm and the way it works.
3.10.2 Apriori algorithm: Apriori algorithm is one of the common techniques
used to find association rules (Agrawal et al, 1993). The name of the algorithm is
based on the fact that the algorithm uses prior knowledge of frequent itemset
properties. Apriori uses an iterative approach recognized as a level-wise search,
where k-itemsets are used to explore (k + 1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and gathering those items that assure minimum
support. The consequential set is indicated L1. Next, L1 is used to find L2, the set of
frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-
itemsets can be found. The result of each Lk needs one full scan of the database. To
advance the efficiency of the level-wise generation of frequent itemsets, an
important property called the Apriori property is used to reduce the search space. It
means that all nonempty subsets of a frequent itemset must also be frequent. By
definition, if an itemset I does not satisfy the minimum support threshold, then I is
Page 63
62
not frequent, that is, P(I) < min sup. If an item A is added to the itemset I, then the
resulting itemset (i.e., I � A) cannot happen more frequently than I. Therefore, I �
A is not frequent either, that is, P(I � A) < min sup. Apriori property used in the
algorithm has two steps consisting of join and prune actions to make it more
efficient. A major challenge in mining frequent itemsets from a large data set is the
fact that such mining often produces a huge number of itemsets satisfying the
minimum support (min sup) threshold, especially when min sup is set low (Han, J.,
& Kamber, M., 2006). To overcome this difficulty, two concepts of closed frequent
itemset and maximal frequent itemset have been introduced. An itemset X is closed
in a data set S if there exists no proper super-itemset Y such that Y has the same
support count as X in S. An itemset X is a closed frequent itemset in set S if X is
both closed and frequent in set S. An itemset X is a maximal frequent itemset (or
max-itemset) in set S if X is frequent, and there exists no super-itemset Y such that
X � Y and Y is frequent in S(Han, J., & Kamber, M., 2006).
Once the frequent itemsets from transactions in a database D have been found,
it is simple to produce strong association rules from them (where strong association
rules satisfy both minimum support and minimum confidence). This can be done
using Equation (1-4) for confidence, which we show again here for completeness:
��������� � � � � |�� � �������_������ � ��������_������� , Equation (1-4)
The conditional probability is expressed in terms of itemset support count,
where support countA � B� is the number of transactions containing the itemsets A � B�, and support count(A) is the number of transactions containing the itemset
A.
So, association rules can be generated as follows:
• For each frequent itemset l, generate all nonempty subsets of l.
"s�l‐s�" �� �������_�����!��������_������� " min _����, • For every nonempty subset s of l, output the rule
, where min confis the minimum confidence threshold (Han & Kamber ,
2006).
Page 64
63
In this research, the customer behavioral variables (RFM) and geographic
variable are associated with purchased products to build customer purchasing
behavior patterns. The association rules discovered at two periods of time to adopt
for change mining to identify customer behaviors that change over time.
We have applied Apriori algorithm by maximal frequent itemset to build
association between customer attributes and their purchased products. These rules
can include any number of attributes on either side of the rule. In the left hand side
(LHS) or conditional part of the rule, we have RFM and customer data and in the
right hand side (RHS) or consequent part, we have purchased product items. Not
all association rules are interesting to decision makers. Rule support and confidence
are two measures of rule interestingness and an interesting rule must satisfy the
minimum support and confidence determined by domain experts.
3.11 Change Mining: After building customer behavior patterns, we want to mine changes happened
in customer purchasing behavior. In this study, two measures of similarity and
unexpectedness from (Chen et al, 2005) are applied to investigate changes in
customer behavior. Also we have applied ordered variables information in
calculating these two measures which resulted in more knowledge of changes which
didn’t done in the previous works. First, we explain about change pattern and then
mathematically explain about the measures.
3.11.1 Change Mining:
After building customer behavior patterns, we want to mine changes happened
in customer purchasing behavior. In this study, two measures of similarity and
unexpectedness from (Chen et al, 2005) are applied to investigate changes in
customer behavior. Also we have applied ordered variables information in
calculating these two measures which resulted in more knowledge in found changes
which didn’t done in the previous works.
First, we explain about change pattern and then mathematically explain about
the measures.
Change Patterns:
Based on previous studies, four patterns are identified to measure changes in
Page 65
64
customer behavior (Dong, G., & Li, J., 1999; Liu & Hsu, 1996; Padmanabhan &
Tuzhilin, 1999; Song et al, 2001 cited on Chen et al, 2005). These patterns include
emerging pattern, added pattern, perished pattern, and unexpected change. These
four change patterns are explained below.
Emerging patterns:
Emerging patterns is a pattern kind for knowledge discovery from databases.
It is described as rules or itemsets whose supports increase significantly between
time-stamped datasets (Dong, G., & Li, J., 1999). Emerging patterns can capture
emerging trends in time-stamped database or practical contrast between data classed
(Dong, G., & Li, J., 1999). In marketing management, emerging patterns involve
the same consumer behavior that exists in different periods of time with trend. The
positive pattern growth rate (i.e. the support of a rule increases over time) indicates
that the customer behavior becomes stronger over time. Meanwhile, a pattern
growth rate below zero specifies that the customer behavior is getting weak. For
emerging patterns, the conditional and consequent parts are the same for two rules,
but support for the two rules changes significantly between different time periods
(Chen et al, 2005). Emerging pattern have been proven practical as (Dong, G., & Li,
J., 1999) mentioned and (Dong, G., & Li, J., 1999) believed that EPs with low to
medium support, such as 1% to 20% can give useful new insights and assistance to
experts.
Added/Perished patterns:
Added pattern defined as a rule at period t' , if all conditional and consequent
parts differ significantly from any rule, at period t( . Perished Pattern is a rule at
periodt(, if all conditional and consequent parts differ significantly from any rule, at
period t' . A perished pattern is a vanished pattern found in the past but not the
present. The rule matching threshold (RMT) is applied by (Chen at al, 2005) to
measure the degree of change.
Unexpected change:
There are some works on unexpected changes in literature on mining
interesting (Chen et al, 2005). Liu and Hsu (1996) classified unexpected changes
into unexpected conditional changes and unexpected consequent changes. If the
conditional parts of �)*(and �+*' are similar, but their consequent parts are diverse,
then �+*' is an unexpected consequent change regarding �)*( (Liu & Hsu, 1996; Song
Page 66
65
et al, 2001). furthermore, if the consequent parts of �)*( and �+*' are similar, but their
conditional parts are different, then �+*' is an unexpected conditional change with
respect to �)*( . In this research, the unexpected changes of customer behavior can
be identified in the form of unexpected purchasing (consequent) patterns and
customer shifting (conditional) patterns based on (Chen et al, 2005). After
explaining the four customer changes, this study elaborates on the measures used to
detect these changes.
Change Measure:
For calculating changes, we have two measures from (Chen et al, 2005), one is
similarity which calculate similarity percentage between two rules and the
unexpectedness measure applying when two rules don’t have any similarity to
discover any unexpected event in rules, either in conditional part or consequent part.
Before calculating these two measures, we need some notations that defined
below.
,*- �. �� �� /�����/���� ��!� ��� �(; ,*1 �. �� �� /�����/���� ��!� ��� �'; �)*- /� /�����/���� ��! �� ,*- , �)*- 2 ,*-; �+*1 /� /�����/���� ��! �� ,*1 , �+*1 2 ,*1; �)+ �. �� �� /����3��� �./� ��4�!�/����!5 /��/� �� ���������/! �/�� 678���� �)*- /�� �*1 ; 9�)+9 �. ��43� �� /����3��� �� �)+ ; )+ �. �� �� /����3��� �./� ��4�!�/����!5 /��/� �� ����:��� �/�� ,78���� �)*- /�� �*1; 9 )+9 �. ��43� �� /����3��� �� )+ ; 9;)*-9 �. ��43� �� /����3��� �� 678 ��� �)*- ; 9;+*19 �. ��43� �� /����3��� �� 678 ��� �+*1 ; 9<)*-9 �. ��43� �� /����3��� �� ,78 ��� �)*-;
Page 67
66
9<+*19 �. ��43� �� /����3��� �� ,78 ��� �+*1 ;
ℓ)+ �. ��4�!/���5 �� /����3��� �� 678 ��� �)*- /�� �+*1 ; .)+ �. ��4�!/���5 �� /����3��� �� ,78 ��� �)*- /�� �+*1 ; >)+? / 3��/�5 @/��/3!, A.� >)+? � 1,�� �. �th /����3�� �� �)+ ./� �. �/4 @/!� ��� �)*- /�� �+*1 , /�� ��.�A�� >)+? �0, � � 1,2, … , 9�)+9; .)+ �. ��4�!/���5 �� /����3��� �� ,78 ��� �)*- /�� �+*1; G)+H / 3��/�5 @/��/3!, A.� G)+H � 1,�� �. :th /����3�� �� )+ ./� �. �/4 @/!� ��� �)*- /�� �+*1 , /�� ��.�A�� G)+H� 0, : � 1,2, … , 9 )+9; 8)+ / 4/��� �� �. ��4�!/���5 3�A� �)*- /�� �+*1; 8)*- �. 4/I�4�4 ��4�!/���5 ��� �)*-; 8+*1 �. 4/I�4�4 ��4�!/���5 ��� �+*1; J)+ / 4/��� �� ��I������� 3�A� �)*- /�� �+*1 ; JK)+ /� /�L���� 4/��� �� ��I������� 3�A� �)*- /�� �+*1 ; M)+ / 3��/�5 @/��/3!, M)+ � 1, �� 4/IN8)*-, 8+*1O � 1; ��.�A��, M)+ � 0 In this study, first, we applied two measures of similarity and unexpectedness by
(Chen et al, 2005). The Similarity measure can be used to measure the degree of
likeness between two rules, and unexpectedness measure can be used to identify the
disparity between dissimilar rules. Two measures shown below:
8)+ � PℓQR ∑ TQRUU9VQR9 W XQR ∑ YQRZZ9[QR9 , �� 9�)+9 \ 0, /��9 )+9 \ 00 �� 9�)+9 � 0, ��9 )+9 � 0 ]
Where ℓ)+ and .)+ are defined as follows:
Page 68
67
ℓ)+ � 9VQR9^_` abQc-a,abRc1a� 2
.)+ � 9[QR9^_` adQc-a,adRc1a� 3
In Eqs. (2) and (3), ℓfgand hfg represent the similarity of the conditional and
consequent parts, respectively. The degree of similarity,Sfg, is between 0 and 1,
where 0 indicates that the two patterns are completely dissimilar, and 1 indicates
that the two patterns are identical.
For mining change, we have some steps that are as below:
First, we calculate similarity measure for every rule in the first period to all of
the rules from the second period and vice versa.
Following calculating the similarity of patterns, the maximum similarity
degrees of Rules rfj( and rgj' are determined to measure the change of patterns
during periods t( andt'. The maximum degrees of similarity are represented using
Eqs. (4) and (5), as below.
8)*- � max 8)(*- , 8)'*- , … , 8)amRc1a*- � 4
8+*1 � max 8(+*1 , 8'+*1 , … , 8amQc-a+*1 � 5
According to (Chen et al, 2005), the maximum similarity provides the basis for
differentiating emerging patterns, added patterns, and perished patterns during
various periods. If the maximum similarity of Rule rfj( , Sfj- , equals 1 (or Sgj1 equals
1), then the rule exists in both time periods t1 and t2, and thus shows an emerging
pattern. If a rule displays positive growth (Sup2>Sup1), then the rule represents a
pattern of customer behavior that becomes strong with time. Vice versa, a growth
rate below zero indicates negative trend of customer behavior change.
If the maximum similarity of Rule rfj( , Sfj- , lies between 0 and 1, the two
rules share a partial resemblance. The decision maker determines a rule matching
Page 69
68
threshold (RMT) to judge whether the similarity of a specific rule satisfies the
criteria set by the individual user. If the maximum similarity of Rule rfj(, Sfj-, is
smaller than RMT (Sfj- n ,;o� this rule gradually perishes in time period t', and is
therefore considered a perished pattern. Else, the rule will be not perished.
Meanwhile, if the maximum similarity of Rule rgj' is belowRMTSgj1 n ,;o�, rgj'
in period t' is quite different to the rules in periodt(, and thus it is considered an
added pattern, else it will be not added rule.
If the maximum value of similarity measure for one rule becomes 0, then
unexpectedness measure is used to judge whether the two rules consist of
unexpected changes. Unexpectedness was initially used as a subjective measure for
interestingness of pattern. Patterns are interesting if they are ‘surprising’ to the user
(Silberschatz, A., & & Tuzhilin, A., 1996)
In this study we have used the unexpectedness measure, introduced by (Chen
et al, 2005).
The measure is illustrated in equation (6)
J)+ � PℓQR ∑ TQRUU9VQR9 s XQR ∑ YQRZZ9[QR9 , �� 9�)+9 \ 0, /��9 )+9 \ 00 �� 9�)+9 � 0, ��9 )+9 � 0 ]
6
If δfg u 0, then Rule rgj' is an unexpected purchasing rule (i.e. unexpected
consequent change) according to rfj(. In this case, customers with same
characteristics shift their purchasing behavior or buy diverse products. If δfg n 0,
then Rule rgj' is an unexpected shifting rule (i.e. unexpected conditional changes)
according to rfj( . This change specifies that the customer group of specific products
has changed to another group. If the unexpectedness value equals 0, the two rules
are completely different. If the value of unexpectedness in comparison of a rule
from t( and rule of period t' become 0, then this is an unexpected perished. In
addition, vice versa, it is an unexpected added.
Our contribution:
Page 70
69
In the previous work, two measures did not use the information that ordinal
data have by themselves. In this study, After mining changes by two measure
introduced by (Chen et al, 2005), we modified previous measures to mine changes
which use the information that ordinal numbers have by themselves.
Here we want to compare ordinal values instead of binary values for each
attribute. For doing so, we calculate distances between values of each common
attribute of LHS and RHS. According to (Han, J., & Kamber, M., 2006), the
dissimilarity (or similarity) between the objects described by interval-scaled
variables is typically computed based on the distance between each pair of objects.
One of the most popular distance measures is Manhattan (or city block) distance,
which is defined as:
��, L� � 9I)( s I+(9 v 9I)' s I+'9 v w v 9I)x s I+x9. The measures of similarity and unexpectedness are modified by using
Manhattan distances. The modified measures are as follow.
z)+? � The distance between pth attribute of �)*( and �+*'where the pth attribute is in
common in �)+ , this is based on the definition of Manhattan distance.
β)+H � The distance between pth attribute of �)*( and �+*' where the qth attribute is in
common in )+, this is based on the definition of Manhattan distance.
8|�L � |}~�|�∑ �~��^_` a�~�-a,a���1a� W |�~�|��QRZ^_` a�~�-a,a���1a� �|�L � |}~�|�∑ �~��^_` a�~�-a,a���1a� s |�~�|��QRZ^_` a�~�-a,a���1a� By defining these measures we bring the information from the ordinal data that we
have.
Page 71
70
Chapter4: Results & Analysis
Data preprocessing result
Customer Segmentation result
Mining customer behavior result
Change mining result
Page 72
The data pre-processing phase of analysis has been done in SQL server 2000
(SQL Server, 2000) and the data
software, 2007). This chapter shows the analysis and result of each step in change
mining process.
4.1 Data preprocessing result4.1.1 Data Cleaning
which are the customers who belongs to Kalleh
from the database. During
but 42 customers belong to Kalleh Company, so we remov
Total number of customer after removing noisy data became 2457.
4.1.2 Data Transformation result:
4.1.2.1 Generalization:
The result of generalization that is explained in Chapter3, is 6 category of
products which is shown in F
Figure 4.1: generalized product category
4.1.2.2 RFM Construction:
behavior patterns we need, customer behavioral variables. This part of the
has been done in the SQL server 2000. F
dataset to two time snapshot, one between '1383/07/01' AND '1384/06/31' as period
one or t1 and the second one between '1384/07/01' AND '1385/06/31' as period two
or t2.
Product Categories
Category1: Dairy Products
Category2: Ice-Cream
Category3: Meat Products(freezed or non-freezed
Category5: Pitza Cheese
Category11: Sauces
Category13: Cooking Cheese & processed cheese
processing phase of analysis has been done in SQL server 2000
and the data-mining phase performs with R package
). This chapter shows the analysis and result of each step in change
Data preprocessing result: 4.1.1 Data Cleaning: According to chapter3, we have some noisy data
which are the customers who belongs to Kalleh Company. So we removed them
two periods that we analyze, there were 2499 customers
to Kalleh Company, so we remove them from the database.
Total number of customer after removing noisy data became 2457.
4.1.2 Data Transformation result:
4.1.2.1 Generalization:
The result of generalization that is explained in Chapter3, is 6 category of
products which is shown in Figure 4.1
generalized product category
4.1.2.2 RFM Construction: As explained in chapter2, for building the customer
behavior patterns we need, customer behavioral variables. This part of the
has been done in the SQL server 2000. For calculating RFM, first we divided our
dataset to two time snapshot, one between '1383/07/01' AND '1384/06/31' as period
one or t1 and the second one between '1384/07/01' AND '1385/06/31' as period two
Product Categories
: Dairy Products
Cream
: Meat Products(freezed or
: Pitza Cheese
: Sauces
: Cooking Cheese & processed
71
processing phase of analysis has been done in SQL server 2000
mining phase performs with R package (R
). This chapter shows the analysis and result of each step in change
According to chapter3, we have some noisy data
. So we removed them
two periods that we analyze, there were 2499 customers
e them from the database.
The result of generalization that is explained in Chapter3, is 6 category of
As explained in chapter2, for building the customer
behavior patterns we need, customer behavioral variables. This part of the research
or calculating RFM, first we divided our
dataset to two time snapshot, one between '1383/07/01' AND '1384/06/31' as period
one or t1 and the second one between '1384/07/01' AND '1385/06/31' as period two
Page 73
We defined recency by calculating the interv
purchase and the last date of each period which for period. It means that the
evaluating time for these two time snapshots are '1384/06/31' and '1385/06/31'.
For frequency and monetary
total number of purchases and total amount spent during each period. According to
the market segmentation by
each customer. So we divide total purchase amount by total number of purchases to
calculate average amount of each purchase. The final data that is ready to the next
step for discretization has the format as
extracted from database can be seen.
Table 4.1: RFM table fields
4.2Customer segmentation (Marcus, C., 1998), we divided customers to four clusters in each period which
include uncertain, frequent, spender and best. According to Customer Value matrix,
we have two axes. The calculation steps of Cus
in the following section.
4.2.1 Customer Value Matrix Result:
Value Matrix in chapter3, we applied customer value matrix introduced by
C., 1998). For each period, we define
number of purchases and the other is average amount of purchase. In table 4.2, table
4.3, the results are shown.
Period 1:
RFM Table
•Period
•Customer Code(ID)
•Recency (days)
•Frequency
•Monetary (Average of purchase)
We defined recency by calculating the interval between the last date of
the last date of each period which for period. It means that the
evaluating time for these two time snapshots are '1384/06/31' and '1385/06/31'.
monetary, we aggregate the transaction data to calc
total number of purchases and total amount spent during each period. According to
the market segmentation by (Marcus, C., 1998), we need the average purchasing of
each customer. So we divide total purchase amount by total number of purchases to
alculate average amount of each purchase. The final data that is ready to the next
step for discretization has the format as illustrated in table 4.1. The fields that are
extracted from database can be seen.
Table 4.1: RFM table fields
segmentation (in SQL server 2000): According to
), we divided customers to four clusters in each period which
include uncertain, frequent, spender and best. According to Customer Value matrix,
we have two axes. The calculation steps of Customer Value Matrix and its result are
4.2.1 Customer Value Matrix Result: According to definition of Customer
Value Matrix in chapter3, we applied customer value matrix introduced by
). For each period, we define two variables for this matrix. One is average
number of purchases and the other is average amount of purchase. In table 4.2, table
4.3, the results are shown.
72
al between the last date of
the last date of each period which for period. It means that the
evaluating time for these two time snapshots are '1384/06/31' and '1385/06/31'.
we aggregate the transaction data to calculate the
total number of purchases and total amount spent during each period. According to
), we need the average purchasing of
each customer. So we divide total purchase amount by total number of purchases to
alculate average amount of each purchase. The final data that is ready to the next
fields that are
According to
), we divided customers to four clusters in each period which
include uncertain, frequent, spender and best. According to Customer Value matrix,
tomer Value Matrix and its result are
According to definition of Customer
Value Matrix in chapter3, we applied customer value matrix introduced by (Marcus,
two variables for this matrix. One is average
number of purchases and the other is average amount of purchase. In table 4.2, table
Page 74
73
Table 4.2 : calculating variables for customer value matrix
Average number of purchases = total number of purchases/total number of customers
Total number of purchases = 16,424
Total number of customers = 789
Average number of purchases = 1,6424/789= 20.82 purchases per customer
Average purchase amount = total sales/total number of customers
Total sales average = 1,037,047,130.8 Rials
Total number of customers = 789
Average purchase amount 1,037,047,130.8 Rials /789= 1,314,381.66 rials per purchase
Period 2:
Table 4.3: calculating variables for customer value matrix
Average number of purchases = total number of purchases/total number of customers
Total number of purchases = 31,061
Total number of customers = 2199
Average number of purchases = 31,061/2199= 14.1 purchases per customer
Average purchase amount = total sales/total number of customers
Total sales average = 2,829,589,665.4 Rials
Total number of customers = 2199
Average purchase amount 2,829,589,665.4 Rials /2199= 1,286,762 per purchase
Based on the customer value matrix we have four clusters. For each customer,
we calculate average amount of purchase and the number of purchases. By
comparing these values with the two average variables calculated before, we
determine each customer belong to which cluster.
When average sale of each customer is less than the average of sales and
Page 75
74
number of purchase is less than the average of frequency, called uncertain segment.
Else, when average sale of each customer is less than the average of sales and
number of purchase is greater than the average of frequency, called frequent
segment. When average sale of each customer is greater than the average of sales
and number of purchase is less than the average of frequency, called spender
segment. Finally, when average sale of each customer is greater than the average of
sales and number of purchase is greater than the average of frequency, called best
segment. Figure4.2 shows the four segments.
Figure 4.2: The Customer Value Matrix
The segmentation results in period one and two are shown in table 4.4, table 4.5
Table 4.4: segment information in For period 1.
Segment Number of
customers
Percentage
Uncertain 441 56%
Frequent 140 18%
Spender 85 11%
Frequency Avg.
Frequency
Monetary
Av
g.
Mo
net
ary
Spender
Best
Uncertain
Frequency
Page 76
75
Best 123 16%
Total 789
Table 4.5: segment information in For period 2
Segment Number of
customers
Percentage
Uncertain 1335 61%
Frequent 389 18%
Spender 227 10%
Best 248 11%
Total 2199
Based on the Customer Value Matrix, as mentioned, for clusters are as follows: the
uncertain, frequent, spenders and best.
4.3Customer Behavior Mining: In this phase, we applied association
rules to analyze the patterns of customer behavior of different time snapshots for
each customer cluster. For mining changes in customer behavior during different
periods, we divide data to two periods and in each period, we build four clusters of
customers include uncertain, spender, frequent and best.
As mentioned, the purpose of this study is to mine customer behavior patterns
by building association rules while customer profile data and behavioral variables
(RFM) are in the conditional part and purchased products are as the consequent part.
The first issue is association rule mining work with discrete variables. Therefore, in
the first phase we need to do discretization.
4.3.1 Discretization Result: As explained in section in chapter3, we have various
types of discretization methods. In this study we did this step in R package. For
three customer behavioral variables, RFM, we have used equal frequency binning.
The result of this discretization is as follows. We build four quantiles by equal
frequency binning in R package for Recency, Frequency and Monetary. Here we
Page 77
76
present the quantile data and histogram of R, F, M and Area separately.
Table4.6: R quantile Figure4.3: R histogram
Page 78
77
Table 4.8: F quantile Figure 4.5: F histogram
Frequency
Variable
Quantile
interval
1th Quantile 1 to 2
2th Quantile 2 to 5
3th Quantile 5 to 20
4th Quantile 20 to
248
For discretizing the area, based on the market expert opinion and their knowledge
about the area we have define four groups.
Page 79
78
4.3.2 Association Rule Mining Results:
We applied Apriori algorithm to mine association rules in this research. The
minimum support and confidence is 17% and we find maximal frequent itemsets.
After association rules built, the association rules of each cluster for two different
time periods are compared to understand the customer behavior patterns of the most
valuable customers. It means that we have 8 rulesets for four cluster o f customers
and in two periods.
4.4Change Mining: In this step, we compare each two rulesets related to each customer in two periods.
In table 4.6you can see the summary of generated rule in each cluster and in each
period.
Table 4.10: Generated rule summary
Number of generated association rules per cluster
Cluster Period1 Period2
1 20 13
Page 80
79
2 76 65
3 22 29
4 127 86
All of the mined association rules and their change types are shown in table
4.6.
Here, in tables below, the rules that have been made by the Chen similarity
formula are seen. For each cluster, we have two ruleset for each period.
In each ruleset we can see different kinds of changes in customer purchasing
behavior. Cluster one is the cluster who buy frequently and their purchase amount is
below the average purchase amount of total. In the generated rules and the changes
in customer purchasing behavior, from the 5 kinds that we defined in the
methodology, we have found 4 kinds of them in four clusters. While, there are large
number of changes in customer behavior patterns, a few example of change pattern
are selected from each change type to provide an explanation.
4.4.1 Some examples of change pattern:
One example of emerging pattern in cluster1:
t1-r5: "Area=poor, -> cat1=1" support =0.191344
t2-r8: "Area=poor, -> cat1=1" support=0.260674 Cat1 is dairy products group
The growth rate is 36%.
This rule shows that the poor area generally buy dairy products. The support of this rule
show 36% growth means rule grows more robust over time.
One example of unexpected purchasing pattern in cluster1:
t1-r1: area=normal -> cat11=1 support =0.170843 cat11 is sauces
t2-r4: area=normal -> cat1=1 support =0.193258 cat1 is dairy products
The above rules show that the initial pattern of customer behavior is area=
normal to purchase sauces category. However, in the second period, this group
Page 81
80
shows that they purchase dairy products. This unexpected consequent pattern can
lead marketing decision makers to enforce their marketing effort to know why this
change happened and promoting dairy products to this group and to reduce
promotion of sauces to normal area, thus increase customer value.
One example of Added Rule:
t1-r24: R=25% , -> cat2=1 support = 0.200514
While R = 25% means recency is between 0 to 5 days and cat2 is ice-cream
The above rule is a newly added pattern which provides a reference for
developing promotion plans to stimulate customer needs.
One example of Perished Rule:
t1-r11: F=100% ,M=50% -> cat1=1 support =0.178571 similarity=0.333
While F=100% means Frequency is between 20 to 248 times, M=50% means Monetary
is between 269471.154 Rials to 538398.005 Rials and Cat1 means Dairy products.
The above rule showed that during the first period, among customers whose
their frequency is between 24 and 248 times and their monetary expenditure is
between 269471.154 Rials to 538398.005 Rials bought dairy product but the
similarity of this rule with the generated rules in the next period is 0.33 which is
lower than rule matching threshold(RMT). In marketing when we face such a
situation, it means that the focus of marketing strategies should be changed from this
group. The unexpected purchasing (Consequent) and unexpected shifting
(condition) patterns can help to better determined where to focus.
4.4.2 Association rules and changes based (Chen et al, 2005): Table 4.11: Generated Rules for period 1 Cluster
1
Rule-
Index rule1 Support Change Type Similarity
Sim-
Rule-
Index
1 area=normal , ->
cat11=1 ,
0.170843 Unexpected perished 0 1
2 M=25% , -> cat1=1 , 0.177677 Emerging trend 1 7
3 M=50% , -> cat11=1 , 0.18451 Unexpected perished 0 1
4 area=poor , -> cat11=1 , 0.220957 Emerging trend 1 9
Page 82
81
5 area=poor , -> cat1=1 , 0.191344 Emerging trend 1 8
6 area=poor , -> cat3=1 , 0.170843 Emerging trend 1 10
7 R=75% , -> cat11=1 , 0.170843 Unexpected perished 0 1
8 R=75% , -> cat1=1 , 0.193622 Emerging trend 1 2
9 R=100% , -> cat11=1 , 0.198178 Unexpected perished 0 1
10 R=100% , -> cat5=1 , 0.175399 Unexpected perished 0 1
11 M=75% , -> cat11=1 , 0.198178 Unexpected perished 0 1
12 M=75% , -> cat1=1 , 0.1959 Emerging trend 1 3
13 M=75% , -> cat5=1 , 0.220957 Unexpected perished 0 1
14 M=75% , -> cat3=1 , 0.200456 Unexpected perished 0 1
15 F=25% , -> cat11=1 , 0.189066 Emerging trend 1 13
16 F=25% , -> cat1=1 , 0.173121 Emerging trend 1 12
17 F=75% , -> cat11=1 , 0.218679 Unexpected
purchasing
0 1
18 F=75% , -> cat1=1 , 0.23918 Emerging trend 1 1
19 F=75% , -> cat5=1 , 0.230068 Unexpected
purchasing
0 1
20 F=75% , -> cat3=1 , 0.173121 Unexpected
purchasing
0 1
Table4.12: Generated Rules for period 2
Cluster 1
Rul-
Index rule2 Support Change Type Similarity
Sim-
Rule-
Index
1 F=75% , -> cat1=1 , 0.18427 1 18
2 R=75% , -> cat1=1 , 0.170037 1 8
3 M=75% , -> cat1=1 , 0.175281 1 12
4 area=normal , ->
cat1=1 ,
0.193258 Unexpected
purchasing
0 1
5 R=100% , -> cat1=1 , 0.207491 Unexpected added 0 1
6 M=50% , -> cat1=1 , 0.229213 Unexpected added 0 1
7 M=25% , -> cat1=1 , 0.213483 1 2
8 area=poor , -> cat1=1 , 0.260674 1 5
9 area=poor , -> cat11=1
,
0.182772 1 4
10 area=poor , -> cat3=1 , 0.170787 1 6
Page 83
82
11 F=25% , -> cat3=1 , 0.170787 Unexpected added 0 1
12 F=25% , -> cat1=1 , 0.277903 1 16
13 F=25% , -> cat11=1 , 0.183521 1 15
Generated rules for cluster2 are as follows:
Table4.13: Generated Rules for period 1 Cluster 2
Rul-
Index
rule1 Support Change Type Similarity Sim-
Rule-
Index
1 F=100% ,R=50% , -> cat5=1 , 0.178571 Not perished 0.5 12
2 F=100% ,area=good , -> cat1=1 , 0.178571 Perished 0.333333 25
3 F=100% ,area=good , -> cat5=1 , 0.178571 Not perished 0.5 18
4 area=good , -> cat11=1 , 0.178571 Unexpected
perished
0 1
5 F=100% ,area=rich , -> cat3=1 , 0.178571 Perished 0.333333 26
6 F=100% ,area=rich , -> cat11=1 , 0.178571 Perished 0.333333 27
7 F=100% ,area=rich , -> cat1=1 ,cat5=1 , 0.178571 Perished 0.333333 21
8 F=100% ,M=50% , -> cat2=1 , 0.171429 Perished 0.166667 21
9 F=100% ,M=50% , -> cat3=1 , 0.178571 Not perished 0.5 5
10 F=100% ,M=50% , -> cat5=1 ,cat11=1 , 0.178571 Perished 0.333333 22
11 F=100% ,M=50% , -> cat1=1 , 0.178571 Perished 0.333333 25
12 M=50% , -> cat1=1 ,cat5=1 , 0.171429 Not perished 0.5 4
13 F=100% ,area=normal , -> cat3=1 ,cat13=1 , 0.171429 Not perished 0.5 14
14 F=100% ,area=normal , -> cat11=1 ,cat13=1 , 0.178571 Not perished 0.5 14
15 F=100% ,area=normal , -> cat5=1 ,cat13=1 , 0.185714 Not perished 0.5 14
16 area=normal , -> cat1=1 ,cat13=1 , 0.171429 Not perished 0.5 13
17 F=100% ,area=normal , -> cat3=1 ,cat11=1 , 0.192857 Not perished 0.5 19
18 F=100% ,area=normal , -> cat1=1 ,cat3=1 , 0.171429 Emerging trend 1 20
19 F=100% ,area=normal , -> cat3=1 ,cat5=1 , 0.178571 Not perished 0.5 18
20 F=100% ,area=normal , -> cat1=1 ,cat5=1 ,cat11=1
,
0.171429 Not perished 0.666667 19
21 area=normal , -> cat2=1 , 0.171429 Unexpected
perished
0 1
22 F=100% ,R=25% ,M=75% , -> cat1=1 ,cat3=1 , 0.171429 Not perished 0.666667 30
23 F=100% ,R=25% ,M=75% , -> cat1=1 ,cat11=1 , 0.171429 Not perished 0.666667 28
Page 84
83
24 F=100% ,R=25% ,M=75% , -> cat5=1 ,cat11=1 , 0.171429 Not perished 0.5 27
25 F=100% ,R=25% ,M=75% , -> cat1=1 ,cat5=1 , 0.185714 Not perished 0.5 25
26 F=100% ,M=75% , -> cat3=1 ,cat11=1 ,cat13=1 , 0.171429 Emerging trend 1 38
27 F=100% ,M=75% , -> cat1=1 ,cat3=1 ,cat13=1 , 0.171429 Emerging trend 1 39
28 F=100% ,M=75% , -> cat3=1 ,cat5=1 ,cat13=1 , 0.185714 Emerging trend 1 33
29 M=75% , -> cat5=1 ,cat11=1 ,cat13=1 , 0.171429 Not perished 0.75 34
30 M=75% , -> cat1=1 ,cat11=1 ,cat13=1 , 0.171429 Not perished 0.75 34
31 F=100% ,M=75% , -> cat1=1 ,cat5=1 ,cat13=1 , 0.178571 Not perished 0.666667 33
32 F=100% ,M=75% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.2 Emerging trend 1 45
33 M=75% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.192857 Emerging trend 1 44
34 F=100% ,M=75% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.2 Emerging trend 1 42
35 F=100% ,M=75% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.207143 Emerging trend 1 43
36 F=100% ,M=75% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.214286 Emerging trend 1 41
37 M=75% , -> cat2=1 , 0.171429 Unexpected
perished
0 1
38 F=100% ,R=25% , -> cat2=1 ,cat3=1 ,cat11=1 , 0.171429 Not perished 0.666667 53
39 R=25% , -> cat1=1 ,cat2=1 ,cat3=1 ,cat11=1 , 0.178571 Not perished 0.75 55
40 F=100% ,R=25% , -> cat1=1 ,cat2=1 ,cat3=1 , 0.192857 Not perished 0.666667 54
41 R=25% , -> cat1=1 ,cat2=1 ,cat3=1 ,cat5=1 , 0.178571 Not perished 0.75 51
42 F=100% ,R=25% , -> cat2=1 ,cat3=1 ,cat5=1 , 0.178571 Not perished 0.666667 47
43 F=100% ,R=25% , -> cat1=1 ,cat2=1 ,cat11=1 , 0.235714 Not perished 0.666667 52
44 R=25% , -> cat1=1 ,cat2=1 ,cat5=1 ,cat11=1 , 0.207143 Not perished 0.75 49
45 F=100% ,R=25% , -> cat2=1 ,cat5=1 ,cat11=1 , 0.214286 Not perished 0.666667 48
46 F=100% ,R=25% , -> cat1=1 ,cat2=1 ,cat5=1 , 0.235714 Not perished 0.666667 46
47 F=100% , -> cat2=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.178571 Not perished 0.75 23
48 F=100% , -> cat1=1 ,cat2=1 ,cat3=1 ,cat13=1 , 0.192857 Not perished 0.75 23
49 F=100% , -> cat2=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.185714 Not perished 0.75 62
50 F=100% , -> cat1=1 ,cat2=1 ,cat11=1 ,cat13=1 , 0.207143 Not perished 0.75 23
51 F=100% , -> cat2=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.192857 Not perished 0.75 22
52 F=100% , -> cat1=1 ,cat2=1 ,cat5=1 ,cat13=1 , 0.221429 Not perished 0.75 21
53 F=100% , -> cat1=1 ,cat2=1 ,cat3=1 ,cat11=1 , 0.257143 Emerging trend 1 23
54 F=100% , -> cat2=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.228571 Not perished 0.75 22
55 F=100% , -> cat1=1 ,cat2=1 ,cat3=1 ,cat5=1 , 0.257143 Not perished 0.75 21
Page 85
84
56 F=100% , -> cat1=1 ,cat2=1 ,cat5=1 ,cat11=1 , 0.292857 Not perished 0.75 21
57 F=100% ,R=25% , -> cat3=1 ,cat11=1 ,cat13=1 , 0.221429 Emerging trend 1 53
58 R=25% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.214286 Emerging trend 1 50
59 R=25% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.207143 Emerging trend 1 55
60 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat13=1 , 0.221429 Emerging trend 1 54
61 R=25% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.214286 Emerging trend 1 51
62 F=100% ,R=25% , -> cat3=1 ,cat5=1 ,cat13=1 , 0.221429 Emerging trend 1 47
63 F=100% ,R=25% , -> cat1=1 ,cat11=1 ,cat13=1 , 0.228571 Emerging trend 1 52
64 R=25% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.228571 Emerging trend 1 49
65 F=100% ,R=25% , -> cat5=1 ,cat11=1 ,cat13=1 , 0.228571 Emerging trend 1 48
66 F=100% ,R=25% , -> cat1=1 ,cat5=1 ,cat13=1 , 0.242857 Emerging trend 1 46
67 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.285714 Emerging trend 1 60
68 R=25% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.271429 Emerging trend 1 59
69 F=100% ,R=25% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.271429 Emerging trend 1 57
70 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.292857 Emerging trend 1 58
71 F=100% ,R=25% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.328571 Emerging trend 1 56
72 F=100% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.321429 Emerging trend 1 64
73 F=100% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.35 Emerging trend 1 63
74 F=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.342857 Emerging trend 1 63
75 F=100% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.35 Emerging trend 1 61
76 F=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.407143 Emerging trend 1 65
Table4.14: Generated Rules for period 2 Cluster 2
Rule-
Index rule2 Support Change Type Similarity
Sim-
Rule-
Index
1 area=rich , -> cat1=1 , 0.177378 Added 0.25 7
2 area=poor , -> cat1=1 ,cat3=1 , 0.187661 Unexpected added 0 1
3 area=poor , -> cat11=1 , 0.172237 Unexpected added 0 1
4 M=50% , -> cat1=1 ,cat11=1 , 0.172237 Not Added 0.5 12
5 M=50% , -> cat3=1 , 0.190231 Not Added 0.5 9
6 M=50% , -> cat5=1 , 0.179949 Not Added 0.5 12
Page 86
85
7 F=75% , -> cat1=1 ,cat11=1 , 0.179949 Unexpected added 0 1
8 F=75% , -> cat1=1 ,cat3=1 , 0.197943 Unexpected added 0 1
9 R=50% , -> cat1=1 ,cat13=1 , 0.172237 Unexpected
purchasing
0 1
10 R=50% , -> cat1=1 ,cat11=1 , 0.187661 Unexpected
purchasing
0 1
11 R=50% , -> cat1=1 ,cat3=1 , 0.203085 Unexpected
purchasing
0 1
12 R=50% , -> cat5=1 , 0.192802 Not Added 0.5 1
13 area=normal , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1
,
0.172237 Not Added 0.5 16
14 F=100% ,area=normal , -> cat13=1 , 0.174807 Not Added 0.5 13
15 area=normal , -> cat5=1 ,cat13=1 , 0.18509 Not Added 0.5 15
16 area=normal , -> cat1=1 ,cat5=1 ,cat11=1 , 0.177378 Not Added 0.5 20
17 area=normal , -> cat1=1 ,cat3=1 ,cat5=1 , 0.182519 Added 0.333333 16
18 F=100% ,area=normal , -> cat5=1 , 0.172237 Not Added 0.5 1
19 F=100% ,area=normal , -> cat1=1 ,cat11=1 , 0.190231 Not Added 0.666667 20
20 F=100% ,area=normal , -> cat1=1 ,cat3=1 , 0.179949 1 18
21 F=100% , -> cat1=1 ,cat2=1 ,cat5=1 , 0.179949 Not Added 0.75 52
22 F=100% , -> cat2=1 ,cat5=1 ,cat11=1 , 0.177378 Not Added 0.75 51
23 F=100% , -> cat1=1 ,cat2=1 ,cat3=1 ,cat11=1 , 0.174807 1 53
24 R=25% , -> cat2=1 , 0.200514 Added 0.25 39
25 F=100% ,R=25% ,M=75% , -> cat1=1 , 0.195373 Not Added 0.5 22
26 F=100% ,R=25% ,M=75% , -> cat3=1 , 0.18509 Not Added 0.5 22
27 F=100% ,R=25% ,M=75% , -> cat11=1 , 0.174807 Not Added 0.5 23
28 R=25% ,M=75% , -> cat1=1 ,cat11=1 , 0.190231 Not Added 0.666667 23
29 R=25% ,M=75% , -> cat3=1 ,cat11=1 , 0.182519 Added 0.333333 22
30 R=25% ,M=75% , -> cat1=1 ,cat3=1 , 0.195373 Not Added 0.666667 22
31 R=25% ,M=75% , -> cat5=1 , 0.192802 Added 0.333333 24
32 R=25% ,M=75% , -> cat13=1 , 0.182519 Added 0.166667 26
33 F=100% ,M=75% , -> cat3=1 ,cat5=1 ,cat13=1 , 0.179949 1 28
34 M=75% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.190231 Not Added 0.75 29
35 M=75% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.187661 Not Added 0.75 29
36 M=75% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.197943 Not Added 0.75 33
37 F=100% ,M=75% , -> cat1=1 ,cat11=1 ,cat13=1 , 0.187661 Not Added 0.666667 26
38 F=100% ,M=75% , -> cat3=1 ,cat11=1 ,cat13=1 , 0.18509 1 26
Page 87
86
39 F=100% ,M=75% , -> cat1=1 ,cat3=1 ,cat13=1 , 0.192802 1 27
40 M=75% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.226221 Not Added 0.75 30
41 F=100% ,M=75% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.197943 1 36
42 F=100% ,M=75% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.197943 1 34
43 F=100% ,M=75% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.213368 1 35
44 M=75% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.226221 1 33
45 F=100% ,M=75% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.233933 1 32
46 F=100% ,R=25% , -> cat1=1 ,cat5=1 ,cat13=1 , 0.182519 1 66
47 F=100% ,R=25% , -> cat3=1 ,cat5=1 ,cat13=1 , 0.187661 1 62
48 F=100% ,R=25% , -> cat5=1 ,cat11=1 ,cat13=1 , 0.172237 1 65
49 R=25% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.174807 1 64
50 R=25% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.177378 1 58
51 R=25% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.187661 1 61
52 F=100% ,R=25% , -> cat1=1 ,cat11=1 ,cat13=1 , 0.192802 1 63
53 F=100% ,R=25% , -> cat3=1 ,cat11=1 ,cat13=1 , 0.195373 1 57
54 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat13=1 , 0.203085 1 60
55 R=25% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.200514 1 59
56 F=100% ,R=25% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.197943 1 71
57 F=100% ,R=25% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.205656 1 69
58 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.215938 1 70
59 R=25% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.208226 1 68
60 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.244216 1 67
61 F=100% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.257069 1 75
62 F=100% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.25964 1 73
63 F=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.272494 1 74
64 F=100% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.298201 1 72
65 F=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.303342 1 76
Table4.15: Generated Rules for period 1 Cluster 3
Rul-
Index rule1 Support Change Type Similarity
Sim-
Rule-
Index
1 F=50% ,M=100% , -> cat5=1 , 0.188235294 Emerging
trend
1 15
Page 88
87
2 M=100% ,area=good , -> cat5=1 , 0.188235294 Not perished 0.5 1
3 R=50% ,M=100% , -> cat5=1 , 0.211764706 Emerging
trend
1 1
4 F=25% ,M=100% , -> cat5=1 , 0.211764706 Emerging
trend
1 9
5 R=75% ,M=100% , -> cat5=1 , 0.188235294 Emerging
trend
1 6
6 M=100% ,area=poor , -> cat11=1 , 0.176470588 Not perished 0.5 14
7 M=100% ,area=poor , -> cat5=1 , 0.258823529 Not perished 0.5 1
8 M=100% ,area=normal , -> cat5=1 , 0.235294118 Emerging
trend
1 4
9 R=100% ,M=100% , -> cat5=1 , 0.282352941 Emerging
trend
1 12
10 F=75% ,M=100% , -> cat1=1 ,cat13=1 , 0.2 Not perished 0.5 16
11 F=75% ,M=100% , -> cat3=1 ,cat13=1 , 0.188235294 Not perished 0.5 16
12 F=75% ,M=100% , -> cat11=1 ,cat13=1 , 0.176470588 Not perished 0.5 16
13 F=75% ,M=100% , -> cat5=1 ,cat13=1 , 0.2 Emerging
trend
1 16
14 M=100% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.211764706 Emerging
trend
1 25
15 M=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.211764706 Emerging
trend
1 27
16 M=100% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.223529412 Emerging
trend
1 26
17 M=100% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.247058824 Emerging
trend
1 28
18 F=75% ,M=100% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.176470588 Not perished 0.666666667 18
19 F=75% ,M=100% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.2 Not perished 0.666666667 20
20 F=75% ,M=100% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.188235294 Not perished 0.666666667 19
21 F=75% ,M=100% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.188235294 Not perished 0.666666667 18
22 M=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.247058824 Emerging
trend
1 29
Table4.16: Generated Rules for period 2 Cluster 3
Rule-
Index rule1 Support
Change
Type Similarity
Sim-
Rule-
Index
1 R=50% ,M=100% , -> cat5=1 , 0.171806 1 3
Page 89
88
2 M=100% ,area=normal , -> cat1=1 , 0.185022 Added 0.25 10
3 M=100% ,area=normal , -> cat3=1 , 0.180617 Added 0.25 11
4 M=100% ,area=normal , -> cat5=1 , 0.189427 1 8
5 R=75% ,M=100% , -> cat3=1 , 0.171806 Added 0.25 11
6 R=75% ,M=100% , -> cat5=1 , 0.193833 1 5
7 F=25% ,M=100% , -> cat1=1 , 0.171806 Added 0.25 10
8 F=25% ,M=100% , -> cat3=1 , 0.171806 Added 0.25 11
9 F=25% ,M=100% , -> cat5=1 , 0.220264 1 4
10 R=100% ,M=100% , -> cat1=1 ,cat11=1 , 0.171806 Added 0.333333 18
11 R=100% ,M=100% , -> cat3=1 ,cat11=1 , 0.171806 Added 0.333333 18
12 R=100% ,M=100% , -> cat5=1 , 0.229075 1 9
13 F=50% ,M=100% , -> cat1=1 ,cat3=1 , 0.171806 Added 0.333333 18
14 F=50% ,M=100% , -> cat11=1 , 0.1982
38
Not
Added
0.5 6
15 F=50% ,M=100% , -> cat5=1 , 0.22467 1 1
16 F=75% ,M=100% , -> cat5=1 ,cat13=1 , 0.180617 1 13
17 F=75% ,M=100% , -> cat1=1 , 0.189427 Not
Added
0.5 10
18 F=75% ,M=100% , -> cat3=1 ,cat11=1 , 0.193833 Not
Added
0.666667 18
19 F=75% ,M=100% , -> cat5=1 ,cat11=1 , 0.180617 Not
Added
0.666667 20
20 F=75% ,M=100% , -> cat3=1 ,cat5=1 , 0.202643 Not
Added
0.666667 19
21 M=100% ,area=poor , -> cat1=1 , 0.198238 Added 0.25 10
22 M=100% ,area=poor , -> cat3=1 ,cat11=1 , 0.185022 Not
Added
0.5 6
Page 90
89
23 M=100% ,area=poor , -> cat5=1 ,cat11=1 , 0.193833 Not
Added
0.5 6
24 M=100% ,area=poor , -> cat3=1 ,cat5=1 , 0.202643 Not
Added
0.5 7
25 M=100% , -> cat1=1 ,cat3=1 ,cat11=1
,cat13=1 ,
0.23348 1 14
26 M=100% , -> cat1=1 ,cat5=1 ,cat11=1
,cat13=1 ,
0.237885 1 16
27 M=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1
,
0.264317 1 15
28 M=100% , -> cat3=1 ,cat5=1 ,cat11=1
,cat13=1 ,
0.264317 1 17
29 M=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1
,
0.303965 1 22
Table4.17: Generated Rules for period 1 Cluster 4
Rule
Index rule1 Support Change Type Similarity
Sim-
Rule-
Index
1 F=100% ,M=100% ,area=rich , -> cat3=1 , 0.178862 Emerging
trend
1 4
2 F=100% ,M=100% ,area=rich , -> cat5=1 , 0.195122 Not perished 0.666667 30
3 M=100% , -> cat2=1 ,cat11=1 , 0.170732 Not perished 0.666667 11
4 M=100% , -> cat2=1 ,cat3=1 ,cat5=1 , 0.170732 Not perished 0.75 17
5 F=100% ,M=100% , -> cat1=1 ,cat2=1 ,cat5=1 , 0.178862 Emerging
trend
1 16
6 R=25% ,M=100% ,area=poor , -> cat3=1 ,cat5=1 , 0.170732 Not perished 0.666667 10
7 F=100% ,R=25% ,area=poor , -> cat3=1 ,cat5=1 , 0.170732 Not perished 0.666667 65
8 F=100% ,R=25% ,M=100% ,area=poor , -> cat3=1 , 0.170732 Not perished 0.5 4
9 R=25% ,M=100% ,area=poor , -> cat5=1 ,cat13=1 , 0.186992 Not perished 0.666667 57
10 F=100% ,R=25% ,area=poor , -> cat5=1 ,cat13=1 , 0.186992 Not perished 0.666667 57
11 F=100% ,R=25% ,M=100% ,area=poor , -> cat13=1 , 0.186992 Not perished 0.5 7
12 F=100% ,R=25% ,M=100% ,area=poor , -> cat5=1 , 0.195122 Not perished 0.5 30
Page 91
90
13 M=100% ,area=poor , -> cat1=1 ,cat3=1 ,cat11=1 , 0.170732 Not perished 0.666667 8
14 area=poor , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.170732 Perished 0.25 8
15 area=poor , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.170732 Perished 0.25 8
16 M=100% ,area=poor , -> cat3=1 ,cat11=1 ,cat13=1 , 0.178862 Not perished 0.5 39
17 F=100% ,area=poor , -> cat3=1 ,cat11=1 ,cat13=1 , 0.170732 Not perished 0.5 34
18 area=poor , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.178862 Perished 0.25 10
19 M=100% ,area=poor , -> cat3=1 ,cat5=1 ,cat11=1 , 0.186992 Not perished 0.666667 10
20 F=100% ,area=poor , -> cat3=1 ,cat5=1 ,cat11=1 , 0.178862 Not perished 0.5 46
21 F=100% ,M=100% ,area=poor , -> cat3=1 ,cat11=1 , 0.178862 Not perished 0.666667 45
22 M=100% ,area=poor , -> cat1=1 ,cat11=1 ,cat13=1 , 0.170732 Not perished 0.5 36
23 area=poor , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.170732 Perished 0.25 9
24 M=100% ,area=poor , -> cat1=1 ,cat5=1 ,cat11=1 , 0.170732 Not perished 0.666667 9
25 M=100% ,area=poor , -> cat5=1 ,cat11=1 ,cat13=1 , 0.186992 Not perished 0.5 41
26 F=100% ,area=poor , -> cat5=1 ,cat11=1 ,cat13=1 , 0.178862 Not perished 0.5 33
27 F=100% ,M=100% ,area=poor , -> cat11=1 ,cat13=1 , 0.178862 Not perished 0.666667 32
28 F=100% ,M=100% ,area=poor , -> cat5=1 ,cat11=1 , 0.186992 Not perished 0.666667 47
29 M=100% ,area=poor , -> cat1=1 ,cat3=1 ,cat13=1 , 0.186992 Not perished 0.666667 8
30 F=100% ,area=poor , -> cat1=1 ,cat3=1 ,cat13=1 , 0.178862 Not perished 0.5 54
31 area=poor , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.186992 Perished 0.25 8
32 M=100% ,area=poor , -> cat1=1 ,cat3=1 ,cat5=1 , 0.186992 Not perished 0.666667 8
33 F=100% ,area=poor , -> cat1=1 ,cat3=1 ,cat5=1 , 0.178862 Not perished 0.5 63
34 F=100% ,M=100% ,area=poor , -> cat1=1 ,cat3=1 , 0.178862 Not perished 0.666667 8
Page 92
91
35 M=100% ,area=poor , -> cat3=1 ,cat5=1 ,cat13=1 , 0.203252 Not perished 0.666667 10
36 F=100% ,area=poor , -> cat3=1 ,cat5=1 ,cat13=1 , 0.195122 Not perished 0.5 56
37 F=100% ,M=100% ,area=poor , -> cat3=1 ,cat13=1 , 0.195122 Not perished 0.666667 55
38 F=100% ,M=100% ,area=poor , -> cat3=1 ,cat5=1 , 0.203252 Not perished 0.666667 10
39 M=100% ,area=poor , -> cat1=1 ,cat5=1 ,cat13=1 , 0.195122 Not perished 0.666667 9
40 F=100% ,area=poor , -> cat1=1 ,cat5=1 ,cat13=1 , 0.186992 Not perished 0.5 53
41 F=100% ,M=100% ,area=poor , -> cat1=1 ,cat13=1 , 0.186992 Not perished 0.666667 52
42 F=100% ,M=100% ,area=poor , -> cat1=1 ,cat5=1 , 0.186992 Not perished 0.666667 9
43 F=100% ,M=100% ,area=poor , -> cat5=1 ,cat13=1 , 0.219512 Not perished 0.666667 57
44 M=100% ,area=normal , -> cat3=1 ,cat5=1 ,cat11=1 , 0.170732 Not perished 0.666667 25
45 M=100% ,area=normal , -> cat1=1 ,cat11=1 ,cat13=1 , 0.170732 Not perished 0.666667 24
46 area=normal , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.170732 Perished 0.25 24
47 M=100% ,area=normal , -> cat1=1 ,cat5=1 ,cat11=1 , 0.186992 Not perished 0.666667 24
48 F=100% ,area=normal , -> cat1=1 ,cat5=1 ,cat11=1 , 0.170732 Not perished 0.5 43
49 F=100% ,M=100% ,area=normal , -> cat1=1 ,cat11=1 , 0.170732 Not perished 0.666667 24
50 M=100% ,area=normal , -> cat5=1 ,cat11=1 ,cat13=1 , 0.178862 Not perished 0.666667 26
51 F=100% ,area=normal , -> cat5=1 ,cat11=1 ,cat13=1 , 0.170732 Not perished 0.5 33
52 F=100% ,M=100% ,area=normal , -> cat11=1 ,cat13=1 , 0.170732 Not perished 0.666667 32
53 F=100% ,M=100% ,area=normal , -> cat5=1 ,cat11=1 , 0.186992 Not perished 0.666667 26
54 M=100% ,area=normal , -> cat1=1 ,cat3=1 ,cat13=1 , 0.170732 Not perished 0.666667 27
55 area=normal , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.170732 Perished 0.375 28
56 M=100% ,area=normal , -> cat1=1 ,cat3=1 ,cat5=1 , 0.178862 Emerging
trend
1 31
57 M=100% ,area=normal , -> cat3=1 ,cat5=1 ,cat13=1 , 0.178862 Emerging
trend
1 28
58 F=100% ,area=normal , -> cat3=1 ,cat5=1 ,cat13=1 , 0.170732 Not perished 0.5 28
59 F=100% ,M=100% ,area=normal , -> cat3=1 ,cat13=1 , 0.178862 Not perished 0.666667 55
60 F=100% ,M=100% ,area=normal , -> cat3=1 ,cat5=1 , 0.170732 Not perished 0.666667 21
61 M=100% ,area=normal , -> cat1=1 ,cat5=1 ,cat13=1 , 0.186992 Not perished 0.666667 27
62 F=100% ,area=normal , -> cat1=1 ,cat5=1 ,cat13=1 , 0.178862 Not perished 0.5 53
63 F=100% ,M=100% ,area=normal , -> cat1=1 ,cat13=1 , 0.178862 Not perished 0.666667 27
64 F=100% ,M=100% ,area=normal , -> cat1=1 ,cat5=1 , 0.195122 Not perished 0.666667 64
65 F=100% ,M=100% ,area=normal , -> cat5=1 ,cat13=1 , 0.203252 Not perished 0.666667 57
66 F=100% ,R=25% ,M=100% ,area=good , -> cat1=1 , 0.186992 Not perished 0.5 3
67 F=100% ,R=25% ,M=100% ,area=good , -> cat13=1 , 0.186992 Not perished 0.75 19
68 F=100% ,M=100% ,area=good , -> cat1=1 ,cat11=1 , 0.178862 Not perished 0.666667 42
69 F=100% ,M=100% ,area=good , -> cat5=1 ,cat11=1 , 0.170732 Not perished 0.666667 47
Page 93
92
70 F=100% ,M=100% ,area=good , -> cat3=1 , 0.186992 Not perished 0.666667 4
71 F=100% ,M=100% ,area=good , -> cat1=1 ,cat13=1 , 0.186992 Not perished 0.666667 52
72 F=100% ,M=100% ,area=good , -> cat1=1 ,cat5=1 , 0.186992 Not perished 0.666667 23
73 R=25% ,M=100% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.439024 Emerging
trend
1 48
74 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.422764 Emerging
trend
1 44
75 R=25% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.406504 Emerging
trend
1 49
76 R=25% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.406504 Emerging
trend
1 38
77 R=25% ,M=100% , -> cat3=1 ,cat11=1 ,cat13=1 , 0.439024 Emerging
trend
1 39
78 F=100% ,R=25% , -> cat3=1 ,cat11=1 ,cat13=1 , 0.430894 Emerging
trend
1 34
79 R=25% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.414634 Emerging
trend
1 40
80 R=25% ,M=100% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.463415 Emerging
trend
1 51
81 F=100% ,R=25% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.447154 Emerging
trend
1 46
82 F=100% ,R=25% ,M=100% , -> cat3=1 ,cat11=1 , 0.479675 Emerging
trend
1 45
83 R=25% ,M=100% , -> cat1=1 ,cat11=1 ,cat13=1 , 0.447154 Emerging
trend
1 36
84 F=100% ,R=25% , -> cat1=1 ,cat11=1 ,cat13=1 , 0.439024 Emerging
trend
1 35
85 R=25% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.422764 Emerging
trend
1 37
86 R=25% ,M=100% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.447154 Emerging
trend
1 50
87 F=100% ,R=25% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.430894 Emerging
trend
1 43
88 F=100% ,R=25% ,M=100% , -> cat1=1 ,cat11=1 , 0.463415 Emerging
trend
1 42
89 R=25% ,M=100% , -> cat5=1 ,cat11=1 ,cat13=1 , 0.463415 Emerging
trend
1 41
90 F=100% ,R=25% , -> cat5=1 ,cat11=1 ,cat13=1 , 0.455285 Emerging
trend
1 33
91 F=100% ,R=25% ,M=100% , -> cat11=1 ,cat13=1 , 0.479675 Emerging
trend
1 32
92 F=100% ,R=25% ,M=100% , -> cat5=1 ,cat11=1 , 0.504065 Emerging
trend
1 47
93 R=25% ,M=100% , -> cat1=1 ,cat3=1 ,cat13=1 , 0.455285 Emerging
trend
1 58
Page 94
93
94 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat13=1 , 0.447154 Emerging
trend
1 54
95 R=25% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.422764 Emerging
trend
1 59
96 R=25% ,M=100% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.455285 Emerging
trend
1 66
97 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.439024 Emerging
trend
1 63
98 F=100% ,R=25% ,M=100% , -> cat1=1 ,cat3=1 , 0.479675 Emerging
trend
1 62
99 R=25% ,M=100% , -> cat3=1 ,cat5=1 ,cat13=1 , 0.471545 Emerging
trend
1 61
100 F=100% ,R=25% , -> cat3=1 ,cat5=1 ,cat13=1 , 0.463415 Emerging
trend
1 56
101 F=100% ,R=25% ,M=100% , -> cat3=1 ,cat13=1 , 0.495935 Emerging
trend
1 55
102 F=100% ,R=25% ,M=100% , -> cat3=1 ,cat5=1 , 0.512195 Emerging
trend
1 65
103 R=25% ,M=100% , -> cat1=1 ,cat5=1 ,cat13=1 , 0.479675 Emerging
trend
1 60
104 F=100% ,R=25% , -> cat1=1 ,cat5=1 ,cat13=1 , 0.471545 Emerging
trend
1 53
105 F=100% ,R=25% ,M=100% , -> cat1=1 ,cat13=1 , 0.520325 Emerging
trend
1 52
106 F=100% ,R=25% ,M=100% , -> cat1=1 ,cat5=1 , 0.512195 Emerging
trend
1 64
107 F=100% ,R=25% ,M=100% , -> cat5=1 ,cat13=1 , 0.544715 Emerging
trend
1 57
108 M=100% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.544715 Emerging
trend
1 73
109 F=100% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.528455 Emerging
trend
1 69
110 M=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.560976 Emerging
trend
1 80
111 F=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.536585 Emerging
trend
1 77
112 F=100% ,M=100% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.569106 Emerging
trend
1 76
113 M=100% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.560976 Emerging
trend
1 75
114 F=100% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.544715 Emerging
trend
1 71
115 F=100% ,M=100% , -> cat3=1 ,cat11=1 ,cat13=1 , 0.569106 Emerging
trend
1 70
116 F=100% ,M=100% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.601626 Emerging 1 79
Page 95
94
trend
117 M=100% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.569106 Emerging
trend
1 74
118 F=100% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.552846 Emerging
trend
1 68
119 F=100% ,M=100% , -> cat1=1 ,cat11=1 ,cat13=1 , 0.577236 Emerging
trend
1 67
120 F=100% ,M=100% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.601626 Emerging
trend
1 78
121 F=100% ,M=100% , -> cat5=1 ,cat11=1 ,cat13=1 , 0.601626 Emerging
trend
1 72
122 M=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.585366 Emerging
trend
1 85
123 F=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.569106 Emerging
trend
1 82
124 F=100% ,M=100% , -> cat1=1 ,cat3=1 ,cat13=1 , 0.601626 Emerging
trend
1 81
125 F=100% ,M=100% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.609756 Emerging
trend
1 86
126 F=100% ,M=100% , -> cat3=1 ,cat5=1 ,cat13=1 , 0.626016 Emerging
trend
1 84
127 F=100% ,M=100% , -> cat1=1 ,cat5=1 ,cat13=1 , 0.642276 Emerging
trend
1 83
Table4.18: Generated Rules for period 2 Cluster 4
Rule-
Index rule1 Support Change
Type Similarity
Sim-
Rule-
Index
1 R=50% ,M=100% , -> cat3=1 , 0.173387 Added 0.333333 1
2 M=100% ,area=rich , -> cat13=1 , 0.177419 Added 0.25 11
3 F=100% ,M=100% ,area=rich , -> cat1=1 , 0.173387 Not Added 0.5 66
4 F=100% ,M=100% ,area=rich , -> cat3=1 , 0.177419 1 1
5 M=100% ,area=rich , -> cat1=1 ,cat3=1 , 0.181452 Added 0.333333 1
6 M=100% ,area=rich , -> cat3=1 ,cat5=1 , 0.173387 Added 0.333333 1
7 M=100% ,area=poor , -> cat13=1 , 0.173387 Not Added 0.5 11
8 M=100% ,area=poor , -> cat1=1 ,cat3=1 , 0.177419 Not Added 0.666667 13
9 M=100% ,area=poor , -> cat1=1 ,cat5=1 , 0.173387 Not Added 0.666667 24
Page 96
95
10 M=100% ,area=poor , -> cat3=1 ,cat5=1 , 0.197581 Not Added 0.666667 6
11 M=100% , -> cat1=1 ,cat2=1 ,cat11=1 , 0.181452 Not Added 0.666667 3
12 M=100% , -> cat2=1 ,cat3=1 ,cat11=1 , 0.177419 Not Added 0.666667 3
13 M=100% , -> cat2=1 ,cat5=1 ,cat11=1 , 0.177419 Not Added 0.666667 3
14 M=100% , -> cat2=1 ,cat13=1 , 0.173387 Not Added 0.5 3
15 F=100% ,M=100% , -> cat1=1 ,cat2=1 ,cat3=1 , 0.173387 Not Added 0.666667 5
16 F=100% ,M=100% , -> cat1=1 ,cat2=1 ,cat5=1 , 0.173387 1 5
17 M=100% , -> cat1=1 ,cat2=1 ,cat3=1 ,cat5=1 , 0.185484 Not Added 0.75 4
18 M=100% ,area=good , -> cat11=1 , 0.181452 Added 0.333333 68
19 F=100% ,M=100% ,area=good , -> cat13=1 , 0.189516 Not Added 0.75 67
20 M=100% ,area=good , -> cat5=1 ,cat13=1 , 0.173387 Added 0.333333 9
21 F=100% ,M=100% ,area=good , -> cat3=1 ,cat5=1 , 0.177419 Not Added 0.666667 38
22 M=100% ,area=good , -> cat1=1 ,cat3=1 , 0.173387 Added 0.333333 13
23 M=100% ,area=good , -> cat1=1 ,cat5=1 , 0.173387 Not Added 0.666667 72
24 M=100% ,area=normal , -> cat1=1 ,cat11=1 , 0.181452 Not Added 0.666667 45
25 M=100% ,area=normal , -> cat3=1 ,cat11=1 , 0.181452 Not Added 0.666667 44
26 M=100% ,area=normal , -> cat5=1 ,cat11=1 , 0.189516 Not Added 0.666667 44
27 M=100% ,area=normal , -> cat1=1 ,cat13=1 , 0.177419 Not Added 0.666667 45
28 M=100% ,area=normal , -> cat3=1 ,cat5=1 ,cat13=1 , 0.185484 1 57
29 F=100% ,M=100% ,area=normal , -> cat3=1 , 0.173387 Not Added 0.666667 1
30 F=100% ,M=100% ,area=normal , -> cat5=1 , 0.185484 Not Added 0.666667 2
31 M=100% ,area=normal , -> cat1=1 ,cat3=1 ,cat5=1 , 0.177419 1 56
32 F=100% ,R=25% ,M=100% , -> cat11=1 ,cat13=1 , 0.375 1 91
33 F=100% ,R=25% , -> cat5=1 ,cat11=1 ,cat13=1 , 0.358871 1 90
34 F=100% ,R=25% , -> cat3=1 ,cat11=1 ,cat13=1 , 0.350806 1 78
35 F=100% ,R=25% , -> cat1=1 ,cat11=1 ,cat13=1 , 0.326613 1 84
36 R=25% ,M=100% , -> cat1=1 ,cat11=1 ,cat13=1 , 0.346774 1 83
37 R=25% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.330645 1 85
38 R=25% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.326613 1 76
39 R=25% ,M=100% , -> cat3=1 ,cat11=1 ,cat13=1 , 0.375 1 77
40 R=25% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.362903 1 79
41 R=25% ,M=100% , -> cat5=1 ,cat11=1 ,cat13=1 , 0.383065 1 89
42 F=100% ,R=25% ,M=100% , -> cat1=1 ,cat11=1 , 0.370968 1 88
Page 97
96
43 F=100% ,R=25% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.350806 1 87
44 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.350806 1 74
45 F=100% ,R=25% ,M=100% , -> cat3=1 ,cat11=1 , 0.395161 1 82
46 F=100% ,R=25% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.379032 1 81
47 F=100% ,R=25% ,M=100% , -> cat5=1 ,cat11=1 , 0.403226 1 92
48 R=25% ,M=100% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.383065 1 73
49 R=25% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.366935 1 75
50 R=25% ,M=100% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.387097 1 86
51 R=25% ,M=100% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.415323 1 80
52 F=100% ,R=25% ,M=100% , -> cat1=1 ,cat13=1 , 0.399194 1 105
53 F=100% ,R=25% , -> cat1=1 ,cat5=1 ,cat13=1 , 0.375 1 104
54 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat13=1 , 0.370968 1 94
55 F=100% ,R=25% ,M=100% , -> cat3=1 ,cat13=1 , 0.439516 1 101
56 F=100% ,R=25% , -> cat3=1 ,cat5=1 ,cat13=1 , 0.419355 1 100
57 F=100% ,R=25% ,M=100% , -> cat5=1 ,cat13=1 , 0.455645 1 107
58 R=25% ,M=100% , -> cat1=1 ,cat3=1 ,cat13=1 , 0.395161 1 93
59 R=25% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.379032 1 95
60 R=25% ,M=100% , -> cat1=1 ,cat5=1 ,cat13=1 , 0.399194 1 103
61 R=25% ,M=100% , -> cat3=1 ,cat5=1 ,cat13=1 , 0.451613 1 99
62 F=100% ,R=25% ,M=100% , -> cat1=1 ,cat3=1 , 0.423387 1 98
63 F=100% ,R=25% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.403226 1 97
64 F=100% ,R=25% ,M=100% , -> cat1=1 ,cat5=1 , 0.431452 1 106
65 F=100% ,R=25% ,M=100% , -> cat3=1 ,cat5=1 , 0.471774 1 102
66 R=25% ,M=100% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.443548 1 96
67 F=100% ,M=100% , -> cat1=1 ,cat11=1 ,cat13=1 , 0.447581 1 119
68 F=100% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.419355 1 118
69 F=100% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.427419 1 109
70 F=100% ,M=100% , -> cat3=1 ,cat11=1 ,cat13=1 , 0.479839 1 115
71 F=100% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.455645 1 114
72 F=100% ,M=100% , -> cat5=1 ,cat11=1 ,cat13=1 , 0.475806 1 121
73 M=100% , -> cat1=1 ,cat3=1 ,cat11=1 ,cat13=1 , 0.491935 1 108
74 M=100% , -> cat1=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.487903 1 117
75 M=100% , -> cat3=1 ,cat5=1 ,cat11=1 ,cat13=1 , 0.524194 1 113
Page 98
97
76 F=100% ,M=100% , -> cat1=1 ,cat3=1 ,cat11=1 , 0.491935 1 112
77 F=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.455645 1 111
78 F=100% ,M=100% , -> cat1=1 ,cat5=1 ,cat11=1 , 0.471774 1 120
79 F=100% ,M=100% , -> cat3=1 ,cat5=1 ,cat11=1 , 0.508065 1 116
80 M=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat11=1 , 0.560484 1 110
81 F=100% ,M=100% , -> cat1=1 ,cat3=1 ,cat13=1 , 0.512097 1 124
82 F=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.479839 1 123
83 F=100% ,M=100% , -> cat1=1 ,cat5=1 ,cat13=1 , 0.508065 1 127
84 F=100% ,M=100% , -> cat3=1 ,cat5=1 ,cat13=1 , 0.560484 1 126
85 M=100% , -> cat1=1 ,cat3=1 ,cat5=1 ,cat13=1 , 0.552419 1 122
86 F=100% ,M=100% , -> cat1=1 ,cat3=1 ,cat5=1 , 0.544355 1 125
4.4.3 Rules with discrete variables in RHS: According to the (Chen et al,
2005), in the RHS parts of the rules, we have the customer buy the products or not.
In this section, we build some rules that show how many times a product category
bought and compare them by Chen similarity formula.
Then we have modified Chen similarity formula by Manhattan distance
formula which calculates difference between the values of each attribute in two
rules.
For each cluster we have 4 ruleset. For each period we have one itemset and
we compare generated rules once by Chen similarity formula and the other time by
modified formula with Manhattan distance. We have two steps, first discretizing the
number of purchases for each product category and the second is generating
association rules and comparing them.
1. Discretization of number f purchases for each product category:
First, before starting, we discretize the frequency of purchases of each product
for each customer during time. It means that if cat1=20 then the selected customer
purchase cat1, 20 times in the selected period. The number of purchases for each
product category discretized and its results with their histogram are as followed.
Page 101
100
Table4.21: Cat3 quantile Figure4.9: Cat3 histogram
Category3 Quantile
Variable interval
1th Quantile 0 to 0
2th Quantile 0 to 1
3th Quantile 1 to 9
4th Quantile 9 to 1050
Page 103
102
Category11 Quantile
Variable interval
1th Quantile 0 to 0
2th Quantile 0 to 1
3th Quantile 1 to 5
4th Quantile 5 to 238
Table4.23: Cat11 quantile Figure4.11: Cat11 histogram
Page 104
103
Category13 Quantile
Variable interval
1th Quantile 0 to 0
2th Quantile 0 to 0
3th Quantile 1 to 2
4th Quantile 2 to 122
4.4.4 Change mining with Manhattan distance
The outcome of generated rules and changes are as followed.
Cluster1: Change mining by (Chen et al, 2005) measures & by Manhattan distance
Table4.25: Generated Rules for period 1 Cluster 1, Change mining by (Chen et al, 2005) measures & Manhattan distance
Rule
Index rule1 Support Change
Type
Similarity Change
Type -M
Similarity
-M
Sim-
Rule -
Index1
Sim-
Rule-
Index2
1 F=0.25 , ->
cat11=0.25 ,
0.12984 Emerging
trend
1.000 Emerging
trend
1.000 2 2
2 F=0.25 , ->
cat3=0.25 ,
0.10251 Emerging
trend
1.000 Emerging
trend
1.000 1 1
3 F=0.25 , ->
cat1=0.25 ,
0.14123 Not perished 0.500 Not
perished
0.500 4 4
4 R=1 , ->
cat1=0.25 ,
0.10251 Not perished 0.500 Not
perished
0.500 4 4
Table4.24: Cat13 quantile Figure4.12: Cat13 histogram
Page 105
104
5 R=0.75 , ->
cat1=0.25 ,
0.11162 Unexpected
perished 0.000 Perished 0.375 1 4
6 M=0.25 , ->
cat1=0.25 ,
0.12756 Not perished 0.500 Not
perished
0.500 5 5
7 F=0.25 , ->
cat5=0.25 ,
0.13212 Emerging
trend
1.000 Emerging
trend
1.000 3 3
8 M=0.75 , ->
cat5=0.25 ,
0.12301 Unexpected
perished
0.000 Unexpected
perished
0.000 1 1
9 R=1 , ->
cat5=0.25 ,
0.10023 Unexpected
perished
0.000 Unexpected
perished
0.000 1 1
10 R=0.75 , ->
cat5=0.25 ,
0.10251 Unexpected
perished
0.000 Unexpected
perished
0.000 1 1
Table4.26: Generated Rules Table 4.26: Generated rules for period 2 Cluster 1, Change mining by (Chen et al, 2005) measures &
Rule
Index
rule 2 Support
Change
Type
Similarity Change
Type -
M
Similarity
-M
Sim-
Rule
Inde
x1
Sim-
Rule
Index2
1 F=0.25 , -
>
cat3=0.25 ,
0.12734 1.000 1.000 2 2
2 F=0.25 , -
>
cat11=0.25
,
0.13483 1.000 1.000 1 1
3 F=0.25 , -
>
cat5=0.25 ,
0.12509 1.000 1.000 7 7
4 F=0.25
,R=1 , ->
cat1=0.25 ,
0.10037 Not
Added
0.500 Not
Added
0.500 3 3
5 F=0.25
,M=0.25 ,
->
cat1=0.25
0.10337 Not
Added
0.500 Not
Added
0.500 3 3
Page 106
105
6 F=0.25
,area=1 , -
>
cat1=0.25
0.10337 Not
Added
0.500 Not
Added
0.500 3 3
Cluster2:Change mining by (Chen et al, 2005) measures & by Manhattan distance
Table4.27: Generated Rules for period 1 Cluster 2, Change mining by(Chen et al, 2005) measures &
Manhattan distance
Rule
Index rule1 Support Change Type Similarity
Change
Type -M
Similarity
-M
Sim-
Rule-
Index
1
Sim-
Rule-
Index
2
1 F=1 , -> cat1=0.25 , 0.10714 Unexpected
purchasing
0.000 Not
perished
0.500 1 11
2 F=1 , -> cat11=0.25 , 0.10714 Unexpected
purchasing
0.000 Not
perished
0.500 1 1
3 F=1 , -> cat11=0.75 , 0.12857 Emerging
trend
1.000 Emerging
trend
1.000 1 1
4 F=1 ,R=0.25 , ->
cat5=0.75 ,
0.10714 Not perished 0.500 Not
perished
0.500 6 6
5 F=1 , -> cat3=0.75
,cat11=1 ,
0.11429 Not perished 0.500 Not
perished
0.583 8 27
6 F=1 ,R=0.25 , ->
cat3=0.75 ,
0.10000 Not perished 0.500 Not
perished
0.500 8 8
7 M=0.75 , ->
cat3=0.75 ,
0.10000 Emerging
trend
1.000 Emerging
trend
1.000 9 9
8 F=1 ,M=0.75 , ->
cat1=0.75 ,
0.10000 Not perished 0.500 Not
perished
0.656 11 22
9 F=1 , -> cat1=0.75
,cat5=1 ,
0.10000 Not perished 0.500 Not
perished
0.500 11 11
Page 107
106
10 F=1 ,R=0.25 , ->
cat1=0.75 ,
0.10000 Not perished 0.500 Not
perished
0.656 11 24
11 area=0.5 , ->
cat1=0.75 ,
0.10000 Unexpected
perished
0.000 Not
perished
0.563 1 7
12 F=1 ,R=0.5 , ->
cat5=1 ,
0.12143 Not perished 0.500 Not
perished
0.500 17 17
13 F=1 ,area=0.75 , ->
cat13=0.75 ,
0.10000 Not perished 0.500 Not
perished
0.500 19 19
14 F=1 ,M=0.75 , ->
cat13=0.75 ,
0.10714 Not perished 0.500 Not
perished
0.500 19 19
15 F=1 ,R=0.25 , ->
cat13=0.75 ,
0.11429 Not perished 0.500 Not
perished
0.750 19 4
16 F=1 ,area=0.5 , ->
cat5=1 ,
0.11429 Not perished 0.500 Not
perished
0.500 17 17
17 F=1 , -> cat1=1
,cat13=1 ,
0.13571 Perished 0.333 Perished 0.375 27 11
18 F=1 , -> cat5=1
,cat13=1 ,
0.10000 Not perished 0.500 Not
perished
0.500 16 16
19 F=1 ,R=0.25 , ->
cat13=1 ,
0.14286 Emerging
trend
1.000 Emerging
trend
1.000 4 4
20 M=0.75 , -> cat13=1
,
0.10000 Emerging
trend
1.000 Emerging
trend
1.000 5 5
21 F=1 ,R=0.25
,area=0.25 , ->
cat1=1 ,
0.10000 Not perished 0.667 Not
perished
0.833 35 35
22 F=1 ,area=0.25 , ->
cat5=1 ,
0.13571 Not perished 0.500 Not
perished
0.500 17 17
23 F=1 ,M=0.5 , ->
cat2=1 ,
0.12143 Not perished 0.500 Not
perished
0.500 2 2
24 F=1 ,M=0.5 , ->
cat1=1 ,
0.10714 Emerging
trend
1.000 Emerging
trend
1.000 22 22
25 F=1 ,M=0.5 , ->
cat5=1 ,
0.15000 Not perished 0.500 Not
perished
0.875 17 17
26 R=0.25 , -> cat2=1
,cat3=1 ,
0.10000 Not perished 0.500 Not
perished
0.500 3 3
27 F=1 ,area=0.75 , ->
cat3=1 ,
0.10714 Emerging
trend
1.000 Emerging
trend
1.000 28 28
28 F=1 ,R=0.25 , ->
cat3=1 ,cat11=1 ,
0.10000 Not perished 0.500 Not
perished
0.500 26 26
29 F=1 ,R=0.25 , ->
cat1=1 ,cat3=1 ,
0.10000 Emerging
trend
1.000 Emerging
trend
1.000 30 30
Page 108
107
Table4. 28: Generated Rules for period 2 Cluster 2, Change mining by (Chen e t al, 2005) measures
&Manhattan distance
Rule
Index
rule2 Support Change
Type Similarity
Change
Type -M
Similarity
-M
Sim-
Rule-
Index1
Sim-
Rule-
Index2
1 F=1 , ->
cat11=0.75 ,
0.10283 1.000 1.000 3 3
2 F=1 , -> cat2=1
,
0.12339 Not Added 0.500 Not
Added
0.500 23 23
3 R=0.25 , ->
cat2=1 ,
0.11054 Not Added 0.500 Not
Added
0.500 26 26
4 F=1 ,R=0.25 , -
> cat13=1
0.10026 1.000 1.000 19 19
5 M=0.75 , ->
cat13=1 ,
0.10283 1.000 1.000 20 20
6 F=1 , ->
cat5=0.75 ,
0.11568 Not Added 0.500 Not
Added
0.500 4 4
7 area=0.25 , ->
cat 1=1 ,
0.10540 Added 0.333 Not
Added
0.563 21 11
8 F=1 , ->
cat3=0.75 ,
0.11054 Not Added 0.500 Not
Added
0.500 5 5
9 M=0.75 , ->
cat3=0.75 ,
0.11825 1.000 1.000 7 7
10 F=0.75 , ->
cat3=0.75 , 0.10283
Unexpected
added 0.000 Added 0.375 1
5
F=1 , ->
cat1=0.75 ,
0.11054 Not Added 0.500 Not
Added
0.500 8 1
12 R=0.25 , ->
cat1=0.75 ,
0.11054 Not Added 0.500 Not
Added
0.500 10 10
13 M=0.75 , ->
cat1=0.75
0.10797 Not Added 0.500 Not
Added
0.500 8 8
Page 109
108
14 F=0.75 , ->
cat1=0.75 , 0.10283
Unexpected
added 0.000 Added 0.375 1
1
15 area=1 , ->
cat1=1 , 0.13882
Unexpected
added 0.000 Added 0.375 1
11
16
F=1 , -> cat5=1
,cat11=1 , 0.11054 Not Added 0.500 Not
Added 0.500 5
5
17 F=1 ,M=0.75 , -
> cat5=1 , 0.14139 Not Added 0.667 Not
Added 0.875 44
25
18
F=1 ,R=0.25 , -
> cat1=1
,cat5=1 ,
0.10026
1.000
1.000 43
43
19 F=1 , ->
cat13=0.75 ,
0.14653 Not Added 0.500 Not
Added
0.500 13 13
20 R=0.25 , ->
cat13=0.75 ,
0.11311 Not Added 0.500 Not
Added
0.500 15 15
21 M=0.75 , ->
cat13=0.75 ,
0.11568 Not Added 0.500 Not
Added
0.750 14 20
F=1 ,M=0.5 , ->
cat1=1 ,
0.11311 1.000 1.000 24 24
23 F=0.75 , ->
cat1=1 , 0.10026
Unexpected
added 0.000 Added 0.375 1 17
24 F=1 ,R=0.5 , ->
cat1=1 ,
0.12339 Not Added 0.500 Not
Added
0.656 24 10
25 R=0.5 , ->
cat11=1 , 0.10540
Unexpected
added 0.000 Added 0.250 1 40
26
F=1 ,M=0.75 , -
> cat3=1 ,cat
=1 ,
0.11054 Not Added 0.500 Not
Added 0.500 28 28
27
F=1 , -> cat1=1
,cat3=1 ,cat
=1 ,
0.10540 Added 0.333 Not
Added 0.583 5 5
28 F=1 ,area=0.75 ,
-> cat3=1 , 0.10540
1.000
1.000 27 27
29
F=1 ,R=0.25
,M=0.75 , >
cat3=1 ,
0.11568 Not Added 0.667 Not
Added 0.667 30 30
30
F=1 ,R=0.25 , -
> cat1=1
,cat3=1 ,
0.11311
1.000
1.000 29 29
31 F=1 ,area=0.75 ,
-> cat =1 , 0.10797
1.000
1.000 36 36
Page 110
109
32
F=1 ,M=0.75 , -
> cat1=1 ,cat
=1 ,
0.11568 Not Added 0.500 Not
Added 0.500 39 39
F=1 ,R=0.25
,M=0.75 , >
cat11=1 ,
0.10026
1.000
1.000 40 40
34
F=1 ,R=0.25 , -
> cat1=1 ,cat
=1 ,
0.10283
1.000
1.000 39 39
35
F=1 ,R=0.25
,area=0.75 , ->
cat 1=1 ,
0.10540
1.000
1.000 37 37
36
F=1 ,R=0.25
,M=0.75 , >
cat1=1 ,
0.12339
1.000
1.000 42 42
Cluster3: Change mining by (Chen et al, 2005) measures & by Manhattan distance
Table 4.29: Generated Rules for period 1 Cluster 3, Change mining by (Chen et al, 2005) measures &
Manhattan distance
Rule
Index rule1 Support
Change
Type Similarity
Change
Type -M
Similarity
-M
Sim-
Rule-
Index1
Sim-
Rule-
Index2
1 M=1 , ->
cat11=0.25 ,
0.10588 Emerging
trend
1.000 Emerging
trend
1.000 14 14
2 M=1 , ->
cat1=0.75 ,
0.10588 Emerging
trend
1.000 Emerging
trend
1.000 10 10
3 M=1 , ->
cat3=0.25 ,
0.11765 Emerging
trend
1.000 Emerging
trend
1.000 1 1
4 M=1 , ->
cat3=0.5 ,
0.12941 Emerging
trend
1.000 Emerging
trend
1.000 4 4
5 M=1 , ->
cat11=0.5 ,
0.14118 Emerging
trend
1.000 Emerging
trend
1.000 5 5
6 M=1 , ->
cat13=0.5 ,
0.15294 Emerging
trend
1.000 Emerging
trend
1.000 8 8
7 F=0.75 ,M=1 , ->
cat3=0.75 ,
0.12941 Emerging
trend
1.000 Emerging
trend
1.000 19 19
Page 111
110
8 F=0.5 ,M=1 , ->
cat5=0.5 ,
0.12941 Not
perished
0.500 Not
perished
0.656 7 11
9 F=0.75 ,M=1
,area=0.5 , ->
cat5=0.75 ,
0.11765 Not
perished
0.667 Not
perished
0.667 11 11
10 F=0.75 ,M=1 , ->
cat =0.75,
0.11765 Not
perished
0.500 Not
perished
0.500 6 6
11 R=1 ,M=1 , ->
cat5=0.5 ,
0.10588 Not
perished
0.500 Not
perished
0.750 7 21
12 R=0.5 ,M=1 , ->
cat5=0.75 ,
0.10588 Not
perished
0.500 Not
perished
0.500 11 11
13 M=1 , ->
cat13=0.25 ,
0.22353 Emerging
trend
1.000 Emerging
trend
1.000 9 9
14 F=0.25 ,R=1
,M=1 , ->
cat5=0.25 ,
0.12941 Not
perished
0.667 Not
perished
0.667 15 15
15 F=0.75 ,M=1 , ->
cat1=0.25 ,
0.15294 Not
perished
0.500 Not
perished
0.750 13 13
16 F=0.75 ,R=0.75
,M =1 , ->
cat5=0.75 ,
0.10588 Not
perished
0.667 Not
perished
0.667 11 11
17 M=1 ,area=1 , ->
cat5=0.75 , 0.10588 Emerging
trend
1.000 Emerging
trend
1.000 12 12
Table4. 30: Generated Rules for period 2 Cluster 3, Change mining by(Chen et al, 2005) measures &
Manhattan distance
Rule
Inde
x rule1 Support
Change
Type Similarity
Change
Type -M
Similarity
-M
Sim-
Rule-
Index1
Sim-Rule-
Inde x2
1 M=1 , -> cat3=0.25
,
0.12775 1.000 1.000 3 3
Page 112
111
2 M=1 , ->
cat13=0.75 ,
0.12775 Unexpected
purchasing
0.000 Not Added 0.750 1 6
3 M=1 , -> cat 1=0.5
,
0.14097 Unexpected
purchasing
0.000 Not Added 0.750 1 2
4 M=1 , -> cat 3=0.5
,
0.14537 1.000 1.000 4 4
5 M=1 , -> cat11=0.5
,
0.14978 1.000 1.000 5 5
6 M=1 , ->
cat11=0.75 ,
0.14097 Not Added 0.500 Not Added 0.750 10 5
7 M=1 , -> cat 5=0.5
,
0.16300 Not Added 0.500 Not Added 0.500 8 8
8 M=1 , -> cat13=0.5
,
0.17181 1.000 1.000 6 6
9 M=1 , ->
cat13=0.25 ,
0.18062 1.000 1.000 13 13
10 M=1 , -> cat1=0.75
,
0.18943 1.000 1.000 2 2
11 F=0.75 ,M=1 , ->
cat5=0.75 ,
0.17621 Not Added 0.667 Not Added 0.667 9 9
12 M=1 ,area=1 , ->
cat5=0.75 ,
0.11454 1.000 1.000 17 17
13 F=0.25 ,M=1 , ->
cat1=0.25 ,
0.11454 Not Added 0.500 Not Added 0.750 15 15
14 M=1 , ->
cat11=0.25 ,
0.22467 1.000 1.000 1 1
15 F=0.25 ,M=1
,area=1 , ->
cat5=0.25 ,
0.10132 Not Added 0.667 Not Added 0.667 14 14
16 R=1 ,M=1 , ->
cat3=0.75 ,
0.11013 Not Added 0.500 Not Added 0.500 7 7
17 M=1 , -> cat
3=0.75 ,cat5=0.25 ,
0.11013 Added 0.250 Added 0.375 7 4
18 F=0.5 ,M=1 , ->
cat3=0.75 ,
0.13656 Not Added 0.500 Not Added 0.875 7 7
19 F=0.75 ,M=1 , ->
cat3=0.75 ,
0.10132 1.000 1.000 7 7
20 M=1 ,area=1 , ->
cat3=0.75 ,
0.12335 Not Added 0.500 Not Added 0.500 7 7
21 R=1 ,M=1 , ->
cat5=0.25 , 0.13656 Not Added 0.667 Not Added 0.750 14 11
Page 113
112
Cluster4: Change mining by (Chen et al, 2005) measures & by Manhattan distance
Table4.31: Generated Rules for period 1 Cluster 4, Change mining by (Chen et al, 2005) measures & Manhattan
distance
Rule
index rule1 Support
Change
Type Similarity
Change
Type -M
Similarity
-M
Sim-
Rule-
Index
1
Sim-
Rule
-
Index
2
1 M=1 , -> cat =0.5 , 0.10569 Emerging
trend
1.000 Emerging
trend
1.000 1 1
2 F=1 ,M=1 , -> cat2=1 , 0.10569 Unexpected
purchasing 0.000 Unexpected
purchasing 0.000 1 1
3 F=1 ,M=1 , -> cat3=0.75
,
0.13008 Not
perished
0.500 Not perished 0.500 4 4
4 F=1 ,R=0.25 ,M=1 , ->
cat1=0.75
0.12195 Not
perished
0.667 Not perished 0.667 12 12
5 F=1 ,M=1 , -> cat5=0.75
,
0.14634 Perished 0.250 Perished 0.375 6 4
6 F=1 ,M=1 , ->
cat1=0.25,cat5=1
0.12195 Not
perished
0. 500 Not perished 0.750 4 11
7 R=0.25 ,M =1 , ->
cat1=0.25 ,
0.10569 Not
perished
0.500 Not perished 0.500 2 2
8 F=1 ,R=0.25 ,M=1
,area=0.25 , -> cat5=1 , 0.10569
Emerging
trend 1.000
Emerging
trend 1.000 18 18
9 F=1 ,M=1 , -> cat11=1
,cat 13=0.75 ,
0.10569 Not
perished
0.500 Not perished 0.583 8 42
10 F=1 ,M=1 , -> cat5=1
,cat13=0.75 ,
0.14634 Emerging
trend
1.000 Emerging
trend
1.000 8 8
F=1 ,R=0.25 ,M=1 , ->
cat13=0.75 ,
0.11382 Not
perished
0.667 Not perished 0.667 9 9
12 M=1 ,area=1 , -> cat 3=1
,cat5=1 ,cat13=1 , 0.10569
Not
perished 0.667 Not perished 0.667 22
13 F=1 ,area=1 , -> cat3=1
,cat5=1 ,cat13=1 , 0.10569
Not
perished 0.500 Not perished 0.500 79 79
14
R=0.25,area=1 , ->
cat3=1 ,cat5=1 ,cat13=1
,
0.10569 Not
perished 0.500 Not perished 0.500 78 78
Page 114
113
15 R=0.25 ,M =1 ,area=1 , -
> cat3=1 ,cat13=1 , 0.10569
Not
perished 0.667 Not perished 0.667 81 81
16 F=1 ,R=0.25 ,area=1 , ->
cat3=1 ,cat13=1 , 0.10569
Not
perished 0.667 Not perished 0.667 81 81
17 F=1 ,M=1 ,area =1 , ->
cat3=1 ,cat13=1 , 0.10569
Not
perished 0.667 Not perished 0.667 81 81
18 R=0.25,M =1 ,area=1 , -
> cat5=1 ,cat13=1 , 0.12195
Not
perished 0.667 Not perished 0.667 82 82
19 F=1 ,R=0.25 ,area=1 , ->
cat5=1 ,cat13=1 , 0.12195
Not
perished 0.667 Not perished 0.667 82 82
20 F=1 ,M=1 ,area =1 , ->
cat5=1 ,cat13=1 , 0.12195
Not
perished 0.667 Not perished 0.667 82 82
21 F=1 ,R=0.25 ,M=1
,area=1 , -> cat13=1 , 0.12195
Not
perished 0.750 Not perished 0.875 26 26
R=0.25 ,M =1 ,area=1 , -
> cat5=1 ,cat11=1 , 0.10569
Not
perished 0.667 Not perished 0.667 87 87
23 F=1 ,R=0.25 ,area=1 , ->
cat5=1 ,cat11=1 , 0.10569
Not
perished 0.667 Not perished 0.667 87 87
24 F=1 ,M=1 ,area =1 , ->
cat5=1 ,cat11=1 , 0.10569
Not
perished 0.667 Not perished 0.667 87 87
25 F=1 ,R=0.25 ,M=1
,area=1 , -> cat11=1 ,
0.12195 Not
perished
0.500 Not perished 0.563 16 16
26 R=0.25 ,M =1 ,area=1 , -
> cat3=1 ,cat5=1 , 0.13008
Not
perished 0.667 Not perished 0.667 22
27 F=1 ,R=0.25 ,area=1 , ->
cat3=1 ,cat5=1 , 0.13008
Not
perished 0.667 Not perished 0.667 88 88
28 F=1 ,M=1 ,area =1 , ->
cat3=1 ,cat5=1 , 0.13821
Not
perished 0.667 Not perished 0.667 22
29 F=1 ,R=0.25 ,M=1
,area=1 , -> cat3=1 , 0.13821
Not
perished 0.750 Not perished 0.875 23 28
30 F=1 ,R=0.25 ,M=1
,area=1 , -> cat5=1 , 0.17886
Emerging
trend 1.000
Emerging
trend 1.000 24 24
31 F=1 ,M=1 ,area =0.75 , -
> cat13=1 ,
0.11382 Emerging
trend
1.000 Emerging
trend
1.000 31 31
32 F=1 ,M=1 ,area =0.75 , -
> cat11=1 ,
0.13008 Not
perished
0.667 Not perished 0.833 16 16
F=1 ,M=1 ,area =0.75 , -
> cat3=1 ,
0.10569 Emerging
trend
1.000 Emerging
trend
1.000 32 32
34 R=0.25,M =1 ,area=0.75
, -> cat5=1 ,
0.10569 Not
perished
0.750 Not perished 0.750 33
35 F=1 ,M=1 ,area =0.75 , -
> cat5=1 ,
0.16260 Not
perished
0.750 Not perished 0.750 33
36 F=1 ,R=0.25 ,M=1
,area=0.5 , -> cat1=1 , 0.12195
Not
perished 0.500 Not perished 0.688 13 13
37 F=1 ,R=0.25 ,M=1
,area=0.5 , -> cat13=1 , 0.15447
Emerging
trend 1.000
Emerging
trend 1.000 26 26
Page 115
114
38 F=1 ,M=1 ,area =0.5 , ->
cat3=1 ,cat11=1 , 0.10569
Not
perished 0.667 Not perished 0.667 86 86
39 F=1 ,R=0.25 ,M=1
,area=0.5 , -> cat11=1 , 0.12195
Not
perished 0.500 Not perished 0.688 16 16
40 F=1 ,R=0.25 ,M=1
,area=0.5 , -> cat3=1 , 0.13008
Emerging
trend 1.000
Emerging
trend 1.000 28 28
41 F=1 ,R=0.25 ,M=1
,area=0.5 , -> cat5=1 , 0.13821
Emerging
trend 1.000
Emerging
trend 1.000 29 29
42 M=1 , -> cat1=1 ,cat 3=1
,cat11=1 ,cat13=1 , 0.13821
Emerging
trend 1.000
Emerging
trend 1.000 34 34
43 F=1 , -> cat1=1 ,cat3=1
,cat11=1
0.13821 Emerging
trend
1.000 Emerging
trend
1.000 35 35
R=0.25 , -> cat1=1
,cat3=1
,cat11=1,cat13=1 ,
0.13008 Emerging
trend
1.000 Emerging
trend
1.000 36 36
45 M=1 , -> cat1=1 ,cat 5=1
,cat11=1 ,cat13=1 , 0.13821
Emerging
trend 1.000
Emerging
trend 1.000 37 37
46 F=1 , -> cat1=1 ,cat5=1
,cat11=1 ,cat13=1 , 0.13821
Emerging
trend 1.000
Emerging
trend 1.000 38 38
47
R=0.25 , -> cat1=1
,cat5=1 ,cat11=1
,cat13=1 ,
0.13821 Emerging
trend 1.000
Emerging
trend 1.000 39 39
48 R=0.25,M =1 , -> cat1=1
,cat11=1 ,cat13=1 , 0.16260
Emerging
trend 1.000
Emerging
trend 1.000 40 40
49 F=1 ,R=0.25 , -> cat 1=1
,cat11=1 ,cat13=1 , 0.16260
Emerging
trend 1.000
Emerging
trend 1.000 41 41
50 F=1 ,M=1 , -> cat1=1
,cat11=1 ,cat13=1 , 0.17073
Emerging
trend 1.000
Emerging
trend 1.000 42 42
51 M=1 , -> cat1=1 ,cat3=1
,cat5=1 ,cat13=1 , 0.13821
Emerging
trend 1.000
Emerging
trend 1.000 43 43
52 F=1 , -> cat1=1 ,cat3=1
,cat5=1 ,cat13=1 , 0.13821
Emerging
trend 1.000
Emerging
trend 1.000 44
53
R=0.25 , -> cat1=1
,cat3=1 ,cat5=1 ,cat13=1
,
0.13008 Emerging
trend 1.000
Emerging
trend 1.000 45 45
54 R=0.25,M =1 , -> cat1=1
,cat3=1 ,cat13=1 , 0.15447
Emerging
trend 1.000
Emerging
trend 1.000 46 46
F=1 ,R=0.25 , -> cat 1=1
,cat3=1 ,cat13=1 , 0.15447
Emerging
trend 1.000
Emerging
trend 1.000 47 47
56 F=1 ,M=1 , -> cat1=1
,cat3=1 ,cat13=1 , 0.17073
Emerging
trend 1.000
Emerging
trend 1.000 48 48
57 R=0.25,M =1 , -> cat1=1
,cat5=1 ,cat13=1 , 0.16260
Emerging
trend 1.000
Emerging
trend 1.000 49 49
58 F=1 ,R=0.25 , -> cat 1=1
,cat5=1 ,cat13=1 , 0.16260
Emerging
trend 1.000
Emerging
trend 1.000 50 50
Page 116
115
59 F=1 ,M=1 , -> cat1=1
,cat5=1 ,cat13=1 , 0.17886
Emerging
trend 1.000
Emerging
trend 1.000 51 51
60 F=1 ,R=0.25 ,M=1 , ->
cat1=1 ,cat13=1 , 0.21138
Emerging
trend 1.000
Emerging
trend 1.000 52 52
61 M=1 , -> cat1=1 ,cat3=1
,cat5=1,cat11=1 ,
0.13821 Emerging
trend
1.000 Emerging
trend
1.000 53 53
62 F=1 , -> cat1=1 ,cat3=1
,cat5=1 ,cat11=1 , 0.13821
Emerging
trend 1.000
Emerging
trend 1.000 54 54
63
R=0.25 , -> cat1=1
,cat3=1 ,cat5=1 ,cat11=1
,
0.13821 Emerging
trend 1.000
Emerging
trend 1.000 55
64
R=0.25 ,M =1 , ->
cat1=1 ,cat3=1 ,cat11=1
,
0.16260 Emerging
trend 1.000
Emerging
trend 1.000 56 56
65 F=1 ,R=0.25 , -> cat 1=1
,cat3=1 ,cat11=1 , 0.16260
Emerging
trend 1.000
Emerging
trend 1.000 57 57
F=1 ,M=1 , -> cat1=1
,cat3=1 ,cat11=1 , 0.17886
Emerging
trend 1.000
Emerging
trend 1.000 58 58
67 R=0.25,M =1 , -> cat1=1
,cat5=1 ,cat11=1 , 0.18699
Emerging
trend 1.000
Emerging
trend 1.000 59 59
68 F=1 ,R=0.25 , -> cat 1=1
,cat5=1 ,cat11=1 , 0.18699
Emerging
trend 1.000
Emerging
trend 1.000 60 60
69 F=1 ,M=1 , -> cat1=1
,cat5=1 ,cat11=1 , 0.19512
Emerging
trend 1.000
Emerging
trend 1.000 61 61
70 F=1 ,R=0.25 ,M=1 , ->
cat1=1 ,cat11=1 , 0.21951
Emerging
trend 1.000
Emerging
trend 1.000 62 62
71 R=0.25,M =1 , -> cat1=1
,cat3=1 ,cat5=1 , 0.16260
Emerging
trend 1.000
Emerging
trend 1.000 63 63
72 F=1 ,R=0.25 , -> cat 1=1
,cat3=1 ,cat5=1 , 0.16260
Emerging
trend 1.000
Emerging
trend 1.000 64 64
73 F=1 ,M=1 , -> cat1=1
,cat3=1 ,cat5=1 , 0.17886
Emerging
trend 1.000
Emerging
trend 1.000 65 65
74 F=1 ,R=0.25 ,M=1 , ->
cat1=1 ,cat3=1 , 0.19512
Emerging
trend 1.000
Emerging
trend 1.000 66
75 F=1 ,R=0.25 ,M=1 , ->
cat1=1 ,cat5=1 , 0.21951
Emerging
trend 1.000
Emerging
trend 1.000 67 67
76 M=1 , -> cat3=1 ,cat 5=1
,cat11=1 ,cat13=1 , 0.15447
Emerging
trend 1.000
Emerging
trend 1.000 68 68
F=1 , -> cat3=1 ,cat5=1
,cat11=1 ,cat13=1 , 0.15447
Emerging
trend 1.000
Emerging
trend 1.000 69 69
78 R=0.25 , -> cat3=1
,cat5=1
,cat11=1,cat13=1 ,
0.14634 Emerging
trend
1.000 Emerging
trend
1.000 70 70
79
R=0.25 ,M =1 , ->
cat3=1 ,cat11=1
,cat13=1 ,
0.17073 Emerging
trend 1.000
Emerging
trend 1.000 71 71
Page 117
116
80 F=1 ,R=0.25 , -> cat 3=1
,cat11=1 ,cat13=1 , 0.17073
Emerging
trend 1.000
Emerging
trend 1.000 72 72
81 F=1 ,M =1 , -> cat3=1
,cat11=1 ,cat13=1 , 0.18699
Emerging
trend 1.000
Emerging
trend 1.000 73 73
82 R=0.25,M =1 , -> cat5=1
,cat11=1 ,cat13=1 , 0.17886
Emerging
trend 1.000
Emerging
trend 1.000 74 74
83 F=1 ,R=0.25 , -> cat 5=1
,cat11=1 ,cat13=1 , 0.17886
Emerging
trend 1.000
Emerging
trend 1.000 75 75
84 F=1 ,M=1 , -> cat5=1
,cat11=1 ,cat13=1 , 0.18699
Emerging
trend 1.000
Emerging
trend 1.000 76 76
85 F=1 ,R=0.25 ,M=1 , ->
cat11=1 ,cat13=1 , 0.22764
Emerging
trend 1.000
Emerging
trend 1.000 77
86 R=0.25,M =1 , -> cat3=1
,cat5=1 ,cat13=1 , 0.23577
Emerging
trend 1.000
Emerging
trend 1.000 78 78
87 F=1 ,R=0.25 , -> cat 3=1
,cat5=1 ,cat13=1 , 0.23577
Emerging
trend 1.000
Emerging
trend 1.000 79 79
F=1 ,M=1 , -> cat3=1
,cat5=1 ,cat13=1 , 0.25203
Emerging
trend 1.000
Emerging
trend 1.000 80 80
89 F=1 ,R=0.25 ,M=1 , ->
cat3=1 ,cat13=1 , 0.28455
Emerging
trend 1.000
Emerging
trend 1.000 81 81
90 F=1 ,R=0.25 ,M=1 , ->
cat5=1 ,cat13=1 , 0.31707
Emerging
trend 1.000
Emerging
trend 1.000 82 82
91 R=0.25,M =1 , -> cat3=1
,cat5=1 ,cat11=1 , 0.19512
Emerging
trend 1.000
Emerging
trend 1.000 83 83
92 F=1 ,R=0.25 , -> cat 3=1
,cat5=1 ,cat11=1 , 0.19512
Emerging
trend 1.000
Emerging
trend 1.000 84 84
93 F=1 ,M=1 , -> cat3=1
,cat5=1 ,cat11=1 , 0.22764
Emerging
trend 1.000
Emerging
trend 1.000 85 85
94 F=1 ,R=0.25 ,M=1 , ->
cat3=1 ,cat11=1 , 0.25203
Emerging
trend 1.000
Emerging
trend 1.000 86 86
95 F=1 ,R=0.25 ,M=1 , ->
cat5=1,cat11=1 ,
0.30081 Emerging
trend
1.000 Emerging
trend
1.000 87 87
96 F=1 ,R=0.25 ,M=1 , ->
cat3=1 ,cat5=1 , 0.30081
Emerging
trend 1.000
Emerging
trend 1.000 88
Page 118
117
Table4.32: Generated Rules for period 2 Cluster 4, Change mining by (Chen et al, 2005) measures & Manhattan
distance
Rule
Index rule1 Support
Change
Type Similarity
Change
Type -
M Similarity
-M
Sim-
Rule-
Index1
Sim-
Rule-
Index2
1 M=1 , -> cat11=0.5 , 0.10081 1.000 1.000 1 1
2 3 M=1 , -> cat1=0.25 , 0.11290 Not Added 0.500 Not
Added
0.500 7 7
3 M=1 , -> cat 11=0.75
,
0.12097 Unexpected
purchasing
0.000 Not
Added
0.750 1 1
4 F=1 ,M =1 , ->
cat3=0.75 ,cat5=1 ,
0.10484 Not Added 0.500 Not
Added
0.583 3 73
5 R=0.25 ,M=1 , ->
cat3=0.75 ,
0.10081 Not Added 0.500 Not
Added
0.500 3 3
6 M=1 , -> cat3=1
,cat5=0.75 ,
0.10484 Added 0.250 Not
Added
0.438 5 51
7 M=1 , -> cat3=1 ,cat
13=0.75 ,
0.11290 Added 0.250 Not
Added
0.438 9 42
8 F=1 ,M=1 , -> cat5=1
,cat13=0.75 ,
0.11290 1.000 1.000 10 10
9 R=0.25 ,M=1 , ->
cat13=0.75 ,
0.10887 Not Added 0.667 Not
Added
0.667 11
10 M=1 , -> cat1=0.75
,cat3=1 ,
0. 11290 Added 0.250 Not
Added
0.438 42 42
F=1 ,M =1 , ->
cat1=0.75 ,cat5=1 ,
0.10081 Not Added 0.500 Not
Added
0.750 6 6
12 R=0.25 ,M=1 , ->
cat1=0.75 ,
0.10887 Not Added 0.667 Not
Added
0.667 4 4
13 F=1 ,M =1 ,area=0.25
, -> cat1=1 ,
0.10484 Not Added 0.500 Not
Added
0.688 36 36
14 F=1 ,M=1 ,area=0.25
, -> cat13=1 ,
0.10484 Not Added 0.667 Not
Added
0.833 31 31
15 M=1 ,area=0.25 , ->
cat3=1 ,cat11=1 ,
0.10484 Added 0.333 Not
Added
0.583 38 38
16 F=1 ,M=1 ,area=0.25
, > cat =1 ,
0.11694 Not Added 0.667 Not
Added
0.833 32 32
17 F=1 ,M =1 ,area=0.25
, -> cat3=1 ,
0.12903 Not Added 0.667 Not
Added
0.833 33
18 F=1 ,R=0.25 ,M=1
,area=0.25 , -> cat5=1
,
0.10081 1.000 1.000 8 8
Page 119
118
19 M=1 ,area=1 , -> cat
1=1 ,
0.10081 Added 0.250 Added 0.375 36 36
20 M=1 ,area=1 , -> cat
13=1 ,
0.10081 Not Added 0.500 Not
Added
0.583 21 31
21 M=1 ,area=1 , -> cat
11=1 ,
0.10081 Not Added 0.500 Not
Added
0.583 25 32
M=1 ,area=1 , ->
cat3=1 ,cat5=1 ,
0.10081 Not Added 0.667 Not
Added
0.667 12 12
23 F=1 ,M =1 ,area=1 , -
> cat3=1 ,
0.10081 Not Added 0.750 Not
Added
0.917 29
24 F=1 ,R=0.25 ,M=1
,area=1 , -> cat5=1 ,
0.10081 1.000 1.000 30 30
25 M=1 ,area=0.5 , ->
cat1=1 ,
0.10484 Not Added 0.500 Not
Added
0.500 36 36
26 F=1 ,R=0.25 ,M=1
,area=0.5 , -> cat13=1
,
0.10887 1.000 1.000 37 37
27 M=1 ,area=0.5 , ->
cat11=1 ,
0.10081 Not Added 0.500 Not
Added
0.583 39 32
28 F=1 ,R=0.25 ,M=1
,area=0.5 , -> cat3=1 ,
0.10887 1.000 1.000 40 40
29 F=1 ,R=0.25 ,M=1
,area=0.5 , -> cat5=1 ,
0.12097 1.000 1.000 41 41
30 M=1 ,area=0.75 , ->
cat1=1 ,
0.10081 Added 0.250 Not
Added
0.438 36 36
31 F=1 ,M=1 ,area=0.75
, -> cat13=1 ,
0.10081 1.000 1.000 31 31
32 F=1 ,M =1 ,area=0.75
, -> cat3=1 ,
0.11694 1.000 1.000 33
F=1 ,R=0.25 ,M=1
,area=0.75 , -> cat5=1
,
0.11694 Not Added 0.750 Not
Added
0.938 8 30
34 M=1 , -> cat 1=1
,cat3=1 ,cat =1
,cat13=1 ,
0.12903 1.000 1.000 42 42
35 F=1 , -> cat1=1
,cat3=1 ,cat11=1
,cat13=1 ,
0.12500 1.000 1.000 43 43
36 R=0.25 , -> cat1=1
,cat3=1 ,cat11=1
,cat13=1 ,
0.10484 1.000 1.000 44
37 M=1 , -> cat 1=1
,cat5=1 ,cat =1
,cat13=1 ,
0.13710 1.000 1.000 45 45
38 F=1 , -> cat1=1
,cat5=1 ,cat11=1
0.13306 1.000 1.000 46 46
Page 120
119
,cat13=1 ,
39 R=0.25 , -> cat1=1
,cat5=1 ,cat11=1
,cat13=1 ,
0.11694 1.000 1.000 47 47
40 R=0.25,M =1 , ->
cat1=1 ,cat11=1
,cat13=1 ,
0.13306 1.000 1.000 48 48
41 F=1 ,R=0.25 , ->
cat1=1 ,cat =1
,cat13=1 ,
0.13306 1.000 1.000 49 49
42 F=1 ,M =1 , -> cat1=1
,cat11=1 ,cat13=1 ,
0.16532 1.000 1.000 50 50
43
M=1 , -> cat1=1
,cat3=1 ,cat5=1
,cat13=1 ,
0.12097
1.000
1.000 51 51
F=1 , -> cat1=1
,cat3=1 ,cat5=1
,cat13=1 ,
0.11694 1.000 1.000 52 52
45 R=0.25 , -> cat1=1
,cat3=1 ,cat5=1
,cat13=1 ,
0.10887 1.000 1.000 53 53
46 R=0.25 ,M=1 , ->
cat1=1 ,cat3=1
,cat13=1 ,
0.13306 1.000 1.000 54 54
47 F=1 ,R=0.25 , ->
cat1=1 ,cat3=1
,cat13=1 ,
0.13306 1.000 1.000 55
48 F=1 ,M =1 , -> cat1=1
,cat3=1 ,cat13=1 ,
0.15323 1.000 1.000 56 56
49 R=0.25 ,M=1 , ->
cat1=1 ,cat5=1
,cat13=1 ,
0.14919 1.000 1.000 57 57
50 F=1 ,R=0.25 , ->
cat1=1 ,cat5=1
,cat13=1 ,
0.14919 1.000 1.000 58 58
51 F=1 ,M =1 , -> cat1=1
,cat5=1 ,cat13=1 ,
0.16532 1.000 1.000 59 59
52 F=1 ,R=0.25 ,M=1 , -
> cat1=1 ,cat13=1 ,
0.18548 1.000 1.000 60 60
53 M=1 , -> cat1=1
,cat3=1 ,cat5=1 ,cat
=1 ,
0.13306 1.000 1.000 61 61
54 F=1 , -> cat1=1
,cat3=1 ,cat5=1 ,cat
=1 ,
0.12903 1.000 1.000 62 62
Page 121
120
R=0.25 , -> cat1=1
,cat3=1 ,cat5=1 ,cat
=1 ,
0.10484 1.000 1.000 63 63
56 R=0.25 ,M=1 , ->
cat1=1 ,cat3=1 ,cat
=1 ,
0.13306 1.000 1.000 64 64
57 F=1 ,R=0.25 , ->
cat1=1 ,cat3=1 ,cat
=1 ,
0.12903 1.000 1.000 65 65
58 F=1 ,M =1 , -> cat1=1
,cat3=1 ,cat =1 ,
0.18145 1.000 1.000 66
59 R=0.25 ,M=1 , ->
cat1=1 ,cat5=1 ,cat
=1 ,
0.13710 1.000 1.000 67 67
60
F=1 ,R=0.25 , ->
cat1=1 ,cat5=1 ,cat
=1 ,
0.13710
1.000
1.000 68 68
61 F=1 ,M =1 , -> cat1=1
,cat5=1 ,cat =1 ,
0.16935 1.000 1.000 69 69
62 F=1 ,R=0.25 ,M=1 , -
> cat1=1 ,cat =1 ,
0.16129 1.000 1.000 70 70
63 R=0.25 ,M=1 , ->
cat1=1 ,cat3=1
,cat5=1 ,
0.14113 1.000 1.000 71 71
64 F=1 ,R=0.25 , ->
cat1=1 ,cat3=1
,cat5=1 ,
0.14113 1.000 1.000 72 72
65 F=1 ,M=1 , > cat1=1
,cat3=1 ,cat5=1 ,
0.16935 1.000 1.000 73 73
F=1 ,R=0.25 ,M=1 , -
> cat1=1 ,cat3=1 ,
0.18145 1.000 1.000 74 74
67 F=1 ,R=0.25 ,M=1 , -
> cat1=1 ,cat5=1 ,
0.19758 1.000 1.000 75 75
68 M=1 , -> cat 3=1
,cat5=1 ,cat =1
,cat13=1 ,
0.15323 1.000 1.000 76 76
69 F=1 , -> cat3=1
,cat5=1 ,cat11=1
,cat13=1 ,
0.14113 1.000 1.000 77
70 R=0.25 , -> cat3=1
,cat5=1 ,cat11=1
,cat13=1 ,
0.12903 1.000 1.000 78 78
71 R=0.25 ,M=1 , ->
cat3=1 ,cat11=1
,cat13=1 ,
0.15323 1.000 1.000 79 79
Page 122
121
72 F=1 ,R=0.25 , ->
cat3=1 ,cat =1
,cat13=1 ,
0.14919 1.000 1.000 80 80
73 F=1 ,M =1 , -> cat3=1
,cat11=1 ,cat13=1 ,
0.18548 1.000 1.000 81 81
74 R=0.25 ,M=1 , ->
cat5=1 ,cat11=1
,cat13=1 ,
0.16935 1.000 1.000 82 82
75 F=1 ,R=0.25 , ->
cat5=1 ,cat =1
,cat13=1 ,
0.16532 1.000 1.000 83 83
76 F=1 ,M =1 , -> cat5=1
,cat11=1 ,cat13=1 ,
0.18952 1.000 1.000 84 84
F=1 ,R=0.25 ,M=1 , -
> cat11=1 ,cat13=1 , 0.18952
1.000
1.000 85 85
78 R=0.25 ,M=1 , ->
cat3=1 ,cat5=1
,cat13=1 ,
0.19355 1.000 1.000 86 86
79 F=1 ,R=0.25 , ->
cat3=1 ,cat5=1
,cat13=1 ,
0.18952 1.000 1.000 87 87
80 F=1 ,M =1 , -> cat3=1
,cat5=1 ,cat13=1 ,
0.21371 1.000 1.000 88
81 F=1 ,R=0.25 ,M=1 , -
> cat3=1 ,cat13=1 ,
0.22984 1.000 1.000 89 89
82 F=1 ,R=0.25 ,M=1 , -
> cat5=1 ,cat13=1 ,
0.26613 1.000 1.000 90 90
83 R=0.25 ,M=1 , ->
cat3=1 ,cat5=1 ,cat
=1 ,
0.16935 1.000 1.000 91 91
84 F=1 ,R=0.25 , ->
cat3=1 ,cat5=1 ,cat
=1 ,
0.16532 1.000 1.000 92 92
85 F=1 ,M =1 , -> cat3=1
,cat5=1 ,cat =1 ,
0.20161 1.000 1.000 93 93
86 F=1 ,R=0.25 ,M=1 , -
> cat3=1 ,cat =1 ,
0.19758 1.000 1.000 94 94
87 F=1 ,R=0.25 ,M=1 , -
> cat5=1 ,cat =1 ,
0.22984 1.000 1.000 95 95
88 F=1 ,R=0.25 ,M=1 , -
> cat3=1,cat5=1 ,
0.29032 1.000 1.000 96 96
Page 123
122
Here, for better explanation, we compare two rules similarity by measures of
(Chen et al, 2005) and by our modified measure with Manhattan distance.
For example in cluster 1, we have two rule s as followed:
T2-r5: R=0.75, -> cat1=0.25
(Chen et al, 2005)'s similarity= 0.000
Our similarity= 0.375 with the t1-r4: F=0.25, R=1, -> cat1=0.25.
This means that in the first method, because R has different values in two time
snapshots, the similarity become zero but at least these rules both has R in their LHS
but their values are different. We calculated the difference based on the distance
between the two R values in two rules and gain more information.
Another example in cluster4:
T2-r3:M=1 -> cat11=0.75
(Chen et al, 2005)'s similarity= 0.000
Our similarity= 0.750 with the t1-r1: M=1 -> cat11=0.5.
Again, by the (Chen et al, 2005)'s similarity, the similarity become zero, while,
Cat11 is in the both RHSs. We calculated the difference based on the distance
between the two R values in two rules and gain more information, because the
similarities of these rules are not zero. In this chapter we explain about the steps that
we have done to mine changes in customer behavior. Our contribution in this study
is using Manhattan distance to gain more information from the rules and increase
the accuracy of the change measures. In average, we have 6.65% improvement in
the change mining measures by using Manhattan distance.
Page 124
123
Chapter5: Conclusion, further research
Conclusion
Our contribution
Limitation
Managerial implication
Future works
Page 125
124
5.1Conclusion: In this study, we mined the purchasing behavior of Kalleh
Distribution Company. The world around us changes constantly. One of the most
important aspects of surviving in a dynamic market is to know and adapt to changes
happening in customer behavior. Knowing and adapting to changes is an important
aspect of our lives. For businesses, knowing what is changing and how it has
changed is also crucial (Liu et al, 2000). In Fast Moving Consumer Goods (FMCG)
Distribution Company like Kalleh, this issue has importance. Kalleh is faced with
the challenge of increasing competition. There are variety of FMCGs products,
distribution companies and their different strategies so in such a market; the
customer behavior may change by the trend of companies’ strategies in the market
and also by changing their need by themselves.
In order to combat with these problems, Kalleh Company wants to find
changes happening in the market by analyzing purchase transaction data. For mining
changes, we should compare customer purchasing behavior during two periods. The
purpose of this study is to mine changes in customer purchasing behavior. In order
to reach this goal we need to building customer purchasing patterns of customers
based on the customer, product and transaction data collected in databases.
Data mining techniques can help us to reach this goal. Change mining has
some steps including data collection, data pre-processing, customer segmentation
based on RFM and by Customer Value Matrix, building customer behavior patterns
by mining association rule and finally comparing generated association rule by two
measures of similarity and unexpectedness. In this study research process is shown
in figure 3.3. This process constructed based on the previous works in the literature.
In data collection phase, we gathered data from two years of transactions of Kalleh
Distribution Company. In the data pre-processing phase we have some steps as
followed, Data Cleaning is one of data preprocessing steps to remove noisy or
inconsistent data. In this study, we have some noisy data which are the customers
who belongs to Kalleh Company. So we removed them from the database. During
two periods that we analyze, there were 2499 customers but 42 customers belong to
Kalleh Company, so we remove them from the database. Total number of customer
after removing noisy data became 2457.
In Data Transformation phase in one step we did generalization that we build 6
groups of products by expert opinion which is shown in fig 4.1. The second task in
data transformation, we build RFM variables. For calculating RFM, first we divided
our dataset to two time snapshot, one between '1383/07/01' AND '1384/06/31' as
Page 126
125
period one or t1 and the second one between '1384/07/01' AND '1385/06/31' as
period two or t2. We defined recency by calculating the interval between the last
date of purchase and the last date of each period which for period. It means that the
evaluating time for these two time snapshots are '1384/06/31' and '1385/06/31'. For
frequency and monetary, we aggregate the transaction data to calculate the total
number of purchases and total amount spent during each period. According to the
market segmentation by (Marcus, C., 1998), we need the average purchasing of each
customer. So we divide total purchase amount by total number of purchases to
calculate average amount of each purchase.
Customer segmentation is the next step in this study. According to (Marcus,
C., 1998), we divided customers to four clusters in each period which include
uncertain, frequent, spender and best. According to Customer Value matrix, we have
two axes. The calculation steps of Customer Value Matrix and its result are in the
following section. In table 4.4, tables 4.5, the results of market segmentation are
shown. We built four clusters of customers include uncertain, spender, frequent and
best.
The next step is customer behavior mining. In this phase, we applied
association rules to analyze the patterns of customer behavior of different time
snapshots for each customer cluster. There are some other methods for change
mining in the literature like decision trees but we have chosen association rules
because according to (Song et al, 2001) by them we can detect complete sets of
changes. Association rule work s with discrete variables, therefore, in the first phase
we need to do discretization. We have used the equal frequency binning to discretize
the RFM data. For discretizing the area, based on the market expert opinion and
their knowledge about the area we have define four groups. Also we have
discretized the number of purchases for each product category. In this phase we built
association rule which in the left hand side of the rule customer profile data and
RFM variables exists and in the right hand side of the rule, purchased product. The
minimum support and confidence is 17%. Also in Apriori algorithm we have used
maximum frequent itemsets. After building association rules, we compared these
rule to mine changes in customer behavior. We have two measures: similarity and
unexpectedness that evaluate how much two rules are similar or different from
(Chen et al, 2005). We calculated changes for each cluster in two periods that are
shown in chapter4.
Page 127
126
5.2Our contribution: The next step was to building customer behavior patterns which in their RHS,
instead of saying just which products were bought, the number of purchase per
product category mentioned. When we use ordinal number instead of binary values
of bought or not bought, we bring more information. When we compare the values
of common attribute in LHS and RHS of two rules, we have more accuracy to find
difference and similarities between rules. This time the support and confidence of
rules were 10%. Then mining changes in generated rule is done by (Chen et al,
2005) measures and by Manhattan distance formula in chapter3, section 3.12. The
results showed that 6.5% in average the accuracy of the change mining improves
which is our contribution in this study.
5.3Limitation: In doing this research we have some limitation. One of them
is finding a good database that saves useful attributes in it. In our study we need
demographic variables of the customer but in the database we just found the
geographic area of each customer to work.
5.4Managerial Implication: In this section, we summarize the various
opportunities of using change mining methodology. The findings of this study have
great implication for many businesses like distribution companies. These companies
should work in a dynamic environment. Their customers are influenced by different
internal and external factors. In such a dynamic environment, knowing the changes
and adapting to them for businesses are crucial. In macro aspects, business managers
can follow the trend of the market in order to provide suitable products and services
for their customer (Liu et al, 2000). If marketing managers find the changes in the
market, they can find the reason of changes in time and show a right reaction to
changes. In Micro aspect, change mining can help managers to better understand
their customer needs by their behavior and design additional niche marketing
campaigns (Song et al, 2001). Change detection is more suitable in dynamic domain
where the human intervention is high. Another application of change mining can be
analyzing the effectiveness of marketing campaigns. Change mining can be used in
manufacturing to monitor changes and control the quality factors. Changes of vario
us measures of product quality can be properly controlled (Song et al, 2001).
Change mining can play an important role especially in FMCG market which
the competition is high. Also, because of huge amount of data that are recorded in
Page 128
127
these companies database, using data mining methods like change mining bring
hidden and useful information from data. We believe that the change detection
problem will become more important as more data mining applications are
implemented.
5.5Future works: In this study, building rules we have just RFM variable and geographic
variable. It is because of the Kalleh database just stored these variables. There fore
the further research may be use other demographic variables like the type of the
customers.
In this research, for change mining we compare each rule in one time snapshot
s with all of the rules in the other time snapshot. Therefore the further research is to
do this comparison more efficiently.
References: Adomavicius, G. Tuzhilin, A., (2001), Using data mining methods to build customer
profiles. IEEE computer,Volume: 34, Issue: 2, pp.74-82
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining assosiaction rules between
sets of items in large databases. In Proceedings of the ACM SIGMOD Conference
on management of data, (pp. 207-216)
Agrawal, R. & Psaila, G., (1995), Active Data Mining, First International
Conference on Knowledge Discovery and Data Mining ({KDD}-95), PP. 3-8.
Agrawal, R., & Srikant, R. (1994), Fast algorithem for mining association rules. In
Proceedings of the international Conference on Very Large Databases ,VLDB-94,
(pp.487-499)
Ayad, A. M., (2000), Incremental mining of constrained association rules, Master
Thesis, Alexandira University, Faculty of Engineering.
Bay, D.S. & Pazzani, M., J. (1999), Detecting Change in Categorical Data: Mining
Contrast Sets, Knowledge Discovery and Data Mining,pp.302-306
Berry, M.J.A, and Linoff,G.S (2004), Data mining techniques for marketing, sales
and customer relationship management(2nd edn), Indiana, Indianapolis publishing
Inc
Page 129
128
Bolton, R. J., David J. Hand, Martin Crowder, (2004), Significance tests for
unsupervised pattern discovery in large continuous multivariate data sets,
Computational Statistics & Data Analysis Volume 46, Number 1, pp. 57 – 79
Böttcher, M., Nauck, D., Borgelt, C., & Kruse, R., ( 2006) , A framework for
discovering interesting business changes from data, BT Technology Journal ,
volume 24, issue 2,pp. 219 ¬228
Brin, S., et al., (1997), Dynamic Itemset Counting and Implication Rules for Market
Basket Data, Proc. ACM SIGMOD Int’l Conf. Management of Data, ACM Press,
New York, pp. 255¬-264.
Chen, M.C, Chiub, A.L, Chang, H.H, (2005), Mining changes in customer behavior
in retail marketing, Expert System with Applications, Volume 28, Issue 4, pp.773-
781
Cho, Y. B., Cho. Y. H., Kim, S. H., 2005, Mining changes in customer buying
behavior for collaborative recommendations, Expert Systems with Applications 28
(2005), (pp. 359–369)
Dong, G., & Li, J. (1999), Efficient mining of emerging patterns: discovering trends
and differences, Conference on Knowledge Discovery in Data archive, Proceedings
of the fifth ACM SIGKDD international conference on Knowledge discovery and
data mining, pp. 43 -52.
Dunham M.H., Xiao Y., Gruenwald Y.L., Hossain Z., (2001), a survey of
association rule mining, ACM Survey Journal (submitted), available at
http://www2.cs.uh.edu/~ceick/6340/grue-assoc.pdf
Feelders, A.J.; Daniels, H.A.M.; Holsheimer, M., (2000), Methodological and
practical aspects of datamining, Information & Management, vol.37 , pp.271-281
Han, J. & Kamber, M., (2006). Data Mining: Concepts and Techniques. San
Francisco: Morgan Kaufmann Publishers.
Hossein Javaheri, S., (2008), Response Modeling in Direct Marketing: a data mining
based approach for target selection, Master's thesis, epubl.luth.se/1653-
0187/2008/014/LTU-PB-EX-08014-SE.pdf
Kantardzic, M., (2003), Data Mining: Concepts, Models, Methods, and Algorithms
byJohn Wiley & Sons
Page 130
129
Larose, D.T. (2006) Data mining methods and model, Hoboken, New Jersey, John
Wiley & sons, Inc
Li, J., Dong, G., Ramamohanarao, K., (2000), Instance-Based Classification by
Emerging Patterns, Principles of Data Mining and Knowledge Discovery, PP.191-
200.
Li, X. B., (2005), A scalable decision tree system and its application in pattern
recognition and intrusion detection, Decision Support Systems, Volume 41, issue 1,
pp.112–130
Liu, B., Hsu, W., (1996), “Post analysis of learnt rules." AAAI-96
Liu, B., Hsu, W., Han, H. S., & Xia, Y. (2000). Mining changes for real-life
applications, Lecture Notes In Computer Science; Proceedings of the Second
International Conference on Data Warehousing and Knowledge Discovery, Volume
1874, pp. 337–346.
Liu,H.;Hussain, F.;Tan, C.L&Manornjan Dash, (2002), Discretization: An Enabling
Technique, Data Mining and Knowledge Discovery, Volume 6, Number 4, pp.393–
423
Malhotra, K.N (1996), Marketing Research: an applied orientation, india, Pearson
Education
Marcus, C. (1998), A practical yet meaningful approach to customer segmentation.
Journal of Consumer Marketing, Volume 15, issue 5, pp.494–504.
Miglautsch, J.R.,(2001), Thoughts on RFM Scoring, Journal of Database Marketing,
Volume 8, issue 1, pp. 67-72.
Min, S., H.& Han, I., (2005), Detection of the customer time-variant pattern for
improving recommender systems, Expert Systems with Applications , Volume 28,
Issue 2, pp.189–199
Nemati, H.R. & Barko, C. D., (2003), Key factors for achieving organizational data-
mining success, Industrial Management & Data Systems, Volume, 103 Issue, 4 pp.
282 -292
Page 131
130
Novo, j., (2008), Drilling Down:Turning Customer Data into Profits with a
Spreadsheet, www.Jimnovo.com
Park, J.S.; Chen, M.-S. & Philip, S.Y., (1995), An Effective Hash Based Algorithm
for Mining Association Rules, Proc. International Conference on Management of
Data, Proceedings of the 1995 ACM SIGMOD international conference on
Management of data, pp.175-186.
Savasere, A. ; Omiecinski, E., & Navathe, S., (1995), An Efficient Algorithm for
Mining Association Rules in Large Databases, Proc. 21st Int’l Conf. Very Large
Data Bases, Morgan Kaufmann, San Francisco, pp. 432-444
Saunders, M., lewis, P., and Thornhill A., (2000), Research Methods for Business
Students, England, Pearson Education Limited
Silberschatz, A., & & Tuzhilin, A., (1996). What makes patterns interesting in
knowledge discovery systems? IEEE Transactions on Knowledge and Data
Engineering, 8 (6), (pp. 970-974).
Song, H. S., Kim, J. K., & Kim, S. H. (2001). Mining the change of customer
behavior in an internet shopping mall. Expert System with Applications, Volume 21,
issue 3, 157–168.
Su, J. H. & Lin, W. Y., (2004), CBW: an efficient algorithm for frequent itemset
mining , System Sciences, 2004. Proceedings of the 37th Annual Hawaii
International Conference on Volume , Issue , 5-8 Jan. 2004 Page(s): 9 pp. ¬
Thomas, S.;Bodagala, S.; Alsabti, K. & Ranka, S., (1997), An efficient algorithm for
the incremental updation of association rules in large databases, In Knowledge
Discovery and Data Mining, pp. 263-266
Toivonen, H., (1996), Sampling Large Databases for Association Rules, Very Large
Data Bases, Proceedings of the 22th International Conference on Very Large Data
Bases, pp. 134 -145
Tsai, C.,Y. & Chiu, C.-C., (2004), A purchase-based market segmentation
methodology, Expert Systems with Applications ,Volume 27, Issue 2, pp.265–276
Twocrows.com, (2005), www.twocrows.com/intro-dm.pdf
Yin, R.K, (1994), Case Study Research, design and methods (2th edn), California,
Page 132
131
Thousand Oaks, Sage publication, Inc
Zaki, M., J.; Parthasarathy, S.; Ogihara, M. & Li, W., (1997), New Parallel
Algorithms for Fast Discovery of Association Rules, Data Mining and Knowledge
Discovery, Volume 1, Number 4, pp. 343-373.
Zhao, Q. & Bhowmick, S.S., (2003),Association Rule Mining: A Survey, Technical
Report, CAIS, Nanyang Technological University, Singapore, No. 2003116 , 200
Wu, J. & Lin, Z., (2005), Research on Customer Segmentation Model by Clustering,
ACM International Conference Proceeding Series, Proceedings of the 7th
international conference on Electronic commerce, Vol. 113, pp-316-318
Softwares:
R Software, 2007, version 2.7.0, www.r-project.com
SQL server, 2000, www.microsoft.com