Page 1
CHURN PREDICTION MODELLING IN MOBILE
TELECOMMUNICATIONS INDUSTRY: A CASE STUDY
OF SAFARICOM LTD
BY
KAIRANGA JAMES MACHARIA
SCHOOL OF MATHEMATICS
COLLEGE OF BIOLOGICAL AND PHYSICAL SCIENCE
UNIVERSITY OF NAIROBI
A project submitted in partial fulfilment of the requirement for the degree of Master of
Science in Social Statistics
JULY 2012
Page 2
DECLARATION
Candidate:
This project report is my original work and has not been presented for a degree in any other
university.
li£ iLKairanga James Macharia 156/64577/2010
Supervisor:
This project report has been submitted for examination with my approval as university supervisor
Signature
Dr. Kipchirchir Isaac Chumba
n.„ W
I
Page 3
ACKNOWLEDGEMENT
I take the first opportunity to thank God for gift of life and good health. Secondly, offer my
sincerest gratitude to my supervisor, Dr. Kipchirchir Isaac Chumba, who has supported me
throughout my project with his patience and knowledge whilst allowing me the room to work with
freedom of thought. One simply could not wish for a better or friendlier supervisor. I acknowledge
my lectures: Prof. Manene Moses M., Prof. Otieno Joseph A. M., Mr. Ndiritu John. M. and Mrs.
Wang'ombe Anne W. for the knowledge they have impacted me throughout my course work. Last
but not least Consumer Planning and Pricing section within Safaricom for providing me with the
required data for analysis.
n
Page 4
DEDICATION
I dedicate the project to my lovely wife Winnie and daughter Tiffany.
iii
Page 5
ABSTRACT
The focus of telecommunication companies has shifted from building a large customer base into
keeping customers in house. For these reasons, it is valuable to know which customers are likely to
switch to a competitor through porting out or purchasing a competitor line.
Since acquiring new customers is more expensive than retaining existing customers, chum
prevention can be regarded as a popular way of reducing the company's costs. In this study, Cox
proportional hazard model and decision tree model are compared with conventional model.
The first model, the Cox model, is based on the theory of survival analysis, whereas the second
model, a decision tree, is commonly used in data mining. Both models are tested on a selection of
pre-paid customers from the database provided by Safaricom Limited.
Current conventional prediction used by Safaricom Limited was improved significantly by using
Cox proportional hazard and decision tree as they both performed better on the ROC curve.
However, for the duration under consideration decision tree performed better than Cox proportional
model.
Decision tree model selected gave probability of chum which is an improvement from conventional
model that only gives binary results of chum and not chum. Also, where the decision tree yields
approximately 50 percent probability of chum conventional model gave varying churn status.
IV
Page 6
LIST OF ABREVIATIONS
AON - Age On Network
ARPU - Average Revenue per User
DMTV - Direct Mail Television
CART - Classification and Regression Trees
CCK - Communications Commission of Kenya
c.d.f - cumulative distribution function
CDR - Call Detail Record
CHAID - Chi Square Automatic Interaction Detection
CLV - Customer Lifetime Value
CRM - Customer Relationship Management
CVM - Customer Value Management
EDA - Exploratory Data Analysis
EDF - Empirical Distribution Function
EDGE - Enhanced Data Rates for GSM Evolution
ETACS - Extended Total Access Communications System
FPR - False Positive Rate
GA - Genetic Algorithm
Global Systems for Mobile - GSM
Gok - Government of Kenya
K.PTC - Kenya Posts and Telecommunications Corporation
MS1SDN - Mobile Number
OTA - Over The Air
PABX - Private Automatic Branch Exchange
PDN - Packetstream Data Networks
v
Page 7
p.d.f - probability density function
Pic - Public limited company
RFM - Recency Frequency Monetary
ROC - Receiver Operating Characteristic
SIM - Subscriber Identity Module
SMS - Short Messaging Service
TKL - Telkom Kenya Limited
TPR - True Positive Rate
USSD - Unstructured Supplementary Service Data
vi
Page 8
TABLE OF CONTENTS
DECLARATION I
ACKNOWLEDGEMENT II
DEDICATION III
ABSTRACT IV
LIST OF ABREVIATIONS V
TABLE OF CONTENTS VII
LIST OF TABLES IX
LIST OF FIGURES X
CHAPTER 1: INTRODUCTION 1
1.1 Background............................................................................................................................... 1
1.2 Problem Statement................................................................................................................. 8
1.3 Research Question...................................................................................................................9
1.4 Objectives...................................................................................................................................9
1.5 Significance of the study .................................................................................................... 10
CHAPTER 2: LITERATURE REVIEW 12
CHAPTER 3: METHODOLOGY 20
3.1 Introduction........................................................................................................................... 20
3.2 Data M ining............................................................................................................................ 21
3.3 Cox Proportional Hazard Mo d el .....................................................................................25
3.4 Decision Tree Model............................................................................................................. 30
3.5 Population and Study Sample.............................................................................................32
vii
Page 9
3.6 Test statistics for Model Comparison.............................................................................33
3.6.1 ROC Curve................................................................................................................................33
3.6.2 Kolmogorov-S mirnov Test (K-S Test) .............................................................................35
3.6.3 Gini Coefficient...................................................................................................................... 37
3.7 Modelling Process.................................................................................................................39
CHAPTER 4: DATA ANALYSIS AND RESULTS 46
4.1 Exploratory Data Analysis..............................................................................................46
4.2 Variable Reduction...............................................................................................................50
4.3 Model Estimation................................................................................................................... 52
4.4.1 Decision Tree ......................................................................................................................... 54
4.4.2 COX PROPORTIONAL HAZARD MODEL....................................................................................... 55
4.4 Model Validation.................................................................................................................. 56
CHAPTER 5: CONCLUSIONS AND RECOMMENDATIONS 61
5.1 Conclusions............................................................................................................................ 61
5.2 Recommendations...................................................................................................................61
APPENDICES 63
Appendix 1: Fit Statistics Table .................................................................................................... 63
Appendix 2: Tree Leaf Report..........................................................................................................65
REFERENCES 65
viii
Page 10
LIST OF TABLES
Table 4.1 Sample statistics variables minimum, mean and maximum values 47
Table 4.2 Sample variables per subscriber 48
Table 4.3 Chum status 51
Table 4.4 Variables Summary 52
Table 4.5 Partition Summary 52
Table 4.6 Summary statistics for class targets 53
Table 4.7 Important variables picked by the decision tree 54
Table 4.8 Summary of Censored Events 55
Table 4.9 Analysis of Maximum Likelihood Estimates (MLE) 56
Table 4.10 Statistics Results from the fitted Models 58
Table 4.11 Comparison of Decision Tree Model and Conventional Model 59
IX
Page 11
LIST OF FIGURES
Figure 3.1 ROC Curve 33
Figure 4.1 Exploration of AON distribution 49
Figure 4.2 Exploration of age distribution 49
Figure 4.3 Chum Status 51
Figure 4.4 Comparing train and validate data set 57
Figure 4.5 Comparing train and validate data set using ROC 57
x
Page 12
CHAPTER 1: INTRODUCTION
1.1 Background
Chum is a measure of subscriber attrition from a given mobile operator network, and is defined as
the number of subscribers who discontinue using a particular network during a specified time period
divided by the average total number of customers or employees over that same time period.
In a dynamic business, chum rate indicate subscriber response to tariff, promotions, competitor
network activities etc. As such, chum rate is an important business metric. To estimate future chum
rates predictive chum modelling is applied.
Largest mobile operators such as Vodafone have long appreciated that the cost of acquiring a new
customer is incrementally greater than the cost of retaining an existing one. Thus, chum data,
alongside subscriber acquisition costs, has become a key measure used by industry analysts and
financial commentators to determine mobile operator performance.
Safaricom, which started as a department of Kenya Posts and Telecommunications Corporation
(KPTC), the former monopoly operator, launched operations in 1993 based on an analogue
Extended Total Access Communications System (ETACS) network which was upgraded to Global
Systems for Mobile (GSM) in 1996 (license awarded in 1999). Safaricom was incorporated on
April 1997 as a private limited liability company.
In accordance with the Government o f Kenya’s policy of divesting its ownership in public
enterprises, the Government of Kenya through the Treasury Department, on 28th March 2008 made
1
Page 13
available to the public 10 billion of the existing ordinary shares of par value ksh. 0.05 each, of the
Company. This represents 25 percent of the total issued share capital of Safaricom from the
Government of Kenya's shareholding in Safaricom Limited.
As at 31st March 2009, the company had 6.175 million registered users, a customer base of 13.36
million, 8,650 retail outlets countrywide, 51 paybill partners, 301 3G enabled base stations in
Nairobi, Mombasa, Naivasha and Eldoret.
In 2009, Safaricom won awards for Best Mobile Money Service in the GSMA. In Global Mobile
Awards it was the winner in the Best Broadcast Commercial Category for its entry of the M-PESA
Send Money Home', in the UN World Business and Development Award it was among the 10
private companies recognized globally for their contribution to the achievement of millennium
development goals through M-PESA, in the Kenyan Banking Awards it was the winner in the
product innovation category (M-PESA) and in the Stockholm Challenge, the winner in the
Economic Development Category ( M-PESA).
M-PESA is a Safaricom product that allows users to transfer money using a mobile phone. Kenya is
the first country in the world to use this service, which is offered in partnership between Safaricom
and Vodafone. M-PESA is available to all members of the public, even if they do not have a bank
account or a bankcard.
Safaricom offers mobile voice services using GSM-900 and GSM-1800 technologies. It launched
GPRS services in July 2004 and Enhanced Data Rates for GSM Evolution (EDGE) services in June
2006. In 2007 it was formally granted Kenya's first license to operate a 3G network.
2
Page 14
Safaricom business model focuses on pre-paid customers (pay-in-advance) without long-term
contract commitments. It requires most o f its customers to pay for services in advance to limit the
customer-related credit risk.
It focuses on:
I. Low-income clients to boost the customer base.
II. Expansion of its GSM coverage footprint in rural areas and capacity levels in key urban
areas.
III. Improve the performance and reliability of its services.
IV. Introducing new and innovative products.
Safaricom offers all and post-paid users a variety of value priced service plan options and products.
Safaricom has aligned itself with other business partners including distributors, suppliers and
technology partners. These arrangements help Safaricom maintain a low cost structure while
ensuring high quality customer products and services. The company bundles its products and
services with products of globally established companies with the goal of deploying reliable, high-
quality cellular products and services to the mass market and competing effectively with other
mobile providers.
In this regard, Safaricom has a working relationship with Vodafone Group Pic, an established leader
in global mobile telecommunications industry. The amount of investment Vodafone has made is
among the largest ever made by any foreign company in Kenya. Vodafone also provides Safaricom
with the opportunity to be a member of its global procurement group and to benefit from
Vodafone's experience in other countries strong marketing efforts, rapid product deployment and
maintaining and growing strong brand recognition.
3
Page 15
The company has focused on enhancing its image by involving itself in the community and
focusing on local themes, which may resonate with the targeted customer base.
During 2008, Safaricom formulated an aggressive growth campaign to increase its subscriber base-
by launching a series of promotions, investing heavily in subscriber acquisition and increased the
core network capacity by targeting rural areas.
By virtue o f the 60 percent shareholding held by the Government of Kenya (GoK), Safaricom was a
state corporation within the meaning of the State Corporations Act (Chapter 446) Laws of Kenya,
which defines a state corporation to include a company incorporated under the Companies Act
which is owned or controlled by the Government or a state corporation. Until 20 December 2007,
the GoK shares were held by Telkom Kenya Limited (TKL), which was a state corporation under
the Act.
Follow ing the offer and sale of 25 percent of the issued shares in Safaricom held by the GoK to the
public in March 2008, the GoK ceased to have a controlling interest in Safaricom under the State
Corporations Act and therefore the provisions of the State Corporations Act no longer apply to it.
To attract new investments into the ICT sector, the regulation capping foreign ownership of
telecoms companies at 80 percent was relaxed to allow foreigners to launch operations without a
local partner.
The introduction of new players and a changing regulatory landscape brought new challenges to
Safaricom and the industry as a whole. A more competitive industry landscape placed downward
pressure on Safaricom market share of gross additions in the medium term. As retail tariffs reduced,
4
Page 16
ARPU reduced for both prepay and post pay subscribers for the industry as a whole. Enhanced
competition created the need for the company to maintain higher levels of selling and limit general
and administrative expense levels. This was due to the potential requirement for higher advertising
costs to protect the subscriber base and increased payroll costs to retain key managerial talent. It led
to focus on product development. In 2009 Safaricom launched Kenya's first mobile internet portal
(www.safaricom.com) to provide free content for its over 1.6 million subscribers who access the
Internet using their phones. The portal enabled Safaricom subscribers to access both local and
international content direct from their mobile phones.
Safaricom then launched Africa's first fully solar-powered phone, branded Simu ya Solar. The new
solar-powered mobile phone went on sale in Kenya in August 2009 at ksh. 499. The solar-powered
phone was produced by the Chinese ZTE Corporation.
Safaricom announced in August 2009 that it has bought 100 percent of a second local WiMAX
operator, Packetstream Data Networks (PDN) and signed an agreement with Nokia and DMTV to
introduce mobile TV service.
In mid-May 2009, Safaricom joined the race to capture the data market in the telecoms industry and
launched the caller ring back tune service. The service branded Skiza, enabled subscribers to choose
a preferred song and set it as their ring back tune. Other products launched in subsequent years
included:
I. tXt-ten for ten (group SMS) - A mobile chat service that enabled subscribers to quickly send
the same message to several members of a group.
II. Advantage Contracts - Offers subscribers an opportunity to control their call costs.
5
Page 17
III. Advantage Plus - Enabled corporate customers to give their staff a limit on their monthly
expenditure.
IV. Safaricom Mail - Email service in conjunction with Google.
V. Toll Free Services - Where called party pays for calls to a toll-free number.
VI. Corporate Direct Connectivity - Direct connection between the customer's PABX and
Safaricom's network facilitates voice communication.
VII. Winback SMS for Roaming - Allows Safaricom to Win-Back visitors lost to a competitor's
network
VIII. Automatic Device Configuration - Subscribers could request for data and network settings
automatically via USSD and SMS and have these delivered directly to their handsets over
the air.
IX. OTA SIM Swap - Allows Safaricom pre-paid subscribers who have lost their SIM Cards to
do a SIM Swap on their handsets.
X. Express Auto bar - A quick one-stop service for all individual customers and guarantees
active post-paid lines.
XI. Kama Kawaida with Rwanda - Here, Safaricom teamed with MTN Rwanda to offer
subscribers seamless service availability at their home tariffs when travelling across the two
countries.
Price wars began in August 2010 when Zain Kenya slashed its on-net prices by 80 percent. Yu and
Orange network followed immediately by reducing calling rates even further. Safaricom countered
by introducing Masaa Tariff which reduced calling rates to ksh. 3. It soon became clear that
strategy will not be to acquire new customers since competitor networks are charging low rates but
managing current subscriber base.
6
Page 18
This lead to the launch of loyalty scheme “ Bonga" to manage subscribers by offering reward on the
number of points accumulated through calling, data and SMS. Subscribers were able to accumulate
points and redeem free minutes and sms. This was later improved to accommodate redemption of
handsets, modem and laptops.
.In a nutshell, below is the life cycle of a Safaricom subscriber:
I. Active - This is the duration of the validity of the recharged voucher topped up. In this state,
the subscriber can make or receive calls and sms browse the internet using a data enabled
handset and transact MPESA.
II. Expiry - this is the next state after subscriber enters after active state if they do not top-up
before expiry of the validity period of the card that they previously topped up. Here, the
subscriber can only receive but cannot make chargeable calls and sms. When a subscriber
tops up they go back to active state. Expiry state last 30 days after which subscriber enters
pooled state.
III. Pool - Here, the subscriber cannot make or receive calls or access any Safaricom service.
It's the initial process of churn and it last for 120 days. In this state, a subscriber cannot top-
up as was the case in expiry state to return to active state.
IV. Inactive - This is the final stage of chum where the line is recycled and resold in the market.
Here the subscriber loses the line together with all resources accumulated by the line such as
Bonga points, airtime balance, data bundles and MPESA monies not withdrawn.
The costs of acquisition of a subscriber are made up of:
I. SIM card cost
II- CCK licence cost
III. Network cost
7
Page 19
IV. Dealer costs
V. Administration costs.
VI. Set-up costs.
These costs are accrued before a subscriber becomes active on our network. Considering there are
also costs to maintain subscriber on the network, it takes on average more than 6 months to recoup
acquisition cost for a new subscriber.
From quarterly sector statistics report by CCK, 2nd Quarter October-December 2011/2012, total
net additions by all mobile operators in December 2011 were 1.5 million. Net addition is the
increase in the total subscriber count from the start of the period to the end of the period. This
implies that all mobile operators in Kenya cannot rely on increasing their subscribers’ base based on
new joiners.
1.2 Problem Statement
Safaricom operates in an industry where switching costs is very low. Subscribers require ksh. 200 to
port to any network. Cost of purchase of competitor line is ksh. 50. Considering acquisition costs
which on average takes more than 6 months to recoup and reducing numbers subscribers available
for new connection, customer churn is the focal concern.
To manage chum, Safaricom has adopted Customer Value Management (CVM), where efforts are
being put in place to ensure no chum for upper and middle segment of subscribers and only selected
chum is to be allowed for lower segment of the subscribers.
8
Page 20
However, due to the nature of pre-paid mobile telephony market which is not contract-based,
subscriber chum is not easily traceable or definable, thus the need to improve on the conventional
model.
1.3 Research Question
The research question is stated as follows:
Is it possible to improve on current conventional methods of predicting chum?
In order to address this question, the following two sub questions are formulated:
I. How well do survival and decision tree model perform in comparison to the conventional
models?
II. Do the two models have an added value compared to the conventional models?
1.4 Objectives
The broad objective is to find out the most accurate chum prediction model by comparing Cox
proportional hazard model and decision tree model against conventional model so as to accurately
determine of probability of each subscriber churning.
Specific objectives:
1. To formulate a chum model using Cox proportional hazard and decision tree models.
2. To compare results of models formulated to current conventional models and determine
the best model.
9
Page 21
3. To determine the probability o f churning for every subscriber based on the best model
selected.
1.5 Significance of the study
Current conventional method focus on reduction in ARPU, which is affected by many other
variables such as:
I. Competitor activities.
II. Demographic factors.
III. Usage factors.
IV. Economic factors.
V. Social factors.
< 4 *
Cox proportional model incorporates all this variables as well as time to chum thus can provide
more accurate prediction of chum for individual subscribers. Decision tree is simple to understand
and offers ability to do oversampling for chum which it considers as unlikely an event.
The model will be important indicator of the success of pricing and promotion strategies adopted by
the company. Currently, the focus of pricing of products and services is purely based on profits
which are not customer centric. Customers who have for many years made significant contribution
to revenue are moving to competitor network because the company seems not to value their loyalty.
Chum probability created will be the input of Customer Lifetime Value (CLV) models to be
developed that seek to provide a useful way to apportion value to a subscriber (or subscriber group)
based on cumulative cash flow from a subscriber relationship and the benefits of loyalty and
10
Page 22
advocacy that increase over time. CLV will provide strategic teams with a means to gauge the
effectiveness of their acquisition costs and retention strategies.
11
Page 23
CHAPTER 2: LITERATURE REVIEW
There is a significant relationship between customer loyalty, satisfaction, trust and switching costs
in mobile telephony market. In this fiercely competitive arena, subscribers demand tailored
products and better sendee at lower prices, while service providers focus on customer acquisition as
their primary focus.
Yankee (2001) indicated that mobile operators estimate the cost of acquiring new subscribers at
seven times more than the annual cost of retaining an existing subscriber on an average basis. The
emergence of the digital economy has intensified the problem of churn management Lejeune (2001)
stated that a company’s initiatives to handle churn and profitability issues have been directed to more
customer-oriented strategies. A customer relationship management (CRM) framework based on the
integration of the electronic channel would incorporate the electronic dimension and be enhanced
by the development of adequate tools for the collection, treatment and analysis of data which plays
a central role in chum management.
Chum amplitude is negatively correlated with the efficiency of data-mining tools, and the
relationship between chum and CRM tools is linear. An analytical framework based upon
sensitivity analysis could anticipate the possible impact induced by the ongoing data-mining
enhancements on chum management and the decision-making process
According to Olafsson et al. (2008), there are two different types of chum namely:
I. Voluntary chum - Which means that established customers choose to stop being customers.
II. Forced chum - Which refers to those established customers who no longer are good
customers and the company cancels the relationship.
12
Page 24
Burez et al. (2008) divided the voluntary chumers to two groups:
I. Commercial chumers - Subscribers who do not renew their fixed term contract at the end of
that contract.
II. Financial chumers - Subscribers who stop paying during their contract to which they are
legally bound.
Seo et al. (2008) investigated retention factors in telecommunications industry by examining other
features and variables. Aim was to examine:
I. How factors that affect switching costs and customer satisfaction, such as length of
association, service plan complexity, handset sophistication and the quality of connectivity,
drive customer retention behavior.
II. How customer demographics such as age and gender affect their choice of service plan
complexity and handset sophistication, leading to differences in customer retention behavior.
They used binary logistic regression model and a two-level hierarchical linear model. The factors
analysed consisted o f complexity of service plan, handsets sophistication, length of association and
connectivity. Customer demographics to be related to these factors are gender and age.
The results showed that:
I. The more complex service plan, more sophisticated handset, longer customer association,
higher connectivity quality of wireless is positively related to customer retention behaviour.
II. Different age and gender groups revealed differences in wireless connectivity quality and
service plan complexity affecting their customer retention behaviour
HI. They did not experience differences in terms of length of customer association and handset
sophistication.
13
Page 25
The results generated questions on why different age and gender groups would differ on the
connectivity quality of wireless service and not on handset sophistication.
Yan et al. (2005) constructed a predictive chum model for pre-paid customer segment. Due to the
limited availability of data, they exploited Call Detail Record (CDR). To construct their predictive
model, they extracted the calling links, that is, who called whom as inputs to neural network model
Using the CDR, they defined two categories of calling links as follows:
I. Direct calling neighbour - A person who calls the customer or whom the customer calls.
II. Indirect calling neighbour - A person who calls the same numbers as the customer does.
Utilizing these neighbours, they discovered the calling community of each customer and
hypothesized that people from a calling community behave in a similar way. So, they supposed that
if a customer most frequently called parties churned from the same service provider, the customer
may also eventually chum.
With the intention of building the chum predictive model they used the CDR data of July and
August to predict the chum in December. In addition, they were provided with chum labels that
showed w ho churned, in both November and December. Their research task was to develop a chum
prediction model, with chum in December as the dependent variable (Prediction Target) and with
independent variables being the CDR data in July and August and the chum information in
November. They analysed the data by using decision tree and neural networks. For the neural
network, if the customer service representatives contact the 10 percent of customers with the
highest scores from the model, they are able to correctly identify 20 percent of the chumers.
14
Page 26
They found that the neural networks outperform the decision tree, which performs even worse than
random sampling for a higher contact rate.
Jahromi (2009) developed a dual-step model building approach, which consisted of clustering phase
and classification phase. The customer base was divided into four clusters, based on their Recency
Frequency Monetary (RFM) related features, with the aim of extracting a logical definition of
chum, and secondly, based on the chum definitions that were extracted in the first step.
In the model building phase, the decision tree (CART algorithm) was utilized to build the predictive
model with the aim of comparing the performance of different algorithms. Neural networks
algorithm and different algorithms of decision tree were utilized to construct the predictive models
for chum in the developed clusters. Evaluating and comparing the performance of the employed
algorithms based on “gain measure".
Jahromi concluded that employing a multi-algorithm approach in which different algorithms are
used for different clusters, yields the maximum “gain" among the tested algorithms.
Furthermore, to deal with imbalanced dataset, a cost- sensitive test was carried out using learning
method as a remedy for handling the class imbalance. This revealed that both simple and cost-
sensitive predictive models have a considerable higher performance than random sampling in both
CART model and multi-algorithm model. Additionally, cost-sensitive learning was proved to
outperform the simple model only in CART model but not in the multi-algorithm.
According to Jahromi, the problem that telecommunication companies face is to recognize the
subscribers with high probability of chum in close future so as to target them with incentives in
15
Page 27
order to convince them to stay. However, due to the absence of an accurate model for monitoring
their clients* behaviour, telecommunication companies are unable to distinguish the chumers from
non-chumers. In such instances they have two options:
I. Send all customers the incentives, which was clearly a waste of money.
II. Quit the chum management program and focus on acquisition program which is
considerably more costly than the retention approach.
According to Jahromi, not only did the model helped in distinguishing the real chumers, but also, it
prevented the waste of money attributed to the mass marketing.
Owczarczuk (2009) studied chum models for customers in the cellular telecommunication industry
using large data marts and tested the usefulness of the popular data mining models to predict chum
of the clients of the Polish cellular telecommunication company. The study was conducted on
subscribers who are:
I. More likely to chum.
II. Less stable.
III. Little is known about them.
Owczarczuk utilised all subscriber usage variables and tested the stability of models across time for
all the percentiles of the lift curve. Test sample were collected six months after the estimation of the
model.
Logistic regression, linear regression, Fisher linear discriminant analysis and decision trees models
were used. The basis of choice of the model was the need to use interpretable models which gives
understanding of the reasons (or at least a symptom) of chum. Owczarczuk claimed that linear
models like regression or Fisher discriminant analysis have a simple interpretation. For example,
16
Page 28
positive coefficient by a variable suggests higher likelihood of chum. Decision trees too have a
clear interpretation which can be expressed in terms of what-if rules.
Data set consisted of the train, calibration and test datasets. Data in the train sample and the
calibration sample came from the dataset collected at the same time, which was then split randomly
into the train and validation part. The test sample was collected six months after the train and
calibration sample.
The models were tested using lift curves that measured the relation of chumers in the top quartiles
of the score generated by the models to the fraction of chumers in the whole population (lifts
expressed as factors not as percentage) since all the linear models had similar performance
regardless of the additional variable selection method (stepwise, backward, forward, none). The
logistic regression was slightly better than linear regression and Fisher discriminant analysis.
Applying preliminary variable selection to decision trees gave similar results to the full decision
tree, so they present only decision trees with the preliminary variable selection.
Main findings were that linear models are more stable than decision trees that get old quickly and
their performance weakens in time, especially in top quartiles of the score. Nevertheless, the study
showed that pre-paid chum can be effectively predicted using large data mart. It was suggested that
as far as future work is concerned, it would be interesting to model chum in the sector that is
somewhere between post-paid and pre-paid - the mix sector. Mix clients have to sign contract and
personal data is available for them, like for the post-paid customers. In addition, they make recharge
which makes them similar to prepaid.
17
Page 29
Ahn et al. (2006) conducted an exploratory research in which they aimed at finding the most
influential factors on customer chum. In their research, they considered a mediator factor named
"Customer's Status”, between churn determinants and customer churn in their model, and
mentioned that “Customer's Status” (from active use to non - use or suspended) change is an early
signal of total customer chum.
In the research, a mediator was taken into account between chum determinants and customer chum,
and it was hypothesized that a customer's status change is an early signal of total customer chum. In
conducting their empirical analysis, they draw a random sample of subscribers of a leading
telecommunications service provider. The account had to be active during the time period between
September 2001 and November 2001. For those customers, all accounts were tracked and examined
for eight month from September 2001 to April 2002, and ‘‘Churn" was defined as the event in which
a subscription was terminated by the end of April 2002. That is, chum happened during the period
from December 2001 to April 2002. For churners' 3-month, 2-month, and 1-month prior data was
collected before the actual termination. For the non-chumers, the most recent last 3 months of data
was collected (from February 2002 to April 2002).
From the collected data they extracted the subscriber's usage and billing data and also the
demographic data. The available data consisted of:
I. Billed amounts.
II. Accumulated loyalty points.
III. Call quality-related indicators.
IV. Handset-related information.
V. Calling plans.
VI. Gender.
18
Page 30
The results showed that dissatisfaction indicators, such as number of complaints and call drop rate
have a significant impact on the probability o f chum. Besides, it was revealed that loyalty points
such as membership card programs have a significant negative impact on the probability of
customer chum.
Moreover, surprisingly the findings showed that heavy users are more likely to chum and also
customer status was found to have significant impact on the probability of chum. In addition they
found out that customer status has a significant impact on the probability of chum. Change of
customer's status from active use to either non-use or suspended increases the chum probability.
19
Page 31
CHAPTER 3: METHODOLOGY
3.1 Introduction
Survival analysis is a collection of statistical methods which model time-to-event data. Central is
the occurrence of a well-defined ‘event’. The variable of interest is the time until this event occurs.
This is in contrast with approaches like regression methods and neural networks which model the
probability of an event. Depending on its application, the event of interest can be the failure of a
physical component or the time to death. In the context of data mining the event of interest is
typically the time until chum or the time until the next purchase.
There are many different types of survival models. Of concern will be survival model that
incorporate a regression component, since these regression models can be used to examine the
influence o f explanatory variables on the event time. In this context, such explanatory variables are
often called covariates. There are two commonly used classes of regression models, that is:
I. Accelerated failure time models.
II. Proportional hazard models.
Accelerated failure time models are based on a survival distribution. Common employed
distributions are Weibull, exponential and log-logistic. In accelerated failure time models, the
regression component affects survival time by rescaling the time axis. The Cox proportional hazard
model is the most popular survival regression model available. It does not make any assumptions on
the survival function as opposed to accelerated failure time models. The regression component
affects the hazard curve through multiplication. Many improvements and adjustments have been
made to the Cox model since the introduction of the model.
20
Page 32
Non-parametric approach covers techniques that do not rely on data belonging to any particular
distribution. These include, among others, distribution free methods, which do not rely on
assumptions that the data are drawn from a given probability distribution. As such it is the opposite
of parametric statistics. It includes non-parametric statistical models, inference and statistical tests
and non-parametric statistics (in the sense o f a statistic over data, which is defined to be a function
of a sample that has no dependency on a parameter), whose interpretation does not depend on the
population fitting any parameterized distributions. Statistics based on the ranks of observations are
one example of such statistics and these play a central role in many non-parametric approaches.
Decision tree model of commutation which is an algorithm or communication process is considered
to be basically decision tree, that is, a sequence of branching operations based on comparisons of
some quantities, the comparisons being assigned the unit computational cost. Data mining
techniques will be used to obtain data from the enterprise warehouse the modelling purpose.
3.2 Data Mining
Data mining, or knowledge discovery, is the computer-assisted process of digging through and
analysing enormous sets of data and then extracting the meaning of the data. Data mining tools
predict behaviours and future trends, allowing businesses to make proactive, knowledge-driven
decisions. Data mining tools can answer business questions that traditionally were too time-
consuming to resolve. They scour databases for hidden patterns, finding predictive information that
experts may miss because it lies outside their expectations.
Data mining derives its name from the similarities between searching for valuable information in a
large database and mining a mountain for a vein of valuable ore. Both processes require either
21
Page 33
sifting through an immense amount of material, or intelligently probing it to find where the value
resides.
Although data mining is still in its infancy, companies in a wide range of industries - including
retail, finance, health care, manufacturing transportation, and aerospace - are already using data
mining tools and techniques to take advantage of historical data. By using pattern recognition
technologies and statistical and mathematical techniques to sift through warehoused information,
data mining helps analysts recognize significant facts, relationships, trends, patterns, exceptions and
anomalies that might otherwise go unnoticed.
For businesses, data mining is used to discover patterns and relationships in the data in order to help
make better business decisions. Data mining can help spot sales trends, develop smarter marketing
campaigns, and accurately predict customer loyalty. Specific uses of data mining include:
I. Market segmentation - Identify the common characteristics of customers who buy the same
products from your company.
II. Customer chum - Predict which customers are likely to leave your company and go to a
competitor.
III. Fraud detection - Identify which transactions are most likely to be fraudulent.
IV. Direct marketing - Identify which prospects should be included in a mailing list to obtain the
highest response rate.
V. Interactive marketing - Predict what each individual accessing a Web site is most likely
interested in seeing.
VI. Market basket analysis - Understand what products or services are commonly purchased
together. For example, beer and diapers.
VII. Trend analysis - Reveal the difference between typical customers this month and last.
22
Page 34
Data mining technology can generate new business opportunities by:
I. Automated prediction of trends and behaviours - Data mining automates the process of finding
predictive information in a large database. Questions that traditionally required extensive hands-
on analysis can now be directly answered from the data. A typical example of a predictive
problem is targeted marketing. Data mining uses data on past promotional mailings to identify
the targets most likely to maximize return on investment in future mailings. Other predictive
problems include forecasting bankruptcy and other forms of default, and identifying segments
of a population likely to respond similarly to given events.
II. Automated discovery of previously unknown patterns - Data mining tools sweep through
databases and identify previously hidden patterns. An example of pattern discovery is the
analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.
Using massively parallel computers, companies dig through volumes of data to discover patterns
about their customers and products. For example, grocery chains have found that when men go to a
supermarket to buy diapers, they sometimes walk out with a six-pack of beer as well. Using that
information, it's possible to lay out a store so that these items are closer.
I. While large-scale information technology has been evolving separate transaction and analytical
systems, data mining provides the link between the two. Data mining software analyses
relationships and patterns in stored transaction data based on open-ended user queries.
II. Classes - Stored data is used to locate data in predetermined groups. For example, a restaurant
chain could mine customer purchase data to determine when customers visit and what they
typically order. This information could be used to increase traffic by having daily specials.
23
Page 35
Clusters - Data items are grouped according to logical relationships or consumer preferences.
For example, data can be mined to identify market segments or consumer affinities.
J. Associations - Data can be mined to identify associations. The beer-diaper example is an
example of associative mining.
\J. Sequential patterns - Data is mined to anticipate behaviour patterns and trends. For example, an
outdoor equipment retailer could predict the likelihood of a backpack being purchased based on
a consumer's purchase of sleeping bags and hiking shoes.
Data mining consists of five major elements: Extract, transform, and load transaction data onto the
data warehouse system, store and manage the data in a multi-dimensional database system, provide
data access to business analysts and information technology professionals., analyse the data by
application software and present the data in a useful format, such as a graph or table.
Different levels of analysis available include:
I Artificial neural networks - Non-linear predictive models that learn through training and
resemble biological neural networks in structure.
H Genetic algorithms - Optimization techniques that use process such as genetic combination,
mutation, and natural selection in a design based on the concepts ol natural evolution.
III. Decision trees - Tree-shaped structures that represent sets of decisions. These decisions generate
rules for the classification of a dataset. Specific decision tree methods include Classification and
Regression Trees (CART) and Chi-square Automatic Interaction Detection (CHAID). CART
and CHAID are decision tree techniques used for classification of a dataset. They provide a set
of rules that you can apply to a new (unclassified) dataset to predict which records will have a
given outcome. CART segments a dataset by creating 2-way splits while CHAID segments
24
Page 36
using chi square tests to create multi-way splits. CART typically requires less data preparation
than CHAID.
IV. Nearest neighbour method - A technique that classifies each record in a dataset based on a
combination of the classes of the k record(s) most similar to it in a historical dataset sometimes
called the k-nearest neighbour technique.
V. Rule induction - The extraction of useful if-then rules from data based on statistical
significance.
VI. Data visualization - The visual interpretation of complex relationships in multidimensional data.
Graphics tools are used to illustrate data relationships.
3.3 Cox Proportional Hazard Model
Survival model, models data which has three main characteristics:
I. The dependent variable or response is the waiting time until the occurrence of a well-defined
event.
II. Observations are censored, in the sense that for some units, the event of interest has not
occurred at the time the data are analyzed.
III. Predictors or explanatory variables whose effect on the waiting time we wish to assess or
control.
Let T be a non-negative random variable representing the waiting time until the occurrence of an
event. For simplicity we will adopt the terminology of survival analysis, referring to the event of
interest as 'chum' and to the waiting time as survival' time, but the techniques to be studied have
much wider applicability.
25
Page 37
We will assume for now that T is a continuous random variable with probability density function
{p.d.J) f ( t ) and cumulative distribution function (c.d.J) F ( t ) . More precisely,
F (t) = Prob{T < t ) = f ( x ) d x . (3.1)
Survival function S ( t ) which gives the probability of being alive at duration / is defined as
5 (t) = P { T > t ) = 1 - F ( t) = J " f t o d x . (3.2)
Hazard function which is instantaneous rate o f occurrence of the event is defined as
P{t< T< t+ St /t > t}h { t) = lim5t^ 0 s t
(3.3)
The conditional probability in the numerator may be written as the ratio of the joint probability that
T is in the interval ( t , t + S t ) and T > t (which is, of course, the same as the probability that t is
in the interval), to the probability of the condition T > t the former may be written as f { t ) S t for
small S t , while the latter is S ( t ) by definition. Dividing by S t and taking to the limit
8t -> 0 yields the result
m = £r t- 0 -4>v ' S (t)
The rate of occurrence of the event at duration t equals the density of events at t , divided by the
probability of surviving to that duration without experiencing the event. We note, from Equation
(3.1) that f ( t ) is the derivative of F ( t ) . This suggests rewriting Equation (3.3) as
h ( t ) = - j t l o g S ( t ) (3.5)
As mentioned, survival analysis typically examines the relationship of the survival distribution to
covariates. Most commonly, this examination entails the specification of a linear-like model for the
26
Page 38
log hazard. For example, a parametric model based on the exponential distribution may be written
as
log (h i( t)) = a + /?!**! + (32Xi2 + ••• + P k Xik (3-6)
or equivalently
h ,(t) = exp(a + + p 2Xi2 + •” + P k x ik)
which is a linear model for the log-hazard or multiplicative model for the hazard. Here, i is a
subscript for observation, and x r , x 2 , x 3, ..., x k are the covariates.
The constant a in this model represents a kind of log-baseline hazard, since
loghi(t) = o r w h e n all of the x 1 , x 2 , x 3 , . . . , x k are zero. (3.8)
The Cox model, in contrast, leaves the baseline hazard function c*(t) = l o g h 0 ( t ) unspecified
log (hj(£)) = log (/l0(f)) + P l x il + @2x i2 + b P k x ik (3-9)
or equivalently
h*(0 = Aio(t) exp (/?i*ii + P 2x i2 + "• + P k x ik) (3.10)
which is the Cox proportional hazard model
Assumptions of Cox proportional hazard model:
I. Non-informative censoring - To satisfy this assumption, the design of the underlying study
ensures that the mechanisms giving rise to censoring of individual subjects are not related to27
Page 39
the probability of an event occurring. Here censoring occurs when subscriber is on pooled
status and is not related to censoring where subscriber revenue decrease by more than 70
percent.
11. Proportional hazards - Here the survival curves for two strata (determined by the particular
choices of values for the x variables) have hazard functions that are proportional over time,
that is, constant relative hazard.
For partial likelihood estimates, instead of using probability density functions from a parametric
distribution, we use the probability of failure conditional on being in the risk set. Suppose we have
a data set with k observations and q distinct failure (event) times. Cox estimation first proceeds by
sorting the ordered failure times, such that t 1 < t 2 < . . . < t q, where t* denotes the failure time
for the i t h individual. For censored cases, we define i to be 0 if the case is right-censored, and 1 if
the case is uncensored. Finally, the ordered event times are modeled as a function of covariates X[.
The partial likelihood function is derived by taking the product ol the conditional probability ot a
failure at time tj, given the number of cases that are at risk of failing at time t[ . We define R ( t [ ) to
denote the number of cases that are at risk o f experiencing an event at time t,-. that is, the risk set,
then the probability that the j t h case will fail at time T; is given by
P(Tj = tj / K ( t i ) ) =>P'*i
Z je R d to e P X)
(3.11)
where the summation operator in the denominator is summing over all individuals in the risk set.
Faking the product of the conditional probabilities in Equation (3.11) yields the partial likelihood
function
28
Page 40
(3.12)
with a corresponding log-likelihood function
(3.13)
where 5, takes the values 1 if the ith individual is uncensored and 0 if right-censored.
The partial likelihood function depends only on ordered duration times, where numerator depends
on all cases with an observed failure and denominator the observation gets repeated as often as it
succeeds when others fail. By maximizing the log-likelihood in Equation (3.13), estimates ol the /?
may be obtained. The results are important in specifying:
I. The baseline hazard therefore /i0( t) is unnecessary.
II. The interval between events does not inform the partial likelihood function.
III. Censored cases contribute information only pertinent to the risk set (that is, the denominator,
not the numerator)
To hand le ties, the Breslow Method is used. It assumes that the risk set does not change among tied
failure times.
(3.14)
29
Page 41
Where, d( denote the multiplicity of failures at t,-, that is, d, is the size of the set D, of individuals
that fail at t, and s, being the sum of the vectors over the individuals who fail at t t .
3.4 Decision tree model
A decision tree depicts rules for dividing data into groups. The first rule splits the entire data set
into some number of pieces, and then another rule may be applied to a piece, different rules to
different pieces, forming a second generation of pieces. In general, a piece may be either split or
left alone to form a final group.
The tree depicts the first split into pieces as branches emanating from a root and subsequent splits as
branches emanating from nodes on older branches. The leaves of the tree are the final groups of the
un-split nodes. For a tree to be useful, the data in a leaf must be similar with respect to some target
measure, so that the tree represents the segregation of a mixture ol data into purified groups.
The decision tree is used to put the performance of the survival model in perspective. Decision
trees can be split into classification and regression trees. Classification trees are used to predict a
categorical outcome, whereas regression trees are used in case of a continuous outcome. Since we
are dealing with a binary outcome, that is, chum, a classification tree is used. In a decision tree each
interior node corresponds to a variable.
An arc to a child represents a possible value of that variable. A leaf represents the outcome given
the values of the variables represented by the path from the root. One of the advantages of decision
trees is that they can be very easily interpreted, since they produce a set of understandable rules.
30
Page 42
Neural networks, on the other hand, are so called black boxes. A trained neural network contains
several optimized parameters and weights which cannot be interpreted easily. It is therefore not
possible to understand why a neural network gives a particular outcome.
A decision tree is a supervised model and thus requires a labeled training set. The outcome ol an
observation, ‘churn' or ‘non-chum', is indicated by a 1 or 0 respectively.
The splitting criterion used in this study is the Gini-index. The Gini-index is a measure of impurity
of a split at a particular node. The Gini-index is defined as:
l - Z kPl (3.15)
where k indicate the different classes and p ^ denotes the relative frequency ol k classes. The
lowest value for the Gini-index is used for splitting the node's observations.
Optimal tree size will be got by over fitting to capture artifacts and noise present in the dataset.
However predictive power is lost. Therefore we will use pre-pruning and post-pruning.
Oversampling will be done by altering the proportion of the outcomes in the training set. This will
increase the proportion of the less frequent outcome (chum) since chum is considered less likely
event.
Advantages of decision tree include:
I. Decision trees implicitly perform variable screening or feature selection. When we fit a
decision tree to a training dataset, the top few nodes on which the tree is split are essentially
the most important variables within the dataset and feature selection is completed
automatically.
II. They require relatively little effort from users for data preparation. To overcome scale
differences between parameters - for example if we have a dataset which measures revenue
31
Page 43
in millions and loan age in years, say, this will require some form of normalization or
scaling before we can fit a regression model and interpret the coefficients. Such variable
transformations are not required with decision trees because the tree structure will remain
the same with or without the transformation. Also decision trees are also not sensitive to
outliers since the splitting are based on proportion of samples within the split ranges and not
on absolute values.
III. Nonlinear relationships between parameters do not affect tree performance. Highly
nonlinear relationships between variables will result in failing checks for simple regression
models and thus make such models invalid. However, decision trees do not require any
assumptions of linearity in the data. Thus, we can use them in scenarios where we know the
parameters are nonlinearly related.
IV. The best feature of using trees for analytics is that it is easy to interpret and explain.
However, without proper pruning or limiting tree growth, they tend to over fit the training data,
making them somewhat poor predictors.
3.5 Population and Study Sample
Population under consideration will be all Safaricom pre-paid subscribers who are on active status.
A sample will be selected randomly and data partitioned into training, validation and test. Initial
hypothesis will be set based on the conventional method criterion. Aim of mobile
telecommunication company is to detect and intervene on chum before it actually occurs. Initial
criterion will be set by denoting non- chumers by 0 and chumers by 1.
32
Page 44
Cox proportional hazard model and decision tree model will be applied to this data set in order to
improve on the initial hypothesis set. The aim will be to find out the most suitable model to predict
chum that improves on the initial hypothesis based on the computed Gini coefficient and
Kolmogorov-Smimov statistic.
3.6 Test statistics for Model Comparison
3.6.1 ROC Curve
Receiver Operating Characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates
the performance of a binary classifier system as its discrimination threshold is varied. It is created
by plotting the fraction of true positives out of the positives (TPR = true positive rate) versus the
fraction of false positives out of the negatives (FPR = false positive rate), at various threshold
settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative
rate.
Figure 3.1 ROC Curve
33
Page 45
ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones
independently from (and prior to specifying) the cost context or the class distribution. ROC
analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making.
The ROC curve was first developed by electrical engineers and radar engineers during World War
II for detecting enemy objects in battlefields and was soon introduced to psychology to account for
perceptual detection of stimuli. ROC analysis since then has been used in medicine, radiology,
biometrics, and other areas for many decades and is increasingly used in machine learning and data
mining research.
The ROC is also known as a relative operating characteristic curve, because it is a comparison of
two operating characteristics (TPR and FPR) as the criterion changes.
ROC curves, although constructed from sensitivity and specificity, do not depend on the decision
threshold. In an ROC curve, every possible decision threshold is considered. An ROC curve is a
plot of a test's false-positive rate (FPR), or 1 — specificity (plotted on the horizontal axis), versus its
sensitivity (plotted on the vertical axis). Each point on the curve represents the sensitivity and FPR
at a different decision threshold. The plotted (FPR, sensitivity) coordinates are connected with line
segments to construct an empirical ROC curve.
Further, the ROC curve of the test provides much more information about how the test performs
than just a single estimate of the test's sensitivity and specificity. Given a test's ROC curve, product
managers can examine the trade-offs in sensitivity versus specificity for various decision thresholds.
Based on the relative costs of false-positive and false-negative product managers can choose the
optimal decision threshold.
34
Page 46
Often, chum management is more complex than is allowed with a decision threshold that classifies
the test results into positive or negative.
ROC Curve is created based on no assumptions of normal distribution. The multiple predictors can
be evaluated simultaneously. It normally indicates interactions among predictors. Further, it
indicates cut-points on these predictors and yields relevant information. It used for non-hypothesis
testing and requires large samples.
3.6.2 Kolmogorov-Smirnov Test (K-S Test)
The Kolmogorov-Smimov (or K-S) tests were developed in the 1930s. The tests compare either one
observed distribution, with a completely specified distribution or two observed distributions. In the
first case, the procedure involves finding the size of the largest difference of the empirical
distribution function and the specified distribution while in the second case the procedure involves
finding the size of the largest difference between the empirical distribution functions.
Assumptions are that sample is random (or both samples are random) and independent if two
samples are involved. The scale of measurement should be at least ordinal and preferably
continuous.
Hypotheses are stated as
Hq : F (x) = G ( x ) for all x versus H x: F(x) =£ G ( x ) for at least one value of*.
35
Page 47
The test statistics is computed as
D m .n = S“P |F„00 - GmWI (316)
where sup means supremum, or largest value of a set, m is the number of subscribers who chum
while n is the number of subscribers who do not chum based on the initial criteria, Fn ( x ) is the
Empirical Distribution Function (EDF) corresponding to F ( x ) and Gm ( x ) is the EDF corresponding
toG(jc) so that at oc-level of significance
F(Pm,n ^ ^m,n,oc) — (3.17)
where d m n K is the critical value which is tabulated. Reject H0 if the value of d m n > d m n oc.
Asymptotic approximation, that is, for large m and n,
01 , 1
so that
d m,n,« (3.19)
for selected values of oc. For example, for oc— 0.05, d — 1.36. In this sense therefore, large values
of the K-S, D m „ lie in the rejection region, that is, discredits the null hypothesis which implies that
the distributions are different. Thus, K-S discriminates, for large values of the statistic.
36
Page 48
An attractive feature of this test is that the distribution of the K-S test statistic itself does not depend
on the underlying c . d . f being tested. Another advantage is that it is an exact test (the chi-square
goodness-of-fit test depends on an adequate sample size for the approximations to be valid).
Despite these advantages, the K-S test has several important limitations:
I. It only applies to continuous distributions.
II. It tends to be more sensitive near the centre of the distribution than at the tails.
III. Perhaps the most serious limitation is that the distribution must be fully specified. That is, il
location, scale, and shape parameters are estimated from the data, the critical region of the
K-S test is no longer valid. It typically must be determined by simulation.
3.6.3 Gini Coefficient
The Gini coefficient (or Gini ratio) is a summary statistic of the Lorenz curve and a measure of
inequality in a population. The Gini coefficient is most easily calculated from unordered size data as
the "relative mean difference," that is., the mean ol the difference between every possible pair ol
individuals, divided by the mean size p ,
where x is an observed value, n is the number of values observed.
When x values are first placed in ascending order, such that each x has rank /, then, some of the
comparisons above can be avoided by using
2 r? v (3.20)
(3.21)
37
Page 49
Equation (3.21) becomes
G =y " (2 ;-* -i)x ,
» Z i-T (3.22)
where .v is an observed value, n is the number of values observed and i is the rank of values in
ascending order. In this case only positive non-zero values are used.
The Gini coefficient ranges from a minimum value of zero, when all individuals are equal, to a
theoretical maximum of one in an infinite population in which every individual except one has a
size of zero. It has been shown that the sample Gini coefficients defined above need to be
multiplied by n(n — 1) in order to become unbiased estimators for the population coefficients.
The Gini coefficient's main advantage is that it is a measure of inequality by means of a ratio
analysis. This makes it easily interpretable, and avoids references to a statistical average or position
unrepresentative of most of the population, such as per capita income or gross domestic product.
The simplicity of Gini coefficient makes it easy to use for comparison across diverse countries and
also allows comparison of income distributions across different groups as well as countries.
Like any time-based measure, Gini coefficients can be used to compare income distribution over
time, thus it is possible to see if inequality is increasing or decreasing independent of absolute
incomes. The Gini coefficient satisfies four principles suggested to be important:
I. Anonymity - It does not matter who the high and low earners are.
II. Scale independence - The Gini coefficient does not consider the size of the economy, the
way it is measured, or whether it is a rich or poor country on average.
III. Population independence - It does not matter how large the population of the country is.
IV. Transfer principle - If income (less than the difference), is transferred from a rich person to a
poor person the resulting distribution is more equal.
The limitations of Gini coefficient largely lie in its relative nature. Considering general and specific,
it loses information about absolute general and specifics. For example, countries may have identical
38
Page 50
Gini coefficients, but differ greatly in wealth. Basic necessities may be available to all in a rich
country, while in the poor country, even basic necessities are unequally available.
3.7 Modelling Process
Model process includes the following four major steps.
3.7.1 Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs
a variety of techniques (mostly graphical) to:
I. Maximize insight into a data set.
II. Uncover underlying structure.
III. Extract important variables.
IV. Detect outliers and anomalies.
V. Test underlying assumptions.
VI. Develop parsimonious models.
VII. Determine optimal factor settings.
Focus of EDA approach is an attitude/philosophy about how a data analysis should be
carried out. EDA is not identical to statistical graphics although the two terms are used
almost interchangeably. Statistical graphics is a collection of techniques all graphically
based and all focusing on one data characterization aspect. EDA encompasses a larger
venue. EDA is an approach to data analysis that postpones the usual assumptions about what
kind of model the data follow with the more direct approach of allowing the data itself to
reveal its underlying structure and model. EDA is not a mere collection of techniques but a
39
Page 51
philosophy as to how we dissect a data set, what we look for, how we look and how we
interpret. It is true that EDA heavily uses the collection of techniques that we call "statistical
graphics", but it is not identical to statistical graphics per se.
Most EDA techniques are graphical in nature with a few quantitative techniques. The reason
for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-
mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing the
data to reveal its structural secrets, and being always ready to gain some new, often
unsuspected, insight into the data. In combination with the natural pattern-recognition
capabilities that we all possess, graphics provides, of course, unparalleled power to carry
this out.
The particular graphical techniques employed in EDA are often quite simple, consisting of
various techniques of:
I. Plotting the raw data (such as data traces, histograms, bi-histograms, probability
plots, lag plots, block plots, and Youden plots.
II. Plotting simple statistics such as mean plots, standard deviation plots, box plots, and
main effects plots of the raw data.
III. Positioning such plots so as to maximize our natural pattern-recognition abilities,
such as using multiple plots per page.
The key point is that regardless of how many factors there are, and regardless of how
complicated the function is, if a good model is selected, then the differences (residuals)
between the raw response data and the predicted values from the fitted model should
themselves behave like a univariate process. Furthermore, the residuals from this univariate
40
Page 52
process fit will behave like random drawings from a fixed distribution with fixed location
(namely, 0 in this case) and with fixed variation.
Thus, if the residuals from the fitted model do in fact behave like the ideal, then testing of
underlying assumptions becomes a tool for the validation and quality of fit of the chosen
model. On the other hand, if the residuals from the chosen fitted model violate one or more
of the above univariate assumptions, then the chosen fitted model is inadequate and an
opportunity exists for arriving at an improved model.
3.7.2 Variable Reduction
One of the first steps in data mining or business analytics problem solving is the process of
eliminating variables which are not significant. There are a couple of reasons for taking this
step. The most obvious reason is that going from a few hundred variables to a handful will
make the interpretation of the results easy. The second and probably more critical reason is
that many modeling techniques become useless as the number of parameters increases. This
is known as the curse of dimensionality.
Probably the simplest way of determining significant variables is to compute the correlation
coefficient y between all pairs of parameters and only select those that exceed a certain cut
off value (say 0.6). However, there are two problems with this method:
I. As the number of variables increases, the data storage requirement for saving these
coefficients increases as (nearly) the square of the number of variables.
41
Page 53
II. More importantly, lor relationships that are non-linear, y is not a very good indicator
of correlation.
To overcome these issues, the chi-square technique can be used. It is easy to see how the
chi-square technique would work in this case: assuming that a target variable is selected,
every parameter is checked in turn to see if the chi-square test detects the existence of a
relationship between the parameter and the target. If the target variable is continuous, it can
be converted into a categorical variable by a simple "binning" process.
If all the variables are continuous, the binning process can still be applied and then the chi-
square test be used. However, entropy based methods can be applied here much more easily.
The advantage of entropy based methods is that they will work even if there is no target
variable. The process is involves computing Shannon entropy for all variables. For every
pair of variables, for a total of p * (P _ l) /2 , mutual information is computed. Finally,
those variables which contribute to more than a given fraction ol the overall information
exchanged within the data set are selected as the key variables. This method is somewhat
similar to the more traditional F-value technique which ensures that the key variables
account for a significant amount of the total variance ol the target variable.
3 7.3 Model Estimation
This involves constructing the model based on the reduced number of variables. Here, the
models are developed based on the decision tree criteria and cox proportional hazard model.
A n branch decision tree is fitted and best tree branch identified. On the other hand. K-S
statistics is fitted where ties are corrected using Breslow method.
42
Page 54
The models are scored using the ROC curve with the conventional model as the baseline.
The improvement is measured using the Gini coefficient and the K.-S statistics. The higher
the two, is the better model.
3.7.4 Model Validation
Model verification and validation are essential parts ol the model development process il
models to be accepted and used to support decision making. Experience has shown that the
model is unlikely to be adopted or even tried out in a real-world setting. Olten the model is
“sent back to the drawing board".
Verification is done to ensure that:
I. The model is programmed correctly.
II. The algorithms have been implemented properly.
III. The model does not contain errors, oversights, or bugs.
Verification ensures that the specification is complete and that mistakes have not been made
in implementing the model. However, verification does not ensure the model:
I. Solves an important problem.
II. Meets a specified set of model requirements.
III. Correctly reflects the workings of a real world process.
No computational model will ever be fully verified, guaranteeing 100 percent error-free
implementation. A high degree of statistical certainty is all that can be realized for any
model as more cases are tested statistical certainty is increased as important cases are tested.
43
Page 55
In principle, a properly structured testing program increases the level of certainty for a
verified model to acceptable levels. Model verification proceeds as more tests are
performed, errors are identified, and corrections are made to the underlying model, often
resulting in retesting requirements to ensure code integrity.
Validation ensures that the model meets its intended requirements in terms of the methods
employed and the results obtained. The ultimate goal of model validation is to make the
model useful in the sense that the model addresses the right problem and provides accurate
information about the system being modeled.
Modeling and simulation are carried out because:
I. We are constrained by linear thinking - We cannot understand how all the various
parts of the system interact and add up to the whole.
II. We cannot imagine all the possibilities that the real system could exhibit.
III. We cannot foresee the full effects of cascading events with our limited mental
models.
IV. We cannot foresee novel events that our mental models cannot even imagine.
Validation exercises amount to a series of attempts to invalidate a model. Presumably, once
a model is shown to be invalid, the model is salvageable with further work and results in a
model having a higher degree of credibility and confidence. The end result of validation is
technically not a validated model, but rather a model that has passed all the validation tests.
Unlike physical systems, for which there are well established procedures for model
validation, no such guidelines exist for social modeling. In the case of models that contain
44
Page 56
elements of human decision making, validation becomes a matter of establishing credibility
in the model. Verification and validation work together by removing barriers and objections
to model use. The task is to establish an argument that the model produces sound insights
and sound data based on a wide range of tests and criteria that stand in tor comparing
model results to data from the real system. The process is akin to developing a legal case in
which a majority of evidence is compiled about why the model is a valid one toi its
purported use.
45
Page 57
CHAPTER 4: DATA ANALYSIS AND RESULTS
4.1 Exploratory Data Analysis
Exploratory Data Analysis was conducted prior to modelling. A univariate frequency analysis was
used to pinpoint value distributions, missing values and outliers. Variable transiormation was
conducted for some necessary numerical variables to reduce the level ol skewness, because
transformations are helpful to improve the fit of a model to the data. The demographic variables
with more than 50 percent of missing values were eliminated.
For observations with missing values, we had a choice to use incomplete observations which may
have made us ignore useful information from the variables that have non-missing values. Also, bias
the sample since observations that have missing values may have other things in common as well.
For interval variables, replacement values were calculated based on the random percentiles of the
variable's distribution, that is, values were assigned based on the probability distribution of the non
missing observations. Missing values for class variables were replaced with the most frequent
values (count or mode).
The figure below shows part of the Exploratory Data Analysis done on the 634 variables available.
It shows the minimum, maximum and mean of each variable under consideration.
46
Page 58
Table 4.1 Sample statistics variables minimum, mean and maximum values
0 Simple Statistics
0te« V a r ia b le N a m e T y p e P e r c e n t ... M in im um M ax im um M e a n Num ber .. M ode P e ... M ode
1 CHURN STATUS CLASS 0 2 73 2502 MR ORGN CLASS 0 .128+ 0.7751947000404803 AGE VAR 0 3 85 32 3884 AON VAR 0 94 4148 1420 8365BN0L DRTN 2G SITES M VAR 0 0 9660860 100305 8.6BNDL DRTN 3G SITES M VAR 0 0 9856623 70398.7.7BNDL DRTN M VAR 0 0 10743877 180078 9.8BNDL DRTN OTHER SITES M VAR 0 0 2524621 9301 337.9BNDL REV 2G SITES M VAR 0 0 2954 847 51.64544.10BNDL REV 3G SITES M VAR 0 0 1045437 66 799.11BNDL REV M VAR 0 0 10491 134 773.12BNDL REV OTHER SITES M VAR 0 0 4561 894 7 459061.13BNDL USAGE 2G SITES M VAR 0 0 1726734 21.3913514BNDL USAGE 3G SITES M VAR 0 0 21181 45 84 9411915BNDL USAGE M VAR 0 0 2127438 115.30416BNDL USAGE OTHER SITES M VAR 0 0 701507 8 97150617BUNDL QTY 2G SITES M VAR 0 0 537447 1304161.18 BUNDL QTY 3G SITES M VAR 0 0 15039 19 96.0369219BUNDL QTY M VAR 0 0 15125 239 8815.20BUNDL QTY OTHER SITES M VAR 0 0 4750 643 134284721 HANDSET ACCESS BNDL DRTN..VAR 0 0 10743877 138378.8.22 HANDSET ACCESS BNDL REV M VAR 0 0 6400 62 2812823 HANDSET ACCESS BNDL USAG..VAR 0 0 7898 803 34 6570124HANDSET ACCESS BUNDL QTY..VAR 0 0 13222 162 564125 HANDSET ACCESS FREQ M VAR 0 0 49668 551 064.26 HANDSET ACCESS UNBNDL D VAR 0 0 1875028 17478 8627 HANDSET ACCESS UNBNDL RE VAR 0 0 7654 933 147 5814.28HANDSET ACCESS UNBNDL US...VAR 0 0 1815.728 15.58705.29HANDSET ACCESS UNBUNDL_ VAR 0 0 6802997 98 82374.30 ID SBSC VAR 0 2305233 2.054E8 4172293531MAINACCOUNTBAL VAR 0.1 0 20014 66 38 65399.32 MODEM ACCESS BNDL DRTN M VAR 0 0 3150382 16990 09.33 MODEM ACCESS BNDL REV M VAR 0 0 10491 491831934MODEM ACCESS BNDL USAGE MVAR 0 0 183751 6347668. •
35 MODEM ACCESS BUNDL QTY M VAR 0 0 12317 47 03508.36MODEM ACCESS FREQ M VAR 0 0 12346 46.285. *37MODEM ACCESS UNBNDL DR1 VAR 0 0 1218759 282289838MODEM ACCESS UNBNDL REV VAR 0 0 5261 912 8 765741.39 MODEM ACCESS UNBNDL USA VAR 0 0 1269397 1 90331240 MOOEM_ACC£SS_UNBUNDL_Q1r_ VAR 0 0 191 9618 0 643692.
Row 3 shows the age of a subscriber. The minimum age shows 3 years while the maximum age is
85. However, it is not possible to have a 3 year old registered as a subscriber thus principles of
variable reduction are important to remove such observations. AON is given in days, the minimum
being 3 months. Minimum airtime balance for all subscribers at midnight tor the duration under
consideration was ksh. 0 and a maximum of ksh. 20,014. The table below shows the values of
selected variables for a group of subscribers.
47
Page 59
Table 4.2 Sample variables per subscriber
□ Sample TabeD.SBSC CHURN.STA
TUSTOTAL_USAGE_M
TOTAL.REV TOTAL.QTV_M
TOTAL_DRTN_M
BNDLJJSAGE_M
UNBNDLJJSAGE_M
BNDL_REV_M
18222140 0 37.72657 99 1798 239 287829 31.47128 6.255291 702349837 0 1.121726 11.661 38 2657 0 1.121726 049191983 0 0.00971 0.1248 10 122 0 0.00971 046686056 0 76.51754 206 8575 1047 374207 75.8476 0.669943 16047720135 0 182.554 316.137 420 154102 169.7288 1282521 2722875850 0 56.75049 947.0866 318 107723 0 5675049 02.031 E8 0 18.45136 36.8657 134 6812 14.99941 3.451951 138453585 c 326.4189 5743675 2859 1162622 318.2805 8.138385 53018175492 c 7.694746 63.96 23 5460 0 7.694746 048694435 0 77.9802 3991246 454 212164 49.26755 2871265 9520465E8 1 179.9791 420 150 18885 177.4735 2.505624 40011991309 0 8606532 4230465 424 346412 85.03944 1.025878 36520334E8 G 73.36142 215.4376 408 132222 61 26743 12094 582850990 G 5 67794 2036542 60 10809 0 5.67794 03675053 GI 1.688952 14.2272 21 2796 0 1.688952 0
14177092 CI 89 41966 2086317 508 389668 72.1074 17.31226 10014448625 c 176.2925 854 4229 601 577303 59 3347 116.9578 12521016130 1 1099941 11 2097 12 4620 10.99941 0 850246299 c 12.30861 223.9848 69 13473 0 12.30861 011776070 1I 255.1367 380.6545 660 226892 251.1375 3.999153 31312107891 0 1586147 609662 49 12429 1295741 2.904059 3547219497 0 0.610638 4.992 23 1232 0 0.610638 03476979 1 5.245648 620724 36 5816 0 5.245648 0
51761737 0 0.07637 1.2012 22 249 0 0.07637 020707904 0 74.3375 437.2189 1200 819109 37.54163 3679587 7851363131 1 0001416 0.0156 1 36 0 0001416 03269997 ) 11.67319 112.944 55 15508 0.015158 1165803 0
49996852 3 61.05821 275.6779 375 300826 57.10089 3.957321 2452.0074E8 1 G 56 0 0 0 0 5651865336ooioam 3 27 32361 74 8118 78 52527 2689256 0.431053 55
Highest usage for the selected subscribers was ksh 326, revenue derived from the subscriber was
ksh. 574, with number of calls made by the subscriber being 2,859. Bundled usage for the
subscriber was 318 megabytes and the out of bundle usage was 8 megabyte.
48
Page 60
Figure below shows distribution of demographic variables
Figure 4.1 Exploration of AON distribution
24 percent o f the subscribers sampled have been Safaricom subscribers for between 3 months and
16 months. 15 percent have been subscribers for between 16 and 30 months while 12 percent ol
them between 30 and 43 months and the rest above 43 months.
Figure 4.2 Exploration of age distribution
49
Page 61
statistically significant categorical variables to be included in the next modelling step. All the
categorical variables with a chi-square value 0.05 or less are retained. This step reduced the number
of variables including all the numerical variables and the kept categorical variables from the step
one. The next step was to use PROC PH REG to further reduce the number of variables. A stepwise
selection method will be used to create a final model with statistically significant effects of the
exploratory variables on customer chum over time.
Below is a summary o f the chum status.
Figure 4.3 Chum Status
Table below gives the actual values.
Table 4.3 Chum status
level Count Prior701810 0.2622
• 1974454 0.7378
Approximately, 700,000 subscribers would chum out of 2 million subscribers based on the
conventional model criteria.
51
Page 62
4$ percent of sampled Safaricom subscribers were between the age of 29 and 35 years. 23 percent
note between the age of 22 and 29 years. None was over 81 years and only 8 percent were above 55
.cars. 14 percent of the subscribers were between the age of 42 and 55 years.
Key
v? ,
i i
Chumers
Non-chumers
Figure 4.3 Comparing distributions o f chumers and non-chumers
The distribution of chumers and non-chumers follow the same distribution on age and AON. This
phenomenon is attributed to the random selection of the sample.
4.2 Variable Reduction
From the variables in the original data set, using PROC FREQ, an initial univariate analysis of all
categorical variables crossed with customer chum status was be carried out to determine the
50
Page 63
justically significant categorical variables to be included in the next modelling step. All the
:tecorical variables with a chi-square value 0.05 or less are retained. This step reduced the number
M'vanables including all the numerical variables and the kept categorical variables from the step
>ne. The next step was to use PROC PHREG to further reduce the number of variables. A stepwise
selection method will be used to create a final model with statistically significant effects of the
exploratory variables on customer chum over time.
Below is a summary of the chum status.
Figure 4.3 Chum Status
Table below gives the actual values.
Table 4.3 Chum status
Level Count Prior1 « ■ ■ ■ ■ ■ ■ ■ 701810 0.26220 1974454 0.7378
Approximately, 700,000 subscribers would chum out of 2 million subscribers based on the
conventional model criteria.
51
Page 64
Table 4.4 Variables Summary
Role Measurement Level Frequency Count
ID INTERVAL 1
INPUT INTERVAL 79
REJECTED NOMINAL 1
REJECTED UNARY 1
TARGET BINARY 1
ID identified each subscriber under consideration. Due to missing value criterion described
subscriber age was rejected as a nominal value and copy quantity of Skiza as unary value because
most subscribers had missing value and it had only 8 classes.
4.3 Model Estimation
With reduced exploratory variables, the final data set had reasonable number of variables to
perform analysis. Here, PROC LIFEREG was used to calculate customer survival probability. In
this step, the final data set was divided to training data set and validation data set at a ratio of 60:40
respectively. The model data set was used to fit the model and the validation data set is used to
score the survival probability for each customer. Below is the summary of subscribers based on the
partitions.
Table 4.5 Partition Summary
Type Number of Observations
DATA 2676264
TRAIN 1605757
VALIDATE 1070507
52
Page 65
Entire sample consists of 2.67 million subscribers with 1.6 million being in the training dataset and
1.07 million in the validation data set. The partitions are further classified as below indicated based
on initial criterion of chum and none chum.
Table 4.6 Summary statistics for class targets
Data=DATA
Numeric Formatted Frequency
Variable Value Value Count Percent
CHURN STATUS 0 0 1974454 73.7765
CHURN STATUS 1 1 701810 26.2235
Data=TRAIN
Numeric Formatted Frequency
Variable Value Value Count Percent
CHURN STATUS 0 0 1184671 73.7765
CHURN STATUS 1 1 421086 26.2235
Data=VALIDATE
Numeric Formatted Frequency
Variable Value Value Count Percent
CHURN STATUS 0 0 789783 73.7765
CHURN STATUS 1 1 280724 26.2235
Clearly, percentanges of chum status in the two data sets are equal to the entire dataset percentages.
53
Page 66
4.4.1 Decision Tree
A two branch decision tree was developed Gini index was used for ordinal criterion in searching for
and evaluating candidate splitting rules with 0.05 level of significance being applied. The table
below' shows the important variables picked by the decision tree.
Table 4.7 Important variables picked by the decision tree
OBS NAME NRULES
IMPORT
ANCE
VIMPORTAN
CE RATIO
1 TOTALREVM 2 1 1 1
2
USAGE_FREQ_2G_SI
TESM 5 0.73160 0.73510 1.00479
3 UNBNDLREVM 4 0.32398 0.32240 0.99513
4 TOTAL TOPUP QTY 7 0.27344 0.27740 1.01446
5
BUNDL_QTY_2G_SIT
ES_M 1 0.13962 0.13829 0.99049
6
HANDSETACCESSF
REQ M 1 0.13136 0.12322 0.93806
7 u n b n d l q t y m 1 0.11362 0.11415 1.00474
8 TOTAL TOPUP_AMT 3 0.10174 0.09831 0.96630
Here, total revenue is the most important factor that determines chum. Data usage on 2 G sites was
also important at 0.73. This was replicated when comparing individual variables contribution to
chum.
54
Page 67
The other important variables include:
I. Revenue derived from out of bundle.
II. Top-up quantity.
III. Number of times 2G sites were used to browse internet.
IV. Use of handset to access internet.
V. Number of times of out of bundle usage.
VI. Top-up amount.
4.4.2 Cox proportional hazard model
Breslow method used to handle failure time’s ties. Below is the summary of events censored and
analysis of maximum likelihood estimates.
Table 4.8 Summary of Censored Events
Summary of the Number of Event and Censored
Values
Total Event Censored
Percent
Censored
1185984 658904 527080 44.44
44 percent of all the events were censored. This is according to time to chum based on the criteria
that chum occurs when one actually leaves the network.
55
Page 68
Table 4.9 Analysis of Maximum Likelihood Estimates (MLE)
Analysis of Maximum Likelihood Estimates
r—--------------------------------------------------------------------------------- Parameter Standard Chi- Hazard
Parameter D F Estimate Error Square P > ChiSq Ratio
\JOTAL_YOICE_USAGE_ 1 3.48E-06 1.47E-07 561.942 <.0001 1
S_SMS_QTY_M 1 0.0000703 3.24E-06 469.2518 <.0001 1
D_TOTALJJSAGE_M 1 -0.0002727 8.59E-06 1007.7976 <.0001 1
PNORMALSKIZAQTY ~ n 0.01026 0.0004349 556.1185 <.0001 1.01
C_AON i -0.0004192 0.0001141 13.503 0.0002 1
According to Cox proportional hazard, chum probability is highly influenced by voice usage,
number of SMS sent, total data usage, Skiza tunes purchased and age on network. Since they are
positive it implies the hazard rate is increasing, therefore, the survival time is shortened.
4.4 Model Validation
Subscribers in the validation data set were scored for predicted chum probabilities. ROC curve was
used to compare comparison of the two models with conventional model criteria as the baseline.
But first, cumulative percent of captured response was drawn as below to show that results of
validation and training data set yield the same results.
56
Page 69
Score Rankings Overlay: CHURN.STATUS □ 0 [ S3
Cumulative % Captured Response ▼
Figure 4.4 Comparing train and validate data set
Curves reveal same performance in the two data set meaning our models are accurate.
From the ROC curve plotted,
Figure 4.5 Comparing train and validate data set using ROC
57
Page 70
11 is clear that decision tree and Cox proportional hazard model performed better that the
conventional model. Upto 0.5 level of specificity, decision tree outperformed Cox proportional
hazard model. This implies that low sensitivity, decision tree would be best, however by allowing
more error Cox proportional hazard model would be better. This means time to chum that was
incorporated in the Cox proportional hazard model would make a difference if the error margin is
increased.
Decision tree was selected as the best overall model. Below is the summary statistics showing
Kolmogorov Smirnov Statistics and Gini Coefficient
Table 4.10 Statistics Results from the fitted Models
Selected Valid: Average Valid: Kolmogorov- Valid: Gini
Model Model Description Squared Error Smirnov Statistic Coefficient
Y Decision Tree 0.13728 0.48 0.63
Cox Proportional Hazard 0.15631 0.45 0.58
Decision tree model K-S statistics dm#n was 0.48. For a level of significance oc= 0.05, dm n oc was
0.00086. Since dm n > dm „ « we reject null hypothesis and conclude that the two distributions are
different. Cox proportional hazard model yield dm n of 0.45 with a dm n oc of 0.00081 at oc= 0.05.
Since d m<n > dm n oc we reject null hypothesis and conclude that the two distributions are different.
This means the two models discriminates the two distribution based on the K-S statistics. On the
other hand, decision tree Gini coefficient was 0.63 compared to 0.58 for Cox proportional hazard
model. This implies, for the duration under consideration a two branch decision model performed
better than Cox proportional hazard model.
58
Page 71
Table below shows the probabilities of churning of selected subscribers by comparing results of
decision tree model selected and conventional model.
Table 4.11 Comparison of Decision Tree Model and Conventional Model
MSISDN Decision Tree Probability of Churn Conventional model churn criteria
7****0467 0.988672 1
7**** 1072 0.745638 1
7****1087 0.322151 1
7**** 1094 0.678493 1
7****1100 0.562877 1
7****1627 0.987649 1
7****1696 0.026731 0
7****2083 0.523435 1
7****2086 0.076549 0
7****2111 0.086542 0
7****2116 0.298768 0
7****2184 0.310123 1
7****2186 0.009182 0
7****2188 0.567317 0
7****2189 0.001231 0
7****2200 0.602344 1
7****2567 0.223672 0
7****3011 0.996783 1
7****3119 0.410098 1
7****3129 0.113231 0
7****3291 0.490876 0
1****3348 0.190231 0
59
Page 72
Decision tree model gives the probability of churning for the subscribers which is an improvement to initial
criteria which only shows if the subscriber will chum or not. There are some inconsistencies that decision
tree model selected improved on such as a subscriber had a probability of 0.56 of churning yet initial criteria
stated that subscriber will not chum.
60
Page 73
l>cciston tree model gives the probability of churning for the subscribers which is an improvement to initial
entena which only shows if the subscriber will chum or not. There are some inconsistencies that decision
tree model selected unproved on such as a subscriber had a probability o f 0.56 o f churning yet initial criteria
stated that subscriber will not chum.
60
Page 74
CHAPTER 5: CONCLUSIONS AND RECOMMENDATIONS
5,1 Conclusions
Current chum prediction methods used by Safaricom Limited are improved significantly by using
Cox proportional hazard and decision tree since there was a lift from the initial criteria on the ROC
curve. However, for the duration under consideration decision tree performed better than Cox
proportional model.
Decision tree gave probability o f chum which is an improvement from conventional model that
only gives binary results of chum and not churn. Also, where the decision tree yields approximately
50 percent probability of chum conventional model gave varying chum status.
5.2 Recommendations
To fully utilize the models, one has to run the models monthly. This would assist in continuously
tracking the behaviours of the subscribers as the behaviour patterns are affected by many
occurrences that cannot be controlled.
With monthly evaluation of propensity to chum, the impact of:
I. Executive management decision for example change of calling rates can be evaluated on the
impact of chum.
II. Competitor activities on propensity to chum can also be evaluated by running the models
monthly as we will be able to track propensity to chum per subscriber incorporating
competitor activities as our explanatory variable.
61
Page 75
Other related models such as neural networks can be applied and compared to the results ol the
decision tree to further improve chum prediction.
62
Page 76
APPENDICES
Appendix 1: Fit Statistics Table
It shows the entire fit statistics results from the ROC curve fitted for decision tree and the Cox
proportional hazard model.
Data Role=Train
Decision
Tree
Cox proportional
hazard model
Train: Bin-Based Two-Way Kolmogorov-Smimov Probability Cutoff 0.25 0.33
Train: Kolmogorov-Smimov Statistic 0.48 0.45
Train: Akaike's Information Criterion 1536933.7
Train: Average Profit for CHURN STATUS 0.81 0.76
Train: Average Squared Error 0.14 0.16
Train: Roc Index 0.82 0.79
Train: Average Error Function 0.48
Train: Cumulative Percent Captured Response 30.99 24.31
Train: Percent Captured Response 13.58 10.93
Selection Criterion 0.81 0.76
Train: Degrees of Freedom for Error 1605693
Train: Model Degrees of Freedom 64
Train: Total Degrees of Freedom 1605757 1605757
Train: Divisor for ASE 3211514 3211514
Train: Error Function 1536805.7
Train: Final Prediction Error 0.16
Train: Gain 209.93 143.1
Train: Gini Coefficient 0.63 0.58
Train: Bin-Based Two-Way Kolmogorov-Smimov Statistic 0.48 0.45
Train: Kolmogorov-Smimov Probability Cutoff 0.26 0.3
Train: Cumulative Lift 3.1 2.43
63
Page 77
Train: Lift 2.72 2.19
Train: Maximum Absolute Error 0.94 1
Train: Misclassification Rate 0.19 0.24
Train: Mean Square Error 0.16
Train: Sum of Frequencies 1605757 1605757
Train: Number of Estimate Weights 64
Train: Total Profit for CHURN STATUS 1297238 1224718
Train: Root Average Sum of Squares 0.37 0.39
Train: Cumulative Percent Response 81.27 63.75
Train: Percent Response 71.21 57.34
Train: Root Final Prediction Error 0.39
Train: Root Mean Squared Error 0.39
Train: Schwarz's Bayesian Criterion 1537720.2
Train: Sum of Squared Errors 440068.96 501028.59
Train: Sum of Case Weights Times Freq 3211514 3211514
64
Page 78
Appendix 2: Tree Leaf Report
Shows (he results of the decision tree giving different nodes strength.
Output
Tree Leaf Report
Rode DepthT ra in in g
Observations
T ra in in gPercent
1V a lid a t io n
ObservationsV alid at io n
Percent 1IS 3 637382 0.06 424964 0. 0627 4 136954 0.15 91177 0.1525 4 117371 0.29 77485 0.2991 6 110262 0.21 73227 0.2144 5 108270 0.30 72788 0.3087 6 104188 0.37 69726 0.388 3 83012 0.91 55254 0. 9185 6 68823 0.46 45985 0.4618 4 65253 0.71 43282 0.7038 5 38535 0.56 25857 0.5753 5 33582 0.26 22195 0.2547 5 17227 0 .44 11779 0.4329 4 15023 0.22 10221 0.2320 4 13450 0.69 8969 0. 6957 5 8765 0.39 5844 0.3956 5 7931 0.65 5271 0.6649 5 7529 0.41 5027 0.4179 6 7038 0.32 4712 0.3384 6 6103 0.61 3897 0.6186 6 5753 0.55 3978 0. 5348 5 4091 0.59 2751 0.5946 5 3656 0.68 2400 0.69105 6 3410 0.40 2241 0.40104 6 990 0.57 702 0. 5878 6 603 0.59 422 0.5590 6 556 0.66 353 0.61
65
Page 79
Owczarczuk, M. (2009) Chum models for customers in the cellular telecommunication industry
using large data marts. Expert Systems w ith Applications, 37, 4710-4712
Seo, D., Ranganathan, C., and Badad, Y. (2008). Two-Level model of customer retention in US
mobile Telecommunication Service Market. Telecommunications Policy, 32, 182-196
Wei, C., and Chiu, I. (2002). Turning Telecommunications call details to chum prediction. Expert
Systems with Applications, 23, 103-112
Yan, L., Fassiono, M., and Baldasare, P. (2005). Predicting Customer Behaviour via calling links.
Proceeding o f International Joint Conference on Neural Networks, 4, 2555 - 2560
Yankee, G. (2001). Chum management in the mobile market. A Brazilian case study, 3, 202-16
67