The CRISP Data Mining Process
August 28, 2004 Data Mining 2
The Data Mining Process
Businessunderstanding
Dataevaluation
Datapreparation
Modeling
Evaluation
Deployment Data
August 28, 2004 Data Mining 3
Business Understanding
Projectobjectives
Projectrequirements
DM ProblemFormulation
PreliminaryPlan
August 28, 2004 Data Mining 4
Case Study
Data mining project done for a large insurance companyConsider the use of data mining to improve understanding of customer databasesLed by the data warehousing team, which wanted to also improve their expertise
August 28, 2004 Data Mining 5
Business Objectives
Understand what coverage packages are of interest to a customer group Targeting of new customers Cross-selling opportunities to existing customers
Understand why a customer group terminates coverage Know in advance what groups are likely to
terminate Understand what factors influence termination
August 28, 2004 Data Mining 6
What are the Goals?
The business goals Improve customer retention Increase cross-selling
Success criteriaCustomer turnover rateAmount of cross-selling
August 28, 2004 Data Mining 7
Data Mining Problems
Classify new and existing customers as either interested or not interested in a particular coverage
Classify existing customers as either likely or unlikely to terminate coverage
August 28, 2004 Data Mining 8
The Data Mining Process
Businessobjectives
Dataevaluation
Datapreparation
Modeling
Evaluation
Deployment Data
August 28, 2004 Data Mining 9
Data Evaluation
Initial data collections
Data quality
Initial insights
Interesting subsets
Data warehousing team
August 28, 2004 Data Mining 10
Case Study: Data Evaluation
Data was extracted from select customer databases by company personnel
Coverage programs with few customers selected for pilot project
Five separate files extracted for five coverage programs
August 28, 2004 Data Mining 11
The Data Mining Process
Businessobjectives
Dataevaluation
Datapreparation
Modeling
Evaluation
Deployment Data
August 28, 2004 Data Mining 12
Data Preparation
Raw DataFinishedData Set
Technical tasks:Data selectionAttribute selectionData cleaning
August 28, 2004 Data Mining 13
Case Study: Data Preparation
Some initial formatting of data in MS ExcelCleaning of data fileCombine headers/instancesAdd a new attribute: interest (yes/no)Must create the no interest cases
End up with a CSV formatted file
August 28, 2004 Data Mining 14
Weka Data Mining Software
Data in CSV format loaded into Weka:Data preprocessingAttribute selectionModeling
ClassificationClusteringAssociation rule mining
Visualization
August 28, 2004 Data Mining 15
Data Preprocessing in Weka
Initial data inspectionMissing valuesUseless attributesNumeric attributes as nominal
Some helpful Weka filtersRemoveUselessReplaceMissingValues
August 28, 2004 Data Mining 16
Data Preprocessing in Weka
Data reduction: Instance dimension
RemovePercentage, and Resample filtersAttribute dimension
Remove redundant attributesRemove irrelevant attributes Identify most important attributes
August 28, 2004 Data Mining 17
Attribute Selection Methods
Three main methods used: InfoGain ChiSquared Relief
Combined results from complimentary methods
Final pruning of attribute list to twenty attributes
August 28, 2004 Data Mining 18
Selected Attributes
LocationTax StateContract StateState CodeZip Code
August 28, 2004 Data Mining 19
Selected Attributes
SizeCase Size Range
Industry Industry Classification Industry Classification NameSIC Code
August 28, 2004 Data Mining 20
Selected Attributes
TimingNew Sale FlagDecision Maker Effective MonthDecision Maker Effective YearNext Renewal MonthNext Renewal Year
August 28, 2004 Data Mining 21
Selected Attributes
InternalAgency NumberOffice NamePricing Category CodeProduct Line NameSmall Group Flag
August 28, 2004 Data Mining 22
Relevance of Attribute Selection
Improved modelingFaster model inductionHigher accuracyEasier to interpret models
Structural knowledge gained from the selection of attributes
August 28, 2004 Data Mining 23
Most Important Attributes
What attributes effect the purchasing decision of a customer group?E.g., the five most important factor that determine if a customer group purchases a particular insurance coverage Agency Number Small Group Flag Zip Code Decision Maker Effective Year Next Renewal Month
August 28, 2004 Data Mining 24
Customer Segmentation
Unique groups of customersSimilar characteristicsSimilar behavior in terms of interest in
coverage
For example, separate predictive models for customer segments for a particular type of insurance
August 28, 2004 Data Mining 25
Customer Segments Used for Modeling
ResultsThree segments for one databaseTwo segments for two databasesOne segment for two databases
Continue modeling for each segment independently
August 28, 2004 Data Mining 26
The Data Mining Process
Businessobjectives
Dataevaluation
Datapreparation
Modeling
Evaluation
Deployment Data
August 28, 2004 Data Mining 27
Modeling
Select modeling technique(s)
Calibrate modeling techniques
Make adjustments to data
August 28, 2004 Data Mining 28
Modeling
Mathematical models for predicting if a customer is interested in a coverageUnderstand why a customer is interestedFor example:If a customer’s state is Indiana and the office is Indianapolis_Office1 then the customer is interested in Coverage_3
August 28, 2004 Data Mining 29
Modeling Techniques
Three modeling techniques tried for predicting customer interest: Decision trees Artificial neural networks (ANN) Support vector machines (SVM)
Decision trees have the advantage of transparencyANN and SVM did not have significantly better prediction accuracy
August 28, 2004 Data Mining 30
Insurance Coverage Interest (Type 6)
Small Group Flag
Y
Product Line Name
No
N
No
Group_2
Yes
Group_1
August 28, 2004 Data Mining 31
Insurance Coverage Interest (Type 7)
Pricing Category Code
Industry Classification
Name
A4
Agency Number
Yes No
<= 430 > 430
Next Renewal Year
NoYes
<= 2000 > 2000
Legal_ServicesTransportation_andPublic_Utilities
Next Renewal Year
Yes No
Group_1Group_2
A2
Yes No
<= 2002> 2002
OthersBranchesomitted
August 28, 2004 Data Mining 32
Accuracy of Predicting Customer Interest
Coverage Accuracy
Type 1 84.0%
Type 2 97.2%
Type 3 98.3%
Type 4 99.5%
Type 5 88.4%
Type 6 100%
Type 7 76.3%
Type 8 85.0%
Type 9 94.8%
August 28, 2004 Data Mining 33
Modeling
Mathematical models for predicting if a customer will terminate coverage
Why do customers terminate a specific type of coverage?
What are the important factors in a customers decision to terminate coverage?
August 28, 2004 Data Mining 34
Who Terminates Type 3 Coverage?
CustomerEffective Year
Terminated
2000
Next RenewalMonth
1999
2000
CoverageEffective Year
CoverageEffective Year
2001 2002
Active
Terminated Terminated Active
2000
Active
2000
7 7
Correct for 95%of customers
August 28, 2004 Data Mining 35
Who Terminates Type 1 Coverage?
Decision tree based on:Distribution numberUnderwriting department numberPrice categoryRate typeRate Plan Year
Predicts 96.3% of terminations correctly
August 28, 2004 Data Mining 36
Accuracy of Predicting Termination
Model Accuracy
Type 1 96.3%
Type 2 96.5%
Type 3 95.3%
Type 4 88.9%
Type 5 88.3%
August 28, 2004 Data Mining 37
The Data Mining Process
Businessobjectives
Dataevaluation
Datapreparation
Modeling
Evaluation
Deployment Data
August 28, 2004 Data Mining 38
Evaluation
Data analysis results in a good model
Are business objectives being achieved?
Is there an important business issue that has
not been considered?
Should the results be used?
August 28, 2004 Data Mining 39
The Data Mining Process
Businessobjectives
Dataevaluation
Datapreparation
Modeling
Evaluation
Deployment Data
August 28, 2004 Data Mining 40
Deployment
Incorporate the results in the organization’s decision making processReportDecision support systemPersonalization of web pagesRepeatable data mining process