The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.

Post on 04-Jan-2016

218 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

The CRISP Data Mining Process

August 28, 2004 Data Mining 2

The Data Mining Process

Businessunderstanding

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

August 28, 2004 Data Mining 3

Business Understanding

Projectobjectives

Projectrequirements

DM ProblemFormulation

PreliminaryPlan

August 28, 2004 Data Mining 4

Case Study

Data mining project done for a large insurance companyConsider the use of data mining to improve understanding of customer databasesLed by the data warehousing team, which wanted to also improve their expertise

August 28, 2004 Data Mining 5

Business Objectives

Understand what coverage packages are of interest to a customer group Targeting of new customers Cross-selling opportunities to existing customers

Understand why a customer group terminates coverage Know in advance what groups are likely to

terminate Understand what factors influence termination

August 28, 2004 Data Mining 6

What are the Goals?

The business goals Improve customer retention Increase cross-selling

Success criteriaCustomer turnover rateAmount of cross-selling

August 28, 2004 Data Mining 7

Data Mining Problems

Classify new and existing customers as either interested or not interested in a particular coverage

Classify existing customers as either likely or unlikely to terminate coverage

August 28, 2004 Data Mining 8

The Data Mining Process

Businessobjectives

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

August 28, 2004 Data Mining 9

Data Evaluation

Initial data collections

Data quality

Initial insights

Interesting subsets

Data warehousing team

August 28, 2004 Data Mining 10

Case Study: Data Evaluation

Data was extracted from select customer databases by company personnel

Coverage programs with few customers selected for pilot project

Five separate files extracted for five coverage programs

August 28, 2004 Data Mining 11

The Data Mining Process

Businessobjectives

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

August 28, 2004 Data Mining 12

Data Preparation

Raw DataFinishedData Set

Technical tasks:Data selectionAttribute selectionData cleaning

August 28, 2004 Data Mining 13

Case Study: Data Preparation

Some initial formatting of data in MS ExcelCleaning of data fileCombine headers/instancesAdd a new attribute: interest (yes/no)Must create the no interest cases

End up with a CSV formatted file

August 28, 2004 Data Mining 14

Weka Data Mining Software

Data in CSV format loaded into Weka:Data preprocessingAttribute selectionModeling

ClassificationClusteringAssociation rule mining

Visualization

August 28, 2004 Data Mining 15

Data Preprocessing in Weka

Initial data inspectionMissing valuesUseless attributesNumeric attributes as nominal

Some helpful Weka filtersRemoveUselessReplaceMissingValues

August 28, 2004 Data Mining 16

Data Preprocessing in Weka

Data reduction: Instance dimension

RemovePercentage, and Resample filtersAttribute dimension

Remove redundant attributesRemove irrelevant attributes Identify most important attributes

August 28, 2004 Data Mining 17

Attribute Selection Methods

Three main methods used: InfoGain ChiSquared Relief

Combined results from complimentary methods

Final pruning of attribute list to twenty attributes

August 28, 2004 Data Mining 18

Selected Attributes

LocationTax StateContract StateState CodeZip Code

August 28, 2004 Data Mining 19

Selected Attributes

SizeCase Size Range

Industry Industry Classification Industry Classification NameSIC Code

August 28, 2004 Data Mining 20

Selected Attributes

TimingNew Sale FlagDecision Maker Effective MonthDecision Maker Effective YearNext Renewal MonthNext Renewal Year

August 28, 2004 Data Mining 21

Selected Attributes

InternalAgency NumberOffice NamePricing Category CodeProduct Line NameSmall Group Flag

August 28, 2004 Data Mining 22

Relevance of Attribute Selection

Improved modelingFaster model inductionHigher accuracyEasier to interpret models

Structural knowledge gained from the selection of attributes

August 28, 2004 Data Mining 23

Most Important Attributes

What attributes effect the purchasing decision of a customer group?E.g., the five most important factor that determine if a customer group purchases a particular insurance coverage Agency Number Small Group Flag Zip Code Decision Maker Effective Year Next Renewal Month

August 28, 2004 Data Mining 24

Customer Segmentation

Unique groups of customersSimilar characteristicsSimilar behavior in terms of interest in

coverage

For example, separate predictive models for customer segments for a particular type of insurance

August 28, 2004 Data Mining 25

Customer Segments Used for Modeling

ResultsThree segments for one databaseTwo segments for two databasesOne segment for two databases

Continue modeling for each segment independently

August 28, 2004 Data Mining 26

The Data Mining Process

Businessobjectives

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

August 28, 2004 Data Mining 27

Modeling

Select modeling technique(s)

Calibrate modeling techniques

Make adjustments to data

August 28, 2004 Data Mining 28

Modeling

Mathematical models for predicting if a customer is interested in a coverageUnderstand why a customer is interestedFor example:If a customer’s state is Indiana and the office is Indianapolis_Office1 then the customer is interested in Coverage_3

August 28, 2004 Data Mining 29

Modeling Techniques

Three modeling techniques tried for predicting customer interest: Decision trees Artificial neural networks (ANN) Support vector machines (SVM)

Decision trees have the advantage of transparencyANN and SVM did not have significantly better prediction accuracy

August 28, 2004 Data Mining 30

Insurance Coverage Interest (Type 6)

Small Group Flag

Y

Product Line Name

No

N

No

Group_2

Yes

Group_1

August 28, 2004 Data Mining 31

Insurance Coverage Interest (Type 7)

Pricing Category Code

Industry Classification

Name

A4

Agency Number

Yes No

<= 430 > 430

Next Renewal Year

NoYes

<= 2000 > 2000

Legal_ServicesTransportation_andPublic_Utilities

Next Renewal Year

Yes No

Group_1Group_2

A2

Yes No

<= 2002> 2002

OthersBranchesomitted

August 28, 2004 Data Mining 32

Accuracy of Predicting Customer Interest

Coverage Accuracy

Type 1 84.0%

Type 2 97.2%

Type 3 98.3%

Type 4 99.5%

Type 5 88.4%

Type 6 100%

Type 7 76.3%

Type 8 85.0%

Type 9 94.8%

August 28, 2004 Data Mining 33

Modeling

Mathematical models for predicting if a customer will terminate coverage

Why do customers terminate a specific type of coverage?

What are the important factors in a customers decision to terminate coverage?

August 28, 2004 Data Mining 34

Who Terminates Type 3 Coverage?

CustomerEffective Year

Terminated

2000

Next RenewalMonth

1999

2000

CoverageEffective Year

CoverageEffective Year

2001 2002

Active

Terminated Terminated Active

2000

Active

2000

7 7

Correct for 95%of customers

August 28, 2004 Data Mining 35

Who Terminates Type 1 Coverage?

Decision tree based on:Distribution numberUnderwriting department numberPrice categoryRate typeRate Plan Year

Predicts 96.3% of terminations correctly

August 28, 2004 Data Mining 36

Accuracy of Predicting Termination

Model Accuracy

Type 1 96.3%

Type 2 96.5%

Type 3 95.3%

Type 4 88.9%

Type 5 88.3%

August 28, 2004 Data Mining 37

The Data Mining Process

Businessobjectives

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

August 28, 2004 Data Mining 38

Evaluation

Data analysis results in a good model

Are business objectives being achieved?

Is there an important business issue that has

not been considered?

Should the results be used?

August 28, 2004 Data Mining 39

The Data Mining Process

Businessobjectives

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

August 28, 2004 Data Mining 40

Deployment

Incorporate the results in the organization’s decision making processReportDecision support systemPersonalization of web pagesRepeatable data mining process

top related