K6255 – Knowledge Discovery and Data Mining Statistical Analysis of Caravan Insurance using IBM SPSS Muthu Kumaar Thangavelu (G1101765E) [email protected]1. INTRODUCTION: The data set contains information on customers of an insurance company which includes the product usage data and socio-demographic data derived from zip area codes supplied by the Dutch data mining company Sentient Machine Research. Our aim is to predict a customer circle who will be interested in buying caravan insurance and predict a model with the given 86 variable values representing the socio demographic, education, insurance interests and income levels of customers. 2. STATISTICAL ANALYSIS 2.1. DATA PREPARATION: 2.1.1. ANALYZING AND CATEGORIZING THE VARIABLES: We extract and analyze the raw variables with labels and try to categorize the variables based on the understanding of the insurance product and the product buyers. We classify the broad range of 86 variables to significant predictors as below CUST_SUB_LIFESTYLE_REFLECTION: Customer sub type MOSTYPE variable has 41 value types which can be categorised under two broad classes which relate to their age, social class, life style and reflection towards investing or spending as follows - Middle and Upper Class, middle aged and senior citizens, high risk cultured liberal investors (8, 9, 12, 13, 23, 25, 36, 2, 3, 4, 5, 15, and 27) - Distributed age and social class, low risk cultured conservative investors (1,6,7,10,11,14,16,17,18,19,20,21,22,24,26,28,29,30,31,32,33,34,35,37,38,39,40,41) CUST_LEVEL_LIFECYCLE: Average age MGEMLEEF holds 6 types of values which can be categorised into three groups and are based on family status and age. - Young, family starters (1) - Middle aged family men (2, 3, and 4) - Senior, family men (5, 6)
15
Embed
Caravan insurance data mining statistical analysis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
K6255 – Knowledge Discovery and Data Mining
Statistical Analysis of Caravan Insurance using IBM SPSS
+ 0.237(PURCHASING_POWER_CLASS_INK) + (PBRAND by PBYSTAND by PPERSAUT by PWAPART) –
2.336
The model also has a high degree of accuracy with a Nagelkerke R square percentage of 19.2%
The model summary and predictor equation is described in the Appendix-4.
3. MODEL INSIGHTS AND CONCLUSION:
The understanding and classification of the initial variables have been thoroughly done to reflect
properties of socio demographic, education, lifestyle, income, car and insurance interests with
relevance to the product type. The logically predicted significant variables have then been analyzed
based on the descriptive statistics of the target variables in the data set using IBM SPSS. Dimension
Reduction, Variable Recoding and Interaction Variables definition have been done to represent
accurate and independent predictors. The logistic regression then gives the required predictor
model.
The model should be broad in prediction with appropriate real world logical reasons for categorizing
and recoding of variables so that it holds good for most possible cases and avoids OVERFITTING.
Appendix -1
DESCRIPTIVE STATISTICS – CROSS TAB RESULTS
Fig 1.0. Rental Home Residents Caravan Insurance Buying Pattern
Fig 1.1. Purchasing Power Class Caravan Insurance Buying Pattern
Fig.1.2. Social Lifestyle based Caravan Insurance Buying Pattern (RECODED VARIABLE)
1 – Middle and Upper Class, middle aged and senior citizens, high risk cultured liberal investors
0 - Distributed age and social class, low risk cultured conservative investors
Fig 1.3. Third Party Insurance Buyers and Caravan Insurance buyers
Fig 1.4. Car Insurance Buyers and Caravan Insurance Buyers
Fig 1.5. Fire Insurance Contribution and Caravan Insurance Interest
Fig 1.6. Social Security Insurance Vs Caravan Insurance Buyers
Appendix -2: (Logistic Regression Summary and Last Convergence Results without PCA Component)
Model Summary
Step
-2 Log
likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 2220.272a .069 .189
2 2210.325a .070 .193
a. Estimation terminated at iteration number 20 because
maximum iterations has been reached. Final solution
cannot be found.
Converged Predictors and corresponding Coefficients in
binary logistic regression ( BLOCK 2 - Second Step )
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
.
.
.
.
The Cross Product continuing up to (4x4 combinations)
a. Variable(s) entered on step 1: PBRAND * PBYSTAND * PPERSAUT * PWAPART .
b. Variable(s) entered on step 2: MINKGEM * MKOOPKLA .
Appendix -3 (Logistic Regression with reduced component with PCA) Initial Components (Average Income and Purchasing Power Class) Vs Principle Component Extracted APPENDIX -3: PRINCIPLE COMPONENT ANALYSIS: FACTOR ANALYSIS:
Correlation Matrix
MINKGEM MKOOPKLA
Correlation MINKGEM 1.000 .452
MKOOPKLA .452 1.000
Sig. (1-tailed) MINKGEM .000
MKOOPKLA .000
After Principal Component Analysis -
Component Matrixa
Component
1
MINKGEM .852
MKOOPKLA .852
Extraction Method:
Principal Component
Analysis.
a. 1 components extracted.
Reproduced Correlations
MINKGEM MKOOPKLA
Reproduced Correlation MINKGEM .726a .726
MKOOPKLA .726 .726a
Residualb MINKGEM -.274
MKOOPKLA -.274
Extraction Method: Principal Component Analysis.
a. Reproduced communalities
b. Residuals are computed between observed and reproduced
correlations. There are 1 (100.0%) nonredundant residuals with
absolute values greater than 0.05.
APPENDIX -4:
After PCA with the Reduced Component – Binary Logistic Regression with other predictor variables
Model Summary
Step
-2 Log
likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 2213.728a .070 .192
a. Estimation terminated at iteration number 20 because
maximum iterations has been reached. Final solution
cannot be found.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1a CUST_SUB_LIFESTYLE_REF
LECTION(1)
-.345 .124 7.778 1 .005 .709
PURCHASING_POWER_CL
ASS_INK
.237 .068 12.009 1 .001 1.268
MHHUUR -.024 .024 1.049 1 .306 .976
MAUT1 .093 .040 5.315 1 .021 1.098
PBRAND * PBYSTAND *
PPERSAUT * PWAPART
207.422 112 .000
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(1) by
PWAPART(1)
-1.467 .779 3.549 1 .060 .231
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(1) by
PWAPART(2)
-18.885 7541.184 .000 1 .998 .000
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(1) by
PWAPART(3)
-1.627 .960 2.874 1 .090 .197
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(2) by
PWAPART(1)
-19.134 40192.970 .000 1 1.000 .000
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(3) by
PWAPART(1)
-3.743 1.257 8.862 1 .003 .024
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(3) by
PWAPART(3)
-.218 1.065 .042 1 .838 .804
.
.
.
.
.
.
PBRAND(7) by
PBYSTAND(4) by
PPERSAUT(4) by
PWAPART(1)
-19.341 23141.295 .000 1 .999 .000
PBRAND(8) by
PBYSTAND(1) by
PPERSAUT(1) by
PWAPART(1)
-18.797 28317.506 .000 1 .999 .000
PBRAND(8) by
PBYSTAND(1) by
PPERSAUT(1) by
PWAPART(3)
-19.114 40192.970 .000 1 1.000 .000
PBRAND(8) by
PBYSTAND(1) by
PPERSAUT(4) by
PWAPART(1)
-19.252 28290.099 .000 1 .999 .000
PBRAND(8) by
PBYSTAND(1) by
PPERSAUT(4) by
PWAPART(3)
-18.921 28301.176 .000 1 .999 .000
PBRAND(8) by
PBYSTAND(1) by
PPERSAUT(5) by
PWAPART(3)
-19.476 40192.970 .000 1 1.000 .000
Constant -2.336 .812 8.271 1 .004 .097
a. Variable(s) entered on step 1: PBRAND * PBYSTAND * PPERSAUT * PWAPART .