Top Banner
De Jong and Heller SAS miner preliminary analysis In this document we go through the data exploration that one should always undertake before embarking on the construction of a statistical model. The occurrence of a claim (claim flag) and claim amount (clm amt) are considered to be the response variables. The effects of the other variables on these responses are considered here, graphically and numerically. Kidsdriv is the number of kids in the car when driving. The left panel in Figure 1 displays histogram of kidsdriv. It shows that driving without any kids in the car is the most popular and the maximum number of kids in the car is 4. The right panel displays the box plot of log claim amount by the number of kids in the car. This indicates that given that a claim occurs, the claim amount is invariant with the number of kids. Therefore, kidsdriv is not a promising predictor for claim amount. Number of kids Frequency 0 1 2 3 4 0 2000 4000 6000 8000 0 1 2 3 4 4 6 8 10 12 Number of kids Log claim amount Figure 1: Kids in the car In Table 1, the probability of a claim increases as the number of kids increases. This indicates that kidsdriv is a potential candidate to predict claim occurrence. Table 1: Kids in the car No. of kids Frequency Percent claims 0 88.0% 25% 1 7.8% 39% 2 3.4% 40% 3 0.7% 53% 4 0.04% 50% Plcydate describes the date that the policy starts. The policy starting date spreads evenly from March 1993 to June 1998. The average starting dates are similar between policies with or without claims,which means that there is no difference between old and new policy holders in terms of claim occurrence. There is no pattern between claim amount and policy starting date. Therefore, this variable is not used for predicting either clm flag or clm amt. Travtime The top left panel in Figure 2 displays the histogram of travel time between home and work. The mean travel time is 33.4 minutes. The top right panel displays the boxplots of travel time without a claim(left) and with a claim(right). The medians and spreads are similar, which means that travtime doesn’t impact claim occurrence. In the bottom left panel, claim amount (given a claim occurred) is plotted against the travel time. In the bottom right panel, log claim amount (given a claim occurred) is plotted October 25, 2007 1
14

De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

Jan 22, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

In this document we go through the data exploration that one should always undertake beforeembarking on the construction of a statistical model. The occurrence of a claim (claim flag)and claim amount (clm amt) are considered to be the response variables. The effects of theother variables on these responses are considered here, graphically and numerically.

Kidsdriv is the number of kids in the car when driving. The left panel in Figure 1 displayshistogram of kidsdriv. It shows that driving without any kids in the car is the mostpopular and the maximum number of kids in the car is 4. The right panel displays thebox plot of log claim amount by the number of kids in the car. This indicates that giventhat a claim occurs, the claim amount is invariant with the number of kids. Therefore,kidsdriv is not a promising predictor for claim amount.

Number of kids

Fre

quen

cy

0 1 2 3 4

020

0040

0060

0080

00

0 1 2 3 4

46

810

12

Number of kids

Log

clai

m a

mou

nt

Figure 1: Kids in the car

In Table 1, the probability of a claim increases as the number of kids increases. Thisindicates that kidsdriv is a potential candidate to predict claim occurrence.

Table 1: Kids in the carNo. of kids Frequency Percent

claims0 88.0% 25%1 7.8% 39%2 3.4% 40%3 0.7% 53%4 0.04% 50%

Plcydate describes the date that the policy starts. The policy starting date spreads evenlyfrom March 1993 to June 1998. The average starting dates are similar between policieswith or without claims,which means that there is no difference between old and new policyholders in terms of claim occurrence. There is no pattern between claim amount andpolicy starting date. Therefore, this variable is not used for predicting either clm flagor clm amt.

Travtime The top left panel in Figure 2 displays the histogram of travel time between homeand work. The mean travel time is 33.4 minutes. The top right panel displays theboxplots of travel time without a claim(left) and with a claim(right). The medians andspreads are similar, which means that travtime doesn’t impact claim occurrence. In thebottom left panel, claim amount (given a claim occurred) is plotted against the traveltime. In the bottom right panel, log claim amount (given a claim occurred) is plotted

October 25, 2007 1

Page 2: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

against the travel time. The horizontal smooth lines in both plots indicate an independentrelationship between claim amount and travel time.

Travel time

Den

sity

0 50 100 150

0.00

00.

010

0.02

0

No Yes

020

4060

8010

014

0

Claim

Tra

vel t

ime

0 20 40 60 80 100 140

020

000

6000

010

0000

Travel time

Cla

im a

mou

nt

0 20 40 60 80 100 140

46

810

12Travel time

Log

clai

m a

mou

nt

Figure 2: Travel time from home to work

Car use There are two types of car usage: commercial and private. 36.8% of cars are forcommercial use and private cars account for 63.2%. As shown in Table 2, the probabilityof a claim is higher for commercial cars. Thus, car usage is a potential explanatoryvariable for claim occurrence.

Figure 3 displays boxplots of claim amount(left) and log claim amount(right) by carusage. Car usage does not look promising as an explanatory variable for claim amount.

Table 2: Car useCar use Frequency Percent

claimsCommercial 36.8% 35%Private 63.2% 22%

October 25, 2007 2

Page 3: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

Commercial Private

020

000

4000

060

000

8000

010

0000

Cla

im a

mou

nt

Commercial Private

46

810

12

Log

clai

m a

mou

nt

Figure 3: Car use

Bluebook describes the value of the car. The top left panel of Figure 4 displays the histogramof bluebook. The boxplots indicates the average bluebook of non claim policies is lowerthan the ones with claim. The smooth lines indicate the relationship between bluebookand claim amount. There is an upward linear relationship between bluebook and claimamount.

Bluebook

Den

sity

0 20000 40000 60000

0e+

002e

−05

4e−

05

No Yes

020

000

4000

060

000

Claim

Blu

eboo

k

0 20000 40000 60000

020

000

6000

010

0000

Bluebook

Cla

im a

mou

nt

0 10000 30000 50000

46

810

12

Bluebook

Log

clai

m a

mou

nt

Figure 4: Bluebook

Retained measures the number of years the customer has been with the company. The his-togram in Figure 5 shows that 15% of customers have taken up policies for less than oneyear. The boxplots show that the average years retained is longer for non–claim policies.Thus, it is a potential explanatory variable for the occurrence of a claim. The flat lineson the bottom plots indicate that claim amount does not depend on the customer loyalty.

Npolicy is the number of policies the customer holds. From Table 3 about 53% of customershold one policy and about 30% hold two. The proportion that make a claim does not

October 25, 2007 3

Page 4: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

Retained

Den

sity

0 5 10 15 20 25

0.00

0.05

0.10

0.15

No Yes

510

1520

25

Claim

Ret

aine

d

5 10 15 20 25

020

000

6000

010

0000

Retained

Cla

im a

mou

nt

5 10 15 20

46

810

12

Retained

Log

clai

m a

mou

ntFigure 5: Retained: number of years with the company

increase with increasing number of policies. The splines of claim amount against npolicyare flat (bottom panels of Figure 6).

Table 3: Number of policies

Number of Frequency Percentpolicies claims

1 53.4% 26.8%2 30.7% 27.2%3 10.9% 27.1%4 3.6% 24.7%5 1.0% 13.0%6 0.1% 0.0%7 0.2% 0.0%8 0.0% -9 0.05% 0.0%

Car type SUV and Sedan are the two most popular car types, as shown in Table 4. Theprobability of claim varies across car types, as indicated in Table 4. The claim amountdoes not vary much across car types (bottom panels of Figure 7).

Red car describes if the car’s color is red. 29% of insured cars are red. The probability ofmaking a claim is similar between red and non red cars, which is shown in Table 5.Figure 8 shows that the claim amount is similar between red and non red cars.

Clm freq measures the number of claims in the past 5 years. The top left panel in Figure9 indicates that majority of policies (61.1%) have no claim in the past 5 years. Thecustomers that incurred a claim this year have a higher number of past claims, on average.

October 25, 2007 4

Page 5: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

# Policies

Den

sity

2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

No Yes

24

68

Claim

# P

olic

ies

2 4 6 8

020

000

6000

010

0000

# Policies

Cla

im a

mou

nt

1 2 3 4 5

46

810

12

# Policies

Log

clai

m a

mou

nt

Figure 6: Number of policies

Table 4: Car TypeCar type Frequency Percent

claimsPanel Truck 8.3% 26%Pickup 17.2% 31%Sedan 26.1% 17%Sports Car 11.4% 35%SUV 28.0% 29%Van 8.9% 27%

Panel Truck Sedan SUV Van

020

000

4000

060

000

8000

010

0000

Car type

Cla

im a

mou

nt

Panel Truck Sedan SUV Van

46

810

12

Car type

Log

clai

m a

mou

nt

Figure 7: Car type

October 25, 2007 5

Page 6: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

Table 5: Red carRed car Frequency Percent

claimsNot Red 71.1% 27%Red 28.9% 26%

no yes

020

000

4000

060

000

8000

010

0000

Red car

Cla

im a

mou

nt

no yes

46

810

12

Red carLo

g cl

aim

am

ount

Figure 8: Red car

This means that clm freq is potentialy a good predictor of claim occurrence. The bottomtwo panels indicate that claim amount is not related to clm freq.

Oldclaim records old claim amounts incurred in the past 5 years. The top right panel of Figure10 indicates that a claim is more likely to occur with a higher past claim amount. Thebottom two panels indicate that there is no relationship between claim amount and oldclaim amount, given a claim occurred.

Revoked measures if the policy holder’s license has been suspended in the last 7 years. Around12.2% of licenses have been suspended in the past 7 years. The occurrence of a claim isrelated to variable revoked. If the policy holder’s license has been suspended in the last7 years, he/she has 45% of chance of incurring a claim, compared with 24% if the licensehas not been suspended. The claim amount is not related to license suspension, which isshown in Figure 11.

Mvr pts is motor vehicle points. As we do not have any information on the data definition. aneducated guess is that low mvr pts is good. The top right panel in Figure 12 indicatesthat the average motor vehicle points for policies with claims is higher. The bottom twopanels indicate that claim amount is not related to motor vehicle points.

Age Average age of drivers making a claim is lower than those without a claim. Age lookspromising as a predictor of occurrence of a claim. The claim amount appears to beindependent of age.

Homekids is the number of kids at home. This variable is not related to claim occurrence andclaim amount, as shown in Figure 14.

Yoj is the number of years the customer has been working. The top left panel of Figure 15shows that yoj follows a normal distribution with mean of about 10 years. There isa hump at 0 since there are student policy holders. The average years of working arehigher with no claim policy holders. This indicates that yoj is a good predictor for claimoccurrence. The years of working is not related to claim size.

October 25, 2007 6

Page 7: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

Claim frequency

Den

sity

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

No Yes

01

23

45

Claim

Cla

im fr

eque

ncy

0 1 2 3 4 5

020

000

6000

010

0000

Claim frequency

Cla

im a

mou

nt

0 1 2 3 4 5

46

810

12

Claim frequencyLo

g cl

aim

am

ount

Figure 9: Claim Frequency

Old claim

Den

sity

0 10000 30000 50000

0.00

000

0.00

005

0.00

010

0.00

015

No Yes

010

000

3000

050

000

Claim

Old

cla

im

0 10000 30000 50000

020

000

6000

010

0000

Old claim

Cla

im a

mou

nt

0 10000 30000 50000

46

810

12

Old claim

Log

clai

m a

mou

nt

Figure 10: Old claim amount

October 25, 2007 7

Page 8: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

No Yes

020

000

4000

060

000

8000

010

0000

Revoked

Cla

im a

mou

nt

No Yes

46

810

12

Revoked

Log

clai

m a

mou

nt

Figure 11: Revoked

Motor vehicle points

Den

sity

0 2 4 6 8 10 12

0.0

0.1

0.2

0.3

0.4

0.5

0.6

No Yes

02

46

810

12

Claim

Mot

or v

ehic

le p

oint

s

0 2 4 6 8 10 12

020

000

6000

010

0000

Motor vehicle points

Cla

im a

mou

nt

0 2 4 6 8 10 12

46

810

12

Motor vehicle points

Log

clai

m a

mou

nt

Figure 12: Motor vehicle points

October 25, 2007 8

Page 9: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

Age

Den

sity

20 30 40 50 60 70 80

0.00

0.01

0.02

0.03

0.04

No Yes

2030

4050

6070

80

Claim

Age

20 30 40 50 60 70 80

020

000

6000

010

0000

Age

Cla

im a

mou

nt

20 30 40 50 60 70

46

810

12

AgeLo

g cl

aim

am

ount

Figure 13: Age

Home kids

Den

sity

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

No Yes

01

23

45

Claim

Hom

e ki

ds

0 1 2 3 4 5

020

000

6000

010

0000

Home kids

Cla

im a

mou

nt

0 1 2 3 4 5

46

810

12

Home kids

Log

clai

m a

mou

nt

Figure 14: Home kids

October 25, 2007 9

Page 10: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

Years of working

Den

sity

0 5 10 15 20

0.00

0.04

0.08

0.12

No Yes

05

1015

20

Claim

Yea

rs o

f wor

king

0 5 10 15 20

020

000

6000

010

0000

Years of working

Cla

im a

mou

nt

0 5 10 15

46

810

12

Years of working

Log

clai

m a

mou

ntFigure 15: Years of working

Income The distribution of income is right skewed with a hump at $0 income, which correspondsto the student policy holders. The average income for no claim policy holders is higher.Claim amount is not related to income.

Gender 54% of policy holders are female. 27.5% of female drivers have incurred a claim while25.5% of male drivers have incurred a claim. The average claim amounts given claimincurred are similar, as shown in Figure 17.

Married 60% of policy holders are married. 34% of non married policy holders have incurreda claim, while 22% of married policy holders have incurred a claim. Therefore, claimoccurrence appears to be associated with marriage status. Figure ?? shows that claimamount is not related to marriage status.

Parent1 denotes a single parent. Around 13% of policy holders are single parents. If the policyholder is a single parent, he or she has 45% probability of making a claim, while 24%probability if not a single parent. The average claim amount does not differ.

Jobclass Table 6 shows that the student and blue collar workers have a higher probabilityof making claims comparing to managers and doctors. However, this variable mightbe correlated with years of working (yoj), as it is expected that managers have a longerworking history. Caution might be taken in including both yoj and jobclass in a model.

Max educ From Table 7, the probability of claim occurrence is lower for policy holders withhigher education. Therefore, it is potentially a useful predictor for claim flag. From Figure21, the claim amount is approximately constant for all education categories. Therefore,it is not a useful predictor for claim amount.

Home value From the top right panel of Figure 22, the average home value is lower for non-claim policy holders. Therefore, it is an useful predictor for claim flag. The bottom twopanels indicate that claim amount is constant over home values.

October 25, 2007 10

Page 11: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

Income

Den

sity

0e+00 1e+05 2e+05 3e+05

0e+

004e

−06

8e−

06

No Yes

0e+

001e

+05

2e+

053e

+05

Claim

Inco

me

0e+00 1e+05 2e+05 3e+05

020

000

6000

010

0000

Income

Cla

im a

mou

nt

0 100000 200000 300000

46

810

12

Income

Log

clai

m a

mou

ntFigure 16: Income

F M

020

000

4000

060

000

8000

010

0000

Gender

Cla

im a

mou

nt

F M

46

810

12

Gender

Log

clai

m a

mou

nt

Figure 17: Gender

No Yes

020

000

4000

060

000

8000

010

0000

Married

Cla

im a

mou

nt

No Yes

46

810

12

Married

Log

clai

m a

mou

nt

Figure 18: Married

October 25, 2007 11

Page 12: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

No Yes

020

000

4000

060

000

8000

010

0000

Single parent

Cla

im a

mou

nt

No Yes

46

810

12

Single parent

Log

clai

m a

mou

nt

Figure 19: Single Parent

Table 6: Job ClassJob class Frequency Percent

claimsBlue Collar 23.7% 35.1%Clerical 16.5% 30.3%Doctor 3.3% 11.5%Home maker 8.7% 27.2%Lawyer 10.7% 18.0%Manager 13.0% 13.5%Professional 14.6% 23.1%Student 9.3% 37.9%

Clerical Home Maker Professional

020

000

4000

060

000

8000

010

0000

Job class

Cla

im a

mou

nt

Clerical Home Maker Professional

46

810

12

Job class

Log

clai

m a

mou

nt

Figure 20: Job Class

Table 7: Maximum education levelMax education Frequency Percentlevel claims< High School 15% 32%High School 29% 35%Bachelors 27% 24%Masters 20% 19%PhD 9% 16%

October 25, 2007 12

Page 13: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

<High School High School PhD

020

000

4000

060

000

8000

010

0000

Maximum education

Cla

im a

mou

nt

<High School High School PhD

46

810

12

Maximum education

Log

clai

m a

mou

nt

Figure 21: Max Education Level

Home value

Den

sity

0e+00 4e+05 8e+05

0e+

002e

−06

4e−

066e

−06

No Yes

0e+

004e

+05

8e+

05

Claim

Hom

e va

lue

0e+00 4e+05 8e+05

020

000

6000

010

0000

Home value

Cla

im a

mou

nt

0e+00 2e+05 4e+05 6e+05

46

810

12

Home value

Log

clai

m a

mou

nt

Figure 22: Home Value

October 25, 2007 13

Page 14: De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained

De Jong and Heller SAS miner preliminary analysis

Table 8: DensityDensity Frequency Percent

claimsUrban 45% 20%Highly Urban 35% 46%Rural 15% 7%Highly Rural 5% 6%

Density is the population density of policy holder’s living area.There are 4 categories: highlyrural, rural, urban and highly urban. In Table 8, the probability of making a claim is farhigher in urban areas. Therefore, density might be a good predictor for claim occurrence.As shown in Figure 23, the claim amount is invariant between different areas.

Highly Rural Rural Urban

020

000

4000

060

000

8000

010

0000

Density

Cla

im a

mou

nt

Highly Rural Rural Urban

46

810

12

Density

Log

clai

m a

mou

nt

Figure 23: Density

October 25, 2007 14