De Jong and Heller SAS miner preliminary analysis In this document we go through the data exploration that one should always undertake before embarking on the construction of a statistical model. The occurrence of a claim (claim flag) and claim amount (clm amt) are considered to be the response variables. The effects of the other variables on these responses are considered here, graphically and numerically. Kidsdriv is the number of kids in the car when driving. The left panel in Figure 1 displays histogram of kidsdriv. It shows that driving without any kids in the car is the most popular and the maximum number of kids in the car is 4. The right panel displays the box plot of log claim amount by the number of kids in the car. This indicates that given that a claim occurs, the claim amount is invariant with the number of kids. Therefore, kidsdriv is not a promising predictor for claim amount. Number of kids Frequency 0 1 2 3 4 0 2000 4000 6000 8000 0 1 2 3 4 4 6 8 10 12 Number of kids Log claim amount Figure 1: Kids in the car In Table 1, the probability of a claim increases as the number of kids increases. This indicates that kidsdriv is a potential candidate to predict claim occurrence. Table 1: Kids in the car No. of kids Frequency Percent claims 0 88.0% 25% 1 7.8% 39% 2 3.4% 40% 3 0.7% 53% 4 0.04% 50% Plcydate describes the date that the policy starts. The policy starting date spreads evenly from March 1993 to June 1998. The average starting dates are similar between policies with or without claims,which means that there is no difference between old and new policy holders in terms of claim occurrence. There is no pattern between claim amount and policy starting date. Therefore, this variable is not used for predicting either clm flag or clm amt. Travtime The top left panel in Figure 2 displays the histogram of travel time between home and work. The mean travel time is 33.4 minutes. The top right panel displays the boxplots of travel time without a claim(left) and with a claim(right). The medians and spreads are similar, which means that travtime doesn’t impact claim occurrence. In the bottom left panel, claim amount (given a claim occurred) is plotted against the travel time. In the bottom right panel, log claim amount (given a claim occurred) is plotted October 25, 2007 1
14
Embed
De Jong and Heller SAS miner preliminary analysis · 2016. 11. 7. · De Jong and Heller SAS miner preliminary analysis Retained Density 25 0.15 Yes 25 Claim Retained 25 100000 Retained
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
De Jong and Heller SAS miner preliminary analysis
In this document we go through the data exploration that one should always undertake beforeembarking on the construction of a statistical model. The occurrence of a claim (claim flag)and claim amount (clm amt) are considered to be the response variables. The effects of theother variables on these responses are considered here, graphically and numerically.
Kidsdriv is the number of kids in the car when driving. The left panel in Figure 1 displayshistogram of kidsdriv. It shows that driving without any kids in the car is the mostpopular and the maximum number of kids in the car is 4. The right panel displays thebox plot of log claim amount by the number of kids in the car. This indicates that giventhat a claim occurs, the claim amount is invariant with the number of kids. Therefore,kidsdriv is not a promising predictor for claim amount.
Number of kids
Fre
quen
cy
0 1 2 3 4
020
0040
0060
0080
00
0 1 2 3 4
46
810
12
Number of kids
Log
clai
m a
mou
nt
Figure 1: Kids in the car
In Table 1, the probability of a claim increases as the number of kids increases. Thisindicates that kidsdriv is a potential candidate to predict claim occurrence.
Table 1: Kids in the carNo. of kids Frequency Percent
Plcydate describes the date that the policy starts. The policy starting date spreads evenlyfrom March 1993 to June 1998. The average starting dates are similar between policieswith or without claims,which means that there is no difference between old and new policyholders in terms of claim occurrence. There is no pattern between claim amount andpolicy starting date. Therefore, this variable is not used for predicting either clm flagor clm amt.
Travtime The top left panel in Figure 2 displays the histogram of travel time between homeand work. The mean travel time is 33.4 minutes. The top right panel displays theboxplots of travel time without a claim(left) and with a claim(right). The medians andspreads are similar, which means that travtime doesn’t impact claim occurrence. In thebottom left panel, claim amount (given a claim occurred) is plotted against the traveltime. In the bottom right panel, log claim amount (given a claim occurred) is plotted
October 25, 2007 1
De Jong and Heller SAS miner preliminary analysis
against the travel time. The horizontal smooth lines in both plots indicate an independentrelationship between claim amount and travel time.
Travel time
Den
sity
0 50 100 150
0.00
00.
010
0.02
0
No Yes
020
4060
8010
014
0
Claim
Tra
vel t
ime
0 20 40 60 80 100 140
020
000
6000
010
0000
Travel time
Cla
im a
mou
nt
0 20 40 60 80 100 140
46
810
12Travel time
Log
clai
m a
mou
nt
Figure 2: Travel time from home to work
Car use There are two types of car usage: commercial and private. 36.8% of cars are forcommercial use and private cars account for 63.2%. As shown in Table 2, the probabilityof a claim is higher for commercial cars. Thus, car usage is a potential explanatoryvariable for claim occurrence.
Figure 3 displays boxplots of claim amount(left) and log claim amount(right) by carusage. Car usage does not look promising as an explanatory variable for claim amount.
Table 2: Car useCar use Frequency Percent
claimsCommercial 36.8% 35%Private 63.2% 22%
October 25, 2007 2
De Jong and Heller SAS miner preliminary analysis
Commercial Private
020
000
4000
060
000
8000
010
0000
Cla
im a
mou
nt
Commercial Private
46
810
12
Log
clai
m a
mou
nt
Figure 3: Car use
Bluebook describes the value of the car. The top left panel of Figure 4 displays the histogramof bluebook. The boxplots indicates the average bluebook of non claim policies is lowerthan the ones with claim. The smooth lines indicate the relationship between bluebookand claim amount. There is an upward linear relationship between bluebook and claimamount.
Bluebook
Den
sity
0 20000 40000 60000
0e+
002e
−05
4e−
05
No Yes
020
000
4000
060
000
Claim
Blu
eboo
k
0 20000 40000 60000
020
000
6000
010
0000
Bluebook
Cla
im a
mou
nt
0 10000 30000 50000
46
810
12
Bluebook
Log
clai
m a
mou
nt
Figure 4: Bluebook
Retained measures the number of years the customer has been with the company. The his-togram in Figure 5 shows that 15% of customers have taken up policies for less than oneyear. The boxplots show that the average years retained is longer for non–claim policies.Thus, it is a potential explanatory variable for the occurrence of a claim. The flat lineson the bottom plots indicate that claim amount does not depend on the customer loyalty.
Npolicy is the number of policies the customer holds. From Table 3 about 53% of customershold one policy and about 30% hold two. The proportion that make a claim does not
October 25, 2007 3
De Jong and Heller SAS miner preliminary analysis
Retained
Den
sity
0 5 10 15 20 25
0.00
0.05
0.10
0.15
No Yes
510
1520
25
Claim
Ret
aine
d
5 10 15 20 25
020
000
6000
010
0000
Retained
Cla
im a
mou
nt
5 10 15 20
46
810
12
Retained
Log
clai
m a
mou
ntFigure 5: Retained: number of years with the company
increase with increasing number of policies. The splines of claim amount against npolicyare flat (bottom panels of Figure 6).
Car type SUV and Sedan are the two most popular car types, as shown in Table 4. Theprobability of claim varies across car types, as indicated in Table 4. The claim amountdoes not vary much across car types (bottom panels of Figure 7).
Red car describes if the car’s color is red. 29% of insured cars are red. The probability ofmaking a claim is similar between red and non red cars, which is shown in Table 5.Figure 8 shows that the claim amount is similar between red and non red cars.
Clm freq measures the number of claims in the past 5 years. The top left panel in Figure9 indicates that majority of policies (61.1%) have no claim in the past 5 years. Thecustomers that incurred a claim this year have a higher number of past claims, on average.
This means that clm freq is potentialy a good predictor of claim occurrence. The bottomtwo panels indicate that claim amount is not related to clm freq.
Oldclaim records old claim amounts incurred in the past 5 years. The top right panel of Figure10 indicates that a claim is more likely to occur with a higher past claim amount. Thebottom two panels indicate that there is no relationship between claim amount and oldclaim amount, given a claim occurred.
Revoked measures if the policy holder’s license has been suspended in the last 7 years. Around12.2% of licenses have been suspended in the past 7 years. The occurrence of a claim isrelated to variable revoked. If the policy holder’s license has been suspended in the last7 years, he/she has 45% of chance of incurring a claim, compared with 24% if the licensehas not been suspended. The claim amount is not related to license suspension, which isshown in Figure 11.
Mvr pts is motor vehicle points. As we do not have any information on the data definition. aneducated guess is that low mvr pts is good. The top right panel in Figure 12 indicatesthat the average motor vehicle points for policies with claims is higher. The bottom twopanels indicate that claim amount is not related to motor vehicle points.
Age Average age of drivers making a claim is lower than those without a claim. Age lookspromising as a predictor of occurrence of a claim. The claim amount appears to beindependent of age.
Homekids is the number of kids at home. This variable is not related to claim occurrence andclaim amount, as shown in Figure 14.
Yoj is the number of years the customer has been working. The top left panel of Figure 15shows that yoj follows a normal distribution with mean of about 10 years. There isa hump at 0 since there are student policy holders. The average years of working arehigher with no claim policy holders. This indicates that yoj is a good predictor for claimoccurrence. The years of working is not related to claim size.
October 25, 2007 6
De Jong and Heller SAS miner preliminary analysis
Claim frequency
Den
sity
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
No Yes
01
23
45
Claim
Cla
im fr
eque
ncy
0 1 2 3 4 5
020
000
6000
010
0000
Claim frequency
Cla
im a
mou
nt
0 1 2 3 4 5
46
810
12
Claim frequencyLo
g cl
aim
am
ount
Figure 9: Claim Frequency
Old claim
Den
sity
0 10000 30000 50000
0.00
000
0.00
005
0.00
010
0.00
015
No Yes
010
000
3000
050
000
Claim
Old
cla
im
0 10000 30000 50000
020
000
6000
010
0000
Old claim
Cla
im a
mou
nt
0 10000 30000 50000
46
810
12
Old claim
Log
clai
m a
mou
nt
Figure 10: Old claim amount
October 25, 2007 7
De Jong and Heller SAS miner preliminary analysis
No Yes
020
000
4000
060
000
8000
010
0000
Revoked
Cla
im a
mou
nt
No Yes
46
810
12
Revoked
Log
clai
m a
mou
nt
Figure 11: Revoked
Motor vehicle points
Den
sity
0 2 4 6 8 10 12
0.0
0.1
0.2
0.3
0.4
0.5
0.6
No Yes
02
46
810
12
Claim
Mot
or v
ehic
le p
oint
s
0 2 4 6 8 10 12
020
000
6000
010
0000
Motor vehicle points
Cla
im a
mou
nt
0 2 4 6 8 10 12
46
810
12
Motor vehicle points
Log
clai
m a
mou
nt
Figure 12: Motor vehicle points
October 25, 2007 8
De Jong and Heller SAS miner preliminary analysis
Age
Den
sity
20 30 40 50 60 70 80
0.00
0.01
0.02
0.03
0.04
No Yes
2030
4050
6070
80
Claim
Age
20 30 40 50 60 70 80
020
000
6000
010
0000
Age
Cla
im a
mou
nt
20 30 40 50 60 70
46
810
12
AgeLo
g cl
aim
am
ount
Figure 13: Age
Home kids
Den
sity
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
No Yes
01
23
45
Claim
Hom
e ki
ds
0 1 2 3 4 5
020
000
6000
010
0000
Home kids
Cla
im a
mou
nt
0 1 2 3 4 5
46
810
12
Home kids
Log
clai
m a
mou
nt
Figure 14: Home kids
October 25, 2007 9
De Jong and Heller SAS miner preliminary analysis
Years of working
Den
sity
0 5 10 15 20
0.00
0.04
0.08
0.12
No Yes
05
1015
20
Claim
Yea
rs o
f wor
king
0 5 10 15 20
020
000
6000
010
0000
Years of working
Cla
im a
mou
nt
0 5 10 15
46
810
12
Years of working
Log
clai
m a
mou
ntFigure 15: Years of working
Income The distribution of income is right skewed with a hump at $0 income, which correspondsto the student policy holders. The average income for no claim policy holders is higher.Claim amount is not related to income.
Gender 54% of policy holders are female. 27.5% of female drivers have incurred a claim while25.5% of male drivers have incurred a claim. The average claim amounts given claimincurred are similar, as shown in Figure 17.
Married 60% of policy holders are married. 34% of non married policy holders have incurreda claim, while 22% of married policy holders have incurred a claim. Therefore, claimoccurrence appears to be associated with marriage status. Figure ?? shows that claimamount is not related to marriage status.
Parent1 denotes a single parent. Around 13% of policy holders are single parents. If the policyholder is a single parent, he or she has 45% probability of making a claim, while 24%probability if not a single parent. The average claim amount does not differ.
Jobclass Table 6 shows that the student and blue collar workers have a higher probabilityof making claims comparing to managers and doctors. However, this variable mightbe correlated with years of working (yoj), as it is expected that managers have a longerworking history. Caution might be taken in including both yoj and jobclass in a model.
Max educ From Table 7, the probability of claim occurrence is lower for policy holders withhigher education. Therefore, it is potentially a useful predictor for claim flag. From Figure21, the claim amount is approximately constant for all education categories. Therefore,it is not a useful predictor for claim amount.
Home value From the top right panel of Figure 22, the average home value is lower for non-claim policy holders. Therefore, it is an useful predictor for claim flag. The bottom twopanels indicate that claim amount is constant over home values.
Table 7: Maximum education levelMax education Frequency Percentlevel claims< High School 15% 32%High School 29% 35%Bachelors 27% 24%Masters 20% 19%PhD 9% 16%
Density is the population density of policy holder’s living area.There are 4 categories: highlyrural, rural, urban and highly urban. In Table 8, the probability of making a claim is farhigher in urban areas. Therefore, density might be a good predictor for claim occurrence.As shown in Figure 23, the claim amount is invariant between different areas.