1 | Page Decision Support Systems Final Paper MIS-648-101 Professor Jerry Fjermestad Presented By Andrey Skroznikov Kevin Matos Edwin Zankang Onyedikachi Achilike
1 | P a g e
Decision Support Systems Final PaperMIS-648-101 Professor Jerry Fjermestad
Presented By
Andrey Skroznikov
Kevin Matos
Edwin Zankang
Onyedikachi Achilike
2 | P a g e
Table of contents
Introduction.........................................................................…............................3
Define the problem.......................................................................................…...4
Develop a proposal......................................................................................…....4
Justification…………………………………………………………………..…………..……….…4 Benefits………………………………………….…………………………………..…………..…..…5
SPSS data model………………………………………………………………………..……….…….….…6
Step 1 Excel. csv database……………………………….....………………..………..…..….7 Step 2 Import into SPSS…………………………………………………………..………..…....7 Step 3 Cleaning the data………………………………………………………….……..……...8
Summary.......................................................................................................…..9
Research results................................................................................................10
Multiple Linear Regression in R………………………………………………..…..……..10 Descriptive analyses in Tableau 8.2…………………………….………….………..…..10 Ordinal Linear Regression in SPSS………………………………………………….…….12 Descriptive analyses in IBM Cognos Insight………………………………………..….13
Contribution…………………………….……………………………….…………………………………….16
Attributes, Independent and dependent variables…………………………………………17
Dimensions of the model..............................................................................…..18
Ordinal linear……………………………………………………………………………………....18 Multiple linear…………………………………………………………..……………..…..…...18
Building the model………………………………………………….……………………………..….……19
References…………………………………………………………………………..…………………………25
3 | P a g e
Introduction
It is becoming increasingly apparent that we all compete from within information
businesses. At some level, all businesses, whether they manufacture goods or provide services,
compete in the global economy as information-driven enterprises. This has fueled the rapid
development and deployment over the past few decades of various forms of Information
Technology (IT). Examples include information systems such as Business Intelligence, IBM
COGNOS, SPSS, tableau, and many more. Each state has its own laws that outline the types and
amount of auto insurance that one is required to have. Your coverage requirements might be
different depending on where you live and what your personal insurance needs are. An agent
can help you understand your state's insurance requirements so you can make an educated
decision about the coverage levels and deductibles you want.
The objective of this paper is to first of all examine All States’ auto insurance premium
pricing and the risk factors involved in the insurance problem, and secondly to develop a
proposal on how to solve the problem. The benefits of this project will be outlined, a software
system that is not complicated will be used, so as to predict and analyze efficiently. After
identifying the data, it shall be imported into SPSS for a series of analysis. The result of these
analysis will give us a better idea on how to strategize, allocate and identify resources as a
growing business.
Decision Support Systems is an automated information system used to support decision-
making within an organization or business. A DSS enables users to filter through and analyze
massive reams of big data and gather information that can be used to solve problems and make
better decisions. The benefits of decision support systems comprise more informed decision-
making, opportune problem solving, superior efficiency and better learning. (11) A DSS can
compile and present information for many aspects of a business, including sales trends, actual
versus projected sales, worker productivity, profitability mix and so on. Decision Support
Systems (DSS) goes a long way to enhance a business operation in so many different ways.
First, it saves time as research has demonstrated and substantiated reduced decision cycle
time, increased employee productivity and more timely information for decision making.
Secondly, it enhances effectiveness and improves interpersonal communication. Thirdly,
4 | P a g e
competitive advantage is being realized, and cost reduction is evident from saving from labor
savings in making decisions and from lower infrastructure or technology costs.
Define the problem
Insurance business is one of the leading elements in financial industry. ”While the use of
big data and business intelligence will matter across sectors, some sectors are set for greater
gains, the computer and electronic products and information sectors, as well as finance and
insurance, are poised to gain substantially from the use of BI”. Major global and national
insurance companies have a rich and longtime serving history, which is only superseded by
huge market share of insurance customers. Among many types of insurance products, car
insurance is perhaps the most common. Different insurance companies utilize various models
for quoting customers their car insurance premiums. As a startup insurance company our
primary goal is to investigate how car age, duration of previous policies with the company and
average age of the customer, affect quoted insurance premium on the policy. Our secondary
goal is to examine how risk is attributed to every policy. We will investigate the influence on risk
categories by the following factors: car age, duration of previous policies, premium cost, and
average customer age on the policy, home ownership and marital status.
Develop a proposal
Justification
As a startup insurance company we realize that car insurance products can be a
significant source of revenue. We also realize that insurance business is extremely competitive
and the only way we could possibly grab a market share of potential customers as if we
properly identify how other insurance companies price their products. Initially as an insurance
startup we do not feel the need to reinvent the wheel. Before we come up with our own unique
pricing strategies we need to understand robust models that are being used by others in the
market. To execute our investigative queries we acquired ALL STATE car insurance database,
which contained all our variables of interest and had almost 420000 observations related to
unique car insurance policies. To provide a sound analysis on premium pricing and factors
5 | P a g e
attributed to composite risk categories we decided to get acquainted with trending BI
platforms. For descriptive parts of our analyses we decided to use IBM COGNOS and Tableau,
because both of these currently trending platforms provide strong visualization tools, based on
similar constructs of dimensions and measures. For the predictive parts of our analyses we
decided to use R and SPSS, because both of these platforms are popular in academic and
business environments for their competency in providing reliable tools for regression analyses.
We justify our BI platform selections as means to reconstruct a quantitative picture of variable
relationships responsible for the core processes of premium pricing and evaluation of risk
categories.
Benefits
“We can measure and therefore manage more precisely than ever before. We can make
better predictions and smarter decisions. We can target more-effective interventions, and can
do so in areas that so far have been dominated by gut and intuition rather than by data and
rigor.” There are a number of direct benefits that we hope to acquire after we conduct our
investigations. By determining the influence of composite factors on quoted premium prices of
our competition we could potentially form a sound strategy to undercut and to optimize future
customer portfolio based on customer car age, average age on the policy and duration of
previous policies. We also hope to attain a profile of customers who currently pay large
premiums as a potential target for our future direct marketing. Understanding the mechanics
behind risk categories, will give us a guideline as to what metrics should be used in determining
whether or not potential customers represent high or low risks. Knowledge of risk factor
categories will also help us to optimize the future portfolio of customers. Indirect benefits to
our investigations revolve around experimentations with four different types of BI platforms.
After we complete our analyses we should have a clear picture as to which specific BI tools suit
our needs the most. As a startup insurance company we need reliable BI tools that provide
intuitive high quality charts at short notice, we also need BI tools that are not complicated in
use for predictive analytics.
6 | P a g e
SPSS data model
Original ALL STATE insurance database was downloaded from www.kaggle.com.
Insurance data base was obtained as an Excel .csv file. Original data contained many errors and
a great number of variables that were not needed for the intended analyses, thus we imported
the original excel data into SPSS for further cleaning and consolidation.
Step 1 Excel . csv database
7 | P a g e
Step 2 Import into SPSS
8 | P a g e
Step 3 Cleaning the data
Significant part of our SPSS data model creation was spent on removing observations with
missing values and removing variables that insurance company provided no detailed
explanation for. Some of the observations were also removed because depicted values did not
fall in the range which was defined by variable details. Original data base contained 665250
unique observations and 25 variables. Cleaned database contained 418439 observations and 11
variables. Here is the list of some of the changes to the original database:
Removed observations relevant to policy group sizes of more than 2 people
Removed the following variables due to irrelevance: customer ID, shopping pt, record
type, day of the purchase, time of the purchase, location of the purchase, car value, etc.
Removed Age oldest and age youngest on the policy and created an “average age on
the policy” variable
Removed noise data as NA for risk factors, NA for duration previous
9 | P a g e
Summary
Research questions:
1. How quoted insurance premium is affected by car age, duration of previous coverage's and
customer average age on the policy.
2. How policy risk category gets affected by car age, duration of previous policies, premium
cost, average customer age on the policy, home ownership and marital status.
To investigate our first research question we imported SPSS data base file into R. We
further conducted multiple linear regression analysis which allowed us to determine which
variables were significant drivers in predicting quoted insurance premium cost to a customer.
To validate regression results we imported the SPSS data base into Tableau 8.2 and conducted
descriptive analysis for each of the drivers versus quoted insurance premium.
To investigate our second research question we conducted an ordinal linear regression
in SPSS. We chose to do ordinal linear regression because we were trying to predict risk
category variable (which is a categorical variable measured on the ordinal scale). With the help
of the ordinal regression we were able to see how risk categories of the insurance policies were
affected by demographic variables of the customers and by insurance construct variables. To
validate ordinal regression results we imported SPSS data base into IBM Cognos Insight and
conducted descriptive analysis for each of the driver variables versus risk category variable.
For both of the analyses our methodology was to compare and to contrast regression
results to matching descriptive inquires of relationships between variables. Visual
representations of relationships between variables helped us to determine anomalies that
questioned validity of results for both of the regression analyses. Discovered anomalies helped
us to understand hidden mechanics for the pricing of the insurance premium and the
evaluation of ranks for risk categories.
10 | P a g e
Research results
1. Multiple Linear Regression in R (Illustrations are in building the model)
Multiple Regression
Estimate Std. Error T value Sig
Intercept 681.95 0.213 3199.7 0.00**
Car_Age -2.179 0.011 -184.2 0.00**
Duration_Previous -1.17 0.014 -81.68 0.00**
Average_age -0.525 0.003 -133.78 0.00**
Adjusted R-squared for our linear multiple regression model was 0.1389. This finding indicates
that the model is poor and only explains for approximately 14% of the variance in quoted cost
premiums. Even though the model does not explain for the majority of the variance, it provided
us with driver loadings. Car_age was the strongest driver in the model (Beta -2.179), this
indicated that the lower is the age of the car, the higher is the quoted premium.
Duration_Previous was the second strongest driver in the model (Beta -1.17), which indicated
that the lower is the duration of previously coverage the higher is the quoted premium. Finally,
average customer age on the policy was the weakest driver (Beta -0.525), which indicated that
the lower is the average age on the policy, the higher is the quoted premium.
Descriptive analyses in Tableau 8.2 (Illustrations are in building the model)
A. To further examine the variable relationships in our multiple regression model, we
conducted a series of descriptive analyses, between driver variables and the target variable
(quoted cost of the insurance premium). The first relationship that we looked at was the cost of
insurance premium versus car age. Our regression model indicated that the lower is the car age
the higher is the premium. However, when plotted against each other we saw that the most
significant jump in the price of premium happens from the newest car to the oldest (age 0 to
age 1). From the period of 3 to 9 years (car age) the premium stays approximately the same.
11 | P a g e
The Tableau model visually pointed out that in year five; there was also a significant drop in
premium which was leveled off by premium increase in the next year. After year 9 premium
gradually declines to the car age of 29 years.
Possible explanations: The premium cost of the insured in year zero is much lower than year one
because most likely the insured purchased a new car and just started paying premium. Another
possible explanation is that insured receive a major discount from the insurer for being new
customer. The reason for the drop in premium for year five is because in general most insured
individuals finish paying off their vehicles and also most manufacturing warranties expire during
this time frame which leads to people maintaining their vehicles before the warranty expires and
after year six people start showing neglect for their vehicles. After warranties expire in general,
vehicles start to display functional issues.
B. The second relationship that we looked at was the cost of the insurance premium versus
duration of previous policies. Our regression model suggested that the lower is the duration,
the higher is the quoted premium. However, when cost is plotted against duration we see that
the cost of premium increases from policy duration of zero to year one. The initial increase in
the cost of premium for the first three years steadily declines up to year fifteen, where it jumps
up surpassing costs of premium for any of the duration years.
Possible explanations: tableau model visually illustrated that after the insured customer has been
with All State for at least three years, those customers graduated into the loyal customer list and
started to receive loyalty discounts, thus paying less premium to the insurer All State. These
loyalty discounts continued until loyalty year fifteen. At year fifteen the major spike in premium
was attributed to loyal senior customers. Possible reason for such an increase was because most
of the customers who have been loyal for long periods of time were over sixty years old and
were much more prone to having accidents, which increased the risk and impacted the amount of
premium All State charged the insured.
12 | P a g e
C. The third relationship that we looked at was the cost of insurance premium versus average
insured age on the policy. Our regression model indicated that the lower is the average age on the
policy, the higher is the quoted premium. However, when cost was plotted against the average
age, we saw a steadily increase in the cost of premium from the age of 17.5 to the age of 25,
which was followed by a leveling decrease up to the age group of 75. The category of policies
with an average age of 75 years old showed the highest premium costs, this showed to be an
anomaly.
Possible explanations: our graph illustrated that younger individuals pay less premium. The
reason for this was because we were provided with less data for new customers and younger
teenagers were insured through their parents. According to the model in Tableau, once a young
adult becomes an adult at the age of twenty six the insurance company considers this age group
to be less risky of suffering a loss. Therefore, the insurer All State lowers the premium charged
to the insured and keeps that rate relatively flat for the duration of a individuals policy life,
provided that they don't get into car accidents or incur any speeding tickets. Once an individual
reached an excessive age of seventy five years old there was a tremendous spike of premium the
insured had to pay to the insurer. The reason for the premium increase is because most
individuals no longer have the same reactions, or mentality as when they were younger. Anyone
over the age of seventy five was considered as much more prone of getting into a car accident
and therefore was in the higher risk category which directly impacted the amount of premium for
the age group. An insurance company will not make money if they keep suffering losses
because they did not charge enough premium to cover for losses.
2. Ordinal Linear Regression in SPSS (Illustrations are in building the model)
Parameter Estimates
Estimate Std. Error Sig
Car_age 0.032 0.001 .000
Duration_previous -0.033 0.001 .000
Cost 0.003 0 .000
Average_age -0.26 0 .000
Homeowner = 0 0.117 0.006 .000
13 | P a g e
Married_couple=0 0.055 0.007 .000
If Beta > 0 High score more likely
= 0 Equally likely scores
< 0 Lower Scores more likely
Pseudo R squared for our ordinal regression was 0.11. This finding indicates that the model is
poor and only explains for 11% of the variance in determining rank categories for risk factors.
However, this model provided us with driver loadings which we can use to determine
relationships. Home ownership was the strongest driver in the model (Beta 0.117), no home
ownership results in probability of high risk category. Average age on the policy was the second
best driver (Beta -0.26), the higher is the average age on the policy, and the lower is the risk
category. Marital status (Beta 0.055) indicated that being single results in high risk category. Car
age (Beta 0.032) indicated that the older is the car on the policy, the higher is the risk category.
Duration previous (Beta -0.033) indicated that the longer is the history of previous policies with
the company, the lower is the risk category. Finally, Cost (Beta 0.003) does not influence risk
categories because beta is close to 0.
Descriptive analyses in IBM COGNOS Insight (Illustrations are in building the model)
A. To examine the relationships in our ordinal regression model we conducted descriptive
analyses between the driver variable and the target variable (risk rank). Our ordinal regression
model indicated that the older is the car the higher is the risk rank for the policy. However,
plotting risk categories against the car age illustrated that when cars age from zero to one year
the risk of the policy actually increased substantially, descriptive model showed that only after
year one the risk categories started to decrease consistently with the progression of the car age.
Possible explanations: we believe that the reason for the initial risk increase was related to new
car defects that require recalls for corrections. For example, the Toyota Camry accelerator brake
had issues a few years ago when drivers would try to brake and instead the car would accelerate
14 | P a g e
causing major accidents. Therefore, our COGNOS descriptive showed that, after the one year
trial period the risk factors lowered because by majority of factory defect issues have been
corrected. We believe that risk increase in year five resulted from the average lifespan of the car
parts which usually start to show wear and tear during the period, All State factored wear and
tear into the construct of risk. After year seven, all levels of the risk factor showed a gradual
decrease because on the average people trade in older vehicles for new models. In general it is
logical to assume that cars do not last much longer than twelve years and end up in junkyards
after reaching the age of twenty one years.
B. We further examined relationship between risk categories and the duration of previous
coverage under the same policy. Basically, we investigated how the loyalty duration of a
customer affects risk. Ordinal regression results indicated that the longer is the previous duration
history the lower is the risk category. However, between year 0 and year 1 we noticed an
increase in the amount risk, categorized by highest risk categories (3, 4) switching places. But
apart from this anomaly, our plot confirmed regression expectations up to the period of year 8-9,
during which all of the customers were deemed to be in the same risk category.
Possible explanations: for the period of 0 to 1 years of policy duration, customers present the
highest turnover risk. All State gradually lowered the turn over policy risk by providing
continual discounts for people who continue to use the services of the company. We think that
the period between 8-9 years of continual coverage indicates a cross over between risk
categories, because the amount of people in highest risk categories (3,4) decline, while the
amount of people in lowest risk categories (1,2) increase.
C. The next relationship that we investigated was average age on the policy versus risk factor.
Our ordinal regression results indicated that the higher is the average age on the policy the lower
is the risk factor. When we plotted age versus risk factor, we saw that regression results are true
but mostly for the highest risk categories (3, 4). Third risk factor category represented itself as
the highest category granted from the age of 17 to the age of 23. Our plot also showed us that
lower risk factor categories (3, 4) actually increase throughout the age progression.
15 | P a g e
Possible explanations:
We attributed the initial decrease (from the age of 17-23) of highest risk factors to increasing
maturity level of younger individuals. COGNOS descriptive also illustrated that after individuals
are twenty five years old the risk factor lowered and remained consistent. We think that after
individuals turn twenty five, in general they are out of college and do not party as much. In other
words people become even more mature, thus even less prone to risky accidents. For the most
part this descriptive analysis confirmed our regression results.
D. Our ordinal regression results indicated that home ownership was the strongest driver in the
model. No home ownership resulted in high risk categories. To further examine this relationship
we plotted home ownership versus risk. The descriptive for this relationship between the two
variables did not show any anomalies, it supported the regression results.
Possible explanations: The reason why individuals that own a home are less risky is because
they have much more to lose such as their home and have more responsibilities than the
individuals who do not own a home in the event of a lawsuit. Another reason why individuals
that own a home are less risky is because they most likely have a car garage that will safely
secure their vehicles.
E. One of our last analyses was to examine the relationship between marital status and risk
categories. Regression results indicated that being single results in higher risk categories. We
plotted marital status versus risk categories, only to find the confirmation of the regression
findings.
Possible explanations: we visually determined that the married individuals are less risky than
individuals who are not married; a possible reason for this is that individuals who are not married
do not have as many responsibilities as people that are married. Another viable reason is that
married people usually share a policy which drives the cost of premium down.
Contribution
16 | P a g e
“Successful companies, defined as those that outperform their peers in profitability, have leaders
who support the use of data’’ (6). Every company’s goal is to grow and expand and the need to
integrate relevant big data will be imperative to this objective. However, big data will not be
enough, this data has to be mined, and analyzed with the use of BI platforms such as the ones
presented in this report. BI tools enable firms to glean all kinds of information such as customer
to business relationship and also aid in business process optimization. Business intelligence
systems combine operational data with analytical tools to present complex and competitive
information to planners and decision makers, in order to improve the timeliness and quality of
the decision-making process (7).
Consumers are always in search of a combination of quality and reasonable pricing; any
company that is able to offer such will gain significant competitive advantage. As a startup
company looking to offer attractive products, the above analysis gave some insight into
consumer base and how to potentially formulate and hyper target our products. It is impossible to
offer great product packages without the right knowledge of customers, and this knowledge is
enabled by the use of BI platforms.
In order to create mutually beneficially premium price points, firms must account for and
mitigate certain risk factors. It is advised that companies put customers in categories reflective of
such risks. Some of the risk factors we examined were home ownership, marital status, and
candidate age. In setting a price with those factors, a family bundle pricing strategy could be a
likely solution in a scenario where an individual lives with family and is not married. The
aforementioned individual poses as a risk because of marital and home ownership status, but
what if he/she met the criteria of “having long duration with previous insurer”, older in age
(26+), and has an accident-free driving record, these factors can also be examined and aid in
marginalizing risk and potentially offering such candidate(s) a better price.
Another scenario can be seen in examining the correlation of car age and premium, new car (0)
paid less premium but according to the ordinal analysis, the newest car posed the highest risk. As
mentioned before this could be as a result of young drivers being under parents insurance,
however the risk is still there, just masked. To relieve this factor, companies can have a flat start
17 | P a g e
out rate for all new cars and add to or subtract from that amount based on other risk factors
present. Also for individuals that pose a lot of risk, loyalty programs can be offered whereby they
get points for good driving (going for a long period of time without an accident: 2+years). This
will diminish risk and also create a relationship whereby customers take accountability for
premium pricing. It also fosters a trusting relationship between business and client, as it shows
that insurer is considerate
Using these descriptive and predictive analytical platforms helps a firm see the big picture of its
business and look for more relationships similar to the above examples. It shows that “one size
does not fit all”, and from such analysis effective customer segmentation can occur for tailor
made pricing.
In summation, Business Intelligence is essential in organization. “A sustainable business model
in today’s market is one that strategizes in congruence with effective knowledge management,
and business intelligence analytics is key for such management of knowledge.
Attributes, Independent and dependent variables
Home_ownership – variable that was described by policy holder having an ownership of a home (0 – no home ownership, 1- owned a home)
Car_age – variable that describe the age of the car on the policy
Married_couple – variable that was used to describe policy holder’s marital status (0 – not married, 1 – married)
Duration_previous – variable that was used to describe the previous longitude of policy with the same insurance company (All State)
Cost – variable that was used to describe cost of the quoted premium
Average_Age – variable that was used to indicate average age of the people on the same insurance policy.
1. Multiple regression had one dependent variable-cost; the analysis had 3 independent variables: Car_age, Duration_previous and Average_Age.2. Ordinal regression had one dependent variable-risk; the analysis had six independent variables: Cost, Home_ownership, Car_age, Married_couple, Duration_previous, Average_age.
18 | P a g e
Dimensions of the model
1. Multiple linear
2. Ordinal linear
Building the model (multiple linear regression in R)
Average Age
Car Age
Duration_previous
Cost
Homeownership
CostCar Age
Married_couple
Duration previous
Average Age
Risk (1-4)
19 | P a g e
Cost versus Car Age
20 | P a g e
Cost Versus Duration Previous
Cost Versus Average Age
21 | P a g e
Ordinal Linear Regression in SPSS
Ordinal Linear Regression in SPSS
22 | P a g e
Risk versus Car_age
Risk versus Duration Previous
Risk versus Average Age
23 | P a g e
Risk versus Home ownership
24 | P a g e
Risk versus Marital Status
25 | P a g e
References.
(1) U. The Mandate for Business Analytics (n.d.): n. pag. Http://spotfire.tibco.com. Spotfire. Web. 16 Nov. 2015. http://spotfire.tibco.com/assets/bltb1c81526719735f0/info-advantage.pdf.
(2) Manyika, James. "Big Data: The Next Frontier for Innovation, Competition, and Productivity", McKinsey Quarterly. May 2011. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation. Viewed 15 November 2014
(3) McAfee, A., and Brynjolfsson, E. 2012. “Big data: the management revolution,” Harvard Business Review, (November 2014), p. 62.
(4) www.Kaggle.com
(5) "Do Low-Income Households Pay More For Auto Insurance?" Online Wholesale Insurance, Superior Access Insurance Services (SAIS). Web. 15 Nov. 2014. http://www1.superioraccess.com/news/insurance-industry-news/do-low-income-households-pay-more-for-auto-insurance.
(6) Cukier, Kenneth. “Ideas economy: Finding Value in Big Data”, Oracle. June 2013. http://www.oracle.com/us/technologies/big-data/finding-value-in-big-data-1991047.pdf. Viewed 16 November 2014
(7) R. Kalakota, Gartner says – BI and Analytics a $12.2 Bln market, , 2011
http://practicalanalytics.wordpress.com/2011/04/24/gartner-says-bi-and-analyticsa-10-5-bln-market/ 2011.
(8) Dalkir, K., (2011) Knowledge management in theory and practice. 2nd ed. Cambridge, Mass.: MIT Press, Print.
(9) 1. B. Crabtree, N.R. Jenning (Eds.), The Practical Application of Intelligent Agents and Multi-Agent Technology (1996) London, UK
(10) Teo, T. and W.Y. Choo (2001) “Assessing the Impact of Using the Internet for Competitive Intelligence”, Information & Management, (39)1, pp. 67.(11) http://www.investopedia.com/terms/d/decision-support-system.asp