BAS 250 Lesson 3: K-Means Clustering
BAS 250Lesson 3: K-Means Clustering
Effectively employ the CRISP-DM method
Develop a k-means cluster data mining model
Interpret output generated by model
This Week’s Learning Objectives
Explain what k-Means clusters are, how they are
found, and their benefits
Demonstrate the necessary format for data in
order to create k-Means clusters
Interpret the clusters generated by a k-Means
model and explain their significance, if any
K-Means Clustering
Clustering means: “Grouping of data or dividing a large data set into
smaller data sets of some similarity”
The “k” in k-Means clustering stands for some number of groups, or
clusters – You can control over these (supervised learning).
Enables the user to define natural groups between data sets by
comparing the means of their individual attribute values
Means are susceptible to undue influence by extreme outliers, so
watching for inconsistent data is very important with k-Means
K-Means Clustering
k-Means algorithm samples observations and then compares
the other attributes in the data set to that sample’s means
Process is repeated in order to ‘circle-in’ on the best matches
and then formulate groups of observations which become
clusters as the means become more and more similar
Sometimes takes a while to run, especially if using a large
number of ‘max runs’ or seeking a large number of clusters (k)
K-Means Clustering
K-Means Clustering
For every business problem going forward, you will work
through the CRISP-DM method to complete your work.
K-Means Clustering: CRISP-DM
Context:
o You work for a major health insurance provider and have been
asked to create a weight and cholesterol management program
for policy holders to reduce policy payouts due to heart disease.
You have a limited budget to communicate with potential policy
holders who would benefit from such a program. Your message
must be targeted to those who are most at risk for heart disease
due to weight issues and high cholesterol.
K-Means Clustering
Business Understanding:
o You will need to search through thousands of
policy holders to find groups of people with similar
characteristics and develop programs and
communications that will be relevant to these
groups.
K-Means Clustering
Data Understanding:
o Instead of searching thousands of policy holders, you have access to a clean sample of roughly
550. There are 3 attributes. Each row is a policy holder. If gender = 1, then male. If gender = 0,
then female.
K-Means Clustering
Data Preparation:
o None of the values seem to be inconsistent.
No missing values and the standard deviations are reasonable.
K-Means Clustering
Data Modeling:
o We will use k-means clustering to determine the natural groups.
We will not be predicting who will have heart disease, as k-means
is not predictive.
o We want to know more than 2 clusters (high and low risk of heart
disease), as there are likely a number of different types of groups.
o For this exercise, we will use 4 potential groups.
K-Means Clustering
K-Means Clustering
Observations:
• Clusters are fairly
balanced.
• We will keep these
groups for
evaluation.
Once we run the cluster process…below is the
output.
K-Means Clustering
Evaluation:
The Centroid Table shows the means for each attribute in the four (k) clusters
K-Means Clustering
To see who these policy holders are, you can see more details by
going to “Folder View” and clicking on “cluster_3”.
Observation #6 refers to a policy holder’s information.
K-Means Clustering
Deployment:
To deploy the information from the analysis, we will go back to the design tab…
Add a filter process
to choose only
“cluster_3” using the
attribute_value_filter
K-Means Clustering
Deployment:
The results from filtering on cluster_3…
K-Means Clustering
Deployment:
You can now go back to your company’s database and issue a SQL
query to pull all records…
SELECT First_Name, Last_Name, Policy_Num, Address, Phone_Num
FROM PolicyHolders_view
WHERE Weight >= 167
AND Cholesterol >= 204
AND Gender = 1;
K-Means Clustering
Deployment:
By targeting the highest risk of heart disease group, you can reduce the
payouts, thus increasing profits for your company.
Note: Your next targeted communication would have been “cluster_2”. This group is
women with a high risk of heart disease. There the message may be communicated
differently from men.
K-Means Clustering
k-Means does not necessarily predict values, it simply
takes known indicators from the attributes in a data set and
groups them together based on those attributes’ similarity
to group averages
It helps the user to understand where one group begins
and the other ends- in other words, where the natural
breaks occur between groups in a data set
K-Means Clustering: Summary
Effectively employ the CRISP-DM method
Develop a k-means cluster data mining model
Interpret output generated by models
Summary - Learning Objectives
“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s Employment
and Training Administration. The solution was created by the grantee and does not necessarily reflect the
official position of the U.S. Department of Labor. The Department of Labor makes no guarantees,
warranties, or assurances of any kind, express or implied, with respect to such information, including any
information on linked sites and including, but not limited to, accuracy of the information or its
completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building Capacity in
Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative
Commons Attribution 4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/
Copyright Information