BAS 250 Lecture 3

BAS 250Lesson 3: K-Means Clustering

Effectively employ the CRISP-DM method

Develop a k-means cluster data mining model

Interpret output generated by model

This Week’s Learning Objectives

Explain what k-Means clusters are, how they are

found, and their benefits

Demonstrate the necessary format for data in

order to create k-Means clusters

Interpret the clusters generated by a k-Means

model and explain their significance, if any

K-Means Clustering

Clustering means: “Grouping of data or dividing a large data set into

smaller data sets of some similarity”

The “k” in k-Means clustering stands for some number of groups, or

clusters – You can control over these (supervised learning).

Enables the user to define natural groups between data sets by

comparing the means of their individual attribute values

Means are susceptible to undue influence by extreme outliers, so

watching for inconsistent data is very important with k-Means

K-Means Clustering

k-Means algorithm samples observations and then compares

the other attributes in the data set to that sample’s means

Process is repeated in order to ‘circle-in’ on the best matches

and then formulate groups of observations which become

clusters as the means become more and more similar

Sometimes takes a while to run, especially if using a large

number of ‘max runs’ or seeking a large number of clusters (k)

K-Means Clustering

K-Means Clustering

For every business problem going forward, you will work

through the CRISP-DM method to complete your work.

K-Means Clustering: CRISP-DM

Context:

o You work for a major health insurance provider and have been

asked to create a weight and cholesterol management program

for policy holders to reduce policy payouts due to heart disease.

You have a limited budget to communicate with potential policy

holders who would benefit from such a program. Your message

must be targeted to those who are most at risk for heart disease

due to weight issues and high cholesterol.

K-Means Clustering

Business Understanding:

o You will need to search through thousands of

policy holders to find groups of people with similar

characteristics and develop programs and

communications that will be relevant to these

groups.

K-Means Clustering

Data Understanding:

o Instead of searching thousands of policy holders, you have access to a clean sample of roughly

550. There are 3 attributes. Each row is a policy holder. If gender = 1, then male. If gender = 0,

then female.

K-Means Clustering

Data Preparation:

o None of the values seem to be inconsistent.

No missing values and the standard deviations are reasonable.

K-Means Clustering

Data Modeling:

o We will use k-means clustering to determine the natural groups.

We will not be predicting who will have heart disease, as k-means

is not predictive.

o We want to know more than 2 clusters (high and low risk of heart

disease), as there are likely a number of different types of groups.

o For this exercise, we will use 4 potential groups.

K-Means Clustering

K-Means Clustering

Observations:

• Clusters are fairly

balanced.

• We will keep these

groups for

evaluation.

Once we run the cluster process…below is the

output.

K-Means Clustering

Evaluation:

The Centroid Table shows the means for each attribute in the four (k) clusters

K-Means Clustering

To see who these policy holders are, you can see more details by

going to “Folder View” and clicking on “cluster_3”.

Observation #6 refers to a policy holder’s information.

K-Means Clustering

Deployment:

To deploy the information from the analysis, we will go back to the design tab…

Add a filter process

to choose only

“cluster_3” using the

attribute_value_filter

K-Means Clustering

Deployment:

The results from filtering on cluster_3…

K-Means Clustering

Deployment:

You can now go back to your company’s database and issue a SQL

query to pull all records…

SELECT First_Name, Last_Name, Policy_Num, Address, Phone_Num

FROM PolicyHolders_view

WHERE Weight >= 167

AND Cholesterol >= 204

AND Gender = 1;

K-Means Clustering

Deployment:

By targeting the highest risk of heart disease group, you can reduce the

payouts, thus increasing profits for your company.

Note: Your next targeted communication would have been “cluster_2”. This group is

women with a high risk of heart disease. There the message may be communicated

differently from men.

K-Means Clustering

k-Means does not necessarily predict values, it simply

takes known indicators from the attributes in a data set and

groups them together based on those attributes’ similarity

to group averages

It helps the user to understand where one group begins

and the other ends- in other words, where the natural

breaks occur between groups in a data set

K-Means Clustering: Summary

Effectively employ the CRISP-DM method

Develop a k-means cluster data mining model

Interpret output generated by models

Summary - Learning Objectives

“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s Employment

and Training Administration. The solution was created by the grantee and does not necessarily reflect the

official position of the U.S. Department of Labor. The Department of Labor makes no guarantees,

warranties, or assurances of any kind, express or implied, with respect to such information, including any

information on linked sites and including, but not limited to, accuracy of the information or its

completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”

Except where otherwise stated, this work by Wake Technical Community College Building Capacity in

Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative

Commons Attribution 4.0 International License. To view a copy of this license, visit

http://creativecommons.org/licenses/by/4.0/

Copyright Information

http://creativecommons.org/licenses/by/4.0/

BAS 250 Lecture 3

Education