Top Banner
BAS 250 Lesson 3: K-Means Clustering
23

BAS 250 Lecture 3

Apr 14, 2017

Download

Education

Wake Tech BAS
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BAS 250 Lecture 3

BAS 250Lesson 3: K-Means Clustering

Page 2: BAS 250 Lecture 3

Effectively employ the CRISP-DM method

Develop a k-means cluster data mining model

Interpret output generated by model

This Week’s Learning Objectives

Page 3: BAS 250 Lecture 3

Explain what k-Means clusters are, how they are

found, and their benefits

Demonstrate the necessary format for data in

order to create k-Means clusters

Interpret the clusters generated by a k-Means

model and explain their significance, if any

K-Means Clustering

Page 4: BAS 250 Lecture 3

Clustering means: “Grouping of data or dividing a large data set into

smaller data sets of some similarity”

The “k” in k-Means clustering stands for some number of groups, or

clusters – You can control over these (supervised learning).

Enables the user to define natural groups between data sets by

comparing the means of their individual attribute values

Means are susceptible to undue influence by extreme outliers, so

watching for inconsistent data is very important with k-Means

K-Means Clustering

Page 5: BAS 250 Lecture 3

k-Means algorithm samples observations and then compares

the other attributes in the data set to that sample’s means

Process is repeated in order to ‘circle-in’ on the best matches

and then formulate groups of observations which become

clusters as the means become more and more similar

Sometimes takes a while to run, especially if using a large

number of ‘max runs’ or seeking a large number of clusters (k)

K-Means Clustering

Page 6: BAS 250 Lecture 3

K-Means Clustering

Page 7: BAS 250 Lecture 3

For every business problem going forward, you will work

through the CRISP-DM method to complete your work.

K-Means Clustering: CRISP-DM

Page 8: BAS 250 Lecture 3

Context:

o You work for a major health insurance provider and have been

asked to create a weight and cholesterol management program

for policy holders to reduce policy payouts due to heart disease.

You have a limited budget to communicate with potential policy

holders who would benefit from such a program. Your message

must be targeted to those who are most at risk for heart disease

due to weight issues and high cholesterol.

K-Means Clustering

Page 9: BAS 250 Lecture 3

Business Understanding:

o You will need to search through thousands of

policy holders to find groups of people with similar

characteristics and develop programs and

communications that will be relevant to these

groups.

K-Means Clustering

Page 10: BAS 250 Lecture 3

Data Understanding:

o Instead of searching thousands of policy holders, you have access to a clean sample of roughly

550. There are 3 attributes. Each row is a policy holder. If gender = 1, then male. If gender = 0,

then female.

K-Means Clustering

Page 11: BAS 250 Lecture 3

Data Preparation:

o None of the values seem to be inconsistent.

No missing values and the standard deviations are reasonable.

K-Means Clustering

Page 12: BAS 250 Lecture 3

Data Modeling:

o We will use k-means clustering to determine the natural groups.

We will not be predicting who will have heart disease, as k-means

is not predictive.

o We want to know more than 2 clusters (high and low risk of heart

disease), as there are likely a number of different types of groups.

o For this exercise, we will use 4 potential groups.

K-Means Clustering

Page 13: BAS 250 Lecture 3

K-Means Clustering

Page 14: BAS 250 Lecture 3

Observations:

• Clusters are fairly

balanced.

• We will keep these

groups for

evaluation.

Once we run the cluster process…below is the

output.

K-Means Clustering

Page 15: BAS 250 Lecture 3

Evaluation:

The Centroid Table shows the means for each attribute in the four (k) clusters

K-Means Clustering

Page 16: BAS 250 Lecture 3

To see who these policy holders are, you can see more details by

going to “Folder View” and clicking on “cluster_3”.

Observation #6 refers to a policy holder’s information.

K-Means Clustering

Page 17: BAS 250 Lecture 3

Deployment:

To deploy the information from the analysis, we will go back to the design tab…

Add a filter process

to choose only

“cluster_3” using the

attribute_value_filter

K-Means Clustering

Page 18: BAS 250 Lecture 3

Deployment:

The results from filtering on cluster_3…

K-Means Clustering

Page 19: BAS 250 Lecture 3

Deployment:

You can now go back to your company’s database and issue a SQL

query to pull all records…

SELECT First_Name, Last_Name, Policy_Num, Address, Phone_Num

FROM PolicyHolders_view

WHERE Weight >= 167

AND Cholesterol >= 204

AND Gender = 1;

K-Means Clustering

Page 20: BAS 250 Lecture 3

Deployment:

By targeting the highest risk of heart disease group, you can reduce the

payouts, thus increasing profits for your company.

Note: Your next targeted communication would have been “cluster_2”. This group is

women with a high risk of heart disease. There the message may be communicated

differently from men.

K-Means Clustering

Page 21: BAS 250 Lecture 3

k-Means does not necessarily predict values, it simply

takes known indicators from the attributes in a data set and

groups them together based on those attributes’ similarity

to group averages

It helps the user to understand where one group begins

and the other ends- in other words, where the natural

breaks occur between groups in a data set

K-Means Clustering: Summary

Page 22: BAS 250 Lecture 3

Effectively employ the CRISP-DM method

Develop a k-means cluster data mining model

Interpret output generated by models

Summary - Learning Objectives

Page 23: BAS 250 Lecture 3

“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s Employment

and Training Administration. The solution was created by the grantee and does not necessarily reflect the

official position of the U.S. Department of Labor. The Department of Labor makes no guarantees,

warranties, or assurances of any kind, express or implied, with respect to such information, including any

information on linked sites and including, but not limited to, accuracy of the information or its

completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”

Except where otherwise stated, this work by Wake Technical Community College Building Capacity in

Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative

Commons Attribution 4.0 International License. To view a copy of this license, visit

http://creativecommons.org/licenses/by/4.0/

Copyright Information