Data mining and statistic al learning - lecture 14 Clustering methods Partitional clustering in which clusters are represented by their centroids (proc FASTCLUS) Agglomerative hierarchical clustering in which the closest clusters are repeatedly merged (proc CLUSTER) Density-based clustering in which core points and associated border points are clustered (proc MODECLUS)
23
Embed
Data mining and statistical learning - lecture 14 Clustering methods Partitional clustering in which clusters are represented by their centroids (proc.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data mining and statistical learning - lecture 14
Clustering methods
Partitional clustering in which clusters are represented by their centroids (proc FASTCLUS)
Agglomerative hierarchical clustering in which the closest clusters are repeatedly merged (proc CLUSTER)
Density-based clustering in which core points and associated border points are clustered (proc MODECLUS)
Data mining and statistical learning - lecture 14
Proc FASTCLUS
Select k initial centroids
Repeat the following until the clusters remain unchanged:
Form k clusters by assigning each point to its nearest centroid
Update the centroid of each cluster
Data mining and statistical learning - lecture 14
Identification of water samples with incorrect
total nitrogen levels
0
5000
10000
15000
20000
25000
0 5000 10000 15000 20000 25000 30000
Total nitrogen (persulfate) mg/l
To
tal
nit
rog
en (
Kje
ldah
l) m g
/l
Data mining and statistical learning - lecture 14
Identification of water samples with incorrect total nitrogen levels
- 2-means clustering
0
5000
10000
15000
20000
25000
0 5000 10000 15000 20000 25000 30000
Total nitrogen (persulfate digestion)
To
tal
nit
rog
en (
Kje
ldah
l)
Cluster 1 Cluster 2
Initializationproblems?
Data mining and statistical learning - lecture 14
Limitations of K-means clustering
1. Difficult to detect clusters with non-spherical shapes
2. Difficult to detect clusters of widely different sizes
3. Difficult to detect clusters of different densities
Data mining and statistical learning - lecture 14
Proc MODECLUS
Use a smoother to estimate the (local) density of the given dataset
A cluster is loosely defined as a region surrounding a local maximum of the probability density function
Data mining and statistical learning - lecture 14
Identification of water samples with incorrect
total nitrogen levels
- proc MODECLUS, R = 1000
Smoothing parameter R = 1000
0
5000
10000
15000
20000
25000
0 10000 20000 30000
Tot_N (Kj) mg/l
To
t_N
(p
s)
mg
/l Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Other clusters
What will happen if R is increased?
Data mining and statistical learning - lecture 14
Identification of water samples with incorrect
total nitrogen levels
- proc MODECLUS, R = 4000
Smoothing parameter R = 4000
0
5000
10000
15000
20000
25000
0 10000 20000 30000
Tot_N (Kj) mg/l
To
t_N
(p
s)
mg
/l
Cluster 1
Cluster 2
Data mining and statistical learning - lecture 14
Identification of water samples with incorrect
total nitrogen levels
- proc MODECLUS, method 6
0
5000
10000
15000
20000
25000
0 5000 10000 15000 20000 25000 30000
Total nítrogen (persulfate digestion)
To
tal n
itro
ge
n (
Kje
lda
hl) Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Clusters 6 - 18
No cluster assigned
Why did the clustering fail?
Data mining and statistical learning - lecture 14
Limitations of density-based clustering
1. Difficult to control (requires repeated runs)
2. Collapses in high dimensions
Data mining and statistical learning - lecture 14
Strength of density-based clustering
Given a sufficiently large sample, nonparametric density-based clustering methods are capable of detecting clusters of unequal size and dispersion and with highly irregular shapes
Data mining and statistical learning - lecture 14
Identification of water samples with incorrect
total nitrogen levels
- transformed data
-10000
-5000
0
5000
10000
15000
0 5000 10000 15000 20000 25000
Total N (Kj)
To
tal
N (
ps)
-T
ota
l N
(K
j)
Data mining and statistical learning - lecture 14
Identification of water samples with incorrect
total nitrogen levels
- proc MODECLUS, R = 2000, transformed data
-10000
-5000
0
5000
10000
15000
0 5000 10000 15000 20000 25000
Total N (Kj)
To
tal
N (
ps)
-T
ota
l N
(K
j)
Cluster 1
Cluster 2
Cluster 3-6
Data mining and statistical learning - lecture 14
Preprocessing
1. Standardization
2. Linear transformation
3. Dimension reduction
Data mining and statistical learning - lecture 14
Postprocessing
1. Split a cluster• Usually, the cluster with the largest SSE is split
2. Introduce a new cluster centroid• Often the point that is farthest from any cluster center is
chosen
3. Disperse a cluster• Remove one centroid and reassign the points to other
clusters
4. Merge two clusters• Typically, the clusters with the closest centroids are chosen
Data mining and statistical learning - lecture 14
Profiling website visitors
1. A total of 296 pages at a Microsoft website are grouped into 13 homogenous categories• Initial• Support• Entertainment• Office• Windows• Othersoft• Download• …..
2. For each of 32711 visitors we have recorded how many times they have visited the different categories of pages
3. We would like to make a behavioural segmentation of the users ( a cluster analysis) that can be used in future marketing decisions