8/3/2019 Clu String
1/18
1
Data Mining:
Clustering
8/3/2019 Clu String
2/18
2
Clustering
Unsupervised learning or clustering buildsmodels from data without predefinedclasses.
The goal is to place records into groupswhere the records in a group are highlysimilar to each other and dissimilar torecords in other groups.
The k-Means algorithm is a simple yeteffective clustering technique.
8/3/2019 Clu String
3/18
3
Clustering Example
8/3/2019 Clu String
4/18
4
K-means example, step 1
k 1
k 2
k 3
X
Y
Pick 3initialcluster
centers(randomly)
8/3/2019 Clu String
5/18
5
K-means example, step 2
k 1
k 2
k 3
X
Y
Assigneach point
to the closestclustercenter
8/3/2019 Clu String
6/18
6
K-means example, step 3
X
Y
Moveeach clustercenter
to the meanof each cluster
k 1
k 2
k2
k 1
k 3
k 3
8/3/2019 Clu String
7/18
7
K-means example, step 4
X
Y
Reassignpointsclosest to adifferent newcluster center
Q: Which points are reassigned?
k 1
k 2
k 3
8/3/2019 Clu String
8/18
8
K-means example, step 4
X
Y
A: three points with animation
k 1
k 3k
2
8/3/2019 Clu String
9/18
9
K-means example, step 4b
X
Y
re-computeclustermeans
k 1
k 3k
2
8/3/2019 Clu String
10/18
10
K-means example, step 5
X
Y
move cluster
centers tocluster means
k 2
k 1
k 3
8/3/2019 Clu String
11/18
11
Finding Distance bet. TwoPoints
Point X Y
a 2 7
b 4 5
c 6 3
8/3/2019 Clu String
12/18
12
Example Data for ClusteringRID Age Years of
service1 30 52 50 253 50 154 25 55 30 106 55 25
Distance (r j , r k)=
8/3/2019 Clu String
13/18
13
Steps of the k-means algorithm: K =2, meaning that Number of clusters is 2
clusters C1 & C2 . Let Rid=3 and Rid=6 be the centers for clusters
C1 & C2 , respectively. Distance(r1 ,r3) =
= = = 22.7 Distance(r1,r6) = =32.0 .
So r1 is placed in cluster C1 since it is closerto the center of C1
8/3/2019 Clu String
14/18
14
Similarly: Distance (r2,r3) = 10 , Distance (r2,r6) = 5
Therefore r2 is added to C2 Distance (r4,r3) = 25.5 , Distance (r4,r6) = 36.6
Therefore r4 is added to C1
Distance (r5 ,r3) = 20.6 , Distance (r5,r6) = 29Therefore r5 is added to C1
Finally:
r1 r3 r4 r5
r2 r6
C1 C2
8/3/2019 Clu String
15/18
15
Find the new means (centers) for the two
clusters.
C i =
C1 =
C1 =
Similarly C 2 = ( 52.5 , 25 )
8/3/2019 Clu String
16/18
16
2nd iteration of the k-meansalgorithm
- Find Distance (r1 , Cen 1 ) ,Distance (r2 , Cen 1 ) , ..
- Find Distance (r1 , Cen 2 ) ,Distance(r2 , Cen 2 ) , ..
Example : Distance ( r1 , Cen 1 ) =
We obtain:
8/3/2019 Clu String
17/18
17
3 rd Iteration of the k-meansalgorithm
Find the new centers. Find the Distances. You will find that the clusters did not change.
C1 still has 1,4,5 and C2 still has 6, 2, 3.
So this iteration is the End of the k-meansclustering algorithm.
8/3/2019 Clu String
18/18
18
Conclusions
Clustering is unsupervisedlearning
K-means algorithm
Mainly for numeric data.