Clu String

8/3/2019 Clu String

1/18

1

Data Mining:

Clustering

8/3/2019 Clu String

2/18

2

Clustering

Unsupervised learning or clustering buildsmodels from data without predefinedclasses.

The goal is to place records into groupswhere the records in a group are highlysimilar to each other and dissimilar torecords in other groups.

The k-Means algorithm is a simple yeteffective clustering technique.

8/3/2019 Clu String

3/18

3

Clustering Example

8/3/2019 Clu String

4/18

4

K-means example, step 1

k 1

k 2

k 3

X

Y

Pick 3initialcluster

centers(randomly)

8/3/2019 Clu String

5/18

5


k 1

k 2

k 3

X

Y

Assigneach point

to the closestclustercenter

8/3/2019 Clu String

6/18

6


X

Y

Moveeach clustercenter

to the meanof each cluster

k 1

k 2

k2

k 1

k 3

k 3

8/3/2019 Clu String

7/18

7


X

Y

Reassignpointsclosest to adifferent newcluster center

Q: Which points are reassigned?

k 1

k 2

k 3

8/3/2019 Clu String

8/18

8


X

Y

A: three points with animation

k 1

k 3k

2

8/3/2019 Clu String

9/18

9

K-means example, step 4b

X

Y

re-computeclustermeans

k 1

k 3k

2

8/3/2019 Clu String

10/18

10


X

Y

move cluster

centers tocluster means

k 2

k 1

k 3

8/3/2019 Clu String

11/18

11

Finding Distance bet. TwoPoints

Point X Y

a 2 7

b 4 5

c 6 3

8/3/2019 Clu String

12/18

12

Example Data for ClusteringRID Age Years of

service1 30 52 50 253 50 154 25 55 30 106 55 25

Distance (r j , r k)=

8/3/2019 Clu String

13/18

13

Steps of the k-means algorithm: K =2, meaning that Number of clusters is 2

clusters C1 & C2 . Let Rid=3 and Rid=6 be the centers for clusters

C1 & C2 , respectively. Distance(r1 ,r3) =

= = = 22.7 Distance(r1,r6) = =32.0 .

So r1 is placed in cluster C1 since it is closerto the center of C1

8/3/2019 Clu String

14/18

14

Similarly: Distance (r2,r3) = 10 , Distance (r2,r6) = 5

Therefore r2 is added to C2 Distance (r4,r3) = 25.5 , Distance (r4,r6) = 36.6

Therefore r4 is added to C1

Distance (r5 ,r3) = 20.6 , Distance (r5,r6) = 29Therefore r5 is added to C1

Finally:

r1 r3 r4 r5

r2 r6

C1 C2

8/3/2019 Clu String

15/18

15

Find the new means (centers) for the two

clusters.

C i =

C1 =

C1 =

Similarly C 2 = ( 52.5 , 25 )

8/3/2019 Clu String

16/18

16

2nd iteration of the k-meansalgorithm

- Find Distance (r1 , Cen 1 ) ,Distance (r2 , Cen 1 ) , ..

- Find Distance (r1 , Cen 2 ) ,Distance(r2 , Cen 2 ) , ..

Example : Distance ( r1 , Cen 1 ) =

We obtain:

8/3/2019 Clu String

17/18

17

3 rd Iteration of the k-meansalgorithm

Find the new centers. Find the Distances. You will find that the clusters did not change.

C1 still has 1,4,5 and C2 still has 6, 2, 3.

So this iteration is the End of the k-meansclustering algorithm.

8/3/2019 Clu String

18/18

18

Conclusions

Clustering is unsupervisedlearning

K-means algorithm

Mainly for numeric data.

Clu String

Documents