Institute of Intelligent Power Electronics – IPE Page1 Data Clustering Methods Docent Xiao-Zhi Gao Department of Electrical Engineering and Automation.

Institute of Intelligent Power Electronics – IPEPage1

Data Clustering Methods

Docent Xiao-Zhi GaoDepartment of Electrical Engineering and Automation


Data Clustering Data clustering is for data organization,

data compression, and model construction

Clustering partitions a data set into groups such as similarity within a group is larger than that among groups

Similarity needs to be defined– Metric of difference between two input

vectors


Clusters in Data

Data need to be normalized into a hypercube beforehand


Similarity?


Similarity Similarity can be defined as distances

between two vectors in the data space

There are a few choices– Euclidean distance (real-values)– Hamming distance (binary or symbols)– Manhattan distance (any)

nxxx ,,, 21 x nyyy ,,, 21 y


Euclidean Distance Euclidean distance between two vectors

is defined as:

n

iiinn yxyxyxyxd

1

22222

211 )()()()(),( yx


Hamming Distance Hamming distance is the number of

positions at which the corresponding symbols of two vectors are different

For example, – "toned" and "roses" is 3– "1011101" and "1001001" is 2– "2173896" and "2233796" is 3


Manhattan Distance Manhattan distance (city block distance)

is equal to the length of all paths connecting the two vectors along all segments

Taxicab geometry


K-Means Clustering Method K-means clustering method partitions a

collection of n vectors into c groups Gi, i=1, 2, ..., c, and finds the cluster centers in these groups so as to minimize a given dissimilarity measurement


K-Means Clustering Method

The dissimilarity measurement (cost function) can be calculated using Euclidean distance in K-means clustering method



The binary membership matrix U is cxn martrix defined as follows:

Xj belongs to group i, if ci is the closest center among all the centers



To minimize the cost function J, the optimal center of a group should be the mean of all the vectors in that group:


K-Means Clustering Method K-means clustering method is an

iterative algorithm to find cluster centers


K-Means Clustering Method There is no guarantee that it can converge

to an optimal solution– Optimization methods might be used to deal

with cost function J The performance of k-means clustering

method depends on the initial cluster centers– Front-end methods should be employed to find

good initial centers


K-Means Clustering Method K-means clustering method might have

problems with clusters of– different densities– non-globular shapes

K-means clustering method is a ’hard’ data clustering approach– Data should belong to clusters to degrees

» Fuzzy k-means method


Clusters of Different Densities

c=3


Clusters of Non-globular Shapes

c=2


Butterfly Data


Mountain Clustering Method Mountain clustering method (Yager,

1994) approximates clusters based on density measure of data

Mountain clustering method can be used either as a stand-alone algorithm or for obtaining initial clusters of other data clustering approaches


Mountain Clustering Method Step 1: Form a grid in the data space,

and the intersections of the grid line are considered as center candidates of clustering, denoted as a set V– Not necessarily evenly spaced– A fine gridding is needed, but can increase

computation burden


Mountain Clustering Method Step 2: Construct mountain functions

representing data density measure. The height of the mountain function at v is:


Mountain Clustering Method Each input vector x contributes to the

heights of mountain functions at v The contribution is inversely

proportional to their distances d(x, v) Mountain function is a measure of data

density (higher if more data points are located nearby)


Mountain Clustering Method

Step 3: Select cluster centers and destruct mountain functions

The points with the largest mountain heights are selected as cluster centers


Mountain Clustering Method The just-identified centers are often

surrounded by input data with high density

The effects of just-identified centers should be eliminated

The mountain functions are revised by substracting a scaled Gaussian function


Mountain Functions

may affect the smoothness of mountain functions

0.02 0.1 0.2


Mountain Destruction

Cluster centers are selected, and mountains are destructed sequentially


Subtractive Clustering Mountain clustering method is simple

but time consuming with growth of dimensions of data

Replace grid points with data points in mountain clustering, and we can get subtractive clustering (Chiu, 1994)– Only data points are considered as cluster

center candidates


Subtractive Clustering The density measure of data point

The density measure of each data point is revised sequentially

ix


Conclusions Three typical off-line data clustering

methods are introduced They often operate in the batch mode The prototypes characterizing data sets

found by the data clustering methods can be used as ’codebooks’


An Application Example


Computer Exercises I

0 1 2 3 4 5 60

1

2

3

4

5

6

7

8

AB


Computer Exercises II

Institute of Intelligent Power Electronics – IPE Page1 Data Clustering Methods Docent Xiao-Zhi Gao Department of Electrical Engineering and Automation.

Documents

automation slide

distance binary

data data need

symbols manhattan distance

data compression

data set

data organization

data space