Top Banner
Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Part 1: Introduction
32

Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Dec 27, 2015

Download

Documents

Asher Bennett
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Clustering methodsCourse code: 175314

Pasi Fränti

10.3.2014

Speech & Image Processing UnitSchool of Computing

University of Eastern FinlandJoensuu, FINLAND

Part 1: Introduction

Page 2: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Sample data

Sources of RGB vectors

Red-Green plot of the vectors

Page 3: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Sample data

Employment statistics:

Page 4: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Application example 1Color reconstruction

Image with compression artifacts

Image with original colors

Page 5: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Application example 2speaker modeling for voice biometrics

Training data

Feature extractionand clustering

Matti

Mikko

Tomi

Speaker models

Tomi

Matti

Feature extraction

Best match: Matti !

Mikko

?

Page 6: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Speaker modeling

Speech data Result of clustering

Page 7: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Application example 3Image segmentation

Normalized color plots according to red and green components.

Image with 4 color clusters

red

gree

n

Page 8: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Application example 4Quantization

Quantized signal Original signal

Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values

Page 9: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Color quantization of imagesColor quantization of images

Color image RGB samples

Clustering

Page 10: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Application example 5Clustering of spatial data

Page 11: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Clustered locations of users

Page 12: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Clustered locations of users

Clustering of photos

Timeline clustering

Page 13: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Clustering GPS trajectoriesMobile users, taxi routes, fleet management

Page 14: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Conclusions from clusters

Cluster 1: Office

Cluster 2: Home

Page 15: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Part I:Clustering problem

Page 16: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Subproblems of clustering

1. Where are the clusters?(Algorithmic problem)

2. How many clusters?(Methodological problem: which criterion?)

3. Selection of attributes (Application related problem)

4. Preprocessing the data(Practical problems: normalization, outliers)

Page 17: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Clustering result as partition

Illustrated by Voronoi diagram

Illustrated by Convex hulls

Cluster prototypesPartition of data

Page 18: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Cluster prototypesPartition of data

Centroids as prototypes

Partition by nearestprototype mapping

Duality of partition and centroids

Page 19: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Cluster missingClusters missing

Too m

any clusters

Incorrect cluster allocation

Incorrect number of clusters

Challenges in clustering

Page 20: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

How to solve?

Solve the clustering: Given input data (X) of N data vectors, and

number of clusters (M), find the clusters. Result given as a set of prototypes, or partition.

Solve the number of clusters: Define appropriate cluster validity function f. Repeat the clustering algorithm for several M. Select the best result according to f.

Solve the problem efficiently.

Algorithmic

problem

Mathematical

problem

Computer science problem

Page 21: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Taxonomy of clustering[Jain, Murty, Flynn, Data clustering: A review, ACM Computing Surveys, 1999.]

• One possible classification based on cost function.

• MSE is well defined and most popular.

Page 22: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Definitions and data

Set of N data points:X={x1, x2, …, xN}

Set of M cluster prototypes (centroids):

C={c1, c2, …, cM},

P={p1, p2, …, pM},

Partition of the data:

Page 23: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Distance and cost function

K

k

kj

kiji xxxxd

1

2),(

N

ipi i

cxN

PCMSE1

21),(

Euclidean distance of data vectors:

Mean square error:

Page 24: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Centroid condition: for a given partition (P), optimal cluster centroids (C) for minimizing MSE are the average vectors of the clusters:

Mj

x

c

jp

jpi

j

i

i ,11

Nicxdp jiMj

i ,1),(minarg 2

1

Dependency of data structures

Optimal partition: for a given centroids (C), optimal partition is the one with nearest centroid :

Page 25: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Complexity of clustering

• Clustering problem is NP complete [Garey et al., 1982]

• Optimal solution by branch-and-bound in exponential time.

• Practical solutions by heuristic algorithms.

M

j

NjM jj

M

MM

N

1

)1(!

1

• Number of possible clusterings:

Page 26: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Cluster software

Main area

Input area

Output

area

• Main area: working space for data

• Input area: inputs to be processed

• Output area:obtained results

• Menu Process:selection of operation

http://cs.joensuu.fi/sipu/soft/cluster2009.exe

Page 27: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Clustering

imageData setCodebook

Partition

Procedure to simulate k-means

Open data set (file *.ts), move it into Input areaOpen data set (file *.ts), move it into Input area

Process – Random codebookProcess – Random codebook, select number of clusters, select number of clusters

REPEATREPEAT

Move obtained codebook from Output area into Input Move obtained codebook from Output area into Input areaarea

Process – Optimal partitionProcess – Optimal partition, select Error function, select Error function

Move codebook into Main area, partition into Input Move codebook into Main area, partition into Input areaarea

Process – Optimal codebookProcess – Optimal codebook

UNTIL DESIRED CLUSTERINGUNTIL DESIRED CLUSTERING

Page 28: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

XLMiner softwarehttp://www.resample.com/xlminer/help/HClst/HClst_ex.htmhttp://www.resample.com/xlminer/help/HClst/HClst_ex.htm

Page 29: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Example of data in XLMiner

Page 30: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Distance matrix & dendrogram

Page 31: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Conclusions

Clustering is a fundamental tools needed in Speech and Image processing.

Failing to do clustering properly may defect the application analysis.

Good clustering tool needed so that researchers can focus on application requirements.

Page 32: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

1. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 3rd edition, 2006.

2. C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

3. A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3): 264-323, September 1999.

4. M.R. Garey, D.S. Johnson and H.S. Witsenhausen, The complexity of the generalized Lloyd-Max problem, IEEE Transactions on Information Theory, 28(2): 255-256, March 1982.

5. F. Aurenhammer: Voronoi diagrams-a survey of a fundamental geometric data structure, ACM Computing Surveys, 23 (3), 345-405, September 1991.

Literature