Francesco Camastra- Kernel Methods for Clustering

8/3/2019 Francesco Camastra- Kernel Methods for Clustering

1/45

Kernel Methods for Clustering

Francesco Camastra

[email protected]

DISI, Universita di Genova


2/45

Talk Outlines

Preliminaries (Unsupervised Learning, Kernel Methods)


Experimental Results

Conclusions and Future Work


3/45

The Learning Problem

The learning problemcan be described as finding a general rule (description)

that explains data given only a sample of limited size.


4/45

Learning Algorithms

Learning algorithms can be grouped in three big families:

Supervised algorithms

Reinforcement Learning algorithms

Unsupervised algorithms


5/45

Supervised Algorithms

If data is a sample of input-output patterns, a data description is a function that

produces the output given the input.

The learning is called supervised because target values(e.g. classes, real values) are associated with data.


6/45

Unsupervised Algorithms

If data is only a sample of objects without associated target values, the

problem is known as unsupervised learning.

A data description can be:

a set of clusters or a probability density function stating the probability to

observe a certain object in the future.

a manifold that contains all data without information loss (manifold

learning).


7/45

Kernel Methods

Kernel Methods are algorithms that implicitly perform,

by replacing the inner product with an appropriate Mercer Kernel, a nonlinear

mapping of the input data to a high dimensional Feature Space.


8/45

Mercer Kernel

We call the kernel G a Mercer kernel(or positive definite kernel) if and only if is

symmetric (i.e G(x, y) = G(y, x) x, y X) andn

j,k=1

cjckG(xj , xk) > 0

for all n 2, x1, . . . , xn X and c1, . . . , cn R.Each kernel G(x, y) can be represented as:

G(x, y) = (x)

(y) : X F

.

Fis called Feature Space. If is known, the mapping is explicit, otherwise isimplicit.


9/45


10/45

Mercer Kernel Examples

Square Kernel S(x, y) = (x y)2Square Kernel mapping is explicit.

If we consider x = (x1

, x2

) y = (y1

, y2

) we have:

S(x, y) = (x1y1 + x2y2)2

= x12y1

2 + 2x1x2y1y2 + x22y2

2

Therefore is:

(x) = (x12, x2

2,

2x1x2) : R2 R3

(in the n-dimensional case x Rn, : Rn Rn(n+1)2 )

Gaussian Kernel G(x, y) = exp( xy22

)

Gaussian kernel mapping is implicit.


11/45

Distance in Feature Space

Given two vectors x and y, we remember that G(x, y) = (x) (y)Therefore it is always possible to compute their distance in the Feature

Space:

(x) (y)2 = ((x) (y)) ((x) (y))= (x)

(x)

2(x)

(y) + (y)

(y)

= G(x, x) 2G(x, y) + G(y, y)


12/45

Clustering Methods

Following Jains approach (Jain et al., 1999), clustering methods can be

categorized into:

Hierarchical ClusteringHierarchical schemes sequentially build nested clusters with a graphical

representation (dendrogram).

Partitioning Clustering

Partitioning methods directly assign all the data points according to some

appropriate criteria (e.g. similarity and density) into different groups (clusters).

The Research has been focused on the prototyped-based clustering

algorithms, which is the most popular class of Partitioning Clustering Methods.


13/45

Clustering: Some Definitions

Let D be a data set, whose cardinality is m, formed by vectors Rn. TheCodebook is the set W = (w1, w2, . . . , wk) where each element (codevector)

wc Rn

and k m.The Voronoi set (Vc) of the codevector wc is the set of all vectors in D for which

wc is the nearest codevector

Vc = { D | c = arg minj wj}A codebook is optimal if minimizes the Quantization error J:

J = 12|D|

Vc

mj=1

wc2

where |D| is the cardinality of D.


14/45

K-Means (Lloyd, 1957)

1. Initialize the codebook W = {w1, w2, . . . , wK} with K vectors chosen fromthe data set D.

2. Compute for each codevector wi W its Voronoi Set ViVi = { D | i = arg min

j wj}

3. Move each codevector wi to the mean of its Voronoi Set. wi =1|Vi|

Vi

4. Go to step 2 if any codevector wi changes otherwise return the codebook.

C


15/45


Methods that kernelise the metric (Yu et al. 2002), i.e. the metric is

computed by means of a Mercer Kernel in a Feature Space.

Kernel K-Means (Girolami, 2002)

Kernel methods based on support vector data description (Camastra and

Verri, 2005)

K li i h i


16/45

Kernelising the metric

Given x, y Rn the metric dG(x, y) in the Feature Space is:

dG(x, y) =

(x)

(y)

= (G(x, x)

2G(x, y) + G(y, y))

12 .

Given a dataset D = (xi Rn, i = 1, 2, . . . , m), the goal is to get a codebookW = (wi Rn, i = 1, 2, . . . , K ) that minimizes the quantization error E(D)

E(D) = 12|D|

mc=1

xiVc

xi wc2

where Vc is the Voronoi set of the codebook wc.

Hence we can think to compute the metrics in the Feature Space, i.e. we have:

EG(D) =1

2

|D

|

m

c=1xiVc

G(xi, xi) 2G(xi, wc) + G(wc, wc)

K li i th t i


17/45

Kernelising the metric (cont.)

A naive solution to minimize EG(D) consists in computingEG(D)wc

and using a

steepest gradient descent algorithm. Hence some classical clustering

algorithms can be kernelised. For instance, we consider online K-MEANS. Itslearning rule is

wc = ( wc) = E(D)wc

where is the input vector, wc is the winner codevector for . Hence it can be

rewritten as:

wc = EG(D)

wc

In the case of G(x, y) = exp( wc22

) the equation becomes

wc = (

wc)

2 exp wc2

2

O Cl SVM


18/45

One-Class SVM

One-Class SVM (1SVM) (Tax and Duin, 1999)(Scholkopf et al., 2001)

searches the hypersphere in the Feature space Fwith centre a and minimalradius R containing most data. The problem can be expressed as:

minR,a,

R2 + Ci

i

subject to(xi) a2 R2 + i and i 0 i = 1, . . . , mwhere x1, . . . , xl is the data set. To solve the problem the Lagrangian L isintroduced:

L = R2 j

(R2 + j (xi) a2)j j

jj + Cj

j

where j 0 and j 0 are Lagrange multipliers and C is constant.

1SVM


19/45

1SVM (cont.)

Setting to zero the derivatives of L w.r.t R, a and j and substituting, we turnthe Lagrangian into the Wolfe dual form W:

W = j

(xj) (xj)j i

j

ij(xi) (xj)

=

jG(xj , xj)j

i jijG(xi, xj)

withj

j = 1 and 0 j C

j = 0

(xj) in the sphere

j = C (xj) are outside the sphere.0 < j < C (xj) in the surface of the sphere.These points are called support vectors

1SVM ( )


20/45

1SVM (cont.)

The center a is: a =j

j(xj)

Therefore the center position can be unknown.

Nevertheless the distance R(x) of a point (x) from the center a can be

always computed:

R2(x) = G(x, x)

2j

j

G(xj

, x) +i

j

ij

G(xi, x

j)

The Gaussian is a usual choice for the kernel G().

Camastra Verri Algorithm: Definitions


21/45

Camastra-Verri Algorithm: Definitions

Given a data set D, we map the data in a Feature Space F.We consider K centers (ai, F, i = 1, . . . , K ). We call the setA = (a1, . . . , aK) Feature Space Codebook.

We define for each center ac its Voronoi Set in Feature Space

F Vc = {xi D c = arg minj(xi) aj}

Camastra Verri Algorithm: Strategy


22/45

Camastra-Verri Algorithm: Strategy

Our algorithm uses a K-Means-like strategy, i.e. it moves repeatedly the

centers, computing for each center 1SVM, until any center changes.

To make more robust the algorithm with respect to the outliers 1SVM is

computed on F Vc() of each center ac

F Vc() = {xi F Vc and (xi) ac < }

F Vc() can be seen the Voronoi set in the Feature Space of the center ac

without outliers.

The parameter can be set up using model selection techniques.

The Algorithm


23/45

The Algorithm

1. Project the data Set D in a Feature Space F, by means a nonlinearmapping . Initialize the centers ac c = 1, . . . , K ac F

2. Compute for each center ac F Vc()

3. Apply 1SVM to each F Vc() and assign to ac the center yielded, i.e.

ac = 1SV M(F Vc())

4. Go to step 2 until any ac changes

5. Return the Feature Space codebook.

Kernel K Means (Girolami 2002)


24/45

Kernel K-Means (Girolami, 2002)

1. Project the data Set D in a Feature Space F, by means a nonlinearmapping . Initialize the centers ac c = 1, . . . , K ac F

2. Compute for each center ac F Vc

3. Move each codevector ai to the mean of its Feature Voronoi Set.

ai =1

|FVi| FVi

()

4. Go to step 2 until any ac changes otherwise return the Feature Space

codebook.

Kernel K Means (cont )


25/45

Kernel K-Means (cont.)

Works even we do not know

We are always able to compute the distance of any point (x)from any

centroid ac. After some maths, we have

(x) ac2 = R2(x) = G(x, x) 2j

G(xj , x) +i

j

G(xi, xj)

Hence even we do not know we are always able to compute Feature

Voronoi Set

Experiments with Camastra-Verri algorithm


26/45

Experiments with Camastra-Verri algorithm

Synthetic Data Set (Delta Set)

Iris Data(Fisher, 1936)

Wisconsin breast cancer database

(Wolberg and Mangasarian, 1990)

Spam data

K-Means (Lloyd 1957), Self Organizing Map (Kohonen, 1982), Neural Gas

(Martinetz et al., 1992), Ng-Jordan Algorithm (Ng et al., 2001) and Our

Algorithm have been tried.

Delta Set: K-Means


27/45

Delta Set: K-Means

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

Delta Set: Our Algorithm


28/45

Delta Set: Our Algorithm

0 0.2 0.4 0.6 0.8 1

-1

-0.5

0

0.5

1

Delta Set: Our Algorithm (I iteration)


29/45

Delta Set: Our Algorithm (I iteration)

0 0.2 0.4 0.6 0.8 1

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Delta Set: Our Algorithm (II iteration)


30/45

Delta Set: Our Algorithm (II iteration)

0 0.2 0.4 0.6 0.8 1

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Delta Set: Our Algorithm (III iteration)


31/45

Delta Set: Our Algorithm (III iteration)

0 0.2 0.4 0.6 0.8 1

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Delta Set: Our Algorithm (IV iteration)


32/45

Delta Set: Our Algorithm (IV iteration)

0 0.2 0.4 0.6 0.8 1

-0.5

-0.25

0

0.25

0.5

0.75

1

Delta Set: Our Algorithm (V iteration)


33/45

Delta Set: Our Algorithm (V iteration)

0 0.2 0.4 0.6 0.8 1

-0.75

-0.5

-0.25

0

0.25

0.5

0.75

1

Delta Set: Our Algorithm (VI iteration)


34/45

g ( )

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

Iris Data


35/45

Iris data is formed by 150 data points of three different classes. One class

(setosa) is linearly separable from the other two (versicolor, virginica), but

the other two are not linearly separable from each other.

Iris Data Dimension is 4.

K-Means, Self Organizing Map (SOM), Neural Gas, Ng-Jordan algorithm

and Our Algorithm have been tried.

Experimentations have been performed using three codevectors.

Iris Data


36/45

-9 -8 -7 -6 -5 -4 -3

-6.5

-6

-5.5

-5

-4.5

-4

Iris Data: K-Means


37/45

-9 -8 -7 -6 -5 -4 -3

-6.5

-6

-5.5

-5

-4.5

-4

Iris Data: Camastra-Verri algorithm


38/45

g

-9 -8 -7 -6 -5 -4 -3

-6.5

-6

-5.5

-5

-4.5

-4

Iris data: Results


39/45

model Points Classified Correctly

SOM 121.5 1.5 (81.0%)K-Means 133.5 0.5 (89.0%)

Neural Gas 137.5 1.5 (91.7%)Ng-Jordan Algorithm 126.5

7.5 (84.3%)

Our Algorithm 142 1 (94.7%)Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm

performances on IRIS Data. The results have been obtained using twenty

different runs for each algorithm.

Wisconsin Data


40/45

Wisconsin breast cancer data is formed by 699 patterns (patients) of two

different classes. The classes are not linearly separable from each other.

The database considered in the experiments has 683 samples since we

have removed 16 patterns with missing values.

Wisconsin Data Dimension is 9. K-Means, Self Organizing Map (SOM),

Neural Gas, Ng-Jordan algorithm and Our Algorithm have been tried.

Experimentations have been performed using two codevectors.

Wisconsin database: Results


41/45


K-Means 656.5 0.5 (96.1%)Neural Gas 656.5 0.5 (96.1%)

SOM 660.5 0.5 (96.7%)Ng-Jordan Algorithm 652

2 (95.5%)

Our Algorithm 662.5 0.5 (97.0%)Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm

performances on Winscosins breast cancer database.

The results have been obtained using twenty different runs for each algorithm.

Spam Data


42/45

Spam Data is formed by 1534 patterns of two different classes (spam and

not-spam). The classes are not linearly separable from each other.

Spam Data Dimension is 57.

K-Means, Self Organizing Map (SOM), Neural Gas, Ng-Jordan algorithm

and Our Algorithm have been tried.

Experimentations have been performed using two codevectors.

Spam data: Results


43/45


K-Means 1083 153 (70.6%)Neural Gas 1050 120 (68.4%)

SOM 1210 30 (78.9%)Ng-Jordan Algorithm 929

0 (60.6%)

Our Algorithm 1247 3 (81.3%)Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm

performances on Spam data.

The results have been obtained using twenty different runs for each algorithm.

Conclusions and Future Works


44/45

Our algorithm performs better than K-Means, Som, Neural Gas and

Ng-Jordan algorithm on a synthetic data set and three UCI benchmarks

(Iris data, Wisconsin Breast Cancer Database, Spam Database)

Future efforts will be devoted to the application of our algorithm to

computer vision problems (e.g. color image segmentation)

At present we are investigating as kernel methods can be generalized interms of fuzzy logic (kernel-fuzzy methods).

Finally, experimental comparisons between our algorithm and Girolamis

algorithm are in progress.

Dedication


45/45

To my mother, Antonia Nicoletta Corbascio, in the mostdifficult moment of her life.

Francesco Camastra- Kernel Methods for Clustering

Documents