8/3/2019 Francesco Camastra- Kernel Methods for Clustering
1/45
Kernel Methods for Clustering
Francesco Camastra
DISI, Universita di Genova
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
2/45
Talk Outlines
Preliminaries (Unsupervised Learning, Kernel Methods)
Kernel Methods for Clustering
Experimental Results
Conclusions and Future Work
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
3/45
The Learning Problem
The learning problemcan be described as finding a general rule (description)
that explains data given only a sample of limited size.
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
4/45
Learning Algorithms
Learning algorithms can be grouped in three big families:
Supervised algorithms
Reinforcement Learning algorithms
Unsupervised algorithms
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
5/45
Supervised Algorithms
If data is a sample of input-output patterns, a data description is a function that
produces the output given the input.
The learning is called supervised because target values(e.g. classes, real values) are associated with data.
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
6/45
Unsupervised Algorithms
If data is only a sample of objects without associated target values, the
problem is known as unsupervised learning.
A data description can be:
a set of clusters or a probability density function stating the probability to
observe a certain object in the future.
a manifold that contains all data without information loss (manifold
learning).
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
7/45
Kernel Methods
Kernel Methods are algorithms that implicitly perform,
by replacing the inner product with an appropriate Mercer Kernel, a nonlinear
mapping of the input data to a high dimensional Feature Space.
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
8/45
Mercer Kernel
We call the kernel G a Mercer kernel(or positive definite kernel) if and only if is
symmetric (i.e G(x, y) = G(y, x) x, y X) andn
j,k=1
cjckG(xj , xk) > 0
for all n 2, x1, . . . , xn X and c1, . . . , cn R.Each kernel G(x, y) can be represented as:
G(x, y) = (x)
(y) : X F
.
Fis called Feature Space. If is known, the mapping is explicit, otherwise isimplicit.
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
9/45
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
10/45
Mercer Kernel Examples
Square Kernel S(x, y) = (x y)2Square Kernel mapping is explicit.
If we consider x = (x1
, x2
) y = (y1
, y2
) we have:
S(x, y) = (x1y1 + x2y2)2
= x12y1
2 + 2x1x2y1y2 + x22y2
2
Therefore is:
(x) = (x12, x2
2,
2x1x2) : R2 R3
(in the n-dimensional case x Rn, : Rn Rn(n+1)2 )
Gaussian Kernel G(x, y) = exp( xy22
)
Gaussian kernel mapping is implicit.
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
11/45
Distance in Feature Space
Given two vectors x and y, we remember that G(x, y) = (x) (y)Therefore it is always possible to compute their distance in the Feature
Space:
(x) (y)2 = ((x) (y)) ((x) (y))= (x)
(x)
2(x)
(y) + (y)
(y)
= G(x, x) 2G(x, y) + G(y, y)
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
12/45
Clustering Methods
Following Jains approach (Jain et al., 1999), clustering methods can be
categorized into:
Hierarchical ClusteringHierarchical schemes sequentially build nested clusters with a graphical
representation (dendrogram).
Partitioning Clustering
Partitioning methods directly assign all the data points according to some
appropriate criteria (e.g. similarity and density) into different groups (clusters).
The Research has been focused on the prototyped-based clustering
algorithms, which is the most popular class of Partitioning Clustering Methods.
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
13/45
Clustering: Some Definitions
Let D be a data set, whose cardinality is m, formed by vectors Rn. TheCodebook is the set W = (w1, w2, . . . , wk) where each element (codevector)
wc Rn
and k m.The Voronoi set (Vc) of the codevector wc is the set of all vectors in D for which
wc is the nearest codevector
Vc = { D | c = arg minj wj}A codebook is optimal if minimizes the Quantization error J:
J = 12|D|
Vc
mj=1
wc2
where |D| is the cardinality of D.
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
14/45
K-Means (Lloyd, 1957)
1. Initialize the codebook W = {w1, w2, . . . , wK} with K vectors chosen fromthe data set D.
2. Compute for each codevector wi W its Voronoi Set ViVi = { D | i = arg min
j wj}
3. Move each codevector wi to the mean of its Voronoi Set. wi =1|Vi|
Vi
4. Go to step 2 if any codevector wi changes otherwise return the codebook.
C
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
15/45
Kernel Methods for Clustering
Methods that kernelise the metric (Yu et al. 2002), i.e. the metric is
computed by means of a Mercer Kernel in a Feature Space.
Kernel K-Means (Girolami, 2002)
Kernel methods based on support vector data description (Camastra and
Verri, 2005)
K li i h i
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
16/45
Kernelising the metric
Given x, y Rn the metric dG(x, y) in the Feature Space is:
dG(x, y) =
(x)
(y)
= (G(x, x)
2G(x, y) + G(y, y))
12 .
Given a dataset D = (xi Rn, i = 1, 2, . . . , m), the goal is to get a codebookW = (wi Rn, i = 1, 2, . . . , K ) that minimizes the quantization error E(D)
E(D) = 12|D|
mc=1
xiVc
xi wc2
where Vc is the Voronoi set of the codebook wc.
Hence we can think to compute the metrics in the Feature Space, i.e. we have:
EG(D) =1
2
|D
|
m
c=1xiVc
G(xi, xi) 2G(xi, wc) + G(wc, wc)
K li i th t i
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
17/45
Kernelising the metric (cont.)
A naive solution to minimize EG(D) consists in computingEG(D)wc
and using a
steepest gradient descent algorithm. Hence some classical clustering
algorithms can be kernelised. For instance, we consider online K-MEANS. Itslearning rule is
wc = ( wc) = E(D)wc
where is the input vector, wc is the winner codevector for . Hence it can be
rewritten as:
wc = EG(D)
wc
In the case of G(x, y) = exp( wc22
) the equation becomes
wc = (
wc)
2 exp wc2
2
O Cl SVM
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
18/45
One-Class SVM
One-Class SVM (1SVM) (Tax and Duin, 1999)(Scholkopf et al., 2001)
searches the hypersphere in the Feature space Fwith centre a and minimalradius R containing most data. The problem can be expressed as:
minR,a,
R2 + Ci
i
subject to(xi) a2 R2 + i and i 0 i = 1, . . . , mwhere x1, . . . , xl is the data set. To solve the problem the Lagrangian L isintroduced:
L = R2 j
(R2 + j (xi) a2)j j
jj + Cj
j
where j 0 and j 0 are Lagrange multipliers and C is constant.
1SVM
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
19/45
1SVM (cont.)
Setting to zero the derivatives of L w.r.t R, a and j and substituting, we turnthe Lagrangian into the Wolfe dual form W:
W = j
(xj) (xj)j i
j
ij(xi) (xj)
=
jG(xj , xj)j
i jijG(xi, xj)
withj
j = 1 and 0 j C
j = 0
(xj) in the sphere
j = C (xj) are outside the sphere.0 < j < C (xj) in the surface of the sphere.These points are called support vectors
1SVM ( )
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
20/45
1SVM (cont.)
The center a is: a =j
j(xj)
Therefore the center position can be unknown.
Nevertheless the distance R(x) of a point (x) from the center a can be
always computed:
R2(x) = G(x, x)
2j
j
G(xj
, x) +i
j
ij
G(xi, x
j)
The Gaussian is a usual choice for the kernel G().
Camastra Verri Algorithm: Definitions
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
21/45
Camastra-Verri Algorithm: Definitions
Given a data set D, we map the data in a Feature Space F.We consider K centers (ai, F, i = 1, . . . , K ). We call the setA = (a1, . . . , aK) Feature Space Codebook.
We define for each center ac its Voronoi Set in Feature Space
F Vc = {xi D c = arg minj(xi) aj}
Camastra Verri Algorithm: Strategy
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
22/45
Camastra-Verri Algorithm: Strategy
Our algorithm uses a K-Means-like strategy, i.e. it moves repeatedly the
centers, computing for each center 1SVM, until any center changes.
To make more robust the algorithm with respect to the outliers 1SVM is
computed on F Vc() of each center ac
F Vc() = {xi F Vc and (xi) ac < }
F Vc() can be seen the Voronoi set in the Feature Space of the center ac
without outliers.
The parameter can be set up using model selection techniques.
The Algorithm
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
23/45
The Algorithm
1. Project the data Set D in a Feature Space F, by means a nonlinearmapping . Initialize the centers ac c = 1, . . . , K ac F
2. Compute for each center ac F Vc()
3. Apply 1SVM to each F Vc() and assign to ac the center yielded, i.e.
ac = 1SV M(F Vc())
4. Go to step 2 until any ac changes
5. Return the Feature Space codebook.
Kernel K Means (Girolami 2002)
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
24/45
Kernel K-Means (Girolami, 2002)
1. Project the data Set D in a Feature Space F, by means a nonlinearmapping . Initialize the centers ac c = 1, . . . , K ac F
2. Compute for each center ac F Vc
3. Move each codevector ai to the mean of its Feature Voronoi Set.
ai =1
|FVi| FVi
()
4. Go to step 2 until any ac changes otherwise return the Feature Space
codebook.
Kernel K Means (cont )
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
25/45
Kernel K-Means (cont.)
Works even we do not know
We are always able to compute the distance of any point (x)from any
centroid ac. After some maths, we have
(x) ac2 = R2(x) = G(x, x) 2j
G(xj , x) +i
j
G(xi, xj)
Hence even we do not know we are always able to compute Feature
Voronoi Set
Experiments with Camastra-Verri algorithm
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
26/45
Experiments with Camastra-Verri algorithm
Synthetic Data Set (Delta Set)
Iris Data(Fisher, 1936)
Wisconsin breast cancer database
(Wolberg and Mangasarian, 1990)
Spam data
K-Means (Lloyd 1957), Self Organizing Map (Kohonen, 1982), Neural Gas
(Martinetz et al., 1992), Ng-Jordan Algorithm (Ng et al., 2001) and Our
Algorithm have been tried.
Delta Set: K-Means
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
27/45
Delta Set: K-Means
0 0.2 0.4 0.6 0.8 1-1
-0.5
0
0.5
1
Delta Set: Our Algorithm
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
28/45
Delta Set: Our Algorithm
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Delta Set: Our Algorithm (I iteration)
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
29/45
Delta Set: Our Algorithm (I iteration)
0 0.2 0.4 0.6 0.8 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Delta Set: Our Algorithm (II iteration)
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
30/45
Delta Set: Our Algorithm (II iteration)
0 0.2 0.4 0.6 0.8 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Delta Set: Our Algorithm (III iteration)
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
31/45
Delta Set: Our Algorithm (III iteration)
0 0.2 0.4 0.6 0.8 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Delta Set: Our Algorithm (IV iteration)
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
32/45
Delta Set: Our Algorithm (IV iteration)
0 0.2 0.4 0.6 0.8 1
-0.5
-0.25
0
0.25
0.5
0.75
1
Delta Set: Our Algorithm (V iteration)
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
33/45
Delta Set: Our Algorithm (V iteration)
0 0.2 0.4 0.6 0.8 1
-0.75
-0.5
-0.25
0
0.25
0.5
0.75
1
Delta Set: Our Algorithm (VI iteration)
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
34/45
g ( )
0 0.2 0.4 0.6 0.8 1-1
-0.5
0
0.5
1
Iris Data
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
35/45
Iris data is formed by 150 data points of three different classes. One class
(setosa) is linearly separable from the other two (versicolor, virginica), but
the other two are not linearly separable from each other.
Iris Data Dimension is 4.
K-Means, Self Organizing Map (SOM), Neural Gas, Ng-Jordan algorithm
and Our Algorithm have been tried.
Experimentations have been performed using three codevectors.
Iris Data
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
36/45
-9 -8 -7 -6 -5 -4 -3
-6.5
-6
-5.5
-5
-4.5
-4
Iris Data: K-Means
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
37/45
-9 -8 -7 -6 -5 -4 -3
-6.5
-6
-5.5
-5
-4.5
-4
Iris Data: Camastra-Verri algorithm
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
38/45
g
-9 -8 -7 -6 -5 -4 -3
-6.5
-6
-5.5
-5
-4.5
-4
Iris data: Results
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
39/45
model Points Classified Correctly
SOM 121.5 1.5 (81.0%)K-Means 133.5 0.5 (89.0%)
Neural Gas 137.5 1.5 (91.7%)Ng-Jordan Algorithm 126.5
7.5 (84.3%)
Our Algorithm 142 1 (94.7%)Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm
performances on IRIS Data. The results have been obtained using twenty
different runs for each algorithm.
Wisconsin Data
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
40/45
Wisconsin breast cancer data is formed by 699 patterns (patients) of two
different classes. The classes are not linearly separable from each other.
The database considered in the experiments has 683 samples since we
have removed 16 patterns with missing values.
Wisconsin Data Dimension is 9. K-Means, Self Organizing Map (SOM),
Neural Gas, Ng-Jordan algorithm and Our Algorithm have been tried.
Experimentations have been performed using two codevectors.
Wisconsin database: Results
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
41/45
model Points Classified Correctly
K-Means 656.5 0.5 (96.1%)Neural Gas 656.5 0.5 (96.1%)
SOM 660.5 0.5 (96.7%)Ng-Jordan Algorithm 652
2 (95.5%)
Our Algorithm 662.5 0.5 (97.0%)Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm
performances on Winscosins breast cancer database.
The results have been obtained using twenty different runs for each algorithm.
Spam Data
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
42/45
Spam Data is formed by 1534 patterns of two different classes (spam and
not-spam). The classes are not linearly separable from each other.
Spam Data Dimension is 57.
K-Means, Self Organizing Map (SOM), Neural Gas, Ng-Jordan algorithm
and Our Algorithm have been tried.
Experimentations have been performed using two codevectors.
Spam data: Results
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
43/45
model Points Classified Correctly
K-Means 1083 153 (70.6%)Neural Gas 1050 120 (68.4%)
SOM 1210 30 (78.9%)Ng-Jordan Algorithm 929
0 (60.6%)
Our Algorithm 1247 3 (81.3%)Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm
performances on Spam data.
The results have been obtained using twenty different runs for each algorithm.
Conclusions and Future Works
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
44/45
Our algorithm performs better than K-Means, Som, Neural Gas and
Ng-Jordan algorithm on a synthetic data set and three UCI benchmarks
(Iris data, Wisconsin Breast Cancer Database, Spam Database)
Future efforts will be devoted to the application of our algorithm to
computer vision problems (e.g. color image segmentation)
At present we are investigating as kernel methods can be generalized interms of fuzzy logic (kernel-fuzzy methods).
Finally, experimental comparisons between our algorithm and Girolamis
algorithm are in progress.
Dedication
8/3/2019 Francesco Camastra- Kernel Methods for Clustering
45/45
To my mother, Antonia Nicoletta Corbascio, in the mostdifficult moment of her life.