Kernel Methods and its applications

Introduction to Kernel methods

ML Workshop, ISI KolkataChiranjib Bhattacharyya

Machine Learning labDept of CSA, IISc

[email protected]

http://drona.csa.iisc.ernet.in/~chiru

19th Oct, 2012

Introduction

Kernel methods makes Machine Learning more applicable.Kernels are similarity measuresKernels can help integrate different sources of data

Agenda

1 Kernel TrickSVM and Non-linear Classification

2 Definition of Kernel functions

3 Kernels and Hilbert SpacesRKHS, Representer theorem etc

PART 1: KERNEL TRICK

Binary classification

Classifier

f : X →−1,1. f (x) = sign(w>x+b)

Data:D = (xi,yi)|i = 1, . . . ,m xi ∈X ,yi ∈ 1,−1

find f from D

Binary classification

Classifier

f : X →−1,1. f (x) = sign(w>x+b)

Data:D = (xi,yi)|i = 1, . . . ,m xi ∈X ,yi ∈ 1,−1

find f from D

Review of C-SVM

minw,bCm

∑i=1

max(1− yi(w>xi +b),0)+12‖w‖2

C-SVM formulation

maximizeα −12 ∑

ijαiαjyiyjx>i xj +

m

∑i=1

αi

subject to 0≤ αi ≤ C,∑i

αiyi = 0

At optimality w = ∑mi=1 αiyixi

f (x) = sign(m

∑i=1

αiyix>i x+b)

C-SVM in feature spaces

Let us work with a feature map, Φ(x).


ijαiαjyiyjΦ(xi)>Φ(xj)+

m

∑i=1

αi


αiyi = 0

and our classifier is

f (x) = sign(m

∑i=1

αiyiΦ(xi)>Φ(x)+b)

The dot product between any pair of examples computed in thefeature space be denoted by

K(x,z) = Φ(x)>Φ(z)

C-SVM in feature spaces

Let us work with a feature map, Φ(x).


ijαiαjyiyjK(xi,xj)+

m

∑i=1

αi


αiyi = 0

and our classifier is

f (x) = sign(m

∑i=1

αiyiK(xi,x)+b)

The dot product between any pair of examples computed in thefeature space be denoted by

K(x,z) = Φ(x)>Φ(z)

An example

Let x ∈ IR2 and Φ(x) = [x21 x2

2

√2x1x2]>

K(x,z) = Φ(x)>Φ(z) = x21z2

1 +2x1x2z1z2 + x22z2

2 =< x,z >2

If K(x,z) = (x>z)r is a dot product in a(d+r−1

r

)feature space

corresponding to x,z ∈ IRd.

If d = 256,r = 4, the feature space size is 6,35,376.

However if we know K one can still solve the SVM formulationwithout explicitly evaluating Φ

Kernel function

Kernel function

K :×X → IR is a Kernel function if

K(x,z) = K(z,x) symmetric

Kis positive semidefinite, i.e.∀n,x1, . . . ,xn ∈X ,

the matrix Kij = K(xi,xj) is psd

Recall that a K ∈ IRd×d is psd if u>Ku≥ 0 for all u ∈ IRd.

Examples of Kernel function

K(x,z) = Φ(x)>Φ(z) where φ : E → IRd

K is symmetric i.e. K(x,z) = K(z,x)

Positive Semidefinite:Let D = x1,x2, . . . ,xn be set of arbitrarily chosen n elements of E .Define

Kij = Φ(xi)>Φ(xj)

For any u ∈ IRn it is straightforward to see that

u>Ku = ‖Φ(D)u‖22 ≥ 0 Φ(D) = [Φ(x1), . . . ,Φ(xn)]

Examples of Kernel function

K(x,z) = Φ(x)>Φ(z) where φ : E → IRd

K is symmetric i.e. K(x,z) = K(z,x)Positive Semidefinite:Let D = x1,x2, . . . ,xn be set of arbitrarily chosen n elements of E .Define

Kij = Φ(xi)>Φ(xj)

For any u ∈ IRn it is straightforward to see that

u>Ku = ‖Φ(D)u‖22 ≥ 0 Φ(D) = [Φ(x1), . . . ,Φ(xn)]

Examples of Kernel functions

K(x,z) = x>z Φ(x) = x

K(x,z) = (x>z)r Φt1t2...td(x) =√

r!t1!t2!....td!x

t11 xt2

2 ...xtdd

∑di=1 ti = r

K(x,z) = e−γ‖x−z‖2

Kernel Construction

Let K1 and K2 be two valid kernels.K(x,y) = Φ(x)>Φ(y)

K(u,v) = K1(u,v)K2(u,v)

K = αK1 +βK2 α,β ≥ 0

K(x,y) =K(x,y)√

K(x,x)√

K(y,y)

K(x,y) = x>y

K(x,y) = (x>y)i

K(x,y) = limN→∞

N

∑i=0

(x>y)i

i!= ex>y

K(x,y) = e−12‖x−y‖2

Kernel Construction

Let K1 and K2 be two valid kernels.K(x,y) = Φ(x)>Φ(y)

K(u,v) = K1(u,v)K2(u,v)

K = αK1 +βK2 α,β ≥ 0

K(x,y) =K(x,y)√

K(x,x)√

K(y,y)

K(x,y) = x>y

K(x,y) = (x>y)i

K(x,y) = limN→∞

N

∑i=0

(x>y)i

i!= ex>y

K(x,y) = e−12‖x−y‖2

Kernel function and feature map

A theorem due to Mercer guarantees a feature map for symmetric, psdkernel functions.Loosely statedFor a symmetric function K : X ×X → IR, there exists an expansionK(x,z) = Φ(x)>Φ(z) iff∫

Xg(x)g(z)K(x,z)dxdz≥ 0

PART 2: Kernels and Hilbert spaces

What is a Dot product(aka Inner Product)

Let X be a vector space.

What is a Dot product

Symmetry < u,v >=< v,u > u,v ∈X

Bilinear < αu+βv,w >= α < u,w > +β < v,w > u,v,w,∈X

Positive Semidefinite < u,u > ≥ 0 u ∈X

< u,u >= 0 iff u = 0

Norm

‖x‖=√〈x,x〉

‖x‖= 0 =⇒ x = 0

Examples of Dot products

X = IRn,< u,v >= u>v

X = IRn,< u,v >=n

∑i=1

λiuivi λi ≥ 0

X = L2(X) =

f :∫

Xf (x)2dx < ∞

f ,g ∈X < f ,g >=

∫X

f (x)g(x)dx

Cauchy Schwartz inequality

Cauchy Schwartz inequality

Let X be an inner product space.

|〈x,y〉| ≤ ‖x‖‖y‖ ∀ x,y ∈X

and equality holds iff x = αz for some scalar α

Proof: ∀α ∈ IR ‖x−αz‖2 ≥ 0

‖x‖2−2α〈x,z〉+α2‖z‖2 ≥ 0∀α

Let α = 〈x,z〉‖z‖2 and the inequality follows by taking square roots. The

claim about equality follows from the definition of norm.

Hilbert Space: Basic facts

Defn: A Inner product space (H ,〈·, ·〉H ) is a Hilbert Space if it isseparable and complete.We will denote the norm as ‖ · ‖H . The orthogonal complement of M,where M ⊂H be a subspace of H is defined asM⊥ = z|〈x,z〉H = 0, ∀x ∈M

Hilbert space Projection theorem

Let M be a subspace of Hilbert space H ,〈·, ·〉H . For every x ∈Hthe following holds

There exists an unique ΠM(x) ∈M such thatΠM(x) = argminz∈M‖x− z‖H

x−ΠM(x) ∈M⊥ 〈z,x−ΠM(x)〉H = 0 ∀z ∈M

‖x‖2H = ‖ΠM(x)‖2

H +‖y‖2H where

x = ΠM(x)+ y where y ∈M⊥

Reproducing kernel Hilbert Space(RKHS)

Let K be any kernel function. Consider the following set

H = f |f (.) =m

∑i=1

αiK(.,xi)∀xi ∈X ,m ∈ N

Dot product

For any f ,g ∈H ,

f (.) =m1

∑i=1

αiK(.,xi) , g(.) =m2

∑i=1

βjK(.,xj)

〈f ,g〉H =m1

∑i=1

m2

∑j=1

αiβjK(xi,xj)

Is it a dot product?

Reproducing kernel Hilbert Space(RKHS)

As K is symmetric, 〈f ,g〉H = 〈g, f 〉H

〈f (.), f (.)〉=m

∑i=1

m

∑j=1

αiαjK(xi,xj)

Recall that K is a psd matrix if K is kernel function and so〈f (.), f (.)〉H ≥ 0

Reproducible Property

for any f ∈H

f (x) =m

∑i=i

αiK(x,xi) = 〈m

∑i=1

αiK(.,xi),K(.,x)〉= 〈f (.),K(.,x)〉

Applying C-S inequality |f (x)| ≤√〈f , f 〉H

√K(x,x) holds leading to

|f (x)|= 0 whenever 〈f , f 〉H = 0

Representer theorem

Representer theorem

Let K be a valid kernel defined on X and H be the correspondingRKHS. Let Ω be an increasing function. The optimization problem

ming∈H

G(g) =m

∑i=1

l(g(xi),yi)+Ω(‖g‖2H )

is solved when g∗ = ∑mi=1 αiK(.,xi)

Proof: Let M = ∑mi=1 αiK(.,xi) i = 1, . . . ,m. Clearly M is a

subspace of H . Take any g ∈H .

g(xi) = 〈g,K(.,xi)〉= 〈gM +gper,K(.,xi)〉

= 〈gM,K(.,xi)〉+ 〈gper,K(.,xi)〉= 〈gM,K(.,xi)〉= gM(xi)

As Ω is an increasing function, Ω(‖g‖2H )≥Ω(‖gM‖2

H )

Representer theorem

Representer theorem

Let K be a valid kernel defined on X and H be the correspondingRKHS. Let Ω be an increasing function. The optimization problem

ming∈H

G(g) =m

∑i=1

l(g(xi),yi)+Ω(‖g‖2H )

is solved when g∗ = ∑mi=1 αiK(.,xi)

Proof: Let M = ∑mi=1 αiK(.,xi) i = 1, . . . ,m. Clearly M is a

subspace of H . Take any g ∈H .

g(xi) = 〈g,K(.,xi)〉= 〈gM +gper,K(.,xi)〉

= 〈gM,K(.,xi)〉+ 〈gper,K(.,xi)〉= 〈gM,K(.,xi)〉= gM(xi)

As Ω is an increasing function, Ω(‖g‖2H )≥Ω(‖gM‖2

H )

References

Kernel methods in Computational BiologyScholkopf et al. 2004Kernel methods for Pattern AnalysisJohn Shawe Taylor and N. CristaniniLearning with KernelsScholkopf and Smola 2002

Kernel Methods and its applications

Documents