Top Banner
Introduction to Kernel methods ML Workshop, ISI Kolkata Chiranjib Bhattacharyya Machine Learning lab Dept of CSA, IISc [email protected] http://drona.csa.iisc.ernet.in/~chiru 19th Oct, 2012
27

Kernel Methods and its applications

Feb 11, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Kernel Methods and its applications

Introduction to Kernel methods

ML Workshop, ISI KolkataChiranjib Bhattacharyya

Machine Learning labDept of CSA, IISc

[email protected]

http://drona.csa.iisc.ernet.in/~chiru

19th Oct, 2012

Page 2: Kernel Methods and its applications

Introduction

Kernel methods makes Machine Learning more applicable.Kernels are similarity measuresKernels can help integrate different sources of data

Page 3: Kernel Methods and its applications

Agenda

1 Kernel TrickSVM and Non-linear Classification

2 Definition of Kernel functions

3 Kernels and Hilbert SpacesRKHS, Representer theorem etc

Page 4: Kernel Methods and its applications

PART 1: KERNEL TRICK

Page 5: Kernel Methods and its applications

Binary classification

Classifier

f : X →−1,1. f (x) = sign(w>x+b)

Data:D = (xi,yi)|i = 1, . . . ,m xi ∈X ,yi ∈ 1,−1

find f from D

Page 6: Kernel Methods and its applications

Binary classification

Classifier

f : X →−1,1. f (x) = sign(w>x+b)

Data:D = (xi,yi)|i = 1, . . . ,m xi ∈X ,yi ∈ 1,−1

find f from D

Page 7: Kernel Methods and its applications

Review of C-SVM

minw,bCm

∑i=1

max(1− yi(w>xi +b),0)+12‖w‖2

C-SVM formulation

maximizeα −12 ∑

ijαiαjyiyjx>i xj +

m

∑i=1

αi

subject to 0≤ αi ≤ C,∑i

αiyi = 0

At optimality w = ∑mi=1 αiyixi

f (x) = sign(m

∑i=1

αiyix>i x+b)

Page 8: Kernel Methods and its applications

C-SVM in feature spaces

Let us work with a feature map, Φ(x).

maximizeα −12 ∑

ijαiαjyiyjΦ(xi)>Φ(xj)+

m

∑i=1

αi

subject to 0≤ αi ≤ C,∑i

αiyi = 0

and our classifier is

f (x) = sign(m

∑i=1

αiyiΦ(xi)>Φ(x)+b)

The dot product between any pair of examples computed in thefeature space be denoted by

K(x,z) = Φ(x)>Φ(z)

Page 9: Kernel Methods and its applications

C-SVM in feature spaces

Let us work with a feature map, Φ(x).

maximizeα −12 ∑

ijαiαjyiyjK(xi,xj)+

m

∑i=1

αi

subject to 0≤ αi ≤ C,∑i

αiyi = 0

and our classifier is

f (x) = sign(m

∑i=1

αiyiK(xi,x)+b)

The dot product between any pair of examples computed in thefeature space be denoted by

K(x,z) = Φ(x)>Φ(z)

Page 10: Kernel Methods and its applications

An example

Let x ∈ IR2 and Φ(x) = [x21 x2

2

√2x1x2]>

K(x,z) = Φ(x)>Φ(z) = x21z2

1 +2x1x2z1z2 + x22z2

2 =< x,z >2

If K(x,z) = (x>z)r is a dot product in a(d+r−1

r

)feature space

corresponding to x,z ∈ IRd.

If d = 256,r = 4, the feature space size is 6,35,376.

However if we know K one can still solve the SVM formulationwithout explicitly evaluating Φ

Page 11: Kernel Methods and its applications

Kernel function

Kernel function

K :×X → IR is a Kernel function if

K(x,z) = K(z,x) symmetric

Kis positive semidefinite, i.e.∀n,x1, . . . ,xn ∈X ,

the matrix Kij = K(xi,xj) is psd

Recall that a K ∈ IRd×d is psd if u>Ku≥ 0 for all u ∈ IRd.

Page 12: Kernel Methods and its applications

Examples of Kernel function

K(x,z) = Φ(x)>Φ(z) where φ : E → IRd

K is symmetric i.e. K(x,z) = K(z,x)

Positive Semidefinite:Let D = x1,x2, . . . ,xn be set of arbitrarily chosen n elements of E .Define

Kij = Φ(xi)>Φ(xj)

For any u ∈ IRn it is straightforward to see that

u>Ku = ‖Φ(D)u‖22 ≥ 0 Φ(D) = [Φ(x1), . . . ,Φ(xn)]

Page 13: Kernel Methods and its applications

Examples of Kernel function

K(x,z) = Φ(x)>Φ(z) where φ : E → IRd

K is symmetric i.e. K(x,z) = K(z,x)Positive Semidefinite:Let D = x1,x2, . . . ,xn be set of arbitrarily chosen n elements of E .Define

Kij = Φ(xi)>Φ(xj)

For any u ∈ IRn it is straightforward to see that

u>Ku = ‖Φ(D)u‖22 ≥ 0 Φ(D) = [Φ(x1), . . . ,Φ(xn)]

Page 14: Kernel Methods and its applications

Examples of Kernel functions

K(x,z) = x>z Φ(x) = x

K(x,z) = (x>z)r Φt1t2...td(x) =√

r!t1!t2!....td!x

t11 xt2

2 ...xtdd

∑di=1 ti = r

K(x,z) = e−γ‖x−z‖2

Page 15: Kernel Methods and its applications

Kernel Construction

Let K1 and K2 be two valid kernels.K(x,y) = Φ(x)>Φ(y)

K(u,v) = K1(u,v)K2(u,v)

K = αK1 +βK2 α,β ≥ 0

K(x,y) =K(x,y)√

K(x,x)√

K(y,y)

K(x,y) = x>y

K(x,y) = (x>y)i

K(x,y) = limN→∞

N

∑i=0

(x>y)i

i!= ex>y

K(x,y) = e−12‖x−y‖2

Page 16: Kernel Methods and its applications

Kernel Construction

Let K1 and K2 be two valid kernels.K(x,y) = Φ(x)>Φ(y)

K(u,v) = K1(u,v)K2(u,v)

K = αK1 +βK2 α,β ≥ 0

K(x,y) =K(x,y)√

K(x,x)√

K(y,y)

K(x,y) = x>y

K(x,y) = (x>y)i

K(x,y) = limN→∞

N

∑i=0

(x>y)i

i!= ex>y

K(x,y) = e−12‖x−y‖2

Page 17: Kernel Methods and its applications

Kernel function and feature map

A theorem due to Mercer guarantees a feature map for symmetric, psdkernel functions.Loosely statedFor a symmetric function K : X ×X → IR, there exists an expansionK(x,z) = Φ(x)>Φ(z) iff∫

Xg(x)g(z)K(x,z)dxdz≥ 0

Page 18: Kernel Methods and its applications

PART 2: Kernels and Hilbert spaces

Page 19: Kernel Methods and its applications

What is a Dot product(aka Inner Product)

Let X be a vector space.

What is a Dot product

Symmetry < u,v >=< v,u > u,v ∈X

Bilinear < αu+βv,w >= α < u,w > +β < v,w > u,v,w,∈X

Positive Semidefinite < u,u > ≥ 0 u ∈X

< u,u >= 0 iff u = 0

Norm

‖x‖=√〈x,x〉

‖x‖= 0 =⇒ x = 0

Page 20: Kernel Methods and its applications

Examples of Dot products

X = IRn,< u,v >= u>v

X = IRn,< u,v >=n

∑i=1

λiuivi λi ≥ 0

X = L2(X) =

f :∫

Xf (x)2dx < ∞

f ,g ∈X < f ,g >=

∫X

f (x)g(x)dx

Page 21: Kernel Methods and its applications

Cauchy Schwartz inequality

Cauchy Schwartz inequality

Let X be an inner product space.

|〈x,y〉| ≤ ‖x‖‖y‖ ∀ x,y ∈X

and equality holds iff x = αz for some scalar α

Proof: ∀α ∈ IR ‖x−αz‖2 ≥ 0

‖x‖2−2α〈x,z〉+α2‖z‖2 ≥ 0∀α

Let α = 〈x,z〉‖z‖2 and the inequality follows by taking square roots. The

claim about equality follows from the definition of norm.

Page 22: Kernel Methods and its applications

Hilbert Space: Basic facts

Defn: A Inner product space (H ,〈·, ·〉H ) is a Hilbert Space if it isseparable and complete.We will denote the norm as ‖ · ‖H . The orthogonal complement of M,where M ⊂H be a subspace of H is defined asM⊥ = z|〈x,z〉H = 0, ∀x ∈M

Hilbert space Projection theorem

Let M be a subspace of Hilbert space H ,〈·, ·〉H . For every x ∈Hthe following holds

There exists an unique ΠM(x) ∈M such thatΠM(x) = argminz∈M‖x− z‖H

x−ΠM(x) ∈M⊥ 〈z,x−ΠM(x)〉H = 0 ∀z ∈M

‖x‖2H = ‖ΠM(x)‖2

H +‖y‖2H where

x = ΠM(x)+ y where y ∈M⊥

Page 23: Kernel Methods and its applications

Reproducing kernel Hilbert Space(RKHS)

Let K be any kernel function. Consider the following set

H = f |f (.) =m

∑i=1

αiK(.,xi)∀xi ∈X ,m ∈ N

Dot product

For any f ,g ∈H ,

f (.) =m1

∑i=1

αiK(.,xi) , g(.) =m2

∑i=1

βjK(.,xj)

〈f ,g〉H =m1

∑i=1

m2

∑j=1

αiβjK(xi,xj)

Is it a dot product?

Page 24: Kernel Methods and its applications

Reproducing kernel Hilbert Space(RKHS)

As K is symmetric, 〈f ,g〉H = 〈g, f 〉H

〈f (.), f (.)〉=m

∑i=1

m

∑j=1

αiαjK(xi,xj)

Recall that K is a psd matrix if K is kernel function and so〈f (.), f (.)〉H ≥ 0

Reproducible Property

for any f ∈H

f (x) =m

∑i=i

αiK(x,xi) = 〈m

∑i=1

αiK(.,xi),K(.,x)〉= 〈f (.),K(.,x)〉

Applying C-S inequality |f (x)| ≤√〈f , f 〉H

√K(x,x) holds leading to

|f (x)|= 0 whenever 〈f , f 〉H = 0

Page 25: Kernel Methods and its applications

Representer theorem

Representer theorem

Let K be a valid kernel defined on X and H be the correspondingRKHS. Let Ω be an increasing function. The optimization problem

ming∈H

G(g) =m

∑i=1

l(g(xi),yi)+Ω(‖g‖2H )

is solved when g∗ = ∑mi=1 αiK(.,xi)

Proof: Let M = ∑mi=1 αiK(.,xi) i = 1, . . . ,m. Clearly M is a

subspace of H . Take any g ∈H .

g(xi) = 〈g,K(.,xi)〉= 〈gM +gper,K(.,xi)〉

= 〈gM,K(.,xi)〉+ 〈gper,K(.,xi)〉= 〈gM,K(.,xi)〉= gM(xi)

As Ω is an increasing function, Ω(‖g‖2H )≥Ω(‖gM‖2

H )

Page 26: Kernel Methods and its applications

Representer theorem

Representer theorem

Let K be a valid kernel defined on X and H be the correspondingRKHS. Let Ω be an increasing function. The optimization problem

ming∈H

G(g) =m

∑i=1

l(g(xi),yi)+Ω(‖g‖2H )

is solved when g∗ = ∑mi=1 αiK(.,xi)

Proof: Let M = ∑mi=1 αiK(.,xi) i = 1, . . . ,m. Clearly M is a

subspace of H . Take any g ∈H .

g(xi) = 〈g,K(.,xi)〉= 〈gM +gper,K(.,xi)〉

= 〈gM,K(.,xi)〉+ 〈gper,K(.,xi)〉= 〈gM,K(.,xi)〉= gM(xi)

As Ω is an increasing function, Ω(‖g‖2H )≥Ω(‖gM‖2

H )

Page 27: Kernel Methods and its applications

References

Kernel methods in Computational BiologyScholkopf et al. 2004Kernel methods for Pattern AnalysisJohn Shawe Taylor and N. CristaniniLearning with KernelsScholkopf and Smola 2002