Top Banner
Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011 (CS5350/6350) Kernel Methods September 15, 2011 1 / 16
95

Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Sep 02, 2018

Download

Documents

dangtruc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernel Methods and Nonlinear Classification

Piyush Rai

CS5350/6350: Machine Learning

September 15, 2011

(CS5350/6350) Kernel Methods September 15, 2011 1 / 16

Page 2: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernel Methods: Motivation

Often we want to capture nonlinear patterns in the data

Nonlinear Regression: Input-output relationship may not be linearNonlinear Classification: Classes may not be separable by a linear boundary

(CS5350/6350) Kernel Methods September 15, 2011 2 / 16

Page 3: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernel Methods: Motivation

Often we want to capture nonlinear patterns in the data

Nonlinear Regression: Input-output relationship may not be linearNonlinear Classification: Classes may not be separable by a linear boundary

Linear models (e.g., linear regression, linear SVM) are not just rich enough

(CS5350/6350) Kernel Methods September 15, 2011 2 / 16

Page 4: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernel Methods: Motivation

Often we want to capture nonlinear patterns in the data

Nonlinear Regression: Input-output relationship may not be linearNonlinear Classification: Classes may not be separable by a linear boundary

Linear models (e.g., linear regression, linear SVM) are not just rich enough

Kernels: Make linear models work in nonlinear settings

(CS5350/6350) Kernel Methods September 15, 2011 2 / 16

Page 5: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernel Methods: Motivation

Often we want to capture nonlinear patterns in the data

Nonlinear Regression: Input-output relationship may not be linearNonlinear Classification: Classes may not be separable by a linear boundary

Linear models (e.g., linear regression, linear SVM) are not just rich enough

Kernels: Make linear models work in nonlinear settings

By mapping data to higher dimensions where it exhibits linear patterns

(CS5350/6350) Kernel Methods September 15, 2011 2 / 16

Page 6: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernel Methods: Motivation

Often we want to capture nonlinear patterns in the data

Nonlinear Regression: Input-output relationship may not be linearNonlinear Classification: Classes may not be separable by a linear boundary

Linear models (e.g., linear regression, linear SVM) are not just rich enough

Kernels: Make linear models work in nonlinear settings

By mapping data to higher dimensions where it exhibits linear patternsApply the linear model in the new input space

(CS5350/6350) Kernel Methods September 15, 2011 2 / 16

Page 7: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernel Methods: Motivation

Often we want to capture nonlinear patterns in the data

Nonlinear Regression: Input-output relationship may not be linearNonlinear Classification: Classes may not be separable by a linear boundary

Linear models (e.g., linear regression, linear SVM) are not just rich enough

Kernels: Make linear models work in nonlinear settings

By mapping data to higher dimensions where it exhibits linear patternsApply the linear model in the new input spaceMapping ≡ changing the feature representation

(CS5350/6350) Kernel Methods September 15, 2011 2 / 16

Page 8: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernel Methods: Motivation

Often we want to capture nonlinear patterns in the data

Nonlinear Regression: Input-output relationship may not be linearNonlinear Classification: Classes may not be separable by a linear boundary

Linear models (e.g., linear regression, linear SVM) are not just rich enough

Kernels: Make linear models work in nonlinear settings

By mapping data to higher dimensions where it exhibits linear patternsApply the linear model in the new input spaceMapping ≡ changing the feature representation

Note: Such mappings can be expensive to compute in general

(CS5350/6350) Kernel Methods September 15, 2011 2 / 16

Page 9: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernel Methods: Motivation

Often we want to capture nonlinear patterns in the data

Nonlinear Regression: Input-output relationship may not be linearNonlinear Classification: Classes may not be separable by a linear boundary

Linear models (e.g., linear regression, linear SVM) are not just rich enough

Kernels: Make linear models work in nonlinear settings

By mapping data to higher dimensions where it exhibits linear patternsApply the linear model in the new input spaceMapping ≡ changing the feature representation

Note: Such mappings can be expensive to compute in generalKernels give such mappings for (almost) free

In most cases, the mappings need not be even computed.. using the Kernel Trick!

(CS5350/6350) Kernel Methods September 15, 2011 2 / 16

Page 10: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Classifying non-linearly separable data

Consider this binary classification problem

Each example represented by a single feature x

No linear separator exists for this data

(CS5350/6350) Kernel Methods September 15, 2011 3 / 16

Page 11: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Classifying non-linearly separable data

Consider this binary classification problem

Each example represented by a single feature x

No linear separator exists for this data

Now map each example as x → {x , x2}Each example now has two features (“derived” from the old representation)

(CS5350/6350) Kernel Methods September 15, 2011 3 / 16

Page 12: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Classifying non-linearly separable data

Consider this binary classification problem

Each example represented by a single feature x

No linear separator exists for this data

Now map each example as x → {x , x2}Each example now has two features (“derived” from the old representation)

Data now becomes linearly separable in the new representation

(CS5350/6350) Kernel Methods September 15, 2011 3 / 16

Page 13: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Classifying non-linearly separable data

Consider this binary classification problem

Each example represented by a single feature x

No linear separator exists for this data

Now map each example as x → {x , x2}Each example now has two features (“derived” from the old representation)

Data now becomes linearly separable in the new representation

Linear in the new representation ≡ nonlinear in the old representation

(CS5350/6350) Kernel Methods September 15, 2011 3 / 16

Page 14: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Classifying non-linearly separable data

Let’s look at another example:

Each example defined by a two features x = {x1, x2}No linear separator exists for this data

(CS5350/6350) Kernel Methods September 15, 2011 4 / 16

Page 15: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Classifying non-linearly separable data

Let’s look at another example:

Each example defined by a two features x = {x1, x2}No linear separator exists for this data

Now map each example as x = {x1, x2} → z = {x21 ,√2x1x2, x

22}

Each example now has three features (“derived” from the old representation)

(CS5350/6350) Kernel Methods September 15, 2011 4 / 16

Page 16: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Classifying non-linearly separable data

Let’s look at another example:

Each example defined by a two features x = {x1, x2}No linear separator exists for this data

Now map each example as x = {x1, x2} → z = {x21 ,√2x1x2, x

22}

Each example now has three features (“derived” from the old representation)

Data now becomes linearly separable in the new representation

(CS5350/6350) Kernel Methods September 15, 2011 4 / 16

Page 17: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Feature Mapping

Consider the following mapping φ for an example x = {x1, . . . , xD}

φ : x → {x21 , x22 , . . . , x2D , , x1x2, x1x2, . . . , x1xD , . . . . . . , xD−1xD}

(CS5350/6350) Kernel Methods September 15, 2011 5 / 16

Page 18: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Feature Mapping

Consider the following mapping φ for an example x = {x1, . . . , xD}

φ : x → {x21 , x22 , . . . , x2D , , x1x2, x1x2, . . . , x1xD , . . . . . . , xD−1xD}It’s an example of a quadratic mapping

Each new feature uses a pair of the original features

(CS5350/6350) Kernel Methods September 15, 2011 5 / 16

Page 19: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Feature Mapping

Consider the following mapping φ for an example x = {x1, . . . , xD}

φ : x → {x21 , x22 , . . . , x2D , , x1x2, x1x2, . . . , x1xD , . . . . . . , xD−1xD}It’s an example of a quadratic mapping

Each new feature uses a pair of the original features

Problem: Mapping usually leads to the number of features blow up!

(CS5350/6350) Kernel Methods September 15, 2011 5 / 16

Page 20: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Feature Mapping

Consider the following mapping φ for an example x = {x1, . . . , xD}

φ : x → {x21 , x22 , . . . , x2D , , x1x2, x1x2, . . . , x1xD , . . . . . . , xD−1xD}It’s an example of a quadratic mapping

Each new feature uses a pair of the original features

Problem: Mapping usually leads to the number of features blow up!

Computing the mapping itself can be inefficient in such cases

(CS5350/6350) Kernel Methods September 15, 2011 5 / 16

Page 21: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Feature Mapping

Consider the following mapping φ for an example x = {x1, . . . , xD}

φ : x → {x21 , x22 , . . . , x2D , , x1x2, x1x2, . . . , x1xD , . . . . . . , xD−1xD}It’s an example of a quadratic mapping

Each new feature uses a pair of the original features

Problem: Mapping usually leads to the number of features blow up!

Computing the mapping itself can be inefficient in such cases

Moreover, using the mapped representation could be inefficient too

e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z)

(CS5350/6350) Kernel Methods September 15, 2011 5 / 16

Page 22: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Feature Mapping

Consider the following mapping φ for an example x = {x1, . . . , xD}

φ : x → {x21 , x22 , . . . , x2D , , x1x2, x1x2, . . . , x1xD , . . . . . . , xD−1xD}It’s an example of a quadratic mapping

Each new feature uses a pair of the original features

Problem: Mapping usually leads to the number of features blow up!

Computing the mapping itself can be inefficient in such cases

Moreover, using the mapped representation could be inefficient too

e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z)

Thankfully, Kernels help us avoid both these issues!

(CS5350/6350) Kernel Methods September 15, 2011 5 / 16

Page 23: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Feature Mapping

Consider the following mapping φ for an example x = {x1, . . . , xD}

φ : x → {x21 , x22 , . . . , x2D , , x1x2, x1x2, . . . , x1xD , . . . . . . , xD−1xD}It’s an example of a quadratic mapping

Each new feature uses a pair of the original features

Problem: Mapping usually leads to the number of features blow up!

Computing the mapping itself can be inefficient in such cases

Moreover, using the mapped representation could be inefficient too

e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z)

Thankfully, Kernels help us avoid both these issues!

The mapping doesn’t have to be explicitly computed

(CS5350/6350) Kernel Methods September 15, 2011 5 / 16

Page 24: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Feature Mapping

Consider the following mapping φ for an example x = {x1, . . . , xD}

φ : x → {x21 , x22 , . . . , x2D , , x1x2, x1x2, . . . , x1xD , . . . . . . , xD−1xD}It’s an example of a quadratic mapping

Each new feature uses a pair of the original features

Problem: Mapping usually leads to the number of features blow up!

Computing the mapping itself can be inefficient in such cases

Moreover, using the mapped representation could be inefficient too

e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z)

Thankfully, Kernels help us avoid both these issues!

The mapping doesn’t have to be explicitly computed

Computations with the mapped features remain efficient

(CS5350/6350) Kernel Methods September 15, 2011 5 / 16

Page 25: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 26: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}Let’s assume we are given a function k (kernel) that takes as inputs x and z

k(x, z) = (x⊤z)

2

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 27: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}Let’s assume we are given a function k (kernel) that takes as inputs x and z

k(x, z) = (x⊤z)

2

= (x1z1 + x2z2)2

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 28: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}Let’s assume we are given a function k (kernel) that takes as inputs x and z

k(x, z) = (x⊤z)

2

= (x1z1 + x2z2)2

= x21 z

21 + x

22 z

22 + 2x1x2z1z2

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 29: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}Let’s assume we are given a function k (kernel) that takes as inputs x and z

k(x, z) = (x⊤z)

2

= (x1z1 + x2z2)2

= x21 z

21 + x

22 z

22 + 2x1x2z1z2

= (x21 ,

√2x1x2, x

22 )

⊤(z

21 ,

√2z1z2, z

22 )

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 30: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}Let’s assume we are given a function k (kernel) that takes as inputs x and z

k(x, z) = (x⊤z)

2

= (x1z1 + x2z2)2

= x21 z

21 + x

22 z

22 + 2x1x2z1z2

= (x21 ,

√2x1x2, x

22 )

⊤(z

21 ,

√2z1z2, z

22 )

= φ(x)⊤φ(z)

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 31: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}Let’s assume we are given a function k (kernel) that takes as inputs x and z

k(x, z) = (x⊤z)

2

= (x1z1 + x2z2)2

= x21 z

21 + x

22 z

22 + 2x1x2z1z2

= (x21 ,

√2x1x2, x

22 )

⊤(z

21 ,

√2z1z2, z

22 )

= φ(x)⊤φ(z)

The above k implicitly defines a mapping φ to a higher dimensional space

φ(x) = {x21 ,√2x1x2, x

22}

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 32: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}Let’s assume we are given a function k (kernel) that takes as inputs x and z

k(x, z) = (x⊤z)

2

= (x1z1 + x2z2)2

= x21 z

21 + x

22 z

22 + 2x1x2z1z2

= (x21 ,

√2x1x2, x

22 )

⊤(z

21 ,

√2z1z2, z

22 )

= φ(x)⊤φ(z)

The above k implicitly defines a mapping φ to a higher dimensional space

φ(x) = {x21 ,√2x1x2, x

22}

Note that we didn’t have to define/compute this mapping

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 33: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}Let’s assume we are given a function k (kernel) that takes as inputs x and z

k(x, z) = (x⊤z)

2

= (x1z1 + x2z2)2

= x21 z

21 + x

22 z

22 + 2x1x2z1z2

= (x21 ,

√2x1x2, x

22 )

⊤(z

21 ,

√2z1z2, z

22 )

= φ(x)⊤φ(z)

The above k implicitly defines a mapping φ to a higher dimensional space

φ(x) = {x21 ,√2x1x2, x

22}

Note that we didn’t have to define/compute this mapping

Simply defining the kernel a certain way gives a higher dim. mapping φ

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 34: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}Let’s assume we are given a function k (kernel) that takes as inputs x and z

k(x, z) = (x⊤z)

2

= (x1z1 + x2z2)2

= x21 z

21 + x

22 z

22 + 2x1x2z1z2

= (x21 ,

√2x1x2, x

22 )

⊤(z

21 ,

√2z1z2, z

22 )

= φ(x)⊤φ(z)

The above k implicitly defines a mapping φ to a higher dimensional space

φ(x) = {x21 ,√2x1x2, x

22}

Note that we didn’t have to define/compute this mapping

Simply defining the kernel a certain way gives a higher dim. mapping φ

Moreover the kernel k(x, z) also computes the dot product φ(x)⊤φ(z)

φ(x)⊤φ(z) would otherwise be much more expensive to compute explicitly

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 35: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels as High Dimensional Feature Mapping

Consider two examples x = {x1, x2} and z = {z1, z2}Let’s assume we are given a function k (kernel) that takes as inputs x and z

k(x, z) = (x⊤z)

2

= (x1z1 + x2z2)2

= x21 z

21 + x

22 z

22 + 2x1x2z1z2

= (x21 ,

√2x1x2, x

22 )

⊤(z

21 ,

√2z1z2, z

22 )

= φ(x)⊤φ(z)

The above k implicitly defines a mapping φ to a higher dimensional space

φ(x) = {x21 ,√2x1x2, x

22}

Note that we didn’t have to define/compute this mapping

Simply defining the kernel a certain way gives a higher dim. mapping φ

Moreover the kernel k(x, z) also computes the dot product φ(x)⊤φ(z)

φ(x)⊤φ(z) would otherwise be much more expensive to compute explicitly

All kernel functions have these properties

(CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Page 36: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: Formally Defined

Recall: Each kernel k has an associated feature mapping φ

(CS5350/6350) Kernel Methods September 15, 2011 7 / 16

Page 37: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: Formally Defined

Recall: Each kernel k has an associated feature mapping φ

φ takes input x ∈ X (input space) and maps it to F (“feature space”)

(CS5350/6350) Kernel Methods September 15, 2011 7 / 16

Page 38: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: Formally Defined

Recall: Each kernel k has an associated feature mapping φ

φ takes input x ∈ X (input space) and maps it to F (“feature space”)

Kernel k(x, z) takes two inputs and gives their similarity in F space

φ : X → Fk : X × X → R, k(x, z) = φ(x)⊤φ(z)

(CS5350/6350) Kernel Methods September 15, 2011 7 / 16

Page 39: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: Formally Defined

Recall: Each kernel k has an associated feature mapping φ

φ takes input x ∈ X (input space) and maps it to F (“feature space”)

Kernel k(x, z) takes two inputs and gives their similarity in F space

φ : X → Fk : X × X → R, k(x, z) = φ(x)⊤φ(z)

F needs to be a vector space with a dot product defined on it

Also called a Hilbert Space

(CS5350/6350) Kernel Methods September 15, 2011 7 / 16

Page 40: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: Formally Defined

Recall: Each kernel k has an associated feature mapping φ

φ takes input x ∈ X (input space) and maps it to F (“feature space”)

Kernel k(x, z) takes two inputs and gives their similarity in F space

φ : X → Fk : X × X → R, k(x, z) = φ(x)⊤φ(z)

F needs to be a vector space with a dot product defined on it

Also called a Hilbert Space

Can just any function be used as a kernel function?

(CS5350/6350) Kernel Methods September 15, 2011 7 / 16

Page 41: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: Formally Defined

Recall: Each kernel k has an associated feature mapping φ

φ takes input x ∈ X (input space) and maps it to F (“feature space”)

Kernel k(x, z) takes two inputs and gives their similarity in F space

φ : X → Fk : X × X → R, k(x, z) = φ(x)⊤φ(z)

F needs to be a vector space with a dot product defined on it

Also called a Hilbert Space

Can just any function be used as a kernel function?

No. It must satisfy Mercer’s Condition

(CS5350/6350) Kernel Methods September 15, 2011 7 / 16

Page 42: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Mercer’s Condition

For k to be a kernel function

(CS5350/6350) Kernel Methods September 15, 2011 8 / 16

Page 43: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Mercer’s Condition

For k to be a kernel function

There must exist a Hilbert Space F for which k defines a dot product

(CS5350/6350) Kernel Methods September 15, 2011 8 / 16

Page 44: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Mercer’s Condition

For k to be a kernel function

There must exist a Hilbert Space F for which k defines a dot product

The above is true if K is a positive definite function∫

dx

∫dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)

(CS5350/6350) Kernel Methods September 15, 2011 8 / 16

Page 45: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Mercer’s Condition

For k to be a kernel function

There must exist a Hilbert Space F for which k defines a dot product

The above is true if K is a positive definite function∫

dx

∫dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)

This is Mercer’s Condition

(CS5350/6350) Kernel Methods September 15, 2011 8 / 16

Page 46: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Mercer’s Condition

For k to be a kernel function

There must exist a Hilbert Space F for which k defines a dot product

The above is true if K is a positive definite function∫

dx

∫dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)

This is Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

(CS5350/6350) Kernel Methods September 15, 2011 8 / 16

Page 47: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Mercer’s Condition

For k to be a kernel function

There must exist a Hilbert Space F for which k defines a dot product

The above is true if K is a positive definite function∫

dx

∫dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)

This is Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

(CS5350/6350) Kernel Methods September 15, 2011 8 / 16

Page 48: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Mercer’s Condition

For k to be a kernel function

There must exist a Hilbert Space F for which k defines a dot product

The above is true if K is a positive definite function∫

dx

∫dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)

This is Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

(CS5350/6350) Kernel Methods September 15, 2011 8 / 16

Page 49: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Mercer’s Condition

For k to be a kernel function

There must exist a Hilbert Space F for which k defines a dot product

The above is true if K is a positive definite function∫

dx

∫dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)

This is Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

k(x, z) = k1(x, z)k2(x, z): direct product

(CS5350/6350) Kernel Methods September 15, 2011 8 / 16

Page 50: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Mercer’s Condition

For k to be a kernel function

There must exist a Hilbert Space F for which k defines a dot product

The above is true if K is a positive definite function∫

dx

∫dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)

This is Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

k(x, z) = k1(x, z)k2(x, z): direct product

Kernels can also be constructed by composing these rules

(CS5350/6350) Kernel Methods September 15, 2011 8 / 16

Page 51: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

The Kernel Matrix

The kernel function k also defines the Kernel Matrix K over the data

(CS5350/6350) Kernel Methods September 15, 2011 9 / 16

Page 52: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

The Kernel Matrix

The kernel function k also defines the Kernel Matrix K over the data

Given N examples {x1, . . . , xN}, the (i , j)-th entry of K is defined as:

Kij = k(xi , xj) = φ(xi )⊤φ(xj)

(CS5350/6350) Kernel Methods September 15, 2011 9 / 16

Page 53: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

The Kernel Matrix

The kernel function k also defines the Kernel Matrix K over the data

Given N examples {x1, . . . , xN}, the (i , j)-th entry of K is defined as:

Kij = k(xi , xj) = φ(xi )⊤φ(xj)

Kij : Similarity between the i-th and j-th example in the feature space F

(CS5350/6350) Kernel Methods September 15, 2011 9 / 16

Page 54: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

The Kernel Matrix

The kernel function k also defines the Kernel Matrix K over the data

Given N examples {x1, . . . , xN}, the (i , j)-th entry of K is defined as:

Kij = k(xi , xj) = φ(xi )⊤φ(xj)

Kij : Similarity between the i-th and j-th example in the feature space F

K: N × N matrix of pairwise similarities between examples in F space

(CS5350/6350) Kernel Methods September 15, 2011 9 / 16

Page 55: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

The Kernel Matrix

The kernel function k also defines the Kernel Matrix K over the data

Given N examples {x1, . . . , xN}, the (i , j)-th entry of K is defined as:

Kij = k(xi , xj) = φ(xi )⊤φ(xj)

Kij : Similarity between the i-th and j-th example in the feature space F

K: N × N matrix of pairwise similarities between examples in F space

K is a symmetric matrix

(CS5350/6350) Kernel Methods September 15, 2011 9 / 16

Page 56: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

The Kernel Matrix

The kernel function k also defines the Kernel Matrix K over the data

Given N examples {x1, . . . , xN}, the (i , j)-th entry of K is defined as:

Kij = k(xi , xj) = φ(xi )⊤φ(xj)

Kij : Similarity between the i-th and j-th example in the feature space F

K: N × N matrix of pairwise similarities between examples in F space

K is a symmetric matrix

K is a positive definite matrix (except for a few exceptions)

(CS5350/6350) Kernel Methods September 15, 2011 9 / 16

Page 57: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

The Kernel Matrix

The kernel function k also defines the Kernel Matrix K over the data

Given N examples {x1, . . . , xN}, the (i , j)-th entry of K is defined as:

Kij = k(xi , xj) = φ(xi )⊤φ(xj)

Kij : Similarity between the i-th and j-th example in the feature space F

K: N × N matrix of pairwise similarities between examples in F space

K is a symmetric matrix

K is a positive definite matrix (except for a few exceptions)

For a P.D. matrix: z⊤Kz > 0, ∀z ∈ RN (also, all eigenvalues positive)

(CS5350/6350) Kernel Methods September 15, 2011 9 / 16

Page 58: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

The Kernel Matrix

The kernel function k also defines the Kernel Matrix K over the data

Given N examples {x1, . . . , xN}, the (i , j)-th entry of K is defined as:

Kij = k(xi , xj) = φ(xi )⊤φ(xj)

Kij : Similarity between the i-th and j-th example in the feature space F

K: N × N matrix of pairwise similarities between examples in F space

K is a symmetric matrix

K is a positive definite matrix (except for a few exceptions)

For a P.D. matrix: z⊤Kz > 0, ∀z ∈ RN (also, all eigenvalues positive)

The Kernel Matrix K is also known as the Gram Matrix

(CS5350/6350) Kernel Methods September 15, 2011 9 / 16

Page 59: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Some Examples of Kernels

The following are the most popular kernels for real-valued vector inputs

Linear (trivial) Kernel:

k(x, z) = x⊤z (mapping function φ is identity - no mapping)

(CS5350/6350) Kernel Methods September 15, 2011 10 / 16

Page 60: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Some Examples of Kernels

The following are the most popular kernels for real-valued vector inputs

Linear (trivial) Kernel:

k(x, z) = x⊤z (mapping function φ is identity - no mapping)

Quadratic Kernel:

k(x, z) = (x⊤z)2 or (1 + x⊤z)2

(CS5350/6350) Kernel Methods September 15, 2011 10 / 16

Page 61: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Some Examples of Kernels

The following are the most popular kernels for real-valued vector inputs

Linear (trivial) Kernel:

k(x, z) = x⊤z (mapping function φ is identity - no mapping)

Quadratic Kernel:

k(x, z) = (x⊤z)2 or (1 + x⊤z)2

Polynomial Kernel (of degree d):

k(x, z) = (x⊤z)d or (1 + x⊤z)d

(CS5350/6350) Kernel Methods September 15, 2011 10 / 16

Page 62: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Some Examples of Kernels

The following are the most popular kernels for real-valued vector inputs

Linear (trivial) Kernel:

k(x, z) = x⊤z (mapping function φ is identity - no mapping)

Quadratic Kernel:

k(x, z) = (x⊤z)2 or (1 + x⊤z)2

Polynomial Kernel (of degree d):

k(x, z) = (x⊤z)d or (1 + x⊤z)d

Radial Basis Function (RBF) Kernel:

k(x, z) = exp[−γ||x− z||2]

(CS5350/6350) Kernel Methods September 15, 2011 10 / 16

Page 63: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Some Examples of Kernels

The following are the most popular kernels for real-valued vector inputs

Linear (trivial) Kernel:

k(x, z) = x⊤z (mapping function φ is identity - no mapping)

Quadratic Kernel:

k(x, z) = (x⊤z)2 or (1 + x⊤z)2

Polynomial Kernel (of degree d):

k(x, z) = (x⊤z)d or (1 + x⊤z)d

Radial Basis Function (RBF) Kernel:

k(x, z) = exp[−γ||x− z||2]

γ is a hyperparameter (also called the kernel bandwidth)

(CS5350/6350) Kernel Methods September 15, 2011 10 / 16

Page 64: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Some Examples of Kernels

The following are the most popular kernels for real-valued vector inputs

Linear (trivial) Kernel:

k(x, z) = x⊤z (mapping function φ is identity - no mapping)

Quadratic Kernel:

k(x, z) = (x⊤z)2 or (1 + x⊤z)2

Polynomial Kernel (of degree d):

k(x, z) = (x⊤z)d or (1 + x⊤z)d

Radial Basis Function (RBF) Kernel:

k(x, z) = exp[−γ||x− z||2]

γ is a hyperparameter (also called the kernel bandwidth)The RBF kernel corresponds to an infinite dimensional feature space F (i.e.,you can’t actually write down the vector φ(x))

(CS5350/6350) Kernel Methods September 15, 2011 10 / 16

Page 65: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Some Examples of Kernels

The following are the most popular kernels for real-valued vector inputs

Linear (trivial) Kernel:

k(x, z) = x⊤z (mapping function φ is identity - no mapping)

Quadratic Kernel:

k(x, z) = (x⊤z)2 or (1 + x⊤z)2

Polynomial Kernel (of degree d):

k(x, z) = (x⊤z)d or (1 + x⊤z)d

Radial Basis Function (RBF) Kernel:

k(x, z) = exp[−γ||x− z||2]

γ is a hyperparameter (also called the kernel bandwidth)The RBF kernel corresponds to an infinite dimensional feature space F (i.e.,you can’t actually write down the vector φ(x))

Note: Kernel hyperparameters (e.g., d , γ) chosen via cross-validation

(CS5350/6350) Kernel Methods September 15, 2011 10 / 16

Page 66: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Using Kernels

Kernels can turn a linear model into a nonlinear one

(CS5350/6350) Kernel Methods September 15, 2011 11 / 16

Page 67: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Using Kernels

Kernels can turn a linear model into a nonlinear one

Recall: Kernel k(x, z) represents a dot product in some high dimensionalfeature space F

(CS5350/6350) Kernel Methods September 15, 2011 11 / 16

Page 68: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Using Kernels

Kernels can turn a linear model into a nonlinear one

Recall: Kernel k(x, z) represents a dot product in some high dimensionalfeature space F

Any learning algorithm in which examples only appear as dot products (x⊤i xj)can be kernelized (i.e., non-linearlized)

(CS5350/6350) Kernel Methods September 15, 2011 11 / 16

Page 69: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Using Kernels

Kernels can turn a linear model into a nonlinear one

Recall: Kernel k(x, z) represents a dot product in some high dimensionalfeature space F

Any learning algorithm in which examples only appear as dot products (x⊤i xj)can be kernelized (i.e., non-linearlized)

.. by replacing the x⊤i xj terms by φ(xi )⊤φ(xj) = k(xi , xj)

(CS5350/6350) Kernel Methods September 15, 2011 11 / 16

Page 70: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Using Kernels

Kernels can turn a linear model into a nonlinear one

Recall: Kernel k(x, z) represents a dot product in some high dimensionalfeature space F

Any learning algorithm in which examples only appear as dot products (x⊤i xj)can be kernelized (i.e., non-linearlized)

.. by replacing the x⊤i xj terms by φ(xi )⊤φ(xj) = k(xi , xj)

Most learning algorithms are like that

(CS5350/6350) Kernel Methods September 15, 2011 11 / 16

Page 71: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Using Kernels

Kernels can turn a linear model into a nonlinear one

Recall: Kernel k(x, z) represents a dot product in some high dimensionalfeature space F

Any learning algorithm in which examples only appear as dot products (x⊤i xj)can be kernelized (i.e., non-linearlized)

.. by replacing the x⊤i xj terms by φ(xi )⊤φ(xj) = k(xi , xj)

Most learning algorithms are like that

Perceptron, SVM, linear regression, etc.

(CS5350/6350) Kernel Methods September 15, 2011 11 / 16

Page 72: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Using Kernels

Kernels can turn a linear model into a nonlinear one

Recall: Kernel k(x, z) represents a dot product in some high dimensionalfeature space F

Any learning algorithm in which examples only appear as dot products (x⊤i xj)can be kernelized (i.e., non-linearlized)

.. by replacing the x⊤i xj terms by φ(xi )⊤φ(xj) = k(xi , xj)

Most learning algorithms are like that

Perceptron, SVM, linear regression, etc.

Many of the unsupervised learning algorithms too can be kernelized (e.g.,K -means clustering, Principal Component Analysis, etc.)

(CS5350/6350) Kernel Methods September 15, 2011 11 / 16

Page 73: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Training

Recall the SVM dual Lagrangian:

Maximize LD (w, b, ξ, α, β) =N∑

n=1

αn −1

2

N∑

m,n=1

αmαnymyn(xTmxn)

subject to

N∑

n=1

αnyn = 0, 0 ≤ αn ≤ C ; n = 1, . . . ,N

(CS5350/6350) Kernel Methods September 15, 2011 12 / 16

Page 74: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Training

Recall the SVM dual Lagrangian:

Maximize LD (w, b, ξ, α, β) =N∑

n=1

αn −1

2

N∑

m,n=1

αmαnymyn(xTmxn)

subject to

N∑

n=1

αnyn = 0, 0 ≤ αn ≤ C ; n = 1, . . . ,N

Replacing xTmxn by φ(xm)⊤φ(xn) = k(xm, xn) = Kmn, where k(., .) is some

suitable kernel function

(CS5350/6350) Kernel Methods September 15, 2011 12 / 16

Page 75: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Training

Recall the SVM dual Lagrangian:

Maximize LD (w, b, ξ, α, β) =N∑

n=1

αn −1

2

N∑

m,n=1

αmαnymyn(xTmxn)

subject to

N∑

n=1

αnyn = 0, 0 ≤ αn ≤ C ; n = 1, . . . ,N

Replacing xTmxn by φ(xm)⊤φ(xn) = k(xm, xn) = Kmn, where k(., .) is some

suitable kernel function

Maximize LD (w, b, ξ, α, β) =N∑

n=1

αn −1

2

N∑

m,n=1

αmαnymynKmn

subject to

N∑

n=1

αnyn = 0, 0 ≤ αn ≤ C ; n = 1, . . . ,N

(CS5350/6350) Kernel Methods September 15, 2011 12 / 16

Page 76: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Training

Recall the SVM dual Lagrangian:

Maximize LD (w, b, ξ, α, β) =N∑

n=1

αn −1

2

N∑

m,n=1

αmαnymyn(xTmxn)

subject to

N∑

n=1

αnyn = 0, 0 ≤ αn ≤ C ; n = 1, . . . ,N

Replacing xTmxn by φ(xm)⊤φ(xn) = k(xm, xn) = Kmn, where k(., .) is some

suitable kernel function

Maximize LD (w, b, ξ, α, β) =N∑

n=1

αn −1

2

N∑

m,n=1

αmαnymynKmn

subject to

N∑

n=1

αnyn = 0, 0 ≤ αn ≤ C ; n = 1, . . . ,N

SVM now learns a linear separator in the kernel defined feature space F

(CS5350/6350) Kernel Methods September 15, 2011 12 / 16

Page 77: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Training

Recall the SVM dual Lagrangian:

Maximize LD (w, b, ξ, α, β) =N∑

n=1

αn −1

2

N∑

m,n=1

αmαnymyn(xTmxn)

subject to

N∑

n=1

αnyn = 0, 0 ≤ αn ≤ C ; n = 1, . . . ,N

Replacing xTmxn by φ(xm)⊤φ(xn) = k(xm, xn) = Kmn, where k(., .) is some

suitable kernel function

Maximize LD (w, b, ξ, α, β) =N∑

n=1

αn −1

2

N∑

m,n=1

αmαnymynKmn

subject to

N∑

n=1

αnyn = 0, 0 ≤ αn ≤ C ; n = 1, . . . ,N

SVM now learns a linear separator in the kernel defined feature space FThis corresponds to a non-linear separator in the original space X

(CS5350/6350) Kernel Methods September 15, 2011 12 / 16

Page 78: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Prediction

Prediction for a test example x (assume b = 0)

y = sign(w⊤x)

(CS5350/6350) Kernel Methods September 15, 2011 13 / 16

Page 79: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Prediction

Prediction for a test example x (assume b = 0)

y = sign(w⊤x) = sign(∑n∈SV

αnynxn⊤x)

SV is the set of support vectors (i.e., examples for which αn > 0)

(CS5350/6350) Kernel Methods September 15, 2011 13 / 16

Page 80: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Prediction

Prediction for a test example x (assume b = 0)

y = sign(w⊤x) = sign(∑n∈SV

αnynxn⊤x)

SV is the set of support vectors (i.e., examples for which αn > 0)

Replacing each example with its feature mapped representation (x → φ(x))

y = sign(∑n∈SV

αnynφ(xn)⊤φ(x))

(CS5350/6350) Kernel Methods September 15, 2011 13 / 16

Page 81: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Prediction

Prediction for a test example x (assume b = 0)

y = sign(w⊤x) = sign(∑n∈SV

αnynxn⊤x)

SV is the set of support vectors (i.e., examples for which αn > 0)

Replacing each example with its feature mapped representation (x → φ(x))

y = sign(∑n∈SV

αnynφ(xn)⊤φ(x)) = sign(

∑n∈SV

αnynk(xn, x))

(CS5350/6350) Kernel Methods September 15, 2011 13 / 16

Page 82: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Prediction

Prediction for a test example x (assume b = 0)

y = sign(w⊤x) = sign(∑n∈SV

αnynxn⊤x)

SV is the set of support vectors (i.e., examples for which αn > 0)

Replacing each example with its feature mapped representation (x → φ(x))

y = sign(∑n∈SV

αnynφ(xn)⊤φ(x)) = sign(

∑n∈SV

αnynk(xn, x))

The weight vector for the kernelized case can be expressed as:

w =∑n∈SV

αnynφ(xn)

(CS5350/6350) Kernel Methods September 15, 2011 13 / 16

Page 83: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Prediction

Prediction for a test example x (assume b = 0)

y = sign(w⊤x) = sign(∑n∈SV

αnynxn⊤x)

SV is the set of support vectors (i.e., examples for which αn > 0)

Replacing each example with its feature mapped representation (x → φ(x))

y = sign(∑n∈SV

αnynφ(xn)⊤φ(x)) = sign(

∑n∈SV

αnynk(xn, x))

The weight vector for the kernelized case can be expressed as:

w =∑n∈SV

αnynφ(xn) =∑n∈SV

αnynk(xn, .)

(CS5350/6350) Kernel Methods September 15, 2011 13 / 16

Page 84: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Prediction

Prediction for a test example x (assume b = 0)

y = sign(w⊤x) = sign(∑n∈SV

αnynxn⊤x)

SV is the set of support vectors (i.e., examples for which αn > 0)

Replacing each example with its feature mapped representation (x → φ(x))

y = sign(∑n∈SV

αnynφ(xn)⊤φ(x)) = sign(

∑n∈SV

αnynk(xn, x))

The weight vector for the kernelized case can be expressed as:

w =∑n∈SV

αnynφ(xn) =∑n∈SV

αnynk(xn, .)

Important: Kernelized SVM needs the support vectors at the test time(except when you can write φ(xn) as an explicit, reasonably-sized vector)

(CS5350/6350) Kernel Methods September 15, 2011 13 / 16

Page 85: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernelized SVM Prediction

Prediction for a test example x (assume b = 0)

y = sign(w⊤x) = sign(∑n∈SV

αnynxn⊤x)

SV is the set of support vectors (i.e., examples for which αn > 0)

Replacing each example with its feature mapped representation (x → φ(x))

y = sign(∑n∈SV

αnynφ(xn)⊤φ(x)) = sign(

∑n∈SV

αnynk(xn, x))

The weight vector for the kernelized case can be expressed as:

w =∑n∈SV

αnynφ(xn) =∑n∈SV

αnynk(xn, .)

Important: Kernelized SVM needs the support vectors at the test time(except when you can write φ(xn) as an explicit, reasonably-sized vector)

In the unkernelized version w =∑

n∈SV αnynxn can be computed and stored asa D × 1 vector, so the support vectors need not be stored

(CS5350/6350) Kernel Methods September 15, 2011 13 / 16

Page 86: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

SVM with an RBF Kernel

The learned decision boundary in the original space is nonlinear

(CS5350/6350) Kernel Methods September 15, 2011 14 / 16

Page 87: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: concluding notes

Kernels give a modular way to learn nonlinear patterns using linear models

All you need to do is replace the inner products with the kernel

(CS5350/6350) Kernel Methods September 15, 2011 15 / 16

Page 88: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: concluding notes

Kernels give a modular way to learn nonlinear patterns using linear models

All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space

(CS5350/6350) Kernel Methods September 15, 2011 15 / 16

Page 89: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: concluding notes

Kernels give a modular way to learn nonlinear patterns using linear models

All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space

Choice of the kernel is an important factor

(CS5350/6350) Kernel Methods September 15, 2011 15 / 16

Page 90: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: concluding notes

Kernels give a modular way to learn nonlinear patterns using linear models

All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space

Choice of the kernel is an important factor

Many kernels are tailor-made for specific types of data

Strings (string kernels): DNA matching, text classification, etc.Trees (tree kernels): Comparing parse trees of phrases/sentences

(CS5350/6350) Kernel Methods September 15, 2011 15 / 16

Page 91: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: concluding notes

Kernels give a modular way to learn nonlinear patterns using linear models

All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space

Choice of the kernel is an important factor

Many kernels are tailor-made for specific types of data

Strings (string kernels): DNA matching, text classification, etc.Trees (tree kernels): Comparing parse trees of phrases/sentences

Kernels can even be learned from the data (hot research topic!)

(CS5350/6350) Kernel Methods September 15, 2011 15 / 16

Page 92: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: concluding notes

Kernels give a modular way to learn nonlinear patterns using linear models

All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space

Choice of the kernel is an important factor

Many kernels are tailor-made for specific types of data

Strings (string kernels): DNA matching, text classification, etc.Trees (tree kernels): Comparing parse trees of phrases/sentences

Kernels can even be learned from the data (hot research topic!)

Kernel learning means learning the similarities between examples (instead ofusing some pre-defined notion of similarity)

(CS5350/6350) Kernel Methods September 15, 2011 15 / 16

Page 93: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: concluding notes

Kernels give a modular way to learn nonlinear patterns using linear models

All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space

Choice of the kernel is an important factor

Many kernels are tailor-made for specific types of data

Strings (string kernels): DNA matching, text classification, etc.Trees (tree kernels): Comparing parse trees of phrases/sentences

Kernels can even be learned from the data (hot research topic!)

Kernel learning means learning the similarities between examples (instead ofusing some pre-defined notion of similarity)

A question worth thinking about: Wouldn’t mapping the data to higherdimensional space cause my classifier (say SVM) to overfit?

(CS5350/6350) Kernel Methods September 15, 2011 15 / 16

Page 94: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Kernels: concluding notes

Kernels give a modular way to learn nonlinear patterns using linear models

All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space

Choice of the kernel is an important factor

Many kernels are tailor-made for specific types of data

Strings (string kernels): DNA matching, text classification, etc.Trees (tree kernels): Comparing parse trees of phrases/sentences

Kernels can even be learned from the data (hot research topic!)

Kernel learning means learning the similarities between examples (instead ofusing some pre-defined notion of similarity)

A question worth thinking about: Wouldn’t mapping the data to higherdimensional space cause my classifier (say SVM) to overfit?

The answer lies in the concepts of large margins and generalization

(CS5350/6350) Kernel Methods September 15, 2011 15 / 16

Page 95: Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classification Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Next class..

Intro to probabilistic methods for supervised learning

Linear Regression (probabilistic version)Logistic Regression

(CS5350/6350) Kernel Methods September 15, 2011 16 / 16