10/18/2015 1 Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,

04/20/23 1Support Vector Machines M.W. Mak

Support Vector MachinesSupport Vector Machines

1. Introduction to SVMs2. Linear SVMs3. Non-linear SVMs

References:

1. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, to appear.

2. S.R. Gunn, 1998. Support Vector Machines for Classification and Regression. (http://www.isis.ecs.soton.ac.uk/resources/svminfo/)

3. Bernhard Schölkopf. Statistical learning and kernel methods. MSR-TR 2000-23, Microsoft Research, 2000. (ftp://ftp.research.microsoft.com/pub/tr/tr-2000-23.pdf)

4. For more resources on support vector machines, see http://www.kernel-machines.org/

Introduction SVMs were developed by Vapnik in 1995 and are becoming

popular due to their attractive features and promising performance.

Conventional neural networks are based on empirical risk minimization where network weights are determined by minimizing the mean squares error between the actual outputs and the desired outputs.

SVMs are based on the structural risk minimization principle where parameters are optimized by minimizing classification error.

SVMs have been shown to posses better generalization capability than conventional neural networks.

Introduction (Cont.)

Given N labeled empirical data:

}1,1{),(,),,( 11 Xyy NNxx

where X is the set of input data in and yi are the labels.

Domain X

We construct a simple classifier by computing the means of the two classes

ii yii

yii NN

where N1 and N2 are the number of data in the class with positive and negative labels, respectively.

We assign a new point x to the class whose mean is closer to it.

To achieve this, we compute 2)( 21 ccc

Introduction (Cont.) Then, we determine the class of x by checking whether the

vector connecting x and c encloses an angle smaller than /2 with the vector

2xDomain X

.21 ccw b

)()(sgn

)(2)(sgn

2 cc bwhere

Introduction (Cont.) In the special case where b = 0, we have

1:21:1

)()(sgn

yii NN

This means that we use ALL data point xi, each being weighted equally by 1/N1 or 1/N2, to define the decision plane.

Domain X

Decision plan

However, we might want to remove the influence of patterns that are far away from the decision boundary, because their influence is usually small.

We may also select only a few important data point (called support vectors) and weight them differently.

Then, we have a support vector machine.

Domain X x

Decision plane

Support vectors

Margin

We aim to find a decision plane that maximizes the margin.

Linear SVMs

Assume that all training data satisfy the constraints:

1for 1

which means

iby ii 01)( wx

Training data points for which the above equality holds lie on hyperplanes parallel to the decision plane.

Linear SVMs (Conts.)

Margin: d

Therefore, maximizing the margin is equivalent to minimizing ||w||2.

b1: x2: xww

Linear SVMs (Lagrangian) We minimize ||w||2 subject to the constraint that

iby ii 01)( wx

This can be achieved by introducing Lagrange multipliers

and a Lagrangian

)1)((2

iiii bybL wxww

The Lagrangian has to be minimized with respect to w and b and maximized with respect to 0i

Nii 1}0{

Linear SVMs (Lagrangian) Setting

0),,( and 0),,(

We obtain

iii yy

and 0 xw

Patterns for which are called Support Vectors. These vectors lie on the margin and satisfy

Skby kk 01)( wx

where S contains the indexes to the support vectors.

Patterns for which are considered to be irrelevant to the classification.

Linear SVMs (Wolfe Dual) Substituting (8) into (7), we obtain the Wolfe dual:

0 and ,,,1 ,0 subject to

1)( :Maximize

The hyper-decision plane is thus

iiii bybf

)(sgnsgn)( xxxwx

.ctorsupport vea is and 1 where1 kkk yb xxw

Linear SVMs (Example) Analytical example (3-point problem):

1 ]0.10.0[

1 ]0.00.1[

1 ]0.00.0[

Objective function:

0 and ,3,,1 ,0 subject to

1)( :Maximize

jijii j

Linear SVMs (Example) We introduce another Lagrange multiplier λ to obtain the

Lagrangian

ii yLF

Differentiating F(α, λ) with respect to λ and αi and set the results to zero, we obtain

1,2,2,4 321

Linear SVMs (Example) Substitute the Lagrange multipliers into Eq. 8

-1 -0.5 0 0.5 1 1.5 2-1

Linear SVM, C=100, #SV=3, acc=100.00%, normW=2.83

Linear SVMs (Example) 4-point linear separable problem:

-1 -0.5 0 0.5 1 1.5 2-1

Linear SVM, C=100, #SV=4, accuracy=100.00%

-1 -0.5 0 0.5 1 1.5 2-1

Linear SVM, C=100, #SV=3, accuracy=100.00%

4 SVs 3 SVs

Linear SVMs (Non-linearly separable) Non-linearly separable: patterns that cannot be separated by

a linear decision boundary without incurring classification error.

0 2 4 6 8 100

Data that causes classification error in linear SVMs

Linear SVMs (Non-linearly separable) We introduce a set of slack variables with

},,,{ 21 N

iby iii 1)( wx

The slack variables allow some data to violate the constraints defined for the linearly separable case (Eq. 6):

iby ii 1)( wx

Therefore, for some we have ,0 where kk

1)( by kk wx

0.8 and 5.0)( e.g. kkk by wx

0 2 4 6 8 100

Linear SVM, C=1000.0, #SV=7, acc=95.00%, normW=0.94

Linear SVMs (Non-linearly separable) E.g. because x10 and x19

are inside the margins, i.e. they violate the constraint (Eq. 6).

667.01910

Linear SVMs (Non-linearly separable) For non-separable cases:

1)( subject to

1 : Minimize

where C is a user-defined penalty parameter to penalize any violation of the margins.

The Lagrangian becomes

ii byCbL

2)1)((

1),,( wxww

Linear SVMs (Non-linearly separable) Wolfe dual optimization:

1)( :Maximize

The output weight vector and bias term are

iiii y

.ctorsupport vea is and 1 where1 kkk yb xxw

04/20/23 24Support Vector Machines M.W. Mak0 2 4 6 8 10

2. Linear SVMs (Types of SVs) Three types of support vectors

1. On the margin:

0;85.2

0;44.0

2. Inside the margin:

667.0;10

3. Outside the margin:

667.2;10

0 2 4 6 8 100

2. Linear SVMs (Types of SVs)

33.0bT xw

1bT xw

67.167.21)(

67.2;;1

202020

0 2 4 6 8 100

2. Linear SVMs (Types of SVs)

33.0bT xw

1bT xw

67.167.21)(

67.2;;1

202020

Swapping Class 1 and Class 2

2. Linear SVMs (Types of SVs) Effect of varying C:

0 2 4 6 8 100

C = 0.1

0 2 4 6 8 100

C = 1002.5i

i 0.4i

3. Non-linear SVMs In case the training data X are not linearly separable, we may

use a kernel function to map the data from the input space to a feature space where data become linearly separable.

Input Space (Domain X)1x

Decision boundary 1iy

Decision boundary

Feature Space

),( iK xx

3. Non-linear SVMs (Conts.) The decision function becomes

iiii bKyf

),(sgn)( xxx

3. Non-linear SVMs (Conts.)

The decision function becomes

iiii bKyf

),(sgn)( xxx

For RBF kernels

222exp),( iiK xxxx

For polynomial kernels

0 ,1),(2

The decision function becomes

iiii bKyf

),(sgn)( xxx

The optimization problem becomes:

1)( :Maximize

3. Non-linear SVMs (Conts.) The effect of varying C on RBF-SVMs:

0 2 4 6 8 100

RBF SVM, 2*sigma2=8.0, C=1000.0, #SV=7, acc=100.00%

C = 10 C = 1000

0 2 4 6 8 100

RBF SVM, 2*sigma2=8.0, C=10.0, #SV=9, acc=90.00%

i 0.0i

3. Non-linear SVMs (Conts.) The effect of varying C on Polynomial-SVMs:

C = 10 C = 1000

0 2 4 6 8 100

Polynomial SVM, degree=2, C=1000.0, #SV=8, acc=90.00%

0 2 4 6 8 100

Polynomial SVM, degree=2, C=10.0, #SV=7, acc=90.00%

i 97.2i

10/18/2015 1 Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,

Documents

Infinite SVM: a DP Mixture of Large-margin Kernel...

Support Vector Machines - UCSBtyang/class/290N14/... · 6.....

Support Vector Machines (SVMs) Part 3: Kernels

1 CSC 4510, Spring 2012. © Paula Matuszek 2012. CSC 4510...

Lecture 10: Non-linear support vector machines. …Lecture.....

Part V Support Vector Machines · CS229 Lecture notes...

Support vector machines Lecture...

SOFT-MARGIN SUPPORT VECTOR MACHINES...

The E ects of Outliers on Support Vector Machines ·...

Support Vector Regression - Semantic Scholar...Support...

Linear Pipeline_collosion Vector Analysis

Support Vector Machin es versus Boosting - Semantic Scholar....

The non-separable case Non-linear SVMs and kernel methods...

7. Support Vector Machines (SVMs) Basic Idea: 1.Transform...

Applied Machine Learning - Oregon State...

Linear algebra and Vector