10/18/2015 1 Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,

Post on 01-Jan-2016

222 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

04/20/23 1Support Vector Machines M.W. Mak

Support Vector MachinesSupport Vector Machines

1. Introduction to SVMs2. Linear SVMs3. Non-linear SVMs

References:

1. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, to appear.

2. S.R. Gunn, 1998. Support Vector Machines for Classification and Regression. (http://www.isis.ecs.soton.ac.uk/resources/svminfo/)

3. Bernhard Schölkopf. Statistical learning and kernel methods. MSR-TR 2000-23, Microsoft Research, 2000. (ftp://ftp.research.microsoft.com/pub/tr/tr-2000-23.pdf)

4. For more resources on support vector machines, see http://www.kernel-machines.org/

04/20/23 2Support Vector Machines M.W. Mak

Introduction SVMs were developed by Vapnik in 1995 and are becoming

popular due to their attractive features and promising performance.

Conventional neural networks are based on empirical risk minimization where network weights are determined by minimizing the mean squares error between the actual outputs and the desired outputs.

SVMs are based on the structural risk minimization principle where parameters are optimized by minimizing classification error.

SVMs have been shown to posses better generalization capability than conventional neural networks.

04/20/23 3Support Vector Machines M.W. Mak

Introduction (Cont.)

Given N labeled empirical data:

}1,1{),(,),,( 11 Xyy NNxx

where X is the set of input data in and yi are the labels.

1iy

1iy

1x

2x

Domain X

1c2c

(1)

D

04/20/23 4Support Vector Machines M.W. Mak

Introduction (Cont.)

We construct a simple classifier by computing the means of the two classes

1:2

21:1

1

1 and

1

ii yii

yii NN

xcxc

where N1 and N2 are the number of data in the class with positive and negative labels, respectively.

We assign a new point x to the class whose mean is closer to it.

To achieve this, we compute 2)( 21 ccc

(2)

04/20/23 5Support Vector Machines M.W. Mak

Introduction (Cont.) Then, we determine the class of x by checking whether the

vector connecting x and c encloses an angle smaller than /2 with the vector

1x

2xDomain X

1c

2c

c

x

.21 ccw b

y

)()(sgn

)(2)(sgn

)(sgn

21

2121

cxcx

ccccx

wcx

)(2

1 2

1

2

2 cc bwhere

04/20/23 6Support Vector Machines M.W. Mak

Introduction (Cont.) In the special case where b = 0, we have

1:

1

1:

1

1:21:1

)()(sgn

)(1

)(1

sgn

21

ii

ii

yiiN

yiiN

yii

yii NN

y

xxxx

xxxx

This means that we use ALL data point xi, each being weighted equally by 1/N1 or 1/N2, to define the decision plane.

(3)

04/20/23 7Support Vector Machines M.W. Mak

Introduction (Cont.)

1iy

1iy2x

Domain X

1c2c

x

1x

Decision plan

ww

04/20/23 8Support Vector Machines M.W. Mak

Introduction (Cont.)

However, we might want to remove the influence of patterns that are far away from the decision boundary, because their influence is usually small.

We may also select only a few important data point (called support vectors) and weight them differently.

Then, we have a support vector machine.

04/20/23 9Support Vector Machines M.W. Mak

Introduction (Cont.)

1iy

1iy2x

Domain X x

1x

Decision plane

Support vectors

Margin

We aim to find a decision plane that maximizes the margin.

04/20/23 10Support Vector Machines M.W. Mak

Linear SVMs

Assume that all training data satisfy the constraints:

1for 1

1for 1

ii

ii

yb

yb

wx

wx

which means

iby ii 01)( wx

Training data points for which the above equality holds lie on hyperplanes parallel to the decision plane.

(4)

(5)

04/20/23 11Support Vector Machines M.W. Mak

Linear SVMs (Conts.)

2x

1x

Margin: d

Therefore, maximizing the margin is equivalent to minimizing ||w||2.

0 bxw

1 bxw

1 bxw

w

wxx

w

w

xxw

xw

xw

2

2)(

2))((

1)(

1)(

21

21

2

1

d

b

b1: x2: xww

04/20/23 12Support Vector Machines M.W. Mak

Linear SVMs (Lagrangian) We minimize ||w||2 subject to the constraint that

iby ii 01)( wx

This can be achieved by introducing Lagrange multipliers

and a Lagrangian

)1)((2

1),,(

1

2

N

iiii bybL wxww

The Lagrangian has to be minimized with respect to w and b and maximized with respect to 0i

(6)

(7)

Nii 1}0{

04/20/23 13Support Vector Machines M.W. Mak

Linear SVMs (Lagrangian) Setting

0),,( and 0),,(

bLbLb

ww

w

We obtain

N

iiii

N

iii yy

11

and 0 xw

Patterns for which are called Support Vectors. These vectors lie on the margin and satisfy

0k

Skby kk 01)( wx

where S contains the indexes to the support vectors.

(8)

Patterns for which are considered to be irrelevant to the classification.

0k

04/20/23 14Support Vector Machines M.W. Mak

Linear SVMs (Wolfe Dual) Substituting (8) into (7), we obtain the Wolfe dual:

N

iiii

jiji

N

i

N

jji

N

ii

yNi

yyL

1

1 11

0 and ,,,1 ,0 subject to

)(2

1)( :Maximize

xx

The hyper-decision plane is thus

N

iiii bybf

1

)(sgnsgn)( xxxwx

(9)

.ctorsupport vea is and 1 where1 kkk yb xxw

04/20/23 15Support Vector Machines M.W. Mak

Linear SVMs (Example) Analytical example (3-point problem):

1 ]0.10.0[

1 ]0.00.1[

1 ]0.00.0[

33

22

11

y

y

y

T

T

T

x

x

x

Objective function:

3

1

3

1

3

1

3

1

0 and ,3,,1 ,0 subject to

)(2

1)( :Maximize

iiii

jijii j

jii

i

yi

yyL

xx

04/20/23 16Support Vector Machines M.W. Mak

Linear SVMs (Example) We introduce another Lagrange multiplier λ to obtain the

Lagrangian

)(2

1

2

1

)(),(

32123

22321

3

1

i

ii yLF

Differentiating F(α, λ) with respect to λ and αi and set the results to zero, we obtain

1,2,2,4 321

04/20/23 17Support Vector Machines M.W. Mak

Linear SVMs (Example) Substitute the Lagrange multipliers into Eq. 8

11

]22[

2

3

1

xw

xw

T

T

iiii

b

y

-1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

1 2

3

Linear SVM, C=100, #SV=3, acc=100.00%, normW=2.83

x1

x2

T

T

xx

xx

b

][

05.0

0

21

21

x

xw

04/20/23 18Support Vector Machines M.W. Mak

Linear SVMs (Example) 4-point linear separable problem:

-1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

1

2

3

4

Linear SVM, C=100, #SV=4, accuracy=100.00%

x1

x2

-1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

1

2

3

4

Linear SVM, C=100, #SV=3, accuracy=100.00%

x1

x2

4 SVs 3 SVs

04/20/23 19Support Vector Machines M.W. Mak

Linear SVMs (Non-linearly separable) Non-linearly separable: patterns that cannot be separated by

a linear decision boundary without incurring classification error.

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Data that causes classification error in linear SVMs

04/20/23 20Support Vector Machines M.W. Mak

Linear SVMs (Non-linearly separable) We introduce a set of slack variables with

},,,{ 21 N

0i

iby iii 1)( wx

The slack variables allow some data to violate the constraints defined for the linearly separable case (Eq. 6):

iby ii 1)( wx

Therefore, for some we have ,0 where kk

1)( by kk wx

0.8 and 5.0)( e.g. kkk by wx

04/20/23 21Support Vector Machines M.W. Mak

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Linear SVM, C=1000.0, #SV=7, acc=95.00%, normW=0.94

x1

x2

Linear SVMs (Non-linearly separable) E.g. because x10 and x19

are inside the margins, i.e. they violate the constraint (Eq. 6).

667.01910

667.01910

04/20/23 22Support Vector Machines M.W. Mak

Linear SVMs (Non-linearly separable) For non-separable cases:

1)( subject to

2

1 : Minimize

1

2

iii

N

ii

by

C

wx

w

where C is a user-defined penalty parameter to penalize any violation of the margins.

The Lagrangian becomes

N

iiii

N

iiii

N

ii byCbL

111

2)1)((

2

1),,( wxww

04/20/23 23Support Vector Machines M.W. Mak

Linear SVMs (Non-linearly separable) Wolfe dual optimization:

N

iiii

jiji

N

i

N

jji

N

ii

yNiC

yyL

1

1 11

0 and ,,,1 ,0 subject to

)(2

1)( :Maximize

xx

The output weight vector and bias term are

N

iiii y

1

xw

.ctorsupport vea is and 1 where1 kkk yb xxw

04/20/23 24Support Vector Machines M.W. Mak0 2 4 6 8 10

0

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Linear SVM, C=10.0, #SV=7, acc=95.00%, normW=0.94

x1

x2

2. Linear SVMs (Types of SVs) Three types of support vectors

1. On the margin:

0;85.2

0;44.0

1)(

0,0

11

1111

by

C

iT

i

ii

xw

2. Inside the margin:

667.0;10

1)(

20 ;

1010

by

C

iT

i

ii

xw

3. Outside the margin:

667.2;10

1)(

2 ;

2020

by

C

iT

i

ii

xw

01717

04/20/23 25Support Vector Machines M.W. Mak

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Linear SVM, C=10.0, #SV=7, acc=95.00%, normW=0.94

x1

x2

2. Linear SVMs (Types of SVs)

1

1)(

0;0;1

b

by

y

T

iT

i

iii

xw

xw

1

1)(

0;0;1

b

by

y

T

iT

i

iii

xw

xw

33.0bT xw

33.0bT xw

1bT xw

1bT xw

67.1

67.167.21)(

67.2;;1

20

2020

202020

b

by

Cy

T

T

xw

xw

04/20/23 26Support Vector Machines M.W. Mak

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Linear SVM, C=10.0, #SV=7, acc=95.00%, normW=0.94

x1

x2

2. Linear SVMs (Types of SVs)

1

1)(

0;0;1

b

by

y

T

iT

i

iii

xw

xw

1

1)(

0;0;1

b

by

y

T

iT

i

iii

xw

xw

33.0bT xw

33.0bT xw

1bT xw

1bT xw

67.1

67.167.21)(

67.2;;1

20

2020

202020

b

by

Cy

T

T

xw

xw

Swapping Class 1 and Class 2

04/20/23 27Support Vector Machines M.W. Mak

2. Linear SVMs (Types of SVs) Effect of varying C:

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Linear SVM, C=0.1, #SV=10, acc=95.00%, normW=0.57

x1

x2

C = 0.1

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Linear SVM, C=100.0, #SV=7, acc=95.00%, normW=0.94

x1

x2

C = 1002.5i

i 0.4i

i

04/20/23 28Support Vector Machines M.W. Mak

3. Non-linear SVMs In case the training data X are not linearly separable, we may

use a kernel function to map the data from the input space to a feature space where data become linearly separable.

1iy

1iy2x

Input Space (Domain X)1x

Decision boundary 1iy

1iy

Decision boundary

Feature Space

),( iK xx

04/20/23 29Support Vector Machines M.W. Mak

3. Non-linear SVMs (Conts.) The decision function becomes

N

iiii bKyf

1

),(sgn)( xxx

(a)

04/20/23 30Support Vector Machines M.W. Mak

3. Non-linear SVMs (Conts.)

04/20/23 31Support Vector Machines M.W. Mak

3. Non-linear SVMs (Conts.)

The decision function becomes

N

iiii bKyf

1

),(sgn)( xxx

For RBF kernels

222exp),( iiK xxxx

For polynomial kernels

0 ,1),(2

pKp

ii

xxxx

04/20/23 32Support Vector Machines M.W. Mak

3. Non-linear SVMs (Conts.)

The decision function becomes

N

iiii bKyf

1

),(sgn)( xxx

The optimization problem becomes:

N

iiii

jiji

N

i

N

jji

N

ii

yNi

KyyW

1

1 11

0 and ,,,1 ,0 subject to

),(2

1)( :Maximize

xx(9)

04/20/23 33Support Vector Machines M.W. Mak

3. Non-linear SVMs (Conts.) The effect of varying C on RBF-SVMs:

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

RBF SVM, 2*sigma2=8.0, C=1000.0, #SV=7, acc=100.00%

x1

x2

C = 10 C = 1000

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

RBF SVM, 2*sigma2=8.0, C=10.0, #SV=9, acc=90.00%

x1

x2

09.3i

i 0.0i

i

04/20/23 34Support Vector Machines M.W. Mak

3. Non-linear SVMs (Conts.) The effect of varying C on Polynomial-SVMs:

C = 10 C = 1000

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Polynomial SVM, degree=2, C=1000.0, #SV=8, acc=90.00%

x1

x2

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Polynomial SVM, degree=2, C=10.0, #SV=7, acc=90.00%

x1

x2

99.2i

i 97.2i

i

top related