Top Banner
An Introduction to Support Vector Machine
35

An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Sep 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

An Introduction to Support Vector Machine

Page 2: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Support Vector Machine (SVM)

n  A classifier derived from statistical learning theory by Vapnik, et al. in 1992

n  SVM became famous when, using images as input, it gave accuracy comparable to neural-network with hand-designed features in a handwriting recognition task

n  Currently, SVM is widely used in object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition, etc.

n  Also used for regression

Page 3: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Outline

n  Linear Discriminant Function n  Large Margin Linear Classifier n  Nonlinear SVM: The Kernel Trick

Page 4: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Linear Discriminant Function

n  g(x) is a linear function:

( ) Tg b= +x w x

x1

x2

wT x + b < 0

wT x + b > 0

n  A hyper-plane in the feature space

n  (Unit-length) normal vector of the hyper-plane:

=wnw

n wT x + b = 0

Page 5: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

n  How would you classify these points using a linear discriminant function in order to minimize the error rate?

Linear Discriminant Function denotes +1

denotes -1

x1

x2

n  Infinite number of answers!

Page 6: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

n  How would you classify these points using a linear discriminant function in order to minimize the error rate?

Linear Discriminant Function denotes +1

denotes -1

x1

x2

n  Infinite number of answers!

Page 7: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

n  How would you classify these points using a linear discriminant function in order to minimize the error rate?

Linear Discriminant Function denotes +1

denotes -1

x1

x2

n  Infinite number of answers!

Page 8: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

x1

x2 n  How would you classify these points using a linear discriminant function in order to minimize the error rate?

Linear Discriminant Function denotes +1

denotes -1

n  Infinite number of answers!

n  Which one is the best?

Page 9: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Large Margin Linear Classifier

“safe zone” n  The linear discriminant

function (classifier) with the maximum margin is the best

n  Margin is defined as the width that the boundary could be increased by before hitting a data point

n  Why it is the best? q  Robust to outliners and thus

strong generalization ability q  Good according to PAC

(Probably Approximately Correct) theory.

Margin

x1

x2

denotes +1

denotes -1

Page 10: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Maximum Margin Classification n  Distance from point xi to the

hyperplane is: n  Examples closest to the

hyperplane are support vectors.

n  Margin M of the classifier is the distance between support vectors on both sides.

n 

wxw br i

T +=

x1

x2 Margin

x+

x+

x- n

Support Vectors

Only support vectors matter; other training points are ignorable.

M

Page 11: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Large Margin Linear Classifier

n  Given a set of data points:

n  With a scale transformation on both w and b, the above is equivalent to

x1

x2

denotes +1

denotes -1

{( , )}, 1,2, ,i iy i n=x L , where

For 1, 1

For 1, 1

Ti i

Ti i

y by b= + + ≥

= − + ≤ −

w xw x

M

wTxi + b ≥ M/2 if yi = +1

wTxi + b ≤ - M/2 if yi = -1

Page 12: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Large Margin Linear Classifier

n  We know that

n  The margin width is:

x1

x2

denotes +1

denotes -1

1 1

T

T

bb

+

+ =

+ = −

w xw x

Margin

x+

x+

x-

( )2 ( )

M + −

+ −

= − ⋅

= − ⋅ =

x x nwx xw w

n

Support Vectors

M

Thus wT (x+-x-) = 2

Page 13: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Large Margin Linear Classifier

n  Formulation:

x1

x2

denotes +1

denotes -1

Margin

x+

x+

x- n

such that

2maximize w

For 1, 1

For 1, 1

Ti i

Ti i

y by b= + + ≥

= − + ≤ −

w xw x

Page 14: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Large Margin Linear Classifier

n  Formulation:

x1

x2

denotes +1

denotes -1

Margin

x+

x+

x- n

21minimize 2w

such that

For 1, 1

For 1, 1

Ti i

Ti i

y by b= + + ≥

= − + ≤ −

w xw x

Page 15: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Large Margin Linear Classifier

n  Formulation:

x1

x2

denotes +1

denotes -1

Margin

x+

x+

x- n ( ) 1T

i iy b+ ≥w x

21minimize 2w

such that

Page 16: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Solving the Optimization Problem

( ) 1Ti iy b+ ≥w x

21minimize 2w

s.t.

Quadratic programming

with linear constraints

( )2

1

1minimize ( , , ) ( ) 12

nT

p i i i ii

L b y bα α=

= − + −∑w w w x

s.t.

Lagrangian Function

0iα ≥

Page 17: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Solving the Optimization Problem

( )2

1

1minimize ( , , ) ( ) 12

nT

p i i i ii

L b y bα α=

= − + −∑w w w x

s.t. 0iα ≥

0pLb

∂=

0pL∂ =∂w 1

n

i i ii

yα=

=∑w x

10

n

i ii

yα=

=∑

Page 18: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Solving the Optimization Problem

( )2

1

1minimize ( , , ) ( ) 12

nT

p i i i ii

L b y bα α=

= − + −∑w w w x

s.t. 0iα ≥

1 1 1

1maximize 2

n n nT

i i j i j i ji i j

y yα αα= = =

−∑ ∑∑ x x

s.t. 0iα ≥1

0n

i ii

yα=

=∑, and

Lagrangian Dual Problem

Page 19: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Solving the Optimization Problem

n  The solution has the form:

( )( ) 1 0Ti i iy bα + − =w x

n  From KKT (Karush–Kuhn–Tucker) condition, we know:

n  Thus, only support vectors have 0iα ≠

1 SV

n

i i i i i ii i

y yα α= ∈

= =∑ ∑w x x

get b from yk (wT xk +b)−1= 0,

where xk is any support vector

x1

x2

x+

x+

x-

Support Vectors

Thus, b = yi - Σαiyixi Txk for any αk > 0

Page 20: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Solving the Optimization Problem

g (x) = wT x +b = αi yi xiT x

i∈SV∑ +b

n  The linear discriminant function is:

n  Notice it relies on a dot product between the test point x and the support vectors xi

n  Also keep in mind that solving the optimization problem involved computing the dot products xi

Txj between all pairs of training points

That is, no need to compute w explicitly for classification.

Page 21: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Large Margin Linear Classifier

n  What if data is not linear separable? (noisy data, outliers, etc.)

n  Slack variables ξi can be added to allow mis-classification of difficult or noisy data points

x1

x2

denotes +1

denotes -1

1ξ2ξ

Page 22: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Large Margin Linear Classifier

n  Formulation:

( ) 1Ti i iy b ξ+ ≥ −w x

2

1

1minimize 2

n

ii

C ξ=

+ ∑w

such that 0iξ ≥

n  Parameter C can be viewed as a way to control over-fitting: it “trades off” the relative importance of maximizing the margin and fitting the training data. n  For large values of C, the optimization will choose a smaller-margin

hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.

Page 23: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Solving the Optimization Problem

n  Formulation: (Lagrangian Dual Problem)

1 1 1

1maximize 2

n n nT

i i j i j i ji i j

y yα αα= = =

−∑ ∑∑ x x

such that

0 i Cα≤ ≤

10

n

i ii

yα=

=∑

Page 24: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Solving the Optimization Problem

n  Again, xi with non-zero αi will be support vectors.

n  Solution to the dual problem is:

b= yk(1- ξk) - Σαiyixi

Txk for any k s.t. αk>0

Again, we don’t need to compute w explicitly for classification:

1 SV

n

i i i i i ii i

y yα α= ∈

= =∑ ∑w x x

g (x) = wT x +b = αi yi xiT x

i∈SV∑ +b

Page 25: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Non-linear SVMs n  Datasets that are linearly separable with noise work out great:

0 x

0 x

x2

0 x

n  But what are we going to do if the dataset is just too hard?

n  How about… mapping data to a higher-dimensional space:

Page 26: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Non-linear SVMs: Feature Space n  General idea: the original input space can be mapped to

some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

Page 27: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Nonlinear SVMs: The Kernel Trick n  With this mapping, our discriminant function is now:

SV( ) ( ) ( ) ( )T T

i ii

g b bφ αφ φ∈

= + = +∑x w x x x

n  No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test.

n  A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

( , ) ( ) ( )Ti j i jK φ φ≡x x x x

Page 28: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Nonlinear SVMs: The Kernel Trick

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi

Txj)2,

Need to show that K(xi,xj) = φ(xi)

Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

,

= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2] = φ(xi)

Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x2

2 √2x1 √2x2]

n  An example:

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Page 29: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Nonlinear SVMs: The Kernel Trick

q  Linear kernel:

2

2( , ) exp( )2i j

i jKσ

−= −

x xx x

( , ) Ti j i jK =x x x x

( , ) (1 )T pi j i jK = +x x x x

0 1( , ) tanh( )Ti j i jK β β= +x x x x

n  Examples of commonly-used kernel functions:

q  Polynomial kernel:

q  Gaussian (Radial-Basis Function (RBF) ) kernel:

q  Sigmoid:

Mercer’s theorem: Every semi-positive definite symmetric function is a kernel.

Page 30: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Nonlinear SVM: Optimization n  Formulation: (Lagrangian Dual Problem)

1 1 1

1maximize ( , )2

n n n

i i j i j i ji i j

y y Kα αα= = =

−∑ ∑∑ x x

such that 0 i Cα≤ ≤

10

n

i ii

yα=

=∑

n  The solution of the discriminant function is

g (x) = αi yi K (xi ,x)i∈SV∑ +b

n  The optimization technique is the same.

Page 31: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Support Vector Machine: Algorithm

n  1. Choose a kernel function

n  2. Choose a value for C

n  3. Solve the quadratic programming problem (many software packages available)

n  4. Construct the discriminant function from the support vectors

Page 32: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Some Issues n  Choice of kernel - Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating appropriate

similarity measures n  Choice of kernel parameters - e.g. σ in Gaussian kernel - σ is the distance between closest points with different classifications - In the absence of reliable criteria, applications rely on the use of a

validation set or cross-validation to set such parameters. n  Optimization criterion – Hard margin v.s. Soft margin - a lengthy series of experiments in which various parameters are

tested

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Page 33: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Summary: Support Vector Machine

n  1. Large Margin Classifier q  Better generalization ability & less over-fitting

n  2. The Kernel Trick q  Map data points to higher dimensional space in

order to make them linearly separable. q  Since only dot product is used, we do not need to

represent the mapping explicitly.

Page 34: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

Demo of LibSVM

http://www.csie.ntu.edu.tw/~cjlin/libsvm/ �

Page 35: An Introduction to Support Vector Machineszhou/568/SVM.pdf · 2016. 3. 4. · Support Vector Machine (SVM) ! A classifier derived from statistical learning theory by Vapnik, et al.

References on SVM and Stock Prediction

http://www.svms.org/finance/HuangNakamoriWang2005.pdf http://cs229.stanford.edu/proj2012/ShenJiangZhang-StockMarketForecastingusingMachineLearningAlgorithms.pdf http://research.ijcaonline.org/volume41/number3/pxc3877555.pdf and other references online …