Ch 9: Support Vector Machines - Seoul National Universitystat.snu.ac.kr/heeseok/teaching/asm17/ch9.pdf · 2017-11-15 · -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

Ch 9: Support Vector Machines

This material is prepared by following the context of James et al. (2013) and slides at

https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-

videos/

9.0 Introduction

• The support vector machine is a generalization of a simple and intuitive classifier called

the maximal margin classifier.

• We discuss the support vector classifier, an extension of the maximal margin classifier

that can be applied in a broader range of cases.

• We further consider the support vector machine, which is a further extension of the

support vector classifier in order to accommodate non-linear class boundaries.

9.1 Maximal Margin Classifier

9.1.1 What Is a Hyperplane ?

• A hyperplane in p dimensions is defined by

{X = (X1, . . . , Xp)T : f(X) = β0 + β1X1 + · · ·+ βpXp = 0} (1)

• The mathematical definition of a hyperplane is quite simple. In two dimensions, a

hyperplane is defined by the equation

β0 + β1X1 + β2X2 = 0

for parameters β0, β1, β2. Any X = (X1, X)T for which the above equation holds is

a point on the hyperplane. Note that the above equation is simply the equation of a

line, since indeed in two dimensions a hyperplane is a line.

• If f(X) = β0 + β1X1 + · · · + βpXp, then f(X) > 0 for points on one side of the

hyperplane, and f(X) > 0 for points on the other. Thus, one can identify that point

X belongs to which regions.

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

The hyperplane 1 + 2X1 + 3X2 = 0 is shown. The blue region is the set of points for which 1 + 2X1 +

3X2 > 0, and the purple region is the set of points for which 1 + 2X1 + 3X2 < 0.

9.1.2 Classification Using a Separating Hyperplane

• Suppose that we have a n × p data matrix X that consists of n training observations

in p–dimensional space and that these observations fall into two classes – that is,

y1, . . . , yn ∈ {−1, 1}.

• We also have a test observation, a p-vector of observed features x∗ = (x∗1, . . . , x∗n)T .

Our goal is to develop a classifier based on the training data that will correctly classify

the test observation using its feature measurements.

• We will now see a new approach that is based upon the concept of a separating hyper-

plane.

• Consider a hyperplane that separates the training observations perfectly according to

their class labels. Then a separating hyperplane has the property that

β0 + β1xi1 + · · ·+ βpxip > 0 if yi = 1

β0 + β1xi1 + · · ·+ βpxip < 0 if yi = −1.

Equivalently, a separating hyperplane has the property that

yi(β0 + β1xi1 + · · ·+ βpxip) > 0

for all i = 1, . . . , n. If a separating hyperplane exists, we can use it to construct a very

natural classifier: a test observation is assigned a class depending on which side of the

hyperplane it is located.

−1 0 1 2 3

9.1.3 The Maximal Margin Classifier

• Among all separating hyperplanes, find the one that makes the biggest gap or margin

between the two classes.

• That is, we can compute the (perpendicular) distance from each training observation

to a given separating hyperplane; the smallest such distance is the minimal distance

from the observations to the hyperplane, and is known as the margin.

• The maximal margin hyperplane is the separating hyperplane for which the margin is

largest.

−1 0 1 2 3

Three training observations are equidistant from the maximal margin hyperplane and lie along the

dashed lines indicating the width of the margin.

• These three observations are known as support vectors, since they are vectors in p-

dimensional space (in Figure, p = 2) and they “support” the maximal margin hyper-

plane in the sense that if these points were moved slightly then the maximal margin

hyperplane would move as well.

9.1.4 Construction of the Maximal Margin Classifier

• The maximal margin hyperplane is the solution to the optimization problem

maxβ0,β1,...βp

subject to

p∑j=1

β2j = 1,

yi(β0 + β1xi1 + · · ·+ βpxip) ≥M i = 1, . . . , n.

• The two constraints ensure that each observation is on the correct side of the hyperplane

and at least a distance M from the hyperplane. Hence, M represents the margin of

our hyperplane, and the optimization problem chooses β0, β1, . . . , βp to maximize M .

−1 0 1 2 3

Need a better classifier ?

9.2 Support Vector Classifier

• It could be worthwhile to misclassify a few training observations in order to do a better

job in classifying the remaining observations.

• The support vector classifier, sometimes called a soft margin classifier, does exactly

• The solution is the following optimization:

maxβ0,β1,...βp,ε1,...,εn

subject to

p∑j=1

β2j = 1,

yi(β0 + β1xi1 + · · ·+ βpxip) ≥M(1− εi),

εi ≥ 0,n∑i=1

εi ≤ C,

where C is a nonnegative tuning parameter.

– ε1, . . . , εn are termed slack variables that allow individual observations to be on

the wrong side of the margin or the hyperplane: If εi = 0 the observation is in

the correct side of the margin, if 0 < ε ≤ 1, the observation is in the wrong side

of the margin, and if ε > 1, it is on the wrong side of the hyperplane.

– the tuning parameter C is selected by cross-validation. As one can see, C bounds

the sum of the ε’s, and so it determines the number and severity of the violations

to the margin (and to the hyperplane) that we will tolerate.

– If C = 0, then there is no budget for violations to the margin, and it must be the

case that ε1 = · · · = εn = 0.

– As the C increases, we become more tolerant of violations to the margin, and so

the margin will widen. Conversely, as C decreases, we become less tolerant of

violations to the margin and so the margin narrows.

−1 0 1 2

A support vector classifier was fit using four different values of the tuning parameter C. The largest

value of C was used in the top left panel, and smaller values were used in the top right, bottom left,

and bottom right panels. When C is large, then there is a high tolerance for observations being on

the wrong side of the margin, and so the margin will be large. As C decreases, the tolerance for

observations being on the wrong side of the margin decreases, and the margin narrows.

−4 −2 0 2 4

Left: The observations fall into two classes, with a non-linear boundary between them. Right: The

support vector classifier seeks a linear boundary, and consequently performs very poorly.

9.3 Support Vector Machines

9.3.1 Classification with Non-linear Decision Boundaries

• Enlarge the space of features by including transformations; e.g. X21 , X

31 , X1X2, X1X

22 , . . ..

Hence go from a p–dimensional space to a M > p dimensional space.

• Fit a support–vector classifier in the enlarged space. This results in non-linear decision

boundaries in the original space.

• Example: Suppose we use (X1, X2, X21 , X

22 , X1X2, X

31 , X

32 , X1X

22 , X

21X2) instead of (X1, X2).

Then the decision boundary would be of the form

β0 + β1X1 + β2X2 + β3X21 + β4X

22 + β5X1X2 + β6X

31 + β7X

32 + β8X1X

22 + β9X

21X2 = 0

This leads to nonlinear decision boundaries in the original space.

−4 −2 0 2 4

9.3.2 The Support Vector Machine

• Polynomials (especially high-dimensional ones) get wild rather fast.

• There is a more elegant and controlled way to introduce nonlinearities in support–

vector classifiers – through the use of kernels.

• The support vector machine (SVM) is an extension of the support vector classifier that

results from enlarging the feature space in a specific way, using kernels.

• Before we discuss these, we must understand the role of inner products in support-

vector classifiers.

• The inner product of two observations xi, xi′ is given by

〈xi, xi′〉 =

p∑j=1

xijxi′j.

• The linear support vector classifier can be represented as

f(x) = β0 +n∑i=1

αi〈x, xi〉.

• To estimate the parameters α1, . . . , αn and β0, all we need are the(n2

)inner products

〈xi, xi′〉 between all pairs of training observations.

• It turns out that most of the α̂i can be zero:

f(x) = β0 +∑i∈S

α̂i〈x, xi〉,

where S is the support set of indices i such that α̂i > 0.

• We consider a generalization of the inner product of the form K(xi, xi′), where K is

some function that we will refer to as a kernel. A kernel is a function that quantifies

the similarity of two observations.

• The solution has the form

f(x) = β0 +∑i∈S

α̂iK(x, xi).

• An example of a possible non-linear kernel is a polynomial kernel as

K(xi, xi′) =

p∑j=1

xijxi′j

where d is a positive integer. Another popular choice is the radial kernel, which takes

the form

K(xi, xi′) = exp

(−γ

p∑j=1

(xij − xi′j)2)

with a positive constant γ.

• Example: Heart Data

– Use 13 predictors such as Age, Sex, and Chol in order to predict whether an

individual has heart disease.

False positive rate

ositiv

0.0 0.2 0.4 0.6 0.8 1.0

Support Vector Classifier

False positive rate

ositiv

0.0 0.2 0.4 0.6 0.8 1.0

SVM: γ=10−3

SVM: γ=10−2

SVM: γ=10−1

ROC curves for the Heart data training set. Left: The support vector classifier and LDA are compared.

Right: The support vector classifier is compared to an SVM using a radial basis kernel with γ =

10−3, 10−2, and 10−1.

False positive rate

ositiv

0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

ositiv

0.0 0.2 0.4 0.6 0.8 1.0

SVM: γ=10−3

SVM: γ=10−2

SVM: γ=10−1

ROC curves for the test set of the Heart data. Left: The support vector classifier and LDA are

compared. Right: The support vector classifier is compared to an SVM using a radial basis kernel

with γ = 10−3, 10−2, and 10−1.

9.4 SVMs with More than Two Classes

• The SVM as defined works for K = 2 classes. What do we do if we have K > 2 classes

• One-Versus-One Classification (OVO): Fit all(K2

)pairwise classifiers f̂k`(x). Classify

x∗ to the class that wins the most pairwise competitions.

• One-Versus-All Classification (OVA): Fit K different 2-class SVM classifiers f̂k(x),

k = 1, . . . , K; each class versus the rest. Classify x∗ to the class for which f̂k(x∗) is

largest.

• Which to choose? If K is not too large, use OVO.

9.5 Relationship to Logistic Regression

• One can rewrite the support-vector classifier optimization for fitting the support vector

classifier f(X) = β0 + β1X1 + · · ·+ βpXp as

minβ0,β1,...,βp

{n∑i=1

max(0, 1− yif(xi)) + λ

p∑j=1

• The above form is like “Loss + Penalty”

minβ0,β1,...,βp

{L(X, y, β) + λP (β)} .

• In our case, the loss function is

L(X, y, β) =n∑i=1

max(0, 1− yi(β0 + β1xi1 + · · ·+ βpxip))

which is called hinge loss.

−6 −4 −2 0 2

SVM Loss

Logistic Regression Loss

yi(β0 + β1xi1 + . . . + βpxip)

Ch 9: Support Vector Machines - Seoul National Universitystat.snu.ac.kr/heeseok/teaching/asm17/ch9.pdf · 2017-11-15 · -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

Documents

1 1.0 pt 1.5 pt 2.0 pt 2.5 pt 0.5 pt 1.0 pt 1.5 pt 2.0 pt...

ACI Fall 2012 Convention ACI October 21 – 24, Toronto, ON....

Analisi Matematica II Test del...

4.4 Mechanical and Electrical ... -...

M107 Hertzsprung-Russell Diagram using Gaia Data...2.0 1.5.....

Quality Assessment of Fractalized NPR Textures: a...

National Academy Sciences Ocean Studies Board · 2014. 6......

GOM Project.pdf2/26/20 MHAC Presentation 26 TOC (wt. %) 0.0....

0.5 1.0 1.5 2 - repositorio.unal.edu.co

Accounting for global-mean warming and scaling ...Annual...

Wolfgang Klippel, Klippel GmbH, Dresden, Germany, klippel...

Cross-stitch · 2020. 8. 6. · the robotics institute 1.5....

Situación Consumo 1S14 web - BBVA Research · algunas...

Health of Queensland Science - Office of the Queensland...

Smart Guide - pibranemark.com · 1.5 1.0 0.5 0.5 1.0 1.5...

Housing’s Role in a Recovery T...2011/02/01 ·...