Ch 9: Support Vector Machines - Seoul National Universitystat.snu.ac.kr/heeseok/teaching/asm17/ch9.pdf · 2017-11-15 · -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
Post on 24-Jul-2020
6 Views
Preview:
Transcript
Ch 9: Support Vector Machines
This material is prepared by following the context of James et al. (2013) and slides at
https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-
videos/
9.0 Introduction
• The support vector machine is a generalization of a simple and intuitive classifier called
the maximal margin classifier.
• We discuss the support vector classifier, an extension of the maximal margin classifier
that can be applied in a broader range of cases.
• We further consider the support vector machine, which is a further extension of the
support vector classifier in order to accommodate non-linear class boundaries.
9.1 Maximal Margin Classifier
9.1.1 What Is a Hyperplane ?
• A hyperplane in p dimensions is defined by
{X = (X1, . . . , Xp)T : f(X) = β0 + β1X1 + · · ·+ βpXp = 0} (1)
• The mathematical definition of a hyperplane is quite simple. In two dimensions, a
hyperplane is defined by the equation
β0 + β1X1 + β2X2 = 0
for parameters β0, β1, β2. Any X = (X1, X)T for which the above equation holds is
a point on the hyperplane. Note that the above equation is simply the equation of a
line, since indeed in two dimensions a hyperplane is a line.
• If f(X) = β0 + β1X1 + · · · + βpXp, then f(X) > 0 for points on one side of the
hyperplane, and f(X) > 0 for points on the other. Thus, one can identify that point
X belongs to which regions.
1
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1
.5−
1.0
−0
.50
.00
.51
.01
.5
X1X
2
The hyperplane 1 + 2X1 + 3X2 = 0 is shown. The blue region is the set of points for which 1 + 2X1 +
3X2 > 0, and the purple region is the set of points for which 1 + 2X1 + 3X2 < 0.
9.1.2 Classification Using a Separating Hyperplane
• Suppose that we have a n × p data matrix X that consists of n training observations
in p–dimensional space and that these observations fall into two classes – that is,
y1, . . . , yn ∈ {−1, 1}.
• We also have a test observation, a p-vector of observed features x∗ = (x∗1, . . . , x∗n)T .
Our goal is to develop a classifier based on the training data that will correctly classify
the test observation using its feature measurements.
• We will now see a new approach that is based upon the concept of a separating hyper-
plane.
• Consider a hyperplane that separates the training observations perfectly according to
their class labels. Then a separating hyperplane has the property that
β0 + β1xi1 + · · ·+ βpxip > 0 if yi = 1
and
β0 + β1xi1 + · · ·+ βpxip < 0 if yi = −1.
Equivalently, a separating hyperplane has the property that
yi(β0 + β1xi1 + · · ·+ βpxip) > 0
2
for all i = 1, . . . , n. If a separating hyperplane exists, we can use it to construct a very
natural classifier: a test observation is assigned a class depending on which side of the
hyperplane it is located.
−1 0 1 2 3
−1
01
23
−1 0 1 2 3
−1
01
23
X1X1
X2
X2
9.1.3 The Maximal Margin Classifier
• Among all separating hyperplanes, find the one that makes the biggest gap or margin
between the two classes.
• That is, we can compute the (perpendicular) distance from each training observation
to a given separating hyperplane; the smallest such distance is the minimal distance
from the observations to the hyperplane, and is known as the margin.
• The maximal margin hyperplane is the separating hyperplane for which the margin is
largest.
3
−1 0 1 2 3
−1
01
23
X1
X2
Three training observations are equidistant from the maximal margin hyperplane and lie along the
dashed lines indicating the width of the margin.
• These three observations are known as support vectors, since they are vectors in p-
dimensional space (in Figure, p = 2) and they “support” the maximal margin hyper-
plane in the sense that if these points were moved slightly then the maximal margin
hyperplane would move as well.
9.1.4 Construction of the Maximal Margin Classifier
• The maximal margin hyperplane is the solution to the optimization problem
maxβ0,β1,...βp
M
subject to
p∑j=1
β2j = 1,
yi(β0 + β1xi1 + · · ·+ βpxip) ≥M i = 1, . . . , n.
• The two constraints ensure that each observation is on the correct side of the hyperplane
and at least a distance M from the hyperplane. Hence, M represents the margin of
our hyperplane, and the optimization problem chooses β0, β1, . . . , βp to maximize M .
4
−1 0 1 2 3
−1
01
23
−1 0 1 2 3
−1
01
23
X1X1
X2
X2
Need a better classifier ?
9.2 Support Vector Classifier
• It could be worthwhile to misclassify a few training observations in order to do a better
job in classifying the remaining observations.
• The support vector classifier, sometimes called a soft margin classifier, does exactly
this.
• The solution is the following optimization:
maxβ0,β1,...βp,ε1,...,εn
M
subject to
p∑j=1
β2j = 1,
yi(β0 + β1xi1 + · · ·+ βpxip) ≥M(1− εi),
εi ≥ 0,n∑i=1
εi ≤ C,
where C is a nonnegative tuning parameter.
– ε1, . . . , εn are termed slack variables that allow individual observations to be on
the wrong side of the margin or the hyperplane: If εi = 0 the observation is in
the correct side of the margin, if 0 < ε ≤ 1, the observation is in the wrong side
of the margin, and if ε > 1, it is on the wrong side of the hyperplane.
5
– the tuning parameter C is selected by cross-validation. As one can see, C bounds
the sum of the ε’s, and so it determines the number and severity of the violations
to the margin (and to the hyperplane) that we will tolerate.
– If C = 0, then there is no budget for violations to the margin, and it must be the
case that ε1 = · · · = εn = 0.
– As the C increases, we become more tolerant of violations to the margin, and so
the margin will widen. Conversely, as C decreases, we become less tolerant of
violations to the margin and so the margin narrows.
−1 0 1 2
−3
−2
−1
01
23
−1 0 1 2
−3
−2
−1
01
23
−1 0 1 2
−3
−2
−1
01
23
−1 0 1 2
−3
−2
−1
01
23
X1X1
X1X1
X2
X2
X2
X2
A support vector classifier was fit using four different values of the tuning parameter C. The largest
value of C was used in the top left panel, and smaller values were used in the top right, bottom left,
and bottom right panels. When C is large, then there is a high tolerance for observations being on
the wrong side of the margin, and so the margin will be large. As C decreases, the tolerance for
observations being on the wrong side of the margin decreases, and the margin narrows.
6
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
Left: The observations fall into two classes, with a non-linear boundary between them. Right: The
support vector classifier seeks a linear boundary, and consequently performs very poorly.
9.3 Support Vector Machines
9.3.1 Classification with Non-linear Decision Boundaries
• Enlarge the space of features by including transformations; e.g. X21 , X
31 , X1X2, X1X
22 , . . ..
Hence go from a p–dimensional space to a M > p dimensional space.
• Fit a support–vector classifier in the enlarged space. This results in non-linear decision
boundaries in the original space.
• Example: Suppose we use (X1, X2, X21 , X
22 , X1X2, X
31 , X
32 , X1X
22 , X
21X2) instead of (X1, X2).
Then the decision boundary would be of the form
β0 + β1X1 + β2X2 + β3X21 + β4X
22 + β5X1X2 + β6X
31 + β7X
32 + β8X1X
22 + β9X
21X2 = 0
This leads to nonlinear decision boundaries in the original space.
7
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
9.3.2 The Support Vector Machine
• Polynomials (especially high-dimensional ones) get wild rather fast.
• There is a more elegant and controlled way to introduce nonlinearities in support–
vector classifiers – through the use of kernels.
• The support vector machine (SVM) is an extension of the support vector classifier that
results from enlarging the feature space in a specific way, using kernels.
• Before we discuss these, we must understand the role of inner products in support-
vector classifiers.
• The inner product of two observations xi, xi′ is given by
〈xi, xi′〉 =
p∑j=1
xijxi′j.
• The linear support vector classifier can be represented as
f(x) = β0 +n∑i=1
αi〈x, xi〉.
• To estimate the parameters α1, . . . , αn and β0, all we need are the(n2
)inner products
〈xi, xi′〉 between all pairs of training observations.
• It turns out that most of the α̂i can be zero:
f(x) = β0 +∑i∈S
α̂i〈x, xi〉,
where S is the support set of indices i such that α̂i > 0.
8
• We consider a generalization of the inner product of the form K(xi, xi′), where K is
some function that we will refer to as a kernel. A kernel is a function that quantifies
the similarity of two observations.
• The solution has the form
f(x) = β0 +∑i∈S
α̂iK(x, xi).
• An example of a possible non-linear kernel is a polynomial kernel as
K(xi, xi′) =
(1 +
p∑j=1
xijxi′j
)d
where d is a positive integer. Another popular choice is the radial kernel, which takes
the form
K(xi, xi′) = exp
(−γ
p∑j=1
(xij − xi′j)2)
with a positive constant γ.
• Example: Heart Data
– Use 13 predictors such as Age, Sex, and Chol in order to predict whether an
individual has heart disease.
False positive rate
Tru
e p
ositiv
e r
ate
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Support Vector Classifier
LDA
False positive rate
Tru
e p
ositiv
e r
ate
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Support Vector Classifier
SVM: γ=10−3
SVM: γ=10−2
SVM: γ=10−1
ROC curves for the Heart data training set. Left: The support vector classifier and LDA are compared.
Right: The support vector classifier is compared to an SVM using a radial basis kernel with γ =
10−3, 10−2, and 10−1.
9
False positive rate
Tru
e p
ositiv
e r
ate
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Support Vector Classifier
LDA
False positive rate
Tru
e p
ositiv
e r
ate
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Support Vector Classifier
SVM: γ=10−3
SVM: γ=10−2
SVM: γ=10−1
ROC curves for the test set of the Heart data. Left: The support vector classifier and LDA are
compared. Right: The support vector classifier is compared to an SVM using a radial basis kernel
with γ = 10−3, 10−2, and 10−1.
9.4 SVMs with More than Two Classes
• The SVM as defined works for K = 2 classes. What do we do if we have K > 2 classes
?
• One-Versus-One Classification (OVO): Fit all(K2
)pairwise classifiers f̂k`(x). Classify
x∗ to the class that wins the most pairwise competitions.
• One-Versus-All Classification (OVA): Fit K different 2-class SVM classifiers f̂k(x),
k = 1, . . . , K; each class versus the rest. Classify x∗ to the class for which f̂k(x∗) is
largest.
• Which to choose? If K is not too large, use OVO.
9.5 Relationship to Logistic Regression
• One can rewrite the support-vector classifier optimization for fitting the support vector
classifier f(X) = β0 + β1X1 + · · ·+ βpXp as
minβ0,β1,...,βp
{n∑i=1
max(0, 1− yif(xi)) + λ
p∑j=1
β2j
}.
10
• The above form is like “Loss + Penalty”
minβ0,β1,...,βp
{L(X, y, β) + λP (β)} .
• In our case, the loss function is
L(X, y, β) =n∑i=1
max(0, 1− yi(β0 + β1xi1 + · · ·+ βpxip))
which is called hinge loss.
−6 −4 −2 0 2
02
46
8
Loss
SVM Loss
Logistic Regression Loss
yi(β0 + β1xi1 + . . . + βpxip)
11
top related