Support Vector Machines SVM 1 Mark Stamp
Dec 30, 2015
1
Support Vector Machines
SVM
Mark Stamp
2
Supervised vs Unsupervised
Often use supervised learningo That is, training relies on labeled datao Training data must be pre-processed
In contrast, unsupervised learningo That is, uses unlabeled datao No pre-processing required for training
Also semi-supervised algorithmso Supervised, but not too much…
SVM
3
HMM for Supervised Learning
Suppose we want to use HMM for malware detection
Train model on set of malwareo All from a particular familyo Labeled as malware of that typeo Test to see how well it distinguished
This is example of supervised learning
SVM
4
Semi-Supervised Learning
Recall HMM for English text example
Using N = 2, we find hidden states correspond to…o Consonants and vowelso We did not specify consonants/vowelso HMM extracted this info from raw data
Semi-supervised learning?o Maybe, depending on definitions…
SVM
5
Unsupervised Learning
Clustering o Good example of unsupervised
learningo The only example?
For mixed dataset, goal of clustering is to reveal hidden structure
No pre-processingo Often no idea how to pre-processo Usually used in “data exploration”
mode
SVM
6
Supervised Learning
English text exampleo Preprocess marking consonants and
vowelso Then train on this labeled data
SVM one of the most popular supervised learning method
Here, only consider binary classificationo Only 2 classes, such as consonant vs
vowelo Other examples of binary classification?
SVM
7
Support Vector Machine
SVM based on 3 main ideas1. Maximize the “margin”
o Max separation between classes
2. Work in a higher dimensional spaceo More “room”, so easier to separate
3. Kernel tricko This is intimately related to 2
Both 1 and 2 are fairly intuitiveSVM
8
Separating Classes Consider labeled data Binary classifier
o Denote red class as 1o And blue is class -1
Easy to see separation How to separate?
o We’ll use a “hyperplane”…
o …which is a line in 2-dSVM
9
Separating Hyperplanes Consider labeled data
o Easy to separate Draw a hyperplane to
separate pointso Classify new data based
on separating hyperplane
o But which hyperplanes is best?
o And why?SVM
10
Maximize Margin Margin is min
distance to misclassifications
Maximize the margino So, yellow hyperplane is
better than purple Seems like a good idea
o But, not always so easyo See next slide…
SVM
11
Separating… NOT What about this case? Yellow line not an
optiono Why not?o No longer “separating”
What to do?o Allow for some errorso Hyperplane need not
completely separate
SVM
12
Soft Margin
Ideally, large margin and no errors But allowing some misclassifications
might increase the margin by a lotoRelax separating requirement
How many errors to allow?oUser defined parameteroTradeoff errors vs larger marginoIn practice, find “best” by trial and
errorSVM
13
Feature Space
Transform data to “feature space”o Feature space is in higher dimensiono But usually we try to reduce
dimensionality Q: Why increase dimensionality??? A: Easier to separate in feature
space Goal is to make data “linearly
separable” o Want to separate classes with
hyperplaneo But not pay a price for high
dimensionality
SVM
14
Higher Dimensional Space Why transform to “higher”
dimension?o One advantage is nonlinear can be
linear
SVM
ϕ
Input space Feature space(pretend it’s in ahigher dimension)
15
Cool Picture A real example of what can happen
by transforming to higher dimension
SVM
16
Feature Space
Usually, higher dimension is bad newso From computational complexity POVo The so-called “curse of dimensionality”
But higher dimension feature space can make data linearly separable
Can we have our cake and eat it too?o Linearly separable and easy to
compute Yes, thanks to the kernel trick
SVM
17
Kernel Trick
Enables us to work in input spaceo With results mapped to feature spaceo No work done explicitly in feature
space Computations in input space
o Lower dimension, so computation easier
Results actually in feature spaceo Higher dimension, so easier to
separate Very cool trick!SVM
18
Kernel Trick
Unfortunately, to understand kernel trick, must dig a little deeper
Also makes other aspects clearer We won’t cover every detail here Just enough to get idea across
o Well, maybe a little more than that… We need Lagrange multipliers
o But first, constrained optimization
SVM
19
Constrained Optimization
“No brainer” example
Maximize: f(x) = 4 – x2 subject to x – 1 = 0
Solution?o Max is at x = 1o Max value is f(1) = 3
Consider more general case next…SVM
0 1 2-2 -1
0
2
4
-2
-4
f(x)
x = 1
20
Lagrange Multipliers
Optimize f(x,y) subject to g(x,y) = c
Define the LagrangianL(x,y,λ) = f(x,y) + λ (g(x,y) – c)
“Stationary points” of L are possible solutions to original problem o All solutions must be stationary pointso Not all stationary points are solutions
Generalize: More variables/constraints
SVM
21
Stationary Points
Has nothing to do with fancy papero That’s stationery, not stationary…
Stationary point means partial derivatives are all 0, that isdL/dx = 0 and dL/dy = 0 and dL/dλ = 0
As mentioned, this generalizes to…o More variables in functions f and g o More constraints: Σ λi (gi(x,y) – ci)
SVM
22
A Realistic Example
Lots of cool geometric examples We look at something different Consider discrete probability
distribution on n points: p1,p2,p3,…,pn
What distribution has max entropy?o Maximize entropy functiono Subject to constraint that pj form a
probability distributionSVM
23
Maximize Entropy
Shannon entropy is Σ pj log2 pj What is a probability distribution?
o Require 0 ≤ pj ≤ 1 for all j, and Σ pj = 1
So, we want to solve the following:o Maximize f(p1,..,pn) = Σ pj log2 pj
o Subject to constraint Σ pj = 1
How should we solve this? o Do you really have to ask?SVM
24
Entropy Example
Recall L(x,y,λ) = f(x,y) + λ (g(x,y) – c)
Problem statementoMaximize f(p1,..,pn) = Σ pj log2 pj
oSubject to constraint Σ pj = 1
In this case, Lagrangian isL(p1,…,pn,λ) = Σ pj log2 pj + λ (Σ pj – 1)
Compute partial derivatives wrt each pj and partial derivative wrt λ
SVM
25
Entropy Example Have L(x,y,λ) = Σ pj log2 pj + λ (Σ pj – 1)
Partial derivatives wrt any pj yieldslog2 pj + 1/ln(2) + λ = 0 (#)
And wrt λ yields the constraint Σ pj – 1 = 0 or Σ pj = 1 (##)
Equation (#) implies all pj are equal With equation (##), all pj = 1/n Conclusion?SVM
26
Notation
Let x=(x1,x2,…,xn) and λ=(λ1,λ2,…,λm)
Then we write Lagrangian asL(x,λ) = f(x) + Σ λi (gi(x) – ci)
Note: L is a function of n+m variables
Can view the problem as followso The gi functions define a feasible
regiono Maximize f over this feasible region
SVM
27
Lagrange Multiplier Example
SVM
28
Lagrangian Duality
SVM
29
Lagrange Multipliers and SVM
Lagrange multipliers very cool indeedo But what does this have to do with
SVM? Can view (soft) margin
computation as constrained optimization problemo In this form, kernel trick will be clear
We can kill 2 birds with 1 stoneo Make margin calculation clearero Make kernel trick perfectly clear
SVM
30
Problem Setup
Let X1,X2,…,Xn be data pointso Each Xi = (xi,yi) a point in the planeo In general, could be higher dimension
Let z1,z2,…,zn be corresponding class labels, where each zi {-1,1} o Where zi = 1 if classified as “red” type
o And zi = -1 if classified as “blue” type
Note that this is a binary classifierSVM
31
Geometric View
Equation of yellow linew1x + w2y + b = 0
Equation of red linew1x + w2y + b = 1
Equation of blue linew1x + w2y + b = -1
Margin is distance between red and blue SVM
x
y
32
Geometric View
All red points X = (x,y) satisfyw1x + w2y + b ≥ 1
All blue points X = (x,y) satisfyw1x + w2y + b ≤ -1
Want inequalities all true after training
SVM
x
y
33
Geometric View
With lines defined… Given new data
point X = (x,y) to classifyo “Red” provided that
w1x + w2y + b > 0
o “Blue” provided thatw1x + w2y + b < 0
This is scoring phaseSVM
x
y
34
Geometric View
The real question is... How to find equation
of the yellow line?o Given {Xi} and {zi}
o Where Xi a point in plane
o And zi its classification
Finding yellow line is the training phase…
SVM
x
y
35
Geometric View
Distance from origin to line Ax+By+C = 0 is|C| / sqrt(A2 + B2)
Origin to red line:|1-b| / ||W||where W = (w1,w2)
Origin to blue line:|-1-b| / ||W||
Margin is m = 2/||W||
SVM
y
x
m
36
Training Phase
Given {Xi} and {zi}, find largest margin m that classifies all points correctly
Want to find red, blue lines in picture
Recall red line is of the formw1x + w2y + b = 1
Blue line is of the formw1x + w2y + b = -1
And maximize margin: m = 2/||W||SVM
37
Training
Since zi {-1,1}, correct classification occurs providedzi (w1xi + w2yi + b) ≥ 1 for all i
Training problem to solve:o Maximize: m = 2/||W||o Subject to constraints:
zi (w1xi + w2yi + b) ≥ 1 for i=1,2,…,n
Can we determine W and b ? SVM
38
Training
The problem on previous slide is equivalent to the following
Minimize: F(W) = ||W||2 / 2 = (w1
2 + w22) / 2
Subject to constraints: 1 - zi (w1xi + w2yi + b) ≤ 0 for all i
Should be starting to look familiar…
SVM
39
Lagrangian
Ignoring inequalities, we have…L(w1,w2,b,λ) = (w1
2 + w22) / 2
+ Σ λi(1 - zi (w1xi + w2yi + b))
ComputedL/dw1 = w1 - Σ λizixi = 0
dL/dw2 = w2 - Σ λiziyi = 0
dL/db = Σ λizi = 0
dL/dλi = 1 - zi (w1xi + w2yi + b) = 0SVM
40
Lagrangian
Derivatives yield constraints andW = Σ λiziXi and Σ λizi = 0
Substitute these into L yieldsL(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj
Where “” is dot product: XiXj = xixj + yiyj
Here, L is only a function of λ o We still have the constraint Σ λizi = 0
o Note: If we find λi then we know WSVM
41
New-and-Improved Problem
Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj Subject to: Σ λizi = 0 and all λi ≥ 0 Why maximize L(λ)? Intuition may
be…o Goal is to minimize F(W) = (w1
2 + w22) /
2 o Subject to constraints in L(λ) functiono Maximize L(λ) finds “best” parameters
λ o And “best” λ will solve this min
problemo This version is known as the dual
problem
SVM
42
Dual Version of Problem
Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj Subject to: Σ λizi = 0 and all λi ≥ 0 Note that this is the dual problem Can always solve it (if solution
exists)o And will find a global maximum
It doesn’t get any better than that!o Note that with HMM (for example), no
guarantee of global maximumSVM
43
All Together Now: Training
Given data points X1,X2,…,Xn Label each Xi with zi {-1,1} Solve dual problem (previous slide)
o Solving it yields λ o Once λ known, compute W=(w1,w2)
and b o Obtain equation of line: w1x + w2y + b
What have we accomplished?SVM
44
All Together Now: Scoring
From training, find λo Yields W=(w1,w2) and b in w1x + w2y +
b Given new data point X = (x,y)
o That is, X not in training set Compute w1x + w2y + b
o If greater than 0, classify X as red typeo Otherwise, classify X as blue type
What happened, in terms of picture?
SVM
45
Geometric Viewpoint
Training?o Find equation of
yellow line, f(X) Score X = (x,y) ?
o If f(X) > 0, then X is above yellow line (classify as red)
o Else X below line (classify as blue)
SVM
y
x
m
46
Scoring Revisited
Use equation of yellow line for scoring
There is an alternative (better) wayo Let f(X) = w1x + w2y + b = W X + b
o And recall that W = Σ λiziXi
Then, f(X) = Σ λizi(XiX) + b Why is this better?
o No need to explicitly compute Wo Any better reasons why it’s better?SVM
47
Support Vectors
When solving L(λ), find mostly λi = 0
Specifically, the Xi for whichzi (w1xi + w2yi + b) > 1
Only constraints that can matter arezi (w1xi + w2yi + b) = 1
The latter are support vectorso Not known in advance training
determines the support vectorsSVM
48
Support Vectors
Picture worth 1k words?
Where are the support vectors?o Other vectors
(training points) don’t matter
o Why not?
SVM
x
y
support vectors
49
Scoring Re-revisited
Score X using f(X) = Σ λizi(XiX) + b Generally, most of the λi are 0 So, sum is not really from i=1 to n
o Instead, sum actually from i=1 to so Where s is number of support vectors
Why does this matter?o Typically, n large, s small, so fast
scoringo And this form of f(X) is very useful…
SVM
50
Training: Soft Margin
Suppose we relax “linearly separable”
Tradeoff errors for bigger margin m o More errors, but
gain bigger margin Note that 2 kinds of
errors illustratedSVM
x
y
m
51
Errors
To account for errors, introduce “slack variables” εi ≥ 0 in optimization
For red point Xi = (xi,yi), constraint isw1xi + w2yi + b ≥ 1 - εi
For blue point Xi = (xi,yi), constraint isw1xi + w2yi + b ≤ -1 + εi
Minimize: ||W||2/2 + C Σ εi
Subject to constraints above
SVM
52
Dual Problem
Work thru details, dual problem is… Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj Subject to: Σ λizi = 0 and C ≥ λi ≥ 0
o Note that this is the same as before…o …except for C ≥ λi condition
We specify C when training Non-linearly separable case is very
similar to linearly separable caseSVM
53
Training and ScoringRe-re-revisited
Trainingo Maximize: L(λ) = Σ λi – ½ ΣΣ λiλjzizj
XiXj
o Subject to: Σ λizi = 0 and C ≥ λi ≥ 0 o Where C specified by user
Scoring: Given X=(x,y) o Compute f(X)=Σ λizi(XiX)+b, where
sum is over support vectorso If f(X) < 0, then X is “blue”; else it’s
“red”SVM
54
Kernel Trick
Finally, can make sense of kernel trick
Recall X1,X2,…,Xn are training vectorso For training, the Xi only appear as
XiXj
o When scoring X, the Xi only appear as XiX
Dot product is a type of inner producto Many other inner products
Can replace “” with any inner producto E.g., one defined in higher dimensions
SVM
55
Kernel Example
Suppose we define
Then ϕ maps element in 2-d to 5-d For Xi=(xi,yi) and Xj=(xj,yj), we have
ϕ(Xi)ϕ(Xj) = (1 + xixj + yiyj)2
Define the kernel function K asK(Xi,Xj) = (1 + xixj + yiyj)2
Note: K is composition of ϕ and “” SVM
56
The Big Picture
Training data lives in input spaceo Where data is not linearly separable
Map input space to higher dimension feature space using a function ϕ
Do training & scoring in feature spaceo Where data is linearly separable
But don’t want to suffer performance penalty due to higher dimension
SVM
57
Training & Scoring with Kernel
Can simply replace XiXj with K(Xi,Xj)
Trainingo Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj K(Xi,Xj)
o Subject to: Σ λizi = 0 and C ≥ λi ≥ 0 o Where C specified by user
Scoring: Given X=(x,y) o Compute f(X)=Σ λizi K(Xi,X)+b o If f(X) < 0, then X is “blue”; else “red”SVM
58
Kernel Trick
No need to map input to feature space
We don’t even need to know ϕ o Only need to know kernel function K
Bottom lineo Obtain the benefit of working in higher
dimension space (linear separable)…o …with no significant performance
penaltyo That’s really an awesome trick SVM
59
Popular Kernels
Polynomial learning machineK(Xi,Xj) = (XiXj + 1)p
Gaussian radial-basis functionK(Xi,Xj) = exp(-(Xi – Xj)(Xi – Xj)/(2σ2))
Two-layer perceptronK(Xi,Xj) = tanh(β0 XiXj + β1)
Many other possibilitieso Selecting “right” kernel is the real
trickSVM
60
SVM +’s and –’s
Strengthso In training, obtain a global maximum,
not just local max o Can tradeoff margin and training errorso Kernel trick is totally awesome
Weaknesseso Choosing kernel is more art than
scienceo Success depends heavily on kernel
choiceSVM
61
References
R. Berwick, An idiot’s guide to support vector machines
E. Kim, Everything you wanted to know about the kernel trick (but were too afraid to ask)
M. Law, A simple introduction to support vector machines
W.S. Noble, What is a support vector machine?, Nature Biotechnology, 24(12):1565-1567, 2006
SVM
62
References: Lagrange Multipliers
D. Klein, Lagrange multipliers without permanent scarring
Wikipedia, Lagrange multiplier
SVM