Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised Often use supervised learning o That is, training relies on labeled data o Training.

1

Support Vector Machines

SVM

Mark Stamp

2

Supervised vs Unsupervised

Often use supervised learningo That is, training relies on labeled datao Training data must be pre-processed

In contrast, unsupervised learningo That is, uses unlabeled datao No pre-processing required for training

Also semi-supervised algorithmso Supervised, but not too much…

SVM

3

HMM for Supervised Learning

Suppose we want to use HMM for malware detection

Train model on set of malwareo All from a particular familyo Labeled as malware of that typeo Test to see how well it distinguished

This is example of supervised learning

SVM

4

Semi-Supervised Learning

Recall HMM for English text example

Using N = 2, we find hidden states correspond to…o Consonants and vowelso We did not specify consonants/vowelso HMM extracted this info from raw data

Semi-supervised learning?o Maybe, depending on definitions…

SVM

5

Unsupervised Learning

Clustering o Good example of unsupervised

learningo The only example?

For mixed dataset, goal of clustering is to reveal hidden structure

No pre-processingo Often no idea how to pre-processo Usually used in “data exploration”

mode

SVM

6

Supervised Learning

English text exampleo Preprocess marking consonants and

vowelso Then train on this labeled data

SVM one of the most popular supervised learning method

Here, only consider binary classificationo Only 2 classes, such as consonant vs

vowelo Other examples of binary classification?

SVM

7

Support Vector Machine

SVM based on 3 main ideas1. Maximize the “margin”

o Max separation between classes

2. Work in a higher dimensional spaceo More “room”, so easier to separate

3. Kernel tricko This is intimately related to 2

Both 1 and 2 are fairly intuitiveSVM

8

Separating Classes Consider labeled data Binary classifier

o Denote red class as 1o And blue is class -1

Easy to see separation How to separate?

o We’ll use a “hyperplane”…

o …which is a line in 2-dSVM

9

Separating Hyperplanes Consider labeled data

o Easy to separate Draw a hyperplane to

separate pointso Classify new data based

on separating hyperplane

o But which hyperplanes is best?

o And why?SVM

10

Maximize Margin Margin is min

distance to misclassifications

Maximize the margino So, yellow hyperplane is

better than purple Seems like a good idea

o But, not always so easyo See next slide…

SVM

11

Separating… NOT What about this case? Yellow line not an

optiono Why not?o No longer “separating”

What to do?o Allow for some errorso Hyperplane need not

completely separate

SVM

12

Soft Margin

Ideally, large margin and no errors But allowing some misclassifications

might increase the margin by a lotoRelax separating requirement

How many errors to allow?oUser defined parameteroTradeoff errors vs larger marginoIn practice, find “best” by trial and

errorSVM

13

Feature Space

Transform data to “feature space”o Feature space is in higher dimensiono But usually we try to reduce

dimensionality Q: Why increase dimensionality??? A: Easier to separate in feature

space Goal is to make data “linearly

separable” o Want to separate classes with

hyperplaneo But not pay a price for high

dimensionality

SVM

14

Higher Dimensional Space Why transform to “higher”

dimension?o One advantage is nonlinear can be

linear

SVM

ϕ

Input space Feature space(pretend it’s in ahigher dimension)

15

Cool Picture A real example of what can happen

by transforming to higher dimension

SVM

16

Feature Space

Usually, higher dimension is bad newso From computational complexity POVo The so-called “curse of dimensionality”

But higher dimension feature space can make data linearly separable

Can we have our cake and eat it too?o Linearly separable and easy to

compute Yes, thanks to the kernel trick

SVM

17

Kernel Trick

Enables us to work in input spaceo With results mapped to feature spaceo No work done explicitly in feature

space Computations in input space

o Lower dimension, so computation easier

Results actually in feature spaceo Higher dimension, so easier to

separate Very cool trick!SVM

18

Kernel Trick

Unfortunately, to understand kernel trick, must dig a little deeper

Also makes other aspects clearer We won’t cover every detail here Just enough to get idea across

o Well, maybe a little more than that… We need Lagrange multipliers

o But first, constrained optimization

SVM

19

Constrained Optimization

“No brainer” example

Maximize: f(x) = 4 – x2 subject to x – 1 = 0

Solution?o Max is at x = 1o Max value is f(1) = 3

Consider more general case next…SVM

0 1 2-2 -1

0

2

4

-2

-4

f(x)

x = 1

20

Lagrange Multipliers

Optimize f(x,y) subject to g(x,y) = c

Define the LagrangianL(x,y,λ) = f(x,y) + λ (g(x,y) – c)

“Stationary points” of L are possible solutions to original problem o All solutions must be stationary pointso Not all stationary points are solutions

Generalize: More variables/constraints

SVM

21

Stationary Points

Has nothing to do with fancy papero That’s stationery, not stationary…

Stationary point means partial derivatives are all 0, that isdL/dx = 0 and dL/dy = 0 and dL/dλ = 0

As mentioned, this generalizes to…o More variables in functions f and g o More constraints: Σ λi (gi(x,y) – ci)

SVM

22

A Realistic Example

Lots of cool geometric examples We look at something different Consider discrete probability

distribution on n points: p1,p2,p3,…,pn

What distribution has max entropy?o Maximize entropy functiono Subject to constraint that pj form a

probability distributionSVM

23

Maximize Entropy

Shannon entropy is Σ pj log2 pj What is a probability distribution?

o Require 0 ≤ pj ≤ 1 for all j, and Σ pj = 1

So, we want to solve the following:o Maximize f(p1,..,pn) = Σ pj log2 pj

o Subject to constraint Σ pj = 1

How should we solve this? o Do you really have to ask?SVM

24

Entropy Example

Recall L(x,y,λ) = f(x,y) + λ (g(x,y) – c)

Problem statementoMaximize f(p1,..,pn) = Σ pj log2 pj

oSubject to constraint Σ pj = 1

In this case, Lagrangian isL(p1,…,pn,λ) = Σ pj log2 pj + λ (Σ pj – 1)

Compute partial derivatives wrt each pj and partial derivative wrt λ

SVM

25

Entropy Example Have L(x,y,λ) = Σ pj log2 pj + λ (Σ pj – 1)

Partial derivatives wrt any pj yieldslog2 pj + 1/ln(2) + λ = 0 (#)

And wrt λ yields the constraint Σ pj – 1 = 0 or Σ pj = 1 (##)

Equation (#) implies all pj are equal With equation (##), all pj = 1/n Conclusion?SVM

26

Notation

Let x=(x1,x2,…,xn) and λ=(λ1,λ2,…,λm)

Then we write Lagrangian asL(x,λ) = f(x) + Σ λi (gi(x) – ci)

Note: L is a function of n+m variables

Can view the problem as followso The gi functions define a feasible

regiono Maximize f over this feasible region

SVM

27

Lagrange Multiplier Example

SVM

28

Lagrangian Duality

SVM

29

Lagrange Multipliers and SVM

Lagrange multipliers very cool indeedo But what does this have to do with

SVM? Can view (soft) margin

computation as constrained optimization problemo In this form, kernel trick will be clear

We can kill 2 birds with 1 stoneo Make margin calculation clearero Make kernel trick perfectly clear

SVM

30

Problem Setup

Let X1,X2,…,Xn be data pointso Each Xi = (xi,yi) a point in the planeo In general, could be higher dimension

Let z1,z2,…,zn be corresponding class labels, where each zi {-1,1} o Where zi = 1 if classified as “red” type

o And zi = -1 if classified as “blue” type

Note that this is a binary classifierSVM

31

Geometric View

Equation of yellow linew1x + w2y + b = 0

Equation of red linew1x + w2y + b = 1

Equation of blue linew1x + w2y + b = -1

Margin is distance between red and blue SVM

x

y

32

Geometric View

All red points X = (x,y) satisfyw1x + w2y + b ≥ 1

All blue points X = (x,y) satisfyw1x + w2y + b ≤ -1

Want inequalities all true after training

SVM

x

y

33

Geometric View

With lines defined… Given new data

point X = (x,y) to classifyo “Red” provided that

w1x + w2y + b > 0

o “Blue” provided thatw1x + w2y + b < 0

This is scoring phaseSVM

x

y

34

Geometric View

The real question is... How to find equation

of the yellow line?o Given {Xi} and {zi}

o Where Xi a point in plane

o And zi its classification

Finding yellow line is the training phase…

SVM

x

y

35

Geometric View

Distance from origin to line Ax+By+C = 0 is|C| / sqrt(A2 + B2)

Origin to red line:|1-b| / ||W||where W = (w1,w2)

Origin to blue line:|-1-b| / ||W||

Margin is m = 2/||W||

SVM

y

x

m

36

Training Phase

Given {Xi} and {zi}, find largest margin m that classifies all points correctly

Want to find red, blue lines in picture

Recall red line is of the formw1x + w2y + b = 1

Blue line is of the formw1x + w2y + b = -1

And maximize margin: m = 2/||W||SVM

37

Training

Since zi {-1,1}, correct classification occurs providedzi (w1xi + w2yi + b) ≥ 1 for all i

Training problem to solve:o Maximize: m = 2/||W||o Subject to constraints:

zi (w1xi + w2yi + b) ≥ 1 for i=1,2,…,n

Can we determine W and b ? SVM

38

Training

The problem on previous slide is equivalent to the following

Minimize: F(W) = ||W||2 / 2 = (w1

2 + w22) / 2

Subject to constraints: 1 - zi (w1xi + w2yi + b) ≤ 0 for all i

Should be starting to look familiar…

SVM

39

Lagrangian

Ignoring inequalities, we have…L(w1,w2,b,λ) = (w1

2 + w22) / 2

+ Σ λi(1 - zi (w1xi + w2yi + b))

ComputedL/dw1 = w1 - Σ λizixi = 0

dL/dw2 = w2 - Σ λiziyi = 0

dL/db = Σ λizi = 0

dL/dλi = 1 - zi (w1xi + w2yi + b) = 0SVM

40

Lagrangian

Derivatives yield constraints andW = Σ λiziXi and Σ λizi = 0

Substitute these into L yieldsL(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj

Where “” is dot product: XiXj = xixj + yiyj

Here, L is only a function of λ o We still have the constraint Σ λizi = 0

o Note: If we find λi then we know WSVM

41

New-and-Improved Problem

Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj Subject to: Σ λizi = 0 and all λi ≥ 0 Why maximize L(λ)? Intuition may

be…o Goal is to minimize F(W) = (w1

2 + w22) /

2 o Subject to constraints in L(λ) functiono Maximize L(λ) finds “best” parameters

λ o And “best” λ will solve this min

problemo This version is known as the dual

problem

SVM

42

Dual Version of Problem

Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj Subject to: Σ λizi = 0 and all λi ≥ 0 Note that this is the dual problem Can always solve it (if solution

exists)o And will find a global maximum

It doesn’t get any better than that!o Note that with HMM (for example), no

guarantee of global maximumSVM

43

All Together Now: Training

Given data points X1,X2,…,Xn Label each Xi with zi {-1,1} Solve dual problem (previous slide)

o Solving it yields λ o Once λ known, compute W=(w1,w2)

and b o Obtain equation of line: w1x + w2y + b

What have we accomplished?SVM

44

All Together Now: Scoring

From training, find λo Yields W=(w1,w2) and b in w1x + w2y +

b Given new data point X = (x,y)

o That is, X not in training set Compute w1x + w2y + b

o If greater than 0, classify X as red typeo Otherwise, classify X as blue type

What happened, in terms of picture?

SVM

45

Geometric Viewpoint

Training?o Find equation of

yellow line, f(X) Score X = (x,y) ?

o If f(X) > 0, then X is above yellow line (classify as red)

o Else X below line (classify as blue)

SVM

y

x

m

46

Scoring Revisited

Use equation of yellow line for scoring

There is an alternative (better) wayo Let f(X) = w1x + w2y + b = W X + b

o And recall that W = Σ λiziXi

Then, f(X) = Σ λizi(XiX) + b Why is this better?

o No need to explicitly compute Wo Any better reasons why it’s better?SVM

47

Support Vectors

When solving L(λ), find mostly λi = 0

Specifically, the Xi for whichzi (w1xi + w2yi + b) > 1

Only constraints that can matter arezi (w1xi + w2yi + b) = 1

The latter are support vectorso Not known in advance training

determines the support vectorsSVM

48

Support Vectors

Picture worth 1k words?

Where are the support vectors?o Other vectors

(training points) don’t matter

o Why not?

SVM

x

y

support vectors

49

Scoring Re-revisited

Score X using f(X) = Σ λizi(XiX) + b Generally, most of the λi are 0 So, sum is not really from i=1 to n

o Instead, sum actually from i=1 to so Where s is number of support vectors

Why does this matter?o Typically, n large, s small, so fast

scoringo And this form of f(X) is very useful…

SVM

50

Training: Soft Margin

Suppose we relax “linearly separable”

Tradeoff errors for bigger margin m o More errors, but

gain bigger margin Note that 2 kinds of

errors illustratedSVM

x

y

m

51

Errors

To account for errors, introduce “slack variables” εi ≥ 0 in optimization

For red point Xi = (xi,yi), constraint isw1xi + w2yi + b ≥ 1 - εi

For blue point Xi = (xi,yi), constraint isw1xi + w2yi + b ≤ -1 + εi

Minimize: ||W||2/2 + C Σ εi

Subject to constraints above

SVM

52

Dual Problem

Work thru details, dual problem is… Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj Subject to: Σ λizi = 0 and C ≥ λi ≥ 0

o Note that this is the same as before…o …except for C ≥ λi condition

We specify C when training Non-linearly separable case is very

similar to linearly separable caseSVM

53

Training and ScoringRe-re-revisited

Trainingo Maximize: L(λ) = Σ λi – ½ ΣΣ λiλjzizj

XiXj

o Subject to: Σ λizi = 0 and C ≥ λi ≥ 0 o Where C specified by user

Scoring: Given X=(x,y) o Compute f(X)=Σ λizi(XiX)+b, where

sum is over support vectorso If f(X) < 0, then X is “blue”; else it’s

“red”SVM

54

Kernel Trick

Finally, can make sense of kernel trick

Recall X1,X2,…,Xn are training vectorso For training, the Xi only appear as

XiXj

o When scoring X, the Xi only appear as XiX

Dot product is a type of inner producto Many other inner products

Can replace “” with any inner producto E.g., one defined in higher dimensions

SVM

55

Kernel Example

Suppose we define

Then ϕ maps element in 2-d to 5-d For Xi=(xi,yi) and Xj=(xj,yj), we have

ϕ(Xi)ϕ(Xj) = (1 + xixj + yiyj)2

Define the kernel function K asK(Xi,Xj) = (1 + xixj + yiyj)2

Note: K is composition of ϕ and “” SVM

56

The Big Picture

Training data lives in input spaceo Where data is not linearly separable

Map input space to higher dimension feature space using a function ϕ

Do training & scoring in feature spaceo Where data is linearly separable

But don’t want to suffer performance penalty due to higher dimension

SVM

57

Training & Scoring with Kernel

Can simply replace XiXj with K(Xi,Xj)

Trainingo Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj K(Xi,Xj)

o Subject to: Σ λizi = 0 and C ≥ λi ≥ 0 o Where C specified by user

Scoring: Given X=(x,y) o Compute f(X)=Σ λizi K(Xi,X)+b o If f(X) < 0, then X is “blue”; else “red”SVM

58

Kernel Trick

No need to map input to feature space

We don’t even need to know ϕ o Only need to know kernel function K

Bottom lineo Obtain the benefit of working in higher

dimension space (linear separable)…o …with no significant performance

penaltyo That’s really an awesome trick SVM

59

Popular Kernels

Polynomial learning machineK(Xi,Xj) = (XiXj + 1)p

Gaussian radial-basis functionK(Xi,Xj) = exp(-(Xi – Xj)(Xi – Xj)/(2σ2))

Two-layer perceptronK(Xi,Xj) = tanh(β0 XiXj + β1)

Many other possibilitieso Selecting “right” kernel is the real

trickSVM

60

SVM +’s and –’s

Strengthso In training, obtain a global maximum,

not just local max o Can tradeoff margin and training errorso Kernel trick is totally awesome

Weaknesseso Choosing kernel is more art than

scienceo Success depends heavily on kernel

choiceSVM

61

References

R. Berwick, An idiot’s guide to support vector machines

E. Kim, Everything you wanted to know about the kernel trick (but were too afraid to ask)

M. Law, A simple introduction to support vector machines

W.S. Noble, What is a support vector machine?, Nature Biotechnology, 24(12):1565-1567, 2006

SVM

http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf

http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf

http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html



http://www.cise.ufl.edu/class/cis4930sp11dtm/notes/intro_svm_new.pdf

http://www.cise.ufl.edu/class/cis4930sp11dtm/notes/intro_svm_new.pdf

http://marriottschool.net/teacher/IS555/Other/SVM_Readings.pdf

http://marriottschool.net/teacher/IS555/Other/SVM_Readings.pdf

62

References: Lagrange Multipliers

D. Klein, Lagrange multipliers without permanent scarring

Wikipedia, Lagrange multiplier

SVM

http://www.cs.berkeley.edu/~klein/papers/lagrange-multipliers.pdf

http://www.cs.berkeley.edu/~klein/papers/lagrange-multipliers.pdf

http://en.wikipedia.org/wiki/Lagrange_multiplier

Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised Often use supervised learning o That is, training relies on labeled data o Training.

Documents

supervised learningsuppose

labeled datatraining

labeled datasvm

yellow hyperplane

separating hyperplanebut

unsupervised learningthat

yellow line

separating hyperplanesconsider