Top Banner
Support Vector Machines SVM 1 Mark Stamp
62

Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised Often use supervised learning o That is, training relies on labeled data o Training.

Dec 30, 2015

Download

Documents

Brook Roberts
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

1

Support Vector Machines

SVM

Mark Stamp

Page 2: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

2

Supervised vs Unsupervised

Often use supervised learningo That is, training relies on labeled datao Training data must be pre-processed

In contrast, unsupervised learningo That is, uses unlabeled datao No pre-processing required for training

Also semi-supervised algorithmso Supervised, but not too much…

SVM

Page 3: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

3

HMM for Supervised Learning

Suppose we want to use HMM for malware detection

Train model on set of malwareo All from a particular familyo Labeled as malware of that typeo Test to see how well it distinguished

This is example of supervised learning

SVM

Page 4: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

4

Semi-Supervised Learning

Recall HMM for English text example

Using N = 2, we find hidden states correspond to…o Consonants and vowelso We did not specify consonants/vowelso HMM extracted this info from raw data

Semi-supervised learning?o Maybe, depending on definitions…

SVM

Page 5: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

5

Unsupervised Learning

Clustering o Good example of unsupervised

learningo The only example?

For mixed dataset, goal of clustering is to reveal hidden structure

No pre-processingo Often no idea how to pre-processo Usually used in “data exploration”

mode

SVM

Page 6: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

6

Supervised Learning

English text exampleo Preprocess marking consonants and

vowelso Then train on this labeled data

SVM one of the most popular supervised learning method

Here, only consider binary classificationo Only 2 classes, such as consonant vs

vowelo Other examples of binary classification?

SVM

Page 7: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

7

Support Vector Machine

SVM based on 3 main ideas1. Maximize the “margin”

o Max separation between classes

2. Work in a higher dimensional spaceo More “room”, so easier to separate

3. Kernel tricko This is intimately related to 2

Both 1 and 2 are fairly intuitiveSVM

Page 8: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

8

Separating Classes Consider labeled data Binary classifier

o Denote red class as 1o And blue is class -1

Easy to see separation How to separate?

o We’ll use a “hyperplane”…

o …which is a line in 2-dSVM

Page 9: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

9

Separating Hyperplanes Consider labeled data

o Easy to separate Draw a hyperplane to

separate pointso Classify new data based

on separating hyperplane

o But which hyperplanes is best?

o And why?SVM

Page 10: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

10

Maximize Margin Margin is min

distance to misclassifications

Maximize the margino So, yellow hyperplane is

better than purple Seems like a good idea

o But, not always so easyo See next slide…

SVM

Page 11: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

11

Separating… NOT What about this case? Yellow line not an

optiono Why not?o No longer “separating”

What to do?o Allow for some errorso Hyperplane need not

completely separate

SVM

Page 12: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

12

Soft Margin

Ideally, large margin and no errors But allowing some misclassifications

might increase the margin by a lotoRelax separating requirement

How many errors to allow?oUser defined parameteroTradeoff errors vs larger marginoIn practice, find “best” by trial and

errorSVM

Page 13: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

13

Feature Space

Transform data to “feature space”o Feature space is in higher dimensiono But usually we try to reduce

dimensionality Q: Why increase dimensionality??? A: Easier to separate in feature

space Goal is to make data “linearly

separable” o Want to separate classes with

hyperplaneo But not pay a price for high

dimensionality

SVM

Page 14: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

14

Higher Dimensional Space Why transform to “higher”

dimension?o One advantage is nonlinear can be

linear

SVM

ϕ

Input space Feature space(pretend it’s in ahigher dimension)

Page 15: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

15

Cool Picture A real example of what can happen

by transforming to higher dimension

SVM

Page 16: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

16

Feature Space

Usually, higher dimension is bad newso From computational complexity POVo The so-called “curse of dimensionality”

But higher dimension feature space can make data linearly separable

Can we have our cake and eat it too?o Linearly separable and easy to

compute Yes, thanks to the kernel trick

SVM

Page 17: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

17

Kernel Trick

Enables us to work in input spaceo With results mapped to feature spaceo No work done explicitly in feature

space Computations in input space

o Lower dimension, so computation easier

Results actually in feature spaceo Higher dimension, so easier to

separate Very cool trick!SVM

Page 18: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

18

Kernel Trick

Unfortunately, to understand kernel trick, must dig a little deeper

Also makes other aspects clearer We won’t cover every detail here Just enough to get idea across

o Well, maybe a little more than that… We need Lagrange multipliers

o But first, constrained optimization

SVM

Page 19: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

19

Constrained Optimization

“No brainer” example

Maximize: f(x) = 4 – x2 subject to x – 1 = 0

Solution?o Max is at x = 1o Max value is f(1) = 3

Consider more general case next…SVM

0 1 2-2 -1

0

2

4

-2

-4

f(x)

x = 1

Page 20: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

20

Lagrange Multipliers

Optimize f(x,y) subject to g(x,y) = c

Define the LagrangianL(x,y,λ) = f(x,y) + λ (g(x,y) – c)

“Stationary points” of L are possible solutions to original problem o All solutions must be stationary pointso Not all stationary points are solutions

Generalize: More variables/constraints

SVM

Page 21: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

21

Stationary Points

Has nothing to do with fancy papero That’s stationery, not stationary…

Stationary point means partial derivatives are all 0, that isdL/dx = 0 and dL/dy = 0 and dL/dλ = 0

As mentioned, this generalizes to…o More variables in functions f and g o More constraints: Σ λi (gi(x,y) – ci)

SVM

Page 22: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

22

A Realistic Example

Lots of cool geometric examples We look at something different Consider discrete probability

distribution on n points: p1,p2,p3,…,pn

What distribution has max entropy?o Maximize entropy functiono Subject to constraint that pj form a

probability distributionSVM

Page 23: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

23

Maximize Entropy

Shannon entropy is Σ pj log2 pj What is a probability distribution?

o Require 0 ≤ pj ≤ 1 for all j, and Σ pj = 1

So, we want to solve the following:o Maximize f(p1,..,pn) = Σ pj log2 pj

o Subject to constraint Σ pj = 1

How should we solve this? o Do you really have to ask?SVM

Page 24: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

24

Entropy Example

Recall L(x,y,λ) = f(x,y) + λ (g(x,y) – c)

Problem statementoMaximize f(p1,..,pn) = Σ pj log2 pj

oSubject to constraint Σ pj = 1

In this case, Lagrangian isL(p1,…,pn,λ) = Σ pj log2 pj + λ (Σ pj – 1)

Compute partial derivatives wrt each pj and partial derivative wrt λ

SVM

Page 25: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

25

Entropy Example Have L(x,y,λ) = Σ pj log2 pj + λ (Σ pj – 1)

Partial derivatives wrt any pj yieldslog2 pj + 1/ln(2) + λ = 0 (#)

And wrt λ yields the constraint Σ pj – 1 = 0 or Σ pj = 1 (##)

Equation (#) implies all pj are equal With equation (##), all pj = 1/n Conclusion?SVM

Page 26: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

26

Notation

Let x=(x1,x2,…,xn) and λ=(λ1,λ2,…,λm)

Then we write Lagrangian asL(x,λ) = f(x) + Σ λi (gi(x) – ci)

Note: L is a function of n+m variables

Can view the problem as followso The gi functions define a feasible

regiono Maximize f over this feasible region

SVM

Page 27: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

27

Lagrange Multiplier Example

SVM

Page 28: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

28

Lagrangian Duality

SVM

Page 29: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

29

Lagrange Multipliers and SVM

Lagrange multipliers very cool indeedo But what does this have to do with

SVM? Can view (soft) margin

computation as constrained optimization problemo In this form, kernel trick will be clear

We can kill 2 birds with 1 stoneo Make margin calculation clearero Make kernel trick perfectly clear

SVM

Page 30: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

30

Problem Setup

Let X1,X2,…,Xn be data pointso Each Xi = (xi,yi) a point in the planeo In general, could be higher dimension

Let z1,z2,…,zn be corresponding class labels, where each zi {-1,1} o Where zi = 1 if classified as “red” type

o And zi = -1 if classified as “blue” type

Note that this is a binary classifierSVM

Page 31: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

31

Geometric View

Equation of yellow linew1x + w2y + b = 0

Equation of red linew1x + w2y + b = 1

Equation of blue linew1x + w2y + b = -1

Margin is distance between red and blue SVM

x

y

Page 32: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

32

Geometric View

All red points X = (x,y) satisfyw1x + w2y + b ≥ 1

All blue points X = (x,y) satisfyw1x + w2y + b ≤ -1

Want inequalities all true after training

SVM

x

y

Page 33: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

33

Geometric View

With lines defined… Given new data

point X = (x,y) to classifyo “Red” provided that

w1x + w2y + b > 0

o “Blue” provided thatw1x + w2y + b < 0

This is scoring phaseSVM

x

y

Page 34: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

34

Geometric View

The real question is... How to find equation

of the yellow line?o Given {Xi} and {zi}

o Where Xi a point in plane

o And zi its classification

Finding yellow line is the training phase…

SVM

x

y

Page 35: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

35

Geometric View

Distance from origin to line Ax+By+C = 0 is|C| / sqrt(A2 + B2)

Origin to red line:|1-b| / ||W||where W = (w1,w2)

Origin to blue line:|-1-b| / ||W||

Margin is m = 2/||W||

SVM

y

x

m

Page 36: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

36

Training Phase

Given {Xi} and {zi}, find largest margin m that classifies all points correctly

Want to find red, blue lines in picture

Recall red line is of the formw1x + w2y + b = 1

Blue line is of the formw1x + w2y + b = -1

And maximize margin: m = 2/||W||SVM

Page 37: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

37

Training

Since zi {-1,1}, correct classification occurs providedzi (w1xi + w2yi + b) ≥ 1 for all i

Training problem to solve:o Maximize: m = 2/||W||o Subject to constraints:

zi (w1xi + w2yi + b) ≥ 1 for i=1,2,…,n

Can we determine W and b ? SVM

Page 38: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

38

Training

The problem on previous slide is equivalent to the following

Minimize: F(W) = ||W||2 / 2 = (w1

2 + w22) / 2

Subject to constraints: 1 - zi (w1xi + w2yi + b) ≤ 0 for all i

Should be starting to look familiar…

SVM

Page 39: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

39

Lagrangian

Ignoring inequalities, we have…L(w1,w2,b,λ) = (w1

2 + w22) / 2

+ Σ λi(1 - zi (w1xi + w2yi + b))

ComputedL/dw1 = w1 - Σ λizixi = 0

dL/dw2 = w2 - Σ λiziyi = 0

dL/db = Σ λizi = 0

dL/dλi = 1 - zi (w1xi + w2yi + b) = 0SVM

Page 40: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

40

Lagrangian

Derivatives yield constraints andW = Σ λiziXi and Σ λizi = 0

Substitute these into L yieldsL(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj

Where “” is dot product: XiXj = xixj + yiyj

Here, L is only a function of λ o We still have the constraint Σ λizi = 0

o Note: If we find λi then we know WSVM

Page 41: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

41

New-and-Improved Problem

Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj Subject to: Σ λizi = 0 and all λi ≥ 0 Why maximize L(λ)? Intuition may

be…o Goal is to minimize F(W) = (w1

2 + w22) /

2 o Subject to constraints in L(λ) functiono Maximize L(λ) finds “best” parameters

λ o And “best” λ will solve this min

problemo This version is known as the dual

problem

SVM

Page 42: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

42

Dual Version of Problem

Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj Subject to: Σ λizi = 0 and all λi ≥ 0 Note that this is the dual problem Can always solve it (if solution

exists)o And will find a global maximum

It doesn’t get any better than that!o Note that with HMM (for example), no

guarantee of global maximumSVM

Page 43: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

43

All Together Now: Training

Given data points X1,X2,…,Xn Label each Xi with zi {-1,1} Solve dual problem (previous slide)

o Solving it yields λ o Once λ known, compute W=(w1,w2)

and b o Obtain equation of line: w1x + w2y + b

What have we accomplished?SVM

Page 44: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

44

All Together Now: Scoring

From training, find λo Yields W=(w1,w2) and b in w1x + w2y +

b Given new data point X = (x,y)

o That is, X not in training set Compute w1x + w2y + b

o If greater than 0, classify X as red typeo Otherwise, classify X as blue type

What happened, in terms of picture?

SVM

Page 45: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

45

Geometric Viewpoint

Training?o Find equation of

yellow line, f(X) Score X = (x,y) ?

o If f(X) > 0, then X is above yellow line (classify as red)

o Else X below line (classify as blue)

SVM

y

x

m

Page 46: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

46

Scoring Revisited

Use equation of yellow line for scoring

There is an alternative (better) wayo Let f(X) = w1x + w2y + b = W X + b

o And recall that W = Σ λiziXi

Then, f(X) = Σ λizi(XiX) + b Why is this better?

o No need to explicitly compute Wo Any better reasons why it’s better?SVM

Page 47: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

47

Support Vectors

When solving L(λ), find mostly λi = 0

Specifically, the Xi for whichzi (w1xi + w2yi + b) > 1

Only constraints that can matter arezi (w1xi + w2yi + b) = 1

The latter are support vectorso Not known in advance training

determines the support vectorsSVM

Page 48: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

48

Support Vectors

Picture worth 1k words?

Where are the support vectors?o Other vectors

(training points) don’t matter

o Why not?

SVM

x

y

support vectors

Page 49: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

49

Scoring Re-revisited

Score X using f(X) = Σ λizi(XiX) + b Generally, most of the λi are 0 So, sum is not really from i=1 to n

o Instead, sum actually from i=1 to so Where s is number of support vectors

Why does this matter?o Typically, n large, s small, so fast

scoringo And this form of f(X) is very useful…

SVM

Page 50: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

50

Training: Soft Margin

Suppose we relax “linearly separable”

Tradeoff errors for bigger margin m o More errors, but

gain bigger margin Note that 2 kinds of

errors illustratedSVM

x

y

m

Page 51: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

51

Errors

To account for errors, introduce “slack variables” εi ≥ 0 in optimization

For red point Xi = (xi,yi), constraint isw1xi + w2yi + b ≥ 1 - εi

For blue point Xi = (xi,yi), constraint isw1xi + w2yi + b ≤ -1 + εi

Minimize: ||W||2/2 + C Σ εi

Subject to constraints above

SVM

Page 52: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

52

Dual Problem

Work thru details, dual problem is… Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj XiXj Subject to: Σ λizi = 0 and C ≥ λi ≥ 0

o Note that this is the same as before…o …except for C ≥ λi condition

We specify C when training Non-linearly separable case is very

similar to linearly separable caseSVM

Page 53: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

53

Training and ScoringRe-re-revisited

Trainingo Maximize: L(λ) = Σ λi – ½ ΣΣ λiλjzizj

XiXj

o Subject to: Σ λizi = 0 and C ≥ λi ≥ 0 o Where C specified by user

Scoring: Given X=(x,y) o Compute f(X)=Σ λizi(XiX)+b, where

sum is over support vectorso If f(X) < 0, then X is “blue”; else it’s

“red”SVM

Page 54: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

54

Kernel Trick

Finally, can make sense of kernel trick

Recall X1,X2,…,Xn are training vectorso For training, the Xi only appear as

XiXj

o When scoring X, the Xi only appear as XiX

Dot product is a type of inner producto Many other inner products

Can replace “” with any inner producto E.g., one defined in higher dimensions

SVM

Page 55: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

55

Kernel Example

Suppose we define

Then ϕ maps element in 2-d to 5-d For Xi=(xi,yi) and Xj=(xj,yj), we have

ϕ(Xi)ϕ(Xj) = (1 + xixj + yiyj)2

Define the kernel function K asK(Xi,Xj) = (1 + xixj + yiyj)2

Note: K is composition of ϕ and “” SVM

Page 56: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

56

The Big Picture

Training data lives in input spaceo Where data is not linearly separable

Map input space to higher dimension feature space using a function ϕ

Do training & scoring in feature spaceo Where data is linearly separable

But don’t want to suffer performance penalty due to higher dimension

SVM

Page 57: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

57

Training & Scoring with Kernel

Can simply replace XiXj with K(Xi,Xj)

Trainingo Max: L(λ) = Σ λi – ½ ΣΣ λiλjzizj K(Xi,Xj)

o Subject to: Σ λizi = 0 and C ≥ λi ≥ 0 o Where C specified by user

Scoring: Given X=(x,y) o Compute f(X)=Σ λizi K(Xi,X)+b o If f(X) < 0, then X is “blue”; else “red”SVM

Page 58: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

58

Kernel Trick

No need to map input to feature space

We don’t even need to know ϕ o Only need to know kernel function K

Bottom lineo Obtain the benefit of working in higher

dimension space (linear separable)…o …with no significant performance

penaltyo That’s really an awesome trick SVM

Page 59: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

59

Popular Kernels

Polynomial learning machineK(Xi,Xj) = (XiXj + 1)p

Gaussian radial-basis functionK(Xi,Xj) = exp(-(Xi – Xj)(Xi – Xj)/(2σ2))

Two-layer perceptronK(Xi,Xj) = tanh(β0 XiXj + β1)

Many other possibilitieso Selecting “right” kernel is the real

trickSVM

Page 60: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

60

SVM +’s and –’s

Strengthso In training, obtain a global maximum,

not just local max o Can tradeoff margin and training errorso Kernel trick is totally awesome

Weaknesseso Choosing kernel is more art than

scienceo Success depends heavily on kernel

choiceSVM

Page 62: Support Vector Machines SVM 1 Mark Stamp. Supervised vs Unsupervised  Often use supervised learning o That is, training relies on labeled data o Training.

62

References: Lagrange Multipliers

D. Klein, Lagrange multipliers without permanent scarring

Wikipedia, Lagrange multiplier

SVM