Support Vector Support Vector Machines and Machines and Kernels Kernels Adapted from slides by Tim Oates Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County University of Maryland Baltimore County Doing Doing Really Really Well with Well with Linear Decision Surfaces Linear Decision Surfaces
42
Embed
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Support Vector Support Vector Machines and Machines and
KernelsKernels
Adapted from slides by Tim OatesAdapted from slides by Tim Oates
Cognition, Robotics, and Learning (CORAL) LabCognition, Robotics, and Learning (CORAL) Lab
University of Maryland Baltimore CountyUniversity of Maryland Baltimore County
Doing Doing ReallyReally Well with Well with
Linear Decision SurfacesLinear Decision Surfaces
OutlineOutline
PredictionPrediction Why might predictions be wrong?Why might predictions be wrong?
Support vector machinesSupport vector machines Doing really well with linear modelsDoing really well with linear models
KernelsKernels Making the non-linear linearMaking the non-linear linear
Supervised ML = Supervised ML = PredictionPrediction
Given training instances (x,y)Given training instances (x,y) Learn a model fLearn a model f Such that f(x) = ySuch that f(x) = y Use f to predict y for new xUse f to predict y for new x Many variations on this basic themeMany variations on this basic theme
Why might predictions be Why might predictions be wrong?wrong?
True Non-Determinism True Non-Determinism Flip a biased coinFlip a biased coin p(p(headsheads) = ) = Estimate Estimate If If > 0.5 predict > 0.5 predict headsheads, else , else tailstails Lots of ML research on problems like Lots of ML research on problems like
thisthis Learn a modelLearn a model Do the best you can in expectationDo the best you can in expectation
Why might predictions be Why might predictions be wrong? wrong?
Partial Observability Partial Observability Something needed to predict y is Something needed to predict y is
missing from observation xmissing from observation x N-bit parity problemN-bit parity problem
x contains N-1 bits (hard PO)x contains N-1 bits (hard PO) x contains N bits but learner ignores some x contains N bits but learner ignores some
of them (soft PO)of them (soft PO)
Why might predictions be Why might predictions be wrong?wrong?
Having the right features (x) is Having the right features (x) is crucialcrucial
XOO O O XXX
X
OO O O
X
X
X
Support Vector Support Vector MachinesMachines
Doing Doing ReallyReally Well with Well with Linear Decision SurfacesLinear Decision Surfaces
Strengths of SVMsStrengths of SVMs
Good generalization in theoryGood generalization in theory Good generalization in practiceGood generalization in practice Work well with few training Work well with few training
instancesinstances Find globally best modelFind globally best model Efficient algorithmsEfficient algorithms Amenable to the kernel trickAmenable to the kernel trick
Linear SeparatorsLinear Separators
Training instancesTraining instances x x nn
y y {-1, 1} {-1, 1} w w nn
b b HyperplaneHyperplane
<w, x> + b = 0<w, x> + b = 0 ww11xx11 + w + w22xx22 … + w … + wnnxxnn + b = 0 + b = 0
Decision functionDecision function f(x) = sign(<w, x> + b)f(x) = sign(<w, x> + b)
Math ReviewMath ReviewInner (dot) product:Inner (dot) product:
<a, b> = a · b = ∑ a<a, b> = a · b = ∑ aii*b*bii = a= a11bb11 + a + a22bb22 + …+a + …+annbbnn
IntuitionsIntuitions
X
X
O
OO
O
OOX
X
X
X
X
XO
O
IntuitionsIntuitions
X
X
O
OO
O
OOX
X
X
X
X
XO
O
IntuitionsIntuitions
X
X
O
OO
O
OOX
X
X
X
X
XO
O
IntuitionsIntuitions
X
X
O
OO
O
OOX
X
X
X
X
XO
O
A “Good” SeparatorA “Good” Separator
X
X
O
OO
O
OOX
X
X
X
X
XO
O
Noise in the Noise in the ObservationsObservations
X
X
O
OO
O
OOX
X
X
X
X
XO
O
Ruling Out Some Ruling Out Some SeparatorsSeparators
X
X
O
OO
O
OOX
X
X
X
X
XO
O
Lots of NoiseLots of Noise
X
X
O
OO
O
OOX
X
X
X
X
XO
O
Maximizing the MarginMaximizing the Margin
X
X
O
OO
O
OOX
X
X
X
X
XO
O
““Fat” SeparatorsFat” Separators
X
X
O
OO
O
OOX
X
X
X
X
XO
O
Why Maximize Margin?Why Maximize Margin?
Increasing margin reduces Increasing margin reduces capacitycapacity Must restrict capacity to generalize Must restrict capacity to generalize
m training instancesm training instances 22mm ways to label them ways to label them What if function class that can separate What if function class that can separate
them all?them all? ShattersShatters the training instances the training instances
VC Dimension is largest m such that VC Dimension is largest m such that function class can shatter some set function class can shatter some set of m pointsof m points
R[f] = risk, test errorR[f] = risk, test error RRempemp[f] = empirical risk, train error[f] = empirical risk, train error h = VC dimensionh = VC dimension m = number of training instancesm = number of training instances = probability that bound does not = probability that bound does not
holdhold
1m
2mh
ln + 14
+ lnhR[f] Remp[f] +
Support VectorsSupport Vectors
X
X
O
OO
O
OO
O
O
X
X
X
X
X
X
The MathThe Math Training instancesTraining instances
x x nn
y y {-1, 1} {-1, 1} Decision functionDecision function
f(x) = sign(<w,x> + b)f(x) = sign(<w,x> + b) w w nn
b b Find w and b that Find w and b that
Perfectly classify training instancesPerfectly classify training instances Assuming linear separabilityAssuming linear separability
Maximize marginMaximize margin
The MathThe Math
For perfect classification, we wantFor perfect classification, we want yyii (<w,x (<w,xii> + b) ≥ 0 for all i> + b) ≥ 0 for all i Why?Why?
To maximize the margin, we wantTo maximize the margin, we want w that minimizes |w|w that minimizes |w|22
Maximize over Maximize over W(W() = ) = ii ii - 1/2 - 1/2 i,ji,j ii jj y yii y yj j <x<xii, x, xjj>>
Subject toSubject to i i 0 0
ii ii y yii = 0 = 0
Decision functionDecision function f(x) = sign(f(x) = sign(ii ii y yii <x, x <x, xii> + b)> + b)
What if Data Are Not What if Data Are Not Perfectly Linearly Perfectly Linearly
Separable?Separable? Cannot find w and b that satisfyCannot find w and b that satisfy
yyii (<w,x (<w,xii> + b) ≥ 1 for all i> + b) ≥ 1 for all i
Introduce slack variables Introduce slack variables ii
yyii (<w,x (<w,xii> + b) ≥ 1 - > + b) ≥ 1 - ii for all i for all i
MinimizeMinimize |w||w|22 + C + C ii
Strengths of SVMsStrengths of SVMs
Good generalization in theoryGood generalization in theory Good generalization in practiceGood generalization in practice Work well with few training Work well with few training
instancesinstances Find globally best modelFind globally best model Efficient algorithmsEfficient algorithms Amenable to the kernel trick …Amenable to the kernel trick …
What if Surface is Non-What if Surface is Non-Linear?Linear?
XXXX
XX
O OOOO
O OOOO
OOOO O
OOO OO
Image from http://www.atrandomresearch.com/iclass/
Kernel MethodsKernel Methods
Making the Non-Linear Making the Non-Linear LinearLinear
When Linear Separators When Linear Separators FailFail
XOO O O XXX x1
x2
X
OO O O
X
X
X
x1
x12
Mapping into a New Feature Mapping into a New Feature SpaceSpace
Rather than run SVM on xRather than run SVM on xii, run it on , run it on (x(xii)) Find non-linear separator in input spaceFind non-linear separator in input space What if What if (x(xii) is really big?) is really big? Use kernels to compute it implicitly!Use kernels to compute it implicitly!
: x : x X = X = (x)(x)
(x(x11,x,x22) = (x) = (x11,x,x22,x,x1122,x,x22
22,x,x11xx22))
Image from http://web.engr.oregonstate.edu/ ~afern/classes/cs534/
Find kernel K such thatFind kernel K such that K(xK(x11,x,x22) = < ) = < (x(x11), ), (x(x22)>)>
Computing K(xComputing K(x11,x,x22) should be ) should be efficient, much more so than efficient, much more so than computing computing (x(x11) and ) and (x(x22))
Use K(xUse K(x11,x,x22) in SVM algorithm rather ) in SVM algorithm rather than <xthan <x11,x,x22>>
Remarkably, this is possibleRemarkably, this is possible
Establishing “kernel-hood” from first principles is Establishing “kernel-hood” from first principles is non-trivialnon-trivial
The Kernel TrickThe Kernel Trick
“Given an algorithm which is formulated in terms of a positive definite kernel K1, one can construct an alternative algorithm by replacing K1 with another positive definite kernel K2” SVMs can use the kernel trick
Using a Different Kernel in Using a Different Kernel in the Dual Optimization the Dual Optimization
ProblemProblem For example, using the polynomial For example, using the polynomial
kernel with d = 4 (including lower-order kernel with d = 4 (including lower-order terms).terms).
Maximize over Maximize over W(W() = ) = ii ii - 1/2 - 1/2 i,ji,j ii jj y yii y yj j <x<xii, x, xjj>>
Subject toSubject to i i 0 0
ii ii y yii = 0 = 0 Decision functionDecision function
f(x) = sign(f(x) = sign(ii ii y yii <x, x <x, xii> + b)> + b)
(<x(<xii, x, xjj> + > + 1)1)44
X
(<x(<xii, x, xjj> + > + 1)1)44
X
These are kernels!So by the kernel trick,we just replace them!
Exotic KernelsExotic Kernels
StringsStrings TreesTrees GraphsGraphs The hard part is establishing kernel-The hard part is establishing kernel-