Top Banner
Support Vector Machines & Kernels Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer and Carlos Guestrin
19

Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Sep 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Support Vector Machines & Kernels Lecture 5

David Sontag

New York University

Slides adapted from Luke Zettlemoyer and Carlos Guestrin

Page 2: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Support Vector Machines

QP form:

More “natural” form:

Empirical loss Regularization term

Equivalent if

Page 3: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Subgradient method

Page 4: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Subgradient method

Step size:

Page 5: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Stochastic subgradient

Subgradient 1

Page 6: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

PEGASOS

Subgradient

Projection

A_t = S

Subgradient method

|A_t| = 1

Stochastic gradient

1

Page 7: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Run-Time of Pegasos

•  Choosing |At|=1

Run-time required for Pegasos to find ε accurate solution w.p. ¸ 1-δ

•  Run-time does not depend on #examples

•  Depends on “difficulty” of problem (λ and ε)

n = # of features

Page 8: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Experiments •  3 datasets (provided by Joachims)

–  Reuters CCAT (800K examples, 47k features)

–  Physics ArXiv (62k examples, 100k features)

–  Covertype (581k examples, 54 features)

Training Time (in seconds):

Pegasos SVM-Perf SVM-Light

Reuters 2 77 20,075

Covertype 6 85 25,514

Astro-Physics 2 5 80

Page 9: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

What’s Next!

•  Learn one of the most interesting and exciting recent advancements in machine learning – The “kernel trick”

– High dimensional feature spaces at no extra cost

•  But first, a detour – Constrained optimization!

Page 10: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Constrained optimization

x*=0

No Constraint x ≥ -1

x*=0 x*=1

x ≥ 1

How do we solve with constraints? Lagrange Multipliers!!!

Page 11: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Lagrange multipliers – Dual variables

Introduce Lagrangian (objective):

We will solve:

Add Lagrange multiplier

Add new constraint

Why is this equivalent? •  min is fighting max! x<b (x-b)<0 maxα-α(x-b) = ∞

•  min won’t let this happen!

x>b, α≥0 (x-b)>0 maxα-α(x-b) = 0, α*=0 •  min is cool with 0, and L(x, α)=x2 (original objective)

x=b α can be anything, and L(x, α)=x2 (original objective)

Rewrite Constraint

The min on the outside forces max to behave, so constraints will be satisfied.

Page 12: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Dual SVM derivation (1) – the linearly separable case (hard margin SVM)

Original optimization problem:

Lagrangian:

Rewrite constraints

One Lagrange multiplier per example

Our goal now is to solve:

Page 13: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Dual SVM derivation (2) – the linearly separable case (hard margin SVM)

Swap min and max

Slater’s condition from convex optimization guarantees that these two optimization problems are equivalent!

(Primal)

(Dual)

Page 14: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Dual SVM derivation (3) – the linearly separable case (hard margin SVM)

Can solve for optimal w, b as function of α:

⇥(x) =

⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅

⇤L

⇤w= w �

j

�jyjxj

7

(Dual)

Substituting these values back in (and simplifying), we obtain:

(Dual)

Sums over all training examples dot product scalars

Page 15: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Reminder: What if the data is not linearly separable?

Use features of features of features of features….

Feature space can get really large really quickly!

�(x) =

⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅

7

Page 16: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Higher order polynomials

number of input dimensions

num

ber

of m

onom

ial t

erm

s

d=2

d=4

d=3

m – input features d – degree of polynomial

grows fast! d = 6, m = 100 about 1.6 billion terms

Page 17: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Dual formulation only depends on dot-products of the features!

First, we introduce a feature mapping:

Next, replace the dot product with an equivalent kernel function:

Page 18: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Polynomial kernel

Polynomials of degree exactly d

d=1

⇥(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

�����������⌃

⇤L

⇤w= w �

j

�jyjxj

⇥(u).⇥(v) =

�u1u2

⇥.

�v1v2

⇥= u1v1 + u2v2 = u.v

7

d=2

For any d (we will skip proof):

⇥(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

�����������⌃

⇤L

⇤w= w �

j

�jyjxj

⇥(u).⇥(v) =

�u1u2

⇥.

�v1v2

⇥= u1v1 + u2v2 = u.v

⇥(u).⇥(v) = (u.v)d

7

⇥(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

�����������⌃

⇤L

⇤w= w �

j

�jyjxj

⇥(u).⇥(v) =

�u1u2

⇥.

�v1v2

⇥= u1v1 + u2v2 = u.v

⇥(u).⇥(v) =

⌥⌥⇧

u21u1u2u2u1u22

��⌃ .

⌥⌥⇧

v21v1v2v2v1v22

��⌃ = u21v21 + 2u1v1u2v2 + u22v

22

⇥(u).⇥(v) = (u.v)d

7

⇥(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

�����������⌃

⇤L

⇤w= w �

j

�jyjxj

⇥(u).⇥(v) =

�u1u2

⇥.

�v1v2

⇥= u1v1 + u2v2 = u.v

⇥(u).⇥(v) =

⌥⌥⇧

u21u1u2u2u1u22

��⌃ .

⌥⌥⇧

v21v1v2v2v1v22

��⌃ = u21v21 + 2u1v1u2v2 + u22v

22

= (u1v1 + u2v2)2

⇥(u).⇥(v) = (u.v)d

7

⇥(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

�����������⌃

⇤L

⇤w= w �

j

�jyjxj

⇥(u).⇥(v) =

�u1u2

⇥.

�v1v2

⇥= u1v1 + u2v2 = u.v

⇥(u).⇥(v) =

⌥⌥⇧

u21u1u2u2u1u22

��⌃ .

⌥⌥⇧

v21v1v2v2v1v22

��⌃ = u21v21 + 2u1v1u2v2 + u22v

22

= (u1v1 + u2v2)2

= (u.v)2

⇥(u).⇥(v) = (u.v)d

7

Page 19: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Common kernels •  Polynomials of degree exactly d

•  Polynomials of degree up to d

•  Gaussian kernels

•  Sigmoid

•  And many others: very active area of research!