Top Banner
Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data x i 2A + , y i =1& x i 2A à , y i =à1 S= f(x i ;y i ) ì ì x i 2R n ;y i 2 fà 1;1g;i= 1;...;mg Find a function by learning from data f :R n ! R f(x) > 0) x2A + and f(x)< 0 ) x2A à The simplest function is linear: f(x)= w 0 x+b
49

Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Binary Classification ProblemLearn a Classifier from the Training Set

Given a training dataset

Main goal: Predict the unseen class label for new data

xi 2 A+ , yi = 1 & xi 2 Aà , yi = à 1

S = f (xi;yi)ììxi 2 Rn;yi 2 f à 1;1g;i = 1;. . .;mg

Find a function by learning from data f : Rn ! R

f (x) > 0) x 2 A+ and f (x) < 0) x 2 Aà

The simplest function is linear: f (x) = w0x + b

Page 2: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Binary Classification ProblemLinearly Separable Case

A-

A+

x0w+ b= à 1

wx0w+ b= + 1x0w+ b= 0

Malignant

Benign

Page 3: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Support Vector MachinesMaximizing the Margin between Bounding Planes

A+

A-

x0w+ b= 1

x0w+ b= à 1

Page 4: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Why We Maximize the Margin?(Based on Statistical Learning Theory)

The Structural Risk Minimization (SRM):

The expected risk will be less than or equal toempirical risk (training error)+ VC (error) bound

íí w

íí

2 / VC bound

minVC bound , min21íí w

íí 2

2 , maxMargin

Page 5: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Summary the Notations

Let S = f (x1;y1); (x2;y2); . . .(xm;ym)gbe a training dataset and represented by matrices

A =

(x1)0

(x2)0...

(xm)0

2

64

3

75 2 Rmâ n; D =

y1 ááá 0......

...0 ááá ym

" #

2 Rmâ m

e= [1;1; . . .;1]02 Rm:D(Aw+ eb)>e, where

A iw+ b > + 1; for D ii = + 1;A iw+ b 6 à 1; for D ii = à 1

equivalent to

Page 6: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Support Vector Classification(Linearly Separable Case, Primal)

the minimization problem:

The hyperplane is determined by solving (w;b)

min(w;b)2R n+1

21 jjwjj22

D(Aw+ eb)>e;

It realizes the maximal margin hyperplane withgeometric margin í = jjwjj2

1

Page 7: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Support Vector Classification(Linearly Separable Case, Dual Form)

The dual problem of previous MP:

maxë2R l

e0ë à 21ë0DAA0Dë

subject to

e0Dë = 0; ë>0:Applying the KKT optimality conditions, we have

w = A0Dë. But where isb?

06ë ? D(Aw+ eb) à e>0Don’t forget

Page 8: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Dual Representation of SVM(Key of Kernel Methods: )

The hypothesis is determined by (ëã;bã)

h(x) = sgn(êx;A0Dëã

ë+ bã)

= sgn(P

i=1

l

yiëãi

êxi;x

ë+ bã)

= sgn(P

ëãi >0

yiëãi

êxi;x

ë+ bã)

w = A0Dëã =P

i=1

`yiëã

i A0i

Remember : A0i = xi

Page 9: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Soft Margin SVM(Nonseparable Case)

If data are not linearly separable Primal problem is infeasible Dual problem is unbounded above

Introduce the slack variable for each training point

yi(w0xi + b)>1à øi; øi>0 8 i The inequality system is always feasible

w = 0; b= 0 & ø= ee.g.

Page 10: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

xj

x

x

x

x

x

x

x

x

o

o

o

o

o

o

o

oi

í

í

øj

øi

Page 11: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

D(Aw+ eb) >e+ øø>0

whereø: nonnegative slack (error) vector

The term e0ø , 1-norm measure of error vector, is

called the training error.

minw;b;ø

e0ø

s.t. (LP)

Robust Linear ProgrammingPreliminary Approach to SVM

For the linearly separable case, at solution of (LP):

ø= 0

Page 12: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

min(w;b;ø)2R n+1+l

21jjwjj22 + 2

Cjjøjj22

D(Aw+ eb) + ø>e

2-Norm Soft Margin:

1-Norm Soft Margin (Conventional SVM):

min(w;b;ø)2R n+1+l

21jjwjj22 + Ce0ø

D(Aw+ eb) + ø>e

ø> 0

Support Vector Machine Formulations(Two Different Measures of Training Error)

Page 13: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Tuning ProcedureHow to determine C?

overfitting

The final value of parameter is one with the maximum testing set correctness !

C

Page 14: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Lagrangian Dual Problem

subject to

subject to

where

Page 15: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

1-Norm Soft Margin SVM Dual Formulation

The Lagrangian for 1-norm soft margin:

L (w;b;ø;ë;r) = 21w0w+ Ce0ø+

ë0[eà D(Aw+ eb) à ø]à r0øwhere ë>0 & r>0

The partial derivatives with respect to primalvariables equal zeros

@w@L (w;b;ø;ë) = wà A0Dë = 0

@b@L (w;b;ø;ë) = e0Dë = 0;

@ø@L (w;b;ø;ë) = Ceà ë à r = 0

Page 16: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Substitute:

L (w;b;ø;ë;r) = 21w0w+ Ce0ø+

ë0[eà D(Aw+ eb) à ø]à r0ø

where ë>0 & r>0

w = A0Dë; Ce0ø= (ë + r)0ø

in L(w;b;ø;ë;r)e0Dë = 0;

ò(ë;r) = 21ë0DAA0Dë + e0ë à ë0DA(A0Dë)

= à 21ë0DAA0Dë + e0ë

s:t: e0Dë = 0; ë à r = Ce ë>0 & r>0and

Page 17: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Dual Maximization Problemfor 1-Norm Soft Margin

Dual:

0 6 ë 6 Ce

maxë2R l

e0ë à 21ë0DAA0Dë

e0Dë = 0

The corresponding KKT complementarity:

06ë ? D(Aw+ eb) + øà e>0

06ø ? ë à Ce60

Page 18: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Slack Variables for 1-Norm Soft Margin SVM

Non-zero slack can only occur when ëãi = C

The points for which 0< ëãi < C lie at the

bounding planes This will help us to find bã

The contribution of outlier in the decision

rule will be at most C The trade-off between accuracy and

regularization directly controls by C

f (x) =P

ëãi >0

yiëãi

êxi;x

ë+ bã

Page 19: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Two-spiral Dataset(94 White Dots & 94 Red Dots)

Page 20: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Learning in Feature Space(Could Simplify the Classification Task)

Learning in a high dimensional space could degrade

generalization performance This phenomenon is called curse of dimensionality

By using a kernel function, that represents the innerproduct of training example in feature space, we never need to explicitly know the nonlinear map.

Even do not know the dimensionality of feature space There is no free lunch

Deal with a huge and dense kernel matrix

Reduced kernel can avoid this difficulty

Page 21: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

X Fþ

þ( ) þ( )

þ( )þ( )

þ( )

þ( )

þ( )þ( )

Page 22: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

f (x) =ð P

j=1

?

wjþj(x)ñ

+ b

Linear Machine in Feature Space

Let þ : X ! Fbe a nonlinear map from the

input space to some feature space

The classifier will be in the form (Primal):

Make it in the dual form:

f (x) =ð P

i=1

lë iyi

êþ(xi) áþ(x)

ëñ+ b

Page 23: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

K (x;z) =êþ(x) áþ(z)

ë

Kernel: Represent Inner Product in Feature Space

The classifier will become:

f (x) =ð P

i=1

lë iyiK (xi;x)

ñ+ b

Definition: A kernel is a functionK : X â X ! Rsuch thatfor all x;z 2 X

where þ : X ! F

Page 24: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

A Simple Example of KernelPolynomial Kernel of Degree 2: K (x;z) =

êx;z

ë2

Let x = x1

x2

ô õ

;z = z1

z2

ô õ2 R2 and the nonlinear map

þ : R27! R3 defined by þ(x) =x2

1

x22

2p

x1x2

2

4

3

5.

Thenêþ(x);þ(z)

ë=

êx;z

ë2= K (x;z).

There are many other nonlinear maps, (x), that

satisfy the relation:ê (x); (z)

ë=

êx;z

ë2= K (x;z)

Page 25: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Power of the Kernel Technique

Consider a nonlinear map þ : Rn7! Rp that consists

of distinct features of all the monomials of degree d.

Then p = n + dà 1d

ð ñ.

For example: n = 11; d = 10; p = 92378

Is it necessary? We only need to know êþ(x);þ(z)

ë!

This can be achieved

K (x;z) =êx;z

ëd

x31x

12x

43x

44 ) â î î î â î â î î î î â î î î î

Page 26: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

The value of kernel function represents the inner product of two training points in feature space

Kernel functions merge two steps

1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space

Kernel TechniqueBased on Mercer’s Condition (1909)

Page 27: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

More Examples of Kernel

K (A;B) : Rmâ n â Rnâ l 7à! Rmâ l

A 2 Rmâ n;a 2 Rm;ö 2 R; d is an integer:

Polynomial Kernel :(AA0+ öaa0)dï

)(Linear KernelAA0: ö = 0;d = 1

Gaussian (Radial Basis) Kernel:

"à ökA ià A jk22; i; j = 1;. . .;mK (A;A0)ij =

Theij -entry of K (A;A0) represents the “similarity” of data points A i A jand

Page 28: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Nonlinear 1-Norm Soft Margin SVMIn Dual Form

Linear SVM:

0 6 ë 6 Ce

maxë2R l

e0ë à 21ë0DAA0Dë

e0Dë = 0

Nonlinear SVM:maxë2R l

e0ë à 21ë0DK (A;A0)Dë

0 6 ë 6 Ce

e0Dë = 0

Page 29: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

1-norm Support Vector MachinesGood for Feature Selection

C > 0 Solve the quadratic program for some :min Ce0ø+ kwk1

D(Aw+ eb) + ø>eø>0;w;b

s. t. ,, denoteswhere D ii = æ1 A+ Aàor membership.

Equivalent to solve a Linear Program as follows: min

ø>0;w;b;s Ce0ø+ e0s

D(Aw+ eb) + ø>e

à s6w6s

Page 30: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

SVM as an Unconstrained Minimization Problem

Hence (QP) is equivalent to the nonsmooth SVM:

minw;b2

C k(eà D(Aw+ eb))+k22 + 2

1(kwk22 + b2)

2C køk2

2+ 21(kwk2

2+ b2)

D(Aw+ eb) + ø>eø>0;w;bmin

s. t.(QP)

Change (QP) into an unconstrained MP

Reduce (n+1+m) variables to (n+1) variables

At the solution of (QP):

where (á)+ = maxf á;0gø= (eà D(Aw+ eb))+

Page 31: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Smooth the Plus Function: Integrate

Step function: xã Sigmoid function:(1+"à 5x)

1

Plus function: x+ p-function: p(x;5)

(1+"à ì x)1

p(x; ì ) := x + ì1 log(1+ "à ì x)

Page 32: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

SSVM: Smooth Support Vector Machine

(á)+

Replacing the plus function in the nonsmooth SVM by the smooth p(á; ì ), gives our SSVM:

ìnonsmooth SVM as goes to infinity. The solution of SSVM converges to the solution of

ì = 5(Typically, )

min(w;b) 2 Rn+12

Ckp((eà D(Aw+ eb)); ì )k22 + 2

1(kwk22 + b2)

Page 33: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Newton-Armijo Method: Quadratic Approximation of SSVM

At each iteration we solve a linear system of:

n+1 equations in n+1 variables Complexity depends on dimension of input space

Converges in 6 to 8 iterations

(wi;bi)è é

generated by solving a The sequence

(wã;bã)quadratic approximation of SSVM, converges to the

of SSVM at a quadratic rate.unique solution

It might be needed to select a stepsize

Page 34: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Newton-Armijo Algorithm

Start with any (w0;b0) 2 Rn+1. Having(wi;bi);

stop if r Ðì (wi;bi) = 0; else : (i) Newton Direction :

r 2Ðì (wi;bi)di = à r Ðì (wi;bi)0

(ii) Armijo Stepsize :

(wi+1;bi+1) = (wi;bi) + õidi

õi 2 f1;21;4

1; :::g

globally and globally and quadraticallquadratically converge ty converge to unique solo unique solution in a finiution in a finite number of te number of stepsstepssuch that Armijo’s rule is satisfied

Ðì (w;b) = 2Ckp((eà D(Aw+ eb)); ì )k2

2+ 21(kwk2

2+ b2)

Page 35: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Nonlinear Smooth SVM K (x0;A0)Dë + b= 0

K (A;A0) ReplaceAA0by a nonlinear kernel :

2Ckp(eà D(K (A;A0)Dë + eb; ì )k2

2+ 21(këk2

2 + b2)ë;bmin

Use Newton-Armijo algorithm to solve the problem

Each iteration solves m+1 linear equations in m+1 variables

Nonlinear classifier depends on the data points with nonzero coefficients :K (x0;A0)Dë + b=

P

ë j>0ë jyjK (A j ;x) + b= 0

Nonlinear Classifier:

Page 36: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Conclusion

SSVM: A new formulation of support vector machine as a smooth unconstrained minimization problem

No optimization (LP, QP) package is needed

Can be solved by a fast Newton-Armijo algorithm

An overview of SVMs for classification

There are many important issues did not address this lecture such as:

How to solve conventional SVM?

How to deal with massive datasets?

How to select parameters: C & ö

Page 37: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Perceptron

Linear threshold unit (LTU)

x1

x2

xn

.

..

w1

w2

wn

w0

x01

i=0n wi xi

1 if i=0n wi xi >0

o(xi) -1 otherwise{

g

Page 38: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Possibilities for function g

Step functionSign function Sigmoid (logistic) function

step(x) = 1, if x > threshold 0, if x threshold(in picture above, threshold = 0)

sign(x) = +1, if x > 0

-1, if x 0sigmoid(x) = 1/(1+e-x)

Adding an extra input with activation x0 = 1 and weightwi, 0 = -T (called the bias weight) is equivalent to having a threshold at T. This way we can always assume a 0 threshold.

Page 39: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Using a Bias Weight to Standardize the Threshold

1-T

x1

x2

w1

w2

w1x1+ w2x2 < T

w1x1+ w2x2 - T < 0

Page 40: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Perceptron Learning Rule

t = 1

o=1

o=-1

w = [0.2 0.2 0.2]

w = [0.25 –0.1 0.5]x2 = 0.2 x1 – 0.5

t = -1

(x, t)=([-1,-1], 1)o = sgn(0.25+0.1-0.5) =1w = [0.2 –0.2 –0.2]w = [0.2 –0.4 –0.2]

(x, t)=([2,1], 1)o =sgn(0.45-0.6+0.3) =1(x, t)=([1,1], 1)o = sgn(0.250.7+0.1) = 1

x1 x1

x1 x1

x2 x2

x2 x2

-0.5x1+0.3x2+0.45>0 o = 1

Page 41: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

The Perceptron Algorithm Rosenblatt, 1956

Given a linearly separable training set andSñ > 0learning rate and the initial weight vector,

bias: and letw0 = 0; b0 = 0

R = max16 i6 `

jjxijj; k = 0:

Page 42: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

The Perceptron Algorithm (Primal Form)

Repeat: for i = 1 to `

if yi(êwk áxi

ë+ bk)60 then

wk+1 wk + ñyixi

bk+1 bk + ñyiR2

k k + 1end if

until no mistakes made within the for loop return:end for

k; (wk;bk) . What is k ?

R = max16 i6 `

jjxijj; k = 0:

Page 43: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

wk+1 wk + ñyixi and bk+1 bk + ñyiR2

yi

àêwk+1 áxi

ë+ bk+1

á> yi

àêwk áxi

ë+ bk

á?

yiàê

wk+1 áxië

+ bk+1á

= yiàê

wk áxië

+ bká

+ ñàê

xi áxië

+ R2á

= yiàê

(wk + ñyixi) áxië

+ bk + ñyiR2á

= yiàê

wk áxië

+ bká

+ yiàñyi

êxi áxi

ë+ R2

á

Page 44: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

The Perceptron Algorithm ( STOP in Finite Steps )

Theorem (Novikoff)

SLet be a non-trivial training set, and letR = max16i6`

jjxijj2:

Suppose that there exists a vector wopt 2 Rn; jjwoptjj = 1

and yi(êwopt áxi

ë+ bopt)>í ;8 16 i6 .̀ Then the number

of mistakes made by the on-line perceptron algorithmSon is at most ( í

2R)2:

Page 45: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Proof of Finite Termination

Proof: Let

xi = xi

R

ô õ;w =

w

Rb

ô õ:

The algorithm starts with an augmented weight vector and updates it at each mistake.

w0 = 0

Let be the augmented weight vector prior to the th mistake. The th update is performed when

wtà 1

tt

yiêwtà 1;xi

ë= yi(

êwtà 1;xi

ë+ btà 1)60

where is the point incorrectly classified by . (xi;yi) 2 S wtà 1

Page 46: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Update Rule of Perceotron wt wtà 1 + ñyixi ;bt bt + ñyiR2

wt =wt

Rbt

ô õ=

wtà 1

Rbtà 1

ô õ+ ñyi

xi

R

ô õ= wtà 1 + ñyixi:

êwt;wopt

ë=

êwtà 1;wopt

ë+ ñyi

êxi;wopt

ë

>êwtà 1;wopt

ë+ ñí >

êwtà 2;wopt

ë+ 2ñí . . . > tñí

(NOTE : yi(êwopt áxi

ë+ bopt)>í ;8 16 i6 )̀

Similarly,jjwtjj22 = jjwtà 1jj22 + 2ñyi

êwtà 1;xi

ë+ ñ2jjxijj22

6 jjwtà 1jj22 + ñ2jjxijj22 = jjwtà 1jj22 + ñ2(jjxijj22 + R2)

6 jjwtà 1jj22 + 2ñ2R2 6 ááá6 2tñ2R2

Page 47: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

Update Rule of Perceotron wt wtà 1 + ñyixi ;bt bt + ñyiR2

êwt;wopt

ë> tñí and jjwtjj22 6 2tñ2R2

jjwoptjj2 2tp

ñR > jjwoptjj2jjwtjj2 >êwt;wopt

ë> tñí

) t 6 2à

íR

á2jjwoptjj22 6

àí

2Rá2

Note : bopt6R; jjwoptjj226 jjwoptjj22 + 1= 2

Page 48: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

The Perceptron Algorithm (Dual Form)

Given a linearly separable training set andS

ë = 0; ë 2 R l ;b= 0;R = max16i6 l jjxijj

w =P

i=1l ë iyixi

Repeat: for i = 1 to l

if yi(P

j=1

l

ë jyjêxj áxi

ë+ b)60 then

ë i ë i + 1; b b+ yiR2

end if

until no mistakes made within the for loop return:

end for

(ë;b)

Page 49: Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data.

What We Got in the Dual Form Perceptron Algorithm?

The number of updates equals:P

i=1

l

ë i = jjëjj1 6 ( í2R)2

ë i > 0implies that the training point(xi;yi)has been

misclassified in the training process at least once.

ë i = 0implies that removing the training point(xi;yi)will not affect the final results

The training data only appear in the algorithm through the entries of the Gram matrix, which is defined below:G 2 R lâ l

Gi j =êxi;xj

ë