Techniques d’intelligence artiﬁcielle Andreas … · Techniques d’intelligence artiﬁcielle Andreas Classen Professeur Pierre-Yves Schobbens FUNDP Namur Institut d’Informatique

Support vector machinesTechniques d’intelligence artificielle

Andreas Classen

Professeur Pierre-Yves Schobbens

FUNDP NamurInstitut d’InformatiqueRue Grandgagnage, 21

B-5000 Namur

13 March 2007

13 March 2007 Andreas Classen, 3rd Master TIA-MA: Support vector machines 1/84

Outline

Part I - Background

Part II - Support Vector Machines

Part III - Conclusion


IntroductionMachine Learning

VC-Dimension

Part I

Background



VC-Dimension

Outline

1 Introduction

2 Machine LearningSupervised vs. Unsupervised learningClassificationSVM Overview

3 VC-DimensionRisc MinimisationVC-Dimension



VC-Dimension

Introduction

Support Vector Machines aresupervisedlearning methodsused for classification and regression.

Note that this introduction only covers the classification case.

Introduced in eraly 1990s by VapnikPopular due to success in handwritten digit recognition



VC-Dimension

Supervised vs. Unsupervised learningClassificationSVM Overview

Outline

1 Introduction





VC-Dimension


What is Machine Learning?

Part of the artificial intelligence research fieldGoal is the development of algorithms allowing computersto learnIdea is to generate knowledge from experience (induction)Strong links to statisticsClassification is a subfield of machine learning



VC-Dimension


Outline

1 Introduction





VC-Dimension


Supervised learning

Learning a function based on given training dataPairs of inputs and desired outputs are givenThe algorithm learns the function that generates thesepairsObjective is to predict output value of unseen inputsExamples: support vector learning, neural networks,nearest neighbor, decision trees



VC-Dimension


Unsupervised learning

Learning a function based on given training dataTraining data only contains the inputsOutputs learned through clustering of inputsThe algorithm learns the probability distribution (or density)of the input setObjective is to predict output value of unseen inputsExamples: data compression, clustering, principalcomponents analysis



VC-Dimension


Outline

1 Introduction





VC-Dimension


Classification

Part of almost all areas of nature and technologyThe idea is to put observations into categoriesWithout classification abstraction is not possibleHence, it is a prerequisite for formation of concepts (andfinally intelligence)

⇒ Classification is thus an integral part of any form ofintelligence



VC-Dimension


Classification (cont’d)

A classifier is basically a labelling function

f : A → B

Where A is the feature space and B a discrete set of labels(or classes)f is generally implemented as an algorithmA support vector machine is a classifier



VC-Dimension


Outline

1 Introduction





VC-Dimension


SVM Overview

A SVM is a supervised learning method used for classification,which means:

given a set of training data

(x1, y1), . . . , (x`, y`) ∈ Rn × {1,−1}

estimate a classifier

f : Rn → {1,−1}

such that f correctly classifies the training data as well asunseen test data

f (xk ) = yk

generated using the same probability distribution P(x, y)as the training data.



VC-Dimension


Terminology

Symbols introduced in the previous slide will be usedthroughout all slides.

x and other bold symbols designate vectorsy designates the class of a training input (−1 or 1)there are ` training examplesP(x, y) is the probability distribution underlying the data



VC-Dimension

Risc MinimisationVC-Dimension

Outline

1 Introduction





VC-Dimension


Errors and Risk Management

Remember, SVMs are a supervised classification methodThese methods are not without errorsThere is always a certain risk of getting wrong results, i.e.

∃ f : ∀i = 1..` : f (xi) = yi

such that for a test set

(xj , yj), . . . , (xk , yk ) ∈ RN × {−1, 1}

where {xj , . . . , xk} ∩ {xi , . . . , x`} = ∅, f produces onlywrong results, i.e.

∀j = 1..k : f (xj) 6= yj



VC-Dimension


Outline

1 Introduction





VC-Dimension


Empirical Risc Minimisation

In the supervised case we have training data with outputs⇒ We can thus evaluate the training error:

Remp[f ] =1`

∑i=1

12|f (xi)− yi |

The training error is called empirical riskFirst idea is to minimise this empirical risk



VC-Dimension


Empirical Risc Minimisation (cont’d)

But, there are functions with minimal empirical risc, andmaximal test error!How to choose if only empirical risk is known?The actual risk, the test error:

R[f ] =

∫12|f (xi)− yi |dP(x, y)

can’t be minimised by only minimising empirical risk.Instead choose f such that its capacity is suitable for theamount of training data (statistical learning (VC) theory, byVapnik and Chervonenkis).



VC-Dimension


Structural Risc Minimisation

Using capacity concepts, VC theory places bounds on thetest error.

⇒ Minimise these bounds wrt. to capacity and training error⇒ Called structural risc minimisation.

A well known capacity measure is the VC dimension h.



VC-Dimension


Outline

1 Introduction





VC-Dimension


VC-Dimension

The VC-Dimension h of a set of functions f (α), is themaximum number of points that can be shattered by f (α).Property: the VC-dimention of the set of orientedhyperplanes in Rn is n + 1.

?



VC-Dimension


VC-Dimension (cont’d)

Usage of the VC dimension gives us the following bound(which η confidence)

R[α] ≤ Remp[α] +

√h(log 2`

h + 1)− log(η4)

`

Note that this bound holds for h < l and does not dependon P(x, y)

⇒ Risk can be minimised by minimising h!


Linear machines on separable dataLinear machines on nonseparable data

Nonlinear machines on nonseparable data

Part II

Support Vector Machines




Reminder

A SVM is a supervised learning method used for classification:given a set of training data

(x1, y1), . . . , (x`, y`) ∈ RN × {1,−1}

estimate a classifier

f : RN → {1,−1}

such that f correctly classifies the training data as well asunseen data

f (xp) = yp

generated using the same probability distribution P(x, y)as the training data.




Different Situations

There are differnet types of SV machines, based on theproperties of the training set.

1 Linear machines on separable dataBoth classes form disjoint convex setsNo training errorEasiest case

2 Linear machines on nonseparable dataGeneralisation of the previous caseBoth classes overlay at some pointThere will thus be training error

3 Nonlinear machines on nonseparable dataGeneralisation of both previous casesBoth classes are only separable with a non-linear function




Linear machines on separable data

The training data, blue and yellow dots represent classes.




Linear machines on separable data

The estimated classes represented by blue and yellow areas.




Linear machines on nonseparable data





Linear machines on nonseparable data





Nonlinear machines on (non)separable data





Nonlinear machines on (non)separable data





SituationOptimisation problem

Outline

4 Linear machines on separable dataSituationOptimisation problem

5 Linear machines on nonseparable dataSituationOptimisation problem

6 Nonlinear machines on nonseparable dataThe Kernel TrickExample





Precondition

If both classes of the training data are disjoint convex sets, ahyperplane exists, separating both convex sets, i.e.

∃ a, b : a x < b ∀x ∈ T1 and a x > b ∀x ∈ T2

a’ x = b

T1 T2





Outline








Situation

w

{x | (w x) + b = 0}.

x2

x1

yi = −1

yi = +1❍

❍

❍❍

◆

◆

◆

◆

❍

Given

(x1, y1), . . . , (x`, y`)

∈ Rd × {1,−1}

There exists aseparating hyperplane:

w · x + b = 0

with w being normalto it (red vector).





Situation (cont’d)

w

{x | (w x) + b = 0}.

}{x | (w x) + b = −1}. {x | (w x) + b = +1}.

x2

x1

yi = −1

yi = +1❍

❍

❍❍

◆

◆

◆

◆

❍

We suppose the trainingdata to satisfy:

xi ·w+b ≥ +1 for yi = +1

xi ·w+b ≤ −1 for yi = −1

Which is the same as:

yi(xi ·w + b)− 1 ≥ 0 ∀ i






w

{x | (w x) + b = 0}.

}{x | (w x) + b = −1}. {x | (w x) + b = +1}.

x2

x1

yi = −1

yi = +1❍

❍

❍❍

◆

◆

◆

◆

❍

The distance from the ori-gin to the separating hy-perplane is:

|b|||w||

The margin between theouter hyperplanes:

|1− b|||w||

−| − 1− b|||w||

=2||w||






w

{x | (w x) + b = 0}.

}{x | (w x) + b = −1}. {x | (w x) + b = +1}.

x2

x1

yi = −1

yi = +1❍

❍

❍❍

◆

◆

◆

◆

❍

The goal is then to max-imise the margin:

max2||w||

≡ min ||w||2

Subject to :

yi(xi ·w + b)− 1 ≥ 0 ∀ i





Outline








Langragian formulation of maximisation problem

We have reduced the problem to an optimisation problem.But, ||w|| is non-linear

⇒ Use Lagrangian formulation

LP ≡12||w||2 −

∑i=1

αi(yi(xi ·w + b)− 1)

min LP wrt . w, b s.c.

{∂LP(w,b,α)

∂αi= 0

αi ≥ 0

Convex quadratic programming problem (objective isquadratic, constraints are linear and all are convex)

⇒ Solve dual problem given by Karush-Kuhn-Tuckerconditions, called Wolfe dual





The Wolfe dual problem

max LP wrt . α s.c.

{∂LP(w,b,α)

∂w = 0 ⇒∑`

i αiyixi = w∂LP(w,b,α)

∂b = 0 ⇒∑`

i αiyi = 0

⇒ LD =∑

i

αi −12

∑i,j

αiαjyiyjxi · xj

⇒ max LD wrt . α s.c.

{ ∑ì αiyi = 0

∀i = 1..` : αi ≥ 0





Conclusion

Solving the dual problem gives a value for w, b and α

Training points (xi , yi), for which the Lagrange multipliersαi 6= 0 are called support vectorsThe solution only depends on these support vectorsAll other training points are irrelevantOur classifier is finally

f (x) = sign((w · x) + b) = sign(∑`

i=1 αiyi · (x · xi) + b)

= sign(∑

i∈{support vectors} αiyi · (x · xi) + b)





Outline








Difference

In this case, data is linearly non separable

The objective function will grow arbitrarily large⇒ Relax conditions that are too strong⇒ Training errors





Outline








Situation

w

{x | (w x) + b = 0}.

}{x | (w x) + b = −1}.{x | (w x) + b = +1}.

x2

x1

yi = −1

yi = +1❍

❍❍

◆

◆

◆

◆

❍

◆ −ξ||w||

The training data is nowsupposed to satisfy theweaker conditions:

xi ·w+b ≥ +1−ξi for yi = +1

xi ·w+b ≤ −1+ξi for yi = −1

Where ξi ≥ 0, i = 1..` areslack variables.





Situation

w

{x | (w x) + b = 0}.

}{x | (w x) + b = −1}.{x | (w x) + b = +1}.

x2

x1

yi = −1

yi = +1❍

❍❍

◆

◆

◆

◆

❍

◆ −ξ||w||

ξi = distance from xi to it’shyperplane, if ξi > 1, xiwill be classified wrong

ξi

||w||

⇒ upper bound for thenumber of training errors:

∑i

ξi





Situation

w

{x | (w x) + b = 0}.

}{x | (w x) + b = −1}.{x | (w x) + b = +1}.

x2

x1

yi = −1

yi = +1❍

❍❍

◆

◆

◆

◆

❍

◆ −ξ||w||

The goal is still to max-imise the margin, but theξi will count as penalties:

min ||w||2 + C∑i=1

ξi

Subject to :

yi(xi ·w+b)−1+ξi ≥ 0 ∀ i





Outline








Langragian formulation of maximisation problem

The optimisation problem is similar to the separable case.A multiplier µi , i = 1..` is added for each ξi

LP ≡12||w||2+C

∑i=1

ξi−∑i=1

αi(yi(xi ·w+b)−1+ξi)−∑i=1

µiξi

min LP wrt . w, b, ξ s.c.

∂LP(w,b,α,µ)∂αi

= 0∂LP(w,b,α,µ)

∂µi= 0

αi ≥ 0

µi ≥ 0

⇒ Same approach as before: solve the Wolfe dual problem





The Wolfe dual problem

max LP wrt . α, µ s.c.

∂LP(w,b,ξ,α,µ)

∂w = 0 ⇒∑`

i αiyixi = w∂LP(w,b,ξ,α,µ)

∂b = 0 ⇒∑`

i αiyi = 0∂LP(w,b,ξ,α,µ)

∂ξi= 0 ⇒ C − αiyi = 0

⇒ LD =∑

i

αi −12

∑i,j

αiαjyiyjxi · xj

⇒ max LD wrt . α, µ s.c.

{∑ì αiyi = 0

∀i = 1..` : 0 ≤ αi ≤ C





Conclusion

The Wolfe dual is the same as in the separable caseOnly difference is αi ≤ C

⇒ Prevents the αi from growing too largeThe classifier is the same as before

f (x) = sign((w·x)+b) = sign( ∑

i∈{support vectors}

αiyi ·(x·xi)+b)




The Kernel TrickExample

Outline








Difference

In this case, data is linearly non separableAnd a linear classifier would yield a large training error

⇒ Generalise to non-linear classifiersBut, how can this be done without too much trouble?





The Idea

The data currently lies in an euclidian space L ofdimension dIn this space it is linearly non separablePerhaps in a space of dimension d ′ > d the data canactually be linearly separated?

⇒ Map the data to a space of higher dimension, in which it islineraly separableFormally: find

(Θ,H) such that Θ : L → H

hoping that in H, the data is linearly separable.





The Idea (cont’d)

But ..we have to find Θ and Hrising dimension implies rising algorithmic complexityoverfitting?

⇒ How can we do this efficiently?Need for an easier way of doing this.





Outline








The Kernel Trick

If we look at our formula:

LD =∑

i

αi −12

∑i,j

αiαjyiyj

[xi · xj

]

max LD wrt . α, µ s.c.

{∑ì αiyi = 0

∀i = 1..` : 0 ≤ αi ≤ C

The training data x only appears in the form of dotproducts xi · xj !

⇒ In H, it would only appear as Θ(xi) ·Θ(xj)





The Kernel trick (cont’d)

Now imagine, there were a K : K (xi , xj) = Θ(xi) ·Θ(xj)

K is called a kernel function, and they are manyThe training can be done without changes, we simplyreplace xi · xj by K (xi , xj)

⇒ We even don’t need to know what Θ is!

But what about w which now also lies in H?



αiyiK (x, xi)+b)

⇒ We still don’t need to calculate Θ explicitly





Finding the kernel

Several questions remain openHow to find the kernel?

Mercer’s condition tells whether for a given K , a couple(Θ,H) exists, but doesn’t tell how to find them

⇒ Research topicRising algorithmic complexity?

For K = (xi · xj)p, Θ(xi) ·Θ(xj) is of O(Cp

dim(L)+p−1)

K , however, can be computed in O(dim(L))

Overfitting?Mapping data to a high dimensional space is generally badfor generalisationThe classifier in H has dim(H) + 1 parameters!But, the intrinsic dimension is still dim(L), and there areonly ` parameters





Outline








Example

We haveL = R2

K (xi , xj) = (xi · xj)2

⇒ what is (Θ,H) ?

Choose:H = R3

Θ : R2 → R3 : Θ

(x1

x2

)=

x2

1√2x1x2

x22





Example (cont’d)

Θ(x) ·Θ(x′)

=

x2

1√2x1x2

x22

·

x ′1

2√

2x ′1x ′2x ′2

2

= x2

1 x ′12 + x1x2x ′1x ′2 + x2

2 x ′22

= (x1x ′1 + x2x ′2)2

= (x · x′)2

= K (x, x′)





Graphic interpretation

R²R² R³Θ





Graphic interpretation (cont’d)

R²R² R³Θ





Other examples of Kernels

Polynomial of degree p

K (x1, x2) = (x1 · x2 + 1)p

Gaussian radial basis function

K (x1, x2) = e−‖x1−x2‖

2

2σ2

Two-layer sigmoidal neural network

K (x1, x2) = tanh(κx1 · x2 − δ)


Complexity resultsFace Recognition Illustration

References

Part III

Conclusion



References

Outline

7 Complexity results

8 Face Recognition IllustrationSystemConclusion

9 References



References

Solving SVMs

Training a SVM means solving the optimisation problemSolving analytically is rarely possible

⇒ In real world, the problem has to be solved numericallyExisting tools for solving lineraly constrained convexquadratic problems can be reusedLarge problems, however, require their own solutionstrategy

⇒ Training complexity depends on the choosen solutionmethod



References

Advantage of kernels

As seen before, even in high-dimensional spaces, kernelfunctions reduce complexity dramaticallyConstructing hyperplanes in high-dimensional spaces isdone with a reasonable amount of computing

In the following:

d = dim(L)

Ns =number of support vectors` =number of training samples



References

Computational complexity of the testing

Testing is simply computing the formula



αiyiK (x, xi)+b)

Complexity:O(MNs)

with M = O(K ) = O(d) if K is a dot product or RBF kernel



References

SystemConclusion

Outline



9 References



References

SystemConclusion

The problem

Taken from [Osuna1997]Input: an imageOutput: the coordinates of faces found in this image

Difficult problem, important pattern variationsfacial appearence differspresence of other objects like glasses, moustacheslight source and resulting shadows



References

SystemConclusion

Approach

Problem devided into two tasks1 Find all possible face-like details of the image

Scan the image exhaustively⇒ Identify every sub-image for every considered scale

2 Classify these details into face and non-faceTrain a SV machine for face detectionClassify each sub-image with this machine



References

SystemConclusion

Outline



9 References



References

SystemConclusion

Training

Training based on a database of face and non-face 19× 19pixel imagesThe kernel is 2nd degree polynomialPreprocessing of the image is performed

Masking: reduce the pixels from 19× 19 = 361 to 283, inorder to reduce noiseIllumination correction: reduce light and heavy shadowsHistogram equalisation: compensate for differences incameras

Extend training by bootstrapping, i.e. testing errors of atrained machine are stored and used as negativeexamples in the next training



References

SystemConclusion

Testing

Re-scale the image, to get different scales of facesCut the 19× 19 windowsFor each one

Apply the preprocessing as in the trainingClassify the sub-imagesIf positive, indicate the position of the face on the originalimage



References

SystemConclusion

Outline



9 References



References

SystemConclusion

Conclusion

Developing an intelligent system using SVMs is wellbeyond applying the mathematical formulaeTraining examples have to be collected carefullyRaw data generally doesn’t work, some preprocessing hasto be done



References

SystemConclusion

Results



References

SystemConclusion

Results (cont’d)



References

SystemConclusion

Results (cont’d)



References

SystemConclusion

Results (cont’d)



References

Outline



9 References



References

References I

C. J. C. Burges.A tutorial on support vector machines for pattern recognition.Data Min. Knowl. Discov., 2(2):121–167, 1998.L. Kaufman.Solving the quadratic programming problem arising in supportvector classification.pages 147–167, 1999.R. S. Ledley and C. Suen, editors.Pattern Recognition Journal.Elsevier, 2007.E. Osuna, R. Freund, and F. Girosi.Support vector machines: Training and applications.Technical Report AIM-1602, Massachusetts Institute ofTechnology, 1997.



References

References II

F. Provost, editor.Machine Learning Journal.Springer, 2007.C. C. Rodriguez.The kernel trick.Technical report, University at Albany, 2004.B. Schölkopf, C. Burges, and A. J. Smola.Advances in Kernel Methods - Support Vector Learning.MIT Press, Cambridge, MA, 1999.


Summary

Support vector machines are trained classifiersThey are based in VC-theory and implement structural riskminimisationAlthough they only classify into classes 1 and −1 they caneasily be extended to classify an arbitrary number ofclassesThey have a good generalisation performanceChoice of the Kernel is trickyEven with high-dimensional kernels, computationalcomplexity is reasonableLarge area of applications


The End - Thank you for your attention.


Techniques d’intelligence artiﬁcielle Andreas … · Techniques d’intelligence artiﬁcielle Andreas Classen Professeur Pierre-Yves Schobbens FUNDP Namur Institut d’Informatique

Documents