Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Computational Pattern Analysis and StatisticalLearning

Lecture 5: Supervised learning

Tijl De Bie, Konstantin Tretyakov(Largely based on joint work with Nello Cristianini and John

Shawe-Taylor)

Tartu, Estonia

November 2006

T. De Bie, K. Tretyakov Pattern Analysis


Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

1 Lecture 5A: Regression and classi�cationLinear regressionFisher�s discriminant analysisSupport Vector Machines

2 Lecture 5B: Kernel regression and classi�cation, and stabilityanalysisKernel ridge regressionHow to �kernelise�an algorithm? �you should know nowKernel support vector machinesStatistical analysis of ridge regression

3 Wrap-up Lecture 5T. De Bie, K. Tretyakov Pattern Analysis


Wrap-up Lecture 5


Overview

Recapitulation of ridge regression �now with o¤set

Fisher�s discriminant analysis

Support Vector Machines



Wrap-up Lecture 5


Least squares regression

We want to approximate yi as a linear function of xiIn terms of a weight vector w, this means: yi � x0iw, or,kyi � x0iwk � 0Pattern function is parameterised by w (note the � sign):

πw (Z ) = �1n

n

∑i=1

�yi � x0iw

�2= �1

nky�Xwk2

Formal pattern recognition problem:

maxw

πw (Z ), maxw�1nky�Xwk2 , min

wky�Xwk2



Wrap-up Lecture 5


Least squares regression with o¤set



Wrap-up Lecture 5


Least squares regression with o¤set

We want to approximate yi as an a¢ ne function of xiIn terms of a weight vector w and o¤set b, this means:yi � x0iw+ b, or, kyi � (x0iw+ b)k � 0Pattern function is parameterised by w (note the � sign):

πw,b (Z ) = �1n

n

∑i=1

�yi �

�x0iw+ b

��2= �1

nky�Xw� 1bk2

Formal pattern recognition problem:

maxw,b

πw,b (Z ), maxw,b�1nky�Xw� 1bk2 , min

wky�Xw� 1bk2



Wrap-up Lecture 5


Ridge regression with o¤set

Danger for over�tting (usually not in 1/low-dimensionalregression, but in high-dimensional spaces such as when usingkernel trick to do nonlinear regression)Capacity control: regularise by additionally controllingC (πw,b) = kwk2



Wrap-up Lecture 5


Ridge regression with o¤set

minwky�Xw� 1bk2 + γ kwk2

Solve by taking gradient w.r.t. w, and derivative w.r.t b, andequating to 0:� �

γI+X0X�w+X01b�X0y = 0

10Xw+ 101b� 10y = 0

Solved by a linear system of equations:� �γI+X0X

�X01

10X 101

��wb

�=

�X0y10y

�



Wrap-up Lecture 5



Let�s assume binary classi�cation: yi 2 f�1, 1gPattern function: learn classi�er as thresholded linearfunction? y sign(x0w+ b)Then:

�g �πw,b (x, y) =�1� sign (y (x0w+ b))

2

�2However, this is hard to optimise... non-convex!

Hence, use a convex upper bound:

�gπw,b (x, y) =�1� y

�x0w+ b

��2T. De Bie, K. Tretyakov Pattern Analysis


Wrap-up Lecture 5



Ideal:

�g �πw,b (x, y) =�1� sign (y (x0w+ b))

2

�2Convex upper bound:

�gπw,b (x, y) =�1� y

�x0w+ b

��2



Wrap-up Lecture 5


Note, for y binary,�gπw,b (x, y) = (1� y (x0w+ b))

2 = (y � (x0w+ b))2

Same as for ridge regression!

Hence, exact same methodology as for (ridge) regression canbe used



Wrap-up Lecture 5



πw,b (X) = 1n ∑n

i=1 gw,b (xi ) withgw,b (xi ) = � (yi � (x0iw+ b))

2 = � (1� yi (x0iw+ b))2

�gw,b (xi ) is the cost associated to each (xi , yi )Quite sensitive to outliers (quadratic!)



Wrap-up Lecture 5



πw,b (X) = 1n ∑n

i=1 gw,b (xi ) withgw,b (xi ) = � (yi � (x0iw+ b))

2 = � (1� yi (x0iw+ b))2

This is the cost associated to each (xi , yi )Quite sensitive to outliers (quadratic!)



Wrap-up Lecture 5


Support Vector Machines for robust regression

Solution: use another cost (not quadratic), also an upper

bound on �g �πw,b (x, y) =�1�sign(y (x0w+b))

2

�2But keep it convex...



Wrap-up Lecture 5


Support vector machines

Averaging pattern function with:

gw,b (xi ) = �max�0, 1� yi

�x0iw+ b

��Pattern function itself:

πw,b (X) = �1n

n

∑i=1max

�0, 1� yi

�x0iw+ b

��Capacity functional:

C (πw,b (X)) = kwk2

Pattern recognition problem:

minw,b

1n

n

∑i=1max

�0, 1� yi

�x0iw+ b

��+ γ kwk2



Wrap-up Lecture 5



Introduce new variables: ξ i � 0 and ξ i � 1� yi (x0iw+ b)Then, ∑n

i=1max (0, 1� yi (x0iw+ b)) = minξ ∑ ξ iHence:

minw,b,ξ

1n

n

∑i=1

ξ i + γ kwk2

s.t. ξ i � 0ξ i � 1� yi

�x0iw+ b

�This is easy to solve using any quadratic programmingtoolbox...



Wrap-up Lecture 5



Property: many ξ i = 0, corresponding to yi (x0iw+ b) � 1,

�x0iw+ b

�� 1 if yi = 1�

x0iw+ b�� 1 if yi = �1

Hence: many (xi , yi ) can be separated by a certain marginThe for which yi (x0iw+ b) � 1 are known as the supportvectorsFor some, (x0iw+ b) = yi holds



Wrap-up Lecture 5



Size of the margin: take a point on the margin, i.e. for which(x0iw+ b) = yi , and another point for which�x0jw+ b

�= �1

Margin is length of projections of xi and xj on w:(xi � xj )0w/ kwk = 2/ kwk



Wrap-up Lecture 5



Capacity functional kwk2 make sure the margin is large...At the same time, the pattern function makes sure theclassi�cation error on the training set is small...

The combination of these two features makes sure that theerror on another set of data points, a test set, can beexpected to be small



Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Ridge regression: recapitulation

Optimal w and b found as:� �γI+X0X

�X01

10X 101

��wb

�=

�X0y10y

�Estimate label for data point x as y = x0w+ b



Wrap-up Lecture 5


Kernel ridge regression

Note:�X0X+ γI

�w+X01b�X0y = 0, w = X0 �

�1γ(y�Xw� 1b)

�Let�s denote α =

�2γ (y�Xw� 1b)

�, then

w = X0α =n

∑i=1

αixi

The weight vector is a linear combination of the data points(representer theorem)Projection of a data point on the weight vector is a weightedsum of kernels (inner products):

x0w+ b = x0X0α+ b =n

∑i=1

αik (x , xi ) + b



Wrap-up Lecture 5



Let�s plug this in the equations (assuming that K = XX0 is fullrank): � �

γI+X0X�X01

10X 101

��wb

�=

�X0y10y

��X 00 1

�� γI+X0X

�X01

10X 101

��X0αb

�=

�X 00 1

��X0y10y

��

γK+K2 K110K 101

��αb

�=

�K 00 1

��y10y

�



Wrap-up Lecture 5



�γK+K2 K110K 101

��αb

�=

�K 00 1

��y10y

��

γI+K 110K 101

��αb

�=

�y10y

�

Again: a set of linear equations...



Wrap-up Lecture 5



In summary, the dual vector α and the o¤set b can be founde¢ ciently by solving�

γI+K 110K 101

��αb

�=

�y10y

�Then, for a test object x the label y can be predicted as

y =n

∑i=1

αik (x , xi ) + b



Wrap-up Lecture 5


Kernel Fisher discriminant analysis

Just a di¤erent use from Kernel ridge regression

With binary labels y

! We will not discuss this in greater detail here



Wrap-up Lecture 5


Recurring themes and tricks

You should have noticed that all methods relying on inner products,distances,... can be expressed in terms of kernel functions:

1 The 1st step in kernelising invokes an instance of therepresenter theorem: the parameters (weight vector, clustercentre) can be represented as a linear combination of the data:

w = Xα

2 The 2nd step plugs in this equation, and left-multiplies theequations to obtain inner products XX�where possible...

3 Kernel trick: substitute the inner products with kernels



Wrap-up Lecture 5


Kernel support vector machines

Same trick works for support vector machines

But a di¤erent approach is more common here: relying onoptimisation theory

Can be used for ridge regression, PCA, etc as well!



Wrap-up Lecture 5



Support vector machine:

minw,b,ξ

1n

n

∑i=1

ξ i + γ kwk2

s.t. ξ i � 0ξ i � 1� yi

�x0iw+ b

�Use Lagrange multipliers α � 0 and β � 0 for bothinequalities



Wrap-up Lecture 5



minw,b,ξ

maxβ1,β2

1n

n

∑i=1

ξ i + γ kwk2 � β0ξ � α0 (ξ � 1+ y� (Xw+ 1b))

maxα,β

minw,b,ξ

1n10ξ + γ kwk2 �

�β0 + α0

�ξ + α01� α0yb� α0 (y�Xw)

Take gradient w.r.t w and equate to 0:

2γw = X0 diag (y) α

Same for ξ:1n1 = (β+ α)

Same for b:α0y = 0



Wrap-up Lecture 5



Plugging all this in the objective, gives:

maxα,β

� 14γ

α0�diag (y)XX0 diag (y)

�α+ α01

maxα,β

� 14γ

α0�diag (y)XX0 diag (y)

�α+ α01

Hence, using kernels and with constraints:

maxα,β

� 14γ

α0�K� yy0

�α+ α01

s.t.1n� α � 0, α0y = 0

This is the Lagrange dual formulation �Lagrange duals areoften directly in kernel form...



Wrap-up Lecture 5


Averaging pattern functions

Will follow same pattern as the bound for PCA

This is due to the fact that both are based on an averagingpattern function

Let us �rst do the study in full generality, for averagingpattern functions

π (X ) =1n

n

∑i=1gπ (xi )



Wrap-up Lecture 5



In general:

π (X)� EX fπ (X)g � maxπ2Π

(π (X)� EX fπ (X)g)

� EZ

�maxπ2Π

(π (Z)� EX fπ (X)g)�

� EXZ

�maxπ2Π

(π (Z)� π (X))�

We should make the approximate inequality into a rigorousinequality...

Then devise an upper bound for the last quantity



Wrap-up Lecture 5



The approximate equality for averaging pattern functions:

maxπ2Π

(π (X)� EX fπ (X)g) � EZ�maxπ2Π

(π (Z)� EX fπ (X)g)�

Let us assume that jgπ (x)� gπ (x�)j � M (true e.g. if0 � gπ (x) � M)Then, replacing one data point xi by a di¤erent value x�i canchange the value of this function of X by at most Mn (thisrequires some thought... check it!)

! McDiarmid...



Wrap-up Lecture 5


McDiarmid�s inequality (again):

Theorem (McDiarmid�s inequality)

For f a function of X = fx1, x2, ..., xi , ..., xng 2 X and xi iid, iff (X ) has bounded di¤erences ci , meaning thatjf (X )� f (X i )j � ci , we have that

P(f (X )� Eff (X )g < ε) � 1� exp� �2ε2

∑ni=1 c

2i

�.



Wrap-up Lecture 5



McDiarmid�s inequality with ci = Mn : with probability at least

1� exp�� 2nε2

M 2

�

maxπ2Π

(π (X)� EX fπ (X)g)�EZ�maxπ2Π

(π (Z)� EX fπ (X)g)�< ε

In other words, with a probability of at least δ/2, we havethat:

maxπ2Π

(π (X)� EX fπ (X)g)

� EZ

�maxπ2Π

(π (Z)� EX fπ (X)g)�+M

rln (2/δ)

2n.



Wrap-up Lecture 5



) quantity to be bounded: EXZ fmaxπ2Π (π (Z)� π (X))gBounded by the Rademacher complexity (σ i i.i.d., 1 or �1both with probability 1

2 )

EXZnmax

π(π (Z)� π (X))

o= EXZ

(maxπ2Π

1n

n

∑i=1

�gπ

�zi�� gπ

�xi��!)

= EXZσ

(maxπ2Π

��1n n

∑i=1

σi�gπ

�zi�� gπ

�xi��)

� EXσ

(maxπ2Π

��2n n

∑i=1

σigπ

�xi��), R (Π) ,



Wrap-up Lecture 5


Rademacher complexity

De�nition (Rademacher and empirical Rademacher complexity)

The Rademacher complexity R (Π) of a space Π of additivepattern functions π with π (X ) = 1

n ∑ni=1 g (xi ) is given by

R (Π) = EXσ

(maxπ2Π

��2n n

∑i=1

σigπ

�xi��).

The empirical Rademacher complexity bRX (Π) of the same patternspace and for given data X = fx1, x2, . . . , xng is given by

bRX (Π) = Eσ

(maxπ2Π

��2n n

∑i=1

σigπ (xi )

��).



Wrap-up Lecture 5


Rademacher complexity

McDiarmid�s inequality ) bRX (Π) � R (Π) with highprobability

Indeed, it is easy to very that McDiarmid�s theorem applieswith ci = 2M

n , showing that with a probability of at least δ/2

R (Π) � bRX (Π) + 2Mr ln (2/δ)

2n



Wrap-up Lecture 5


Rademacher bounds

The resulting empirical Rademacher type bound is given by

π (X)� EX fπ (X)g = π (X)� Ex fgπ (x)g

� bRX (Π) + 3Mr ln (2/δ)

2n

which holds with probability 1� 2 � δ/2 = 1� δ over randomdraws of XHereby, M is an upper bound on jgπ (x)� gπ (x�)j (8x, x�)Power of this type of bounds:

Quite tight, data-dependentbRX (Π) is usually easy to boundT. De Bie, K. Tretyakov Pattern Analysis


Wrap-up Lecture 5


Ridge regression stability bound (without o¤set)

We prove stability for the strati�cation formulation:

minw

1njjXw� yjj2 s.t. jjwjj2 � c

Assume: kxk2 � R2x and y � Ry0 � gπw (x, y) = (x

0w� y)2 � cR2x + 2pcRxRy + R2y , so:

M = cR2x + 2pcRxRy + R2y

Empirical Rademacher complexity:

bRX (Π)� 2

nc

sn

∑i=1(x0ixi )

2 +2n

sn

∑i=1y4i +

4n

pc

sn

∑i=1y2i (x

0ixi ).



Wrap-up Lecture 5



bRX (Π) = Eσ

(maxw

��2n n

∑i=1

σi�x0iw� yi

�2��)

= Eσ

(maxw

��2n n

∑i=1

σi

��x0iw�2+ y2i � 2yix0iw

��)

� 2nEσ

(maxw

�� n∑i=1 σi�x0iw�2��+

�� n∑i=1 σiy2i

��+ 2�� n∑i=1 σiyix0iw

��!)

� 2nEσ

�maxw

��∑ni=1

σixix0i ,ww

0��+maxw ��∑ni=1 σiy2i

��+2maxw

��∑ni=1

σiyixi ,w

��



Wrap-up Lecture 5



bRX (Π) � 2nEσ

8>>>>><>>>>>:

pc2r

∑ni ,j=1

Dσixix0i , σjxjx

0j

E+q

∑ni ,j=1 σiσjy2i y

2j

+2pc

r∑ni ,j=1

Dσiyixi , σjyjxj

E9>>>>>=>>>>>;

� 2nc

vuutEσ

(n

∑i ,j=1

σiσj

Dxix0i , xjx

0j

E)+2n

vuutEσ

(n

∑i ,j=1

σiσjy2i y2j

)

+4n

pc

vuutEσ

(n

∑i ,j=1

σiσj hyixi , yjxj i)



Wrap-up Lecture 5



bRX (Π) � 2nc

vuutEσ

(n

∑i ,j=1

σiσj

Dxix0i , xjx

0j

E)

+2n

vuutEσ

(n

∑i ,j=1

σiσjy2i y2j

)

+4n

pc

vuutEσ

(n

∑i ,j=1

σiσj hyixi , yjxj i)

� 2nc

sn

∑i=1(x0ixi )

2 +2n

sn

∑i=1y4i +

4n

pc

sn

∑i=1y2i (x

0ixi )



Wrap-up Lecture 5

Wrap-up

Supervised learning methods:

Ridge regression revisited (now with o¤set)Fisher�s discriminant analysisSupport vector machine

Kernel versions

Statistical study

In general for averaging pattern functions using RademachercomplexitiesIn particular, applied to ridge regression


Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Documents