Lecture 5A: Regression and classication Lecture 5B: Kernel regression and classication, and stability analysis Wrap-up Lecture 5 Computational Pattern Analysis and Statistical Learning Lecture 5: Supervised learning Tijl De Bie, Konstantin Tretyakov (Largely based on joint work with Nello Cristianini and John Shawe-Taylor) Tartu, Estonia November 2006 T. De Bie, K. Tretyakov Pattern Analysis
44
Embed
Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Computational Pattern Analysis and StatisticalLearning
Lecture 5: Supervised learning
Tijl De Bie, Konstantin Tretyakov(Largely based on joint work with Nello Cristianini and John
Shawe-Taylor)
Tartu, Estonia
November 2006
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
2 Lecture 5B: Kernel regression and classi�cation, and stabilityanalysisKernel ridge regressionHow to �kernelise�an algorithm? �you should know nowKernel support vector machinesStatistical analysis of ridge regression
3 Wrap-up Lecture 5T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Overview
Recapitulation of ridge regression �now with o¤set
Fisher�s discriminant analysis
Support Vector Machines
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Least squares regression
We want to approximate yi as a linear function of xiIn terms of a weight vector w, this means: yi � x0iw, or,kyi � x0iwk � 0Pattern function is parameterised by w (note the � sign):
πw (Z ) = �1n
n
∑i=1
�yi � x0iw
�2= �1
nky�Xwk2
Formal pattern recognition problem:
maxw
πw (Z ), maxw�1nky�Xwk2 , min
wky�Xwk2
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Least squares regression with o¤set
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Least squares regression with o¤set
We want to approximate yi as an a¢ ne function of xiIn terms of a weight vector w and o¤set b, this means:yi � x0iw+ b, or, kyi � (x0iw+ b)k � 0Pattern function is parameterised by w (note the � sign):
πw,b (Z ) = �1n
n
∑i=1
�yi �
�x0iw+ b
��2= �1
nky�Xw� 1bk2
Formal pattern recognition problem:
maxw,b
πw,b (Z ), maxw,b�1nky�Xw� 1bk2 , min
wky�Xw� 1bk2
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Ridge regression with o¤set
Danger for over�tting (usually not in 1/low-dimensionalregression, but in high-dimensional spaces such as when usingkernel trick to do nonlinear regression)Capacity control: regularise by additionally controllingC (πw,b) = kwk2
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Ridge regression with o¤set
minwky�Xw� 1bk2 + γ kwk2
Solve by taking gradient w.r.t. w, and derivative w.r.t b, andequating to 0:� �
γI+X0X�w+X01b�X0y = 0
10Xw+ 101b� 10y = 0
Solved by a linear system of equations:� �γI+X0X
�X01
10X 101
��wb
�=
�X0y10y
�
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Fisher�s discriminant analysis
Let�s assume binary classi�cation: yi 2 f�1, 1gPattern function: learn classi�er as thresholded linearfunction? y sign(x0w+ b)Then:
�g �πw,b (x, y) =�1� sign (y (x0w+ b))
2
�2However, this is hard to optimise... non-convex!
Hence, use a convex upper bound:
�gπw,b (x, y) =�1� y
�x0w+ b
��2T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Fisher�s discriminant analysis
Ideal:
�g �πw,b (x, y) =�1� sign (y (x0w+ b))
2
�2Convex upper bound:
�gπw,b (x, y) =�1� y
�x0w+ b
��2
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Note, for y binary,�gπw,b (x, y) = (1� y (x0w+ b))
2 = (y � (x0w+ b))2
Same as for ridge regression!
Hence, exact same methodology as for (ridge) regression canbe used
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
This is the cost associated to each (xi , yi )Quite sensitive to outliers (quadratic!)
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Support Vector Machines for robust regression
Solution: use another cost (not quadratic), also an upper
bound on �g �πw,b (x, y) =�1�sign(y (x0w+b))
2
�2But keep it convex...
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Support vector machines
Averaging pattern function with:
gw,b (xi ) = �max�0, 1� yi
�x0iw+ b
��Pattern function itself:
πw,b (X) = �1n
n
∑i=1max
�0, 1� yi
�x0iw+ b
��Capacity functional:
C (πw,b (X)) = kwk2
Pattern recognition problem:
minw,b
1n
n
∑i=1max
�0, 1� yi
�x0iw+ b
��+ γ kwk2
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Support vector machines
Introduce new variables: ξ i � 0 and ξ i � 1� yi (x0iw+ b)Then, ∑n
i=1max (0, 1� yi (x0iw+ b)) = minξ ∑ ξ iHence:
minw,b,ξ
1n
n
∑i=1
ξ i + γ kwk2
s.t. ξ i � 0ξ i � 1� yi
�x0iw+ b
�This is easy to solve using any quadratic programmingtoolbox...
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Support vector machines
Property: many ξ i = 0, corresponding to yi (x0iw+ b) � 1,
�x0iw+ b
�� 1 if yi = 1�
x0iw+ b�� �1 if yi = �1
Hence: many (xi , yi ) can be separated by a certain marginThe for which yi (x0iw+ b) � 1 are known as the supportvectorsFor some, (x0iw+ b) = yi holds
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Support vector machines
Size of the margin: take a point on the margin, i.e. for which(x0iw+ b) = yi , and another point for which�x0jw+ b
�= �1
Margin is length of projections of xi and xj on w:(xi � xj )0w/ kwk = 2/ kwk
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Linear regressionFisher�s discriminant analysisSupport Vector Machines
Support vector machines
Capacity functional kwk2 make sure the margin is large...At the same time, the pattern function makes sure theclassi�cation error on the training set is small...
The combination of these two features makes sure that theerror on another set of data points, a test set, can beexpected to be small
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Ridge regression: recapitulation
Optimal w and b found as:� �γI+X0X
�X01
10X 101
��wb
�=
�X0y10y
�Estimate label for data point x as y = x0w+ b
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Kernel ridge regression
Note:�X0X+ γI
�w+X01b�X0y = 0, w = X0 �
�1γ(y�Xw� 1b)
�Let�s denote α =
�2γ (y�Xw� 1b)
�, then
w = X0α =n
∑i=1
αixi
The weight vector is a linear combination of the data points(representer theorem)Projection of a data point on the weight vector is a weightedsum of kernels (inner products):
x0w+ b = x0X0α+ b =n
∑i=1
αik (x , xi ) + b
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Kernel ridge regression
Let�s plug this in the equations (assuming that K = XX0 is fullrank): � �
γI+X0X�X01
10X 101
��wb
�=
�X0y10y
��X 00 1
�� �γI+X0X
�X01
10X 101
��X0αb
�=
�X 00 1
��X0y10y
��
γK+K2 K110K 101
��αb
�=
�K 00 1
��y10y
�
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Kernel ridge regression
�γK+K2 K110K 101
��αb
�=
�K 00 1
��y10y
��
γI+K 110K 101
��αb
�=
�y10y
�
Again: a set of linear equations...
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Kernel ridge regression
In summary, the dual vector α and the o¤set b can be founde¢ ciently by solving�
γI+K 110K 101
��αb
�=
�y10y
�Then, for a test object x the label y can be predicted as
y =n
∑i=1
αik (x , xi ) + b
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Kernel Fisher discriminant analysis
Just a di¤erent use from Kernel ridge regression
With binary labels y
! We will not discuss this in greater detail here
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Recurring themes and tricks
You should have noticed that all methods relying on inner products,distances,... can be expressed in terms of kernel functions:
1 The 1st step in kernelising invokes an instance of therepresenter theorem: the parameters (weight vector, clustercentre) can be represented as a linear combination of the data:
w = Xα
2 The 2nd step plugs in this equation, and left-multiplies theequations to obtain inner products XX�where possible...
3 Kernel trick: substitute the inner products with kernels
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Kernel support vector machines
Same trick works for support vector machines
But a di¤erent approach is more common here: relying onoptimisation theory
Can be used for ridge regression, PCA, etc as well!
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Kernel support vector machines
Support vector machine:
minw,b,ξ
1n
n
∑i=1
ξ i + γ kwk2
s.t. ξ i � 0ξ i � 1� yi
�x0iw+ b
�Use Lagrange multipliers α � 0 and β � 0 for bothinequalities
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Kernel support vector machines
minw,b,ξ
maxβ1,β2
1n
n
∑i=1
ξ i + γ kwk2 � β0ξ � α0 (ξ � 1+ y� (Xw+ 1b))
maxα,β
minw,b,ξ
1n10ξ + γ kwk2 �
�β0 + α0
�ξ + α01� α0yb� α0 (y�Xw)
Take gradient w.r.t w and equate to 0:
2γw = X0 diag (y) α
Same for ξ:1n1 = (β+ α)
Same for b:α0y = 0
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Kernel support vector machines
Plugging all this in the objective, gives:
maxα,β
� 14γ
α0�diag (y)XX0 diag (y)
�α+ α01
maxα,β
� 14γ
α0�diag (y)XX0 diag (y)
�α+ α01
Hence, using kernels and with constraints:
maxα,β
� 14γ
α0�K� yy0
�α+ α01
s.t.1n� α � 0, α0y = 0
This is the Lagrange dual formulation �Lagrange duals areoften directly in kernel form...
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Averaging pattern functions
Will follow same pattern as the bound for PCA
This is due to the fact that both are based on an averagingpattern function
Let us �rst do the study in full generality, for averagingpattern functions
π (X ) =1n
n
∑i=1gπ (xi )
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Averaging pattern functions
In general:
π (X)� EX fπ (X)g � maxπ2Π
(π (X)� EX fπ (X)g)
� EZ
�maxπ2Π
(π (Z)� EX fπ (X)g)�
� EXZ
�maxπ2Π
(π (Z)� π (X))�
We should make the approximate inequality into a rigorousinequality...
Then devise an upper bound for the last quantity
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Averaging pattern functions
The approximate equality for averaging pattern functions:
maxπ2Π
(π (X)� EX fπ (X)g) � EZ�maxπ2Π
(π (Z)� EX fπ (X)g)�
Let us assume that jgπ (x)� gπ (x�)j � M (true e.g. if0 � gπ (x) � M)Then, replacing one data point xi by a di¤erent value x�i canchange the value of this function of X by at most Mn (thisrequires some thought... check it!)
! McDiarmid...
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
McDiarmid�s inequality (again):
Theorem (McDiarmid�s inequality)
For f a function of X = fx1, x2, ..., xi , ..., xng 2 X and xi iid, iff (X ) has bounded di¤erences ci , meaning thatjf (X )� f (X i )j � ci , we have that
P(f (X )� Eff (X )g < ε) � 1� exp� �2ε2
∑ni=1 c
2i
�.
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Averaging pattern functions
McDiarmid�s inequality with ci = Mn : with probability at least
1� exp�� 2nε2
M 2
�
maxπ2Π
(π (X)� EX fπ (X)g)�EZ�maxπ2Π
(π (Z)� EX fπ (X)g)�< ε
In other words, with a probability of at least δ/2, we havethat:
maxπ2Π
(π (X)� EX fπ (X)g)
� EZ
�maxπ2Π
(π (Z)� EX fπ (X)g)�+M
rln (2/δ)
2n.
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Averaging pattern functions
) quantity to be bounded: EXZ fmaxπ2Π (π (Z)� π (X))gBounded by the Rademacher complexity (σ i i.i.d., 1 or �1both with probability 1
2 )
EXZnmax
π(π (Z)� π (X))
o= EXZ
(maxπ2Π
1n
n
∑i=1
�gπ
�zi�� gπ
�xi��!)
= EXZσ
(maxπ2Π
�����1n n
∑i=1
σi�gπ
�zi�� gπ
�xi�������)
� EXσ
(maxπ2Π
�����2n n
∑i=1
σigπ
�xi������), R (Π) ,
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Rademacher complexity
De�nition (Rademacher and empirical Rademacher complexity)
The Rademacher complexity R (Π) of a space Π of additivepattern functions π with π (X ) = 1
n ∑ni=1 g (xi ) is given by
R (Π) = EXσ
(maxπ2Π
�����2n n
∑i=1
σigπ
�xi������).
The empirical Rademacher complexity bRX (Π) of the same patternspace and for given data X = fx1, x2, . . . , xng is given by
bRX (Π) = Eσ
(maxπ2Π
�����2n n
∑i=1
σigπ (xi )
�����).
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Rademacher complexity
McDiarmid�s inequality ) bRX (Π) � R (Π) with highprobability
Indeed, it is easy to very that McDiarmid�s theorem applieswith ci = 2M
n , showing that with a probability of at least δ/2
R (Π) � bRX (Π) + 2Mr ln (2/δ)
2n
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Rademacher bounds
The resulting empirical Rademacher type bound is given by
π (X)� EX fπ (X)g = π (X)� Ex fgπ (x)g
� bRX (Π) + 3Mr ln (2/δ)
2n
which holds with probability 1� 2 � δ/2 = 1� δ over randomdraws of XHereby, M is an upper bound on jgπ (x)� gπ (x�)j (8x, x�)Power of this type of bounds:
Quite tight, data-dependentbRX (Π) is usually easy to boundT. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Ridge regression stability bound (without o¤set)
We prove stability for the strati�cation formulation:
minw
1njjXw� yjj2 s.t. jjwjj2 � c
Assume: kxk2 � R2x and y � Ry0 � gπw (x, y) = (x
0w� y)2 � cR2x + 2pcRxRy + R2y , so:
M = cR2x + 2pcRxRy + R2y
Empirical Rademacher complexity:
bRX (Π)� 2
nc
sn
∑i=1(x0ixi )
2 +2n
sn
∑i=1y4i +
4n
pc
sn
∑i=1y2i (x
0ixi ).
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Ridge regression stability bound (without o¤set)
bRX (Π) = Eσ
(maxw
�����2n n
∑i=1
σi�x0iw� yi
�2�����)
= Eσ
(maxw
�����2n n
∑i=1
σi
��x0iw�2+ y2i � 2yix0iw
������)
� 2nEσ
(maxw
����� n∑i=1 σi�x0iw�2�����+
����� n∑i=1 σiy2i
�����+ 2����� n∑i=1 σiyix0iw
�����!)
� 2nEσ
�maxw
��∑ni=1
σixix0i ,ww
0���+maxw ��∑ni=1 σiy2i
��+2maxw
��∑ni=1
σiyixi ,w
��� �
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Ridge regression stability bound (without o¤set)
bRX (Π) � 2nEσ
8>>>>><>>>>>:
pc2r
∑ni ,j=1
Dσixix0i , σjxjx
0j
E+q
∑ni ,j=1 σiσjy2i y
2j
+2pc
r∑ni ,j=1
Dσiyixi , σjyjxj
E9>>>>>=>>>>>;
� 2nc
vuutEσ
(n
∑i ,j=1
σiσj
Dxix0i , xjx
0j
E)+2n
vuutEσ
(n
∑i ,j=1
σiσjy2i y2j
)
+4n
pc
vuutEσ
(n
∑i ,j=1
σiσj hyixi , yjxj i)
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression
Ridge regression stability bound (without o¤set)
bRX (Π) � 2nc
vuutEσ
(n
∑i ,j=1
σiσj
Dxix0i , xjx
0j
E)
+2n
vuutEσ
(n
∑i ,j=1
σiσjy2i y2j
)
+4n
pc
vuutEσ
(n
∑i ,j=1
σiσj hyixi , yjxj i)
� 2nc
sn
∑i=1(x0ixi )
2 +2n
sn
∑i=1y4i +
4n
pc
sn
∑i=1y2i (x
0ixi )
T. De Bie, K. Tretyakov Pattern Analysis
Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis
Wrap-up Lecture 5
Wrap-up
Supervised learning methods:
Ridge regression revisited (now with o¤set)Fisher�s discriminant analysisSupport vector machine
Kernel versions
Statistical study
In general for averaging pattern functions using RademachercomplexitiesIn particular, applied to ridge regression