Kernel Regression Prof. Bennett Math Model of Learning and Discovery 2/24/03 Based on Chapter 2 of Shawe-Taylor and Cristianini
Dec 31, 2015
Kernel Regression
Prof. Bennett
Math Model of Learning and Discovery 2/24/03
Based on Chapter 2 of
Shawe-Taylor and Cristianini
Ridge Regression Review
Use least norm solution for fixed
Regularized problem
Optimality Condition:
2 2min ( , )L S w w w y Xw
( , )2 2 ' 2 ' 0
L S
w
w X y X Xww
' 'n X X I w X y
0.
Requires 0(n3) operations
Dual Representation
Inverse always exists for any
Alternative representation:
1' ' w X X I X y
0.
1
1
' ' ' '
' '
X X I w X y w X y X Xw
w X y Xw X α
1
1
'
'
where '
α y Xw
α y Xw y XX α
XX α α y
α G I y G XX
Solving ll equation is 0(l3)
Dual Ridge Regression
To predict new point:
Note need only compute G, the Gram Matrix
1
1
( ) , , '
where ,
i ii
i
g
x w x x x y G I z
z x x
' ,ij i jG G XX x x
Ridge Regression requires only inner products between data points
Linear Regression in Feature Space
Key Idea:
Map data to higher dimensional space (feature space) and perform linear regression in embedded space.
Embedding Map:: n NR F R N n x
Kernel Function
A kernel is a function K such that
There are many possible kernels.
Simplest is linear kernel.
, ( ), ( )
where is a mapping from input space
to feature space .
FK
F
x u x u
, ,K x u x u
Ridge Regression in Feature Space
To predict new point:
To compute the Gram Matrix
1
1
( ( )) , ( ) ( ), ( ) '
where ( ), ( )
i ii
i
g
x w x x x y G I z
z x x
( ) ( ) ' ( ), ( ) ( , )ij i j i jG K G X X x x x x
Use kernel to compute inner product
Alternative Dual Derivation
Original math model
Equivalent math model
Construct dual using Wolfe Duality
2 2min ( )f w w w y Xw
2 2,
1
1 1min ( )
2 2
. . 1, ,
z ii
i i i
f z
s t y z i
w w w
X w
Lagrangian Function
Consider the problem
Lagrangian function is
min ( ): diff
( ) 0: diff . .
1, ,
nr
nii
f rf R R
h rh R Rs t
i
1
1
( , ) ( ) ( ( ))
( , ) ( ) ( ( ))
i ii
r r i r ii
L r u f r u h r
L r u f r u h r
Wolfe Dual Problem
Primal
Dual
min ( ): diff and convex
( ) 0: diff . .
1, ,
nr
nii
f rf R R
h rh R Rs t
i
,1
1
max ( , ) ( ) ( ( ))
. . ( , ) ( ) ( ( )) 0
i ir u
i
r r i r ii
L r u f r u h r
s t L r u f r u h r
Lagrangian Function
Primal
Lagrangian
21 1 2, 2 2
1
min ( , )
. . 1, ,
l
ii
i i i
f z
s t y z i
w z w z w
X w
21 1 22 2
1 1
1
1
( , , ) ( )
( , , ) 0
( , ) 0
i
i
l l
i i i ii i
l
ii
L z y z
L
L
w
z
w z α w X w
w z α w X
w,z α z α
Wolfe Dual Problem
Construct Wolfe Dual
Simplify by eliminating z=
21 1 22 2, ,
1 1
1
max ( , , ) ( )
. . ( , , ) 0
1( , ) 0
i
i
l l
i i i ii i
l
ii
L z y z
s t L
L
w z α
w
z
w z α w X w
w z α w X
w,z α z α
Simplified Problem
Get rid of z
Simplify by eliminating w=X’
21 2 22 2 2,
1 1
21 22 2
1 1
1
max
. . ( , , ) 0
i
i
i
l l
i i i i ii i i i
l l
i i i ii i i
l
ii
y
y
s t L
w α
w
w X w
w X w
w z α w X
Simplified Problem
Get rid of w
1 22 2
1 1 1
1 1 1
1 22 2
i,j=1 1 1
max ,
,
,
unconstrained
i j
i i j
i
l l l
i j ii j i
l l l
i i ji i j
l l l
i j i j i ii i
y
x x y
αX X
X X
Optimal solution
Problem in matrix notation with G=XX’
Solution satisfies
1
2 2min ( ) ' 'f G y
1
( ) 0
( )l
f G y
G I y
What about Bias
If we limit regression function to f(x)=w’x means that solution must pass through origin.
Many models may require a bias or constant factor f(x)=w’x+b
Eliminate Bias
One way to eliminate bias is to “center” the data
Make data have mean of 0
1
2
1ystandard deviation
l
ii
l
ii
ymean y y
l
y y
l
Center y
centered y y y
Y now has sample mean of 0
Frequently good to make y have standard length:
y
y ynormalized y
Center X
Mean X
Center X
1 1' where is vector of onesil l
μ x X e e
1
1 1 1
ˆ ' ' Centered Point
ˆ ' ' ( ') Centered Data
i i i
x x μ x e X
X X eμ X ee X I ee X
You Try
Consider data matrix with 3 points in 4 dimensions
Computer the centered by hand and with the following formula.
1 2 4 1
3 4 1 4
2 0 9 1
X
1ˆ ( ') X I ee X
Center (X) in Feature Space
We cannot center (X) directly in feature space.
Center G = XX’
Works in feature space too for G in kernel space
1 1 1 1ˆ ˆ ˆ ' ( ') '( ') ' ( ') ( ') ' G XX I ee XX I ee I ee G I ee
( ) ( ') G X X K
1 1ˆ ( ') ( ') ' K I ee K I ee
Centering Kernel
Practical Computation:
1 1ˆ ( ') ( ') K I ee K I ee
1
1
Let ' ' row average of
Let ' subtract row average
Let row average of
ˆLet ' subtract column average
μ e K K
K K eμ
c K e K
K K ce
Ridge Regression in Feature Space
Original way
Predicted normalized y
Predicted original y
1
1
( ( )) ( , )i ii
g K
α G I y x x x
1
1
ˆ ˆˆ ˆ ˆ( ( )) ( , )i ii
g K
α G I y x x x
1
1
ˆ ˆˆ ˆ ˆ( ( )) ( , )y i i yi
g K
α G I y x x x
Centering Test Data
1
1
ˆ ˆˆ ˆ ˆ( ( )) ( , )y i i yi
g K
α G I y x x x
1 1' '
1'
ˆ ( ')
ˆ ( )( ')
trtr tr tr tr
tst tst tr
where
K K e μ I ee μ K e
K K e μ I ee
1'ˆ ˆˆ ˆˆ ( ( ))tst y tst yg
α G I y X K α e
Calculate test data just like training data:
Prediction of test data becomes:
Alternate Approach
Directly add bias to the model:
Optimization problem becomes:21 1 2
, , 2 21
min ( , , )
. . 1, ,
l
b ii
i i ib
f b z
s t y z i
w z w z w
X w
is biasi iy b b X w
Lagrangian Function
Consider the problem
Lagrangian function is
min ( ): diff
( ) 0: diff . .
1, ,
nr
nii
f rf R R
h rh R Rs t
i
1
1
( , ) ( ) ( ( ))
( , ) ( ) ( ( ))
i ii
r r i r ii
L r u f r u h r
L r u f r u h r
Lagrangian FunctionPrimal
21 1 22 2
1 1
1
1
1
( , , , ) ( )
( , , , ) 0
( , , ) 0
( , , ) 0
i
i
i
l l
i i i ii i
l
ii
l
bi
L b z y b z
L b
L b
L b
w
z
w z α w X w
w z α w X
w,z α z α
w,z α
21 1 2, , 2 2
1
min ( , , )
. . 1, ,
l
b ii
i i ib
f b z
s t y z i
w z w z w
X w
Wolfe Dual Problem
Simplify by eliminating z= and using e’ =0
21 1 22 2, , ,
1 1
1
1
max ( )
. . ( , , ) 0
1( , ) 0
( , , ) 0
i
i
i
l l
i i i ib
i i
l
ii
l
bi
z y b z
s t L
L
L b
w z α
w
z
w X w
w z α w X
w,z α z α
w,z α
Simplified Problem
Simplify by eliminating w=X’
21 2 2
2 2 2,1 1
21 22 2
1 1
1
1
max
. . ( , , ) 0
0
i
i
i
i
l l
i i i i i ii i i i i
l l
i i i ii i i
l
ii
l
i
y b
y
s t L
w α
w
w X w
w X w
w z α w X
Simplified Problem
Get rid of w
1 22 2
1 1 1
1 1 1
1 22 2
i,j=1 1 1
max ,
,
,
. . 0
i j
i i j
i
l l l
i j ii j i
l l l
i i ji i j
l l l
i j i j i ii i
i
y
x x y
s t
αX X
X X
New Problem to be solved
Problem in matrix notation with G=XX’
This is a constrained optimization problem. Solution is also system of equations, but not as simple.
1
2 2min ( )
. . 0
f
s t
α
α α'Gα α y'α
e'α
Kernel Ridge Regression
Centered algorithm just requires centering of the kernel and solving one equation.
Can also add bias directly.
+ Lots of fast equation solvers.
+ Theory supports generalization
- requires full training kernel to compute - requires full training kernel to predict future points