Data mining and statistic al learning - lecture 6 1 Overview • Basis expansion • Splines • (Natural) cubic splines • Smoothing splines • Nonparametric logistic regression • Multidimensional splines • Wavelets
Data mining and statistical learning - lecture 6
1
Overview
• Basis expansion
• Splines
• (Natural) cubic splines
• Smoothing splines
• Nonparametric logistic regression
• Multidimensional splines
• Wavelets
Data mining and statistical learning - lecture 6
2
Linear basis expansion (1)
Linear regression
True model:
Question: How to find ?
Answer: Solve a system of linear equations to obtain
x1 x2 x3 y
1 -3 6 12
… … … …
332211)( xxxxfy
f̂
321ˆ,ˆ,ˆ
Data mining and statistical learning - lecture 6
3
Linear basis expansion (2)
Nonlinear model
True model:
Question: How to find ?
Answer: A) Introduce new variables
x1 x2 x3 y
1 -3 -1 12
… … … … 2143322211 sin3 xxexxxy x
f̂
21433
22211
,sin
,, 3
xuxu
exuxxu x
Data mining and statistical learning - lecture 6
4
Linear basis expansion (3)
Nonlinear model
B) Transform the data set
True model:
C) Apply linear regression to obtain
u1 u2 u3 u4 y
-3 -1.1 -0.84 1 12
… … … … 44332211 uuuuy
4321ˆ,ˆ,ˆ,ˆ
Data mining and statistical learning - lecture 6
5
Linear basis expansion (4)
Conclusion:
We can easily fit any model of the type
i.e., we can easily undertake a linear basis expansion in X
Example: If the model is known to be nonlinear, but the exact form is unknown, we can try to introduce interaction terms
M
mmm XhXf
1
21122
11111 XXXXXXf pp
Data mining and statistical learning - lecture 6
6
Piecewise polynomial functions
Assume X is one-dimesional
Def. Assume the domain [a, b] of X is split into intervals [a, ξ1], [ξ 1, ξ 2], ..., [ξ n, b]. Then f(X) is said to be piecewise polynomial if f(X) is represented by separate polynomials in the different intervals.
Note The points ξ1,..., ξ n are called knots
Data mining and statistical learning - lecture 6
7
Piecewise polynomials
Example. Continuous piecewise linear function
Alternative A. Introduce linear functions on each interval and a set of constraints
(4 free parameters) INS. FIG 5.1 lower left
Alternative B. Use a basis expansion (4 free parameters)
Theorem. The given formulations are equivalent.
2322
1211
333
222
111
yy
yy
xy
xy
xy
241321 ,,,1 XXhXXhXXhXh
Data mining and statistical learning - lecture 6
8
Splines
Definition A piecewise polynomial is called order-M spline if it has continuous derivatives up to order M-1 at the knots.
Alternative definition An order-M spline is a function which can be represented by basis functions ( K= #knots )
Theorem. The definitions above are equivalent.
Terrminology. Order-4 spline is called cubic spline INS. FIG 5.2 LR
(look at basis and compare #free parameters)
Note. Cubic splines: knot-discontinuity is not visible
KlXXh
MjXXhM
llM
jj
,,1,
,,1,1
1
Data mining and statistical learning - lecture 6
9
Variance of spline estimators – boundary effects
INSERT FIG 5.3
Data mining and statistical learning - lecture 6
10
Natural cubic spline
Def. A cubic spline f is called natural cubic spline if the its 2nd and 3rd derivatives are zero at a and b
Note It implies that f is linear on extreme intervals
Basis functions of natural cubic splines
kK
Kkk
Kkk
XXXd
KkXdXdNXXNXN
33
1221
where
2...,,1,,,1
Data mining and statistical learning - lecture 6
11
Fitting smooth functions to data
Minimize a penalized sum of squared residuals
where λ is smoothing parameter.
λ=0 : any function interpolating data
λ=+ : least squares line fit
dttfxfyfRSSN
iii
2
1
2,
Data mining and statistical learning - lecture 6
12
Optimality of smoothing splines
Theorem The function f minimizing RSS for a given is a natural cubic spline with knots at all unique values of xi (NOTE: N knots!)
The optimal spline can be computed as follows.
yNNN
N
NyNy
TN
T
jiijNijij
NTT
TN
jjj
dttNtNxN
RSS
xNxNxf
1
''''
1
ˆ
,
Data mining and statistical learning - lecture 6
13
A smoothing spline is a linear smoother
The fitted function
is linear in the response values.
ySyNNNN T
NTf
1ˆ
Data mining and statistical learning - lecture 6
14
Degrees of freedom of smoothing splines
The effective degrees of freedom is
dfλ = trace(Sλ)
i.e., the sum of the diagonal elements of S.
Data mining and statistical learning - lecture 6
15
Smoothing splines and eigenvectors
It can be shown that
where K is the so-called penalty matrix
Furthermore, the eigen-decomposition is
Note: dk and uk are eigenvalues and
eigenvectors, respectively, of K
1 KIS
k
k
N
k
Tkkk
d
1
11
uuS
Data mining and statistical learning - lecture 6
16
Smoothing splines and shrinkage
• Smoothing spline decomposes vector y with respect to basis of eigenvectors and shrinks respective contributions
• The eigenvectors ordered by ρ increase in complexity. The higher the complexity, the more the contribution is shrunk.
N
k
Tkkk
1
,yuuyS
Data mining and statistical learning - lecture 6
17
Smoothing splines and local curve fitting
• Eigenvalues are reverse functions of λ. The higher λ, the higher penalization.
• Smoother matrix is has banded nature -> local fitting method
• INSERT fig 5.8
N
k kdtracedf
1 1
1
S
Data mining and statistical learning - lecture 6
18
Fitting smoothing splines in practice (1)
Reinsch form:
Theorem. If f is natural cubic spline with values at knots f and second derivative at knots then
where Q and R are band matrices, dependent on ξ only.
Theorem.
1 KIS
RQT f
TQQRK 1
Data mining and statistical learning - lecture 6
19
Fitting smoothing splines in practice (2)
Reinsch algorithm
• Evaluate QTy
• Compute R+λQTQ and find Cholesky decomposition (in linear time!)
• Solve matrix equation (in linear time!)
• Obtain f=y-λQγ
Data mining and statistical learning - lecture 6
20
Automated selection of smoothing parameters (1)
What can be selected:
Regression splines
• Degree of spline
• Placement of knots
->MARS procedure
Smoothing spline
• Penalization parameter
Data mining and statistical learning - lecture 6
21
Automated selection of smoothing parameters (2)
Fixing the degrees of freedom
• If we fix dfλ then we can find λ by solving the equation numerically
• One could try two different dfλ and choose one based on F-tests, residual plots etc.
N
k kdtracedf
1 1
1
S
Data mining and statistical learning - lecture 6
22
Automated selection of smoothing parameters (3)
The bias-variance trade off
INSERT FIG. 5.9
EPE – integrated squared
prediction error,
CV- cross validation
N
k kdtracedf
1 1
1
S
Data mining and statistical learning - lecture 6
23
Nonparametric logistic regression
Logistic regression model
Note: X is one-dimensional
What is f:
Linear -> ordinary logistic regression (Chapter 4)
• Enough smooth -> nonparametric logistic regression (splines+others)
• Other choices are possible
)(
|0Pr
|1Prlog Xf
xXY
xXY
Data mining and statistical learning - lecture 6
24
Nonparametric logistic regression
Problem formulation:
Minimize penalized log-likelihood
Good news: Solution is still a natural cubic spline.
Bad news: There is no analytic expression of that spline function
dttfflfl up
2
2
1,,min
Data mining and statistical learning - lecture 6
25
Nonparametric logistic regression
How to proceed?
Use Newton-Rapson to compute spline numerically, i.e
• Compute (analytically)
1. Compute Newton direction using current value of parameter and derivative information
2. Compute new value of parameter using old value and update formula
T
pp
pp
ll
ll
2
2,
ppoldnew ll
12
Data mining and statistical learning - lecture 6
26
Multidimensional splines
How to fit data smoothly in higher dimensions?
A) Use basis of one dimensional functions and produce basis by tensor product
Problem: Exponential INS FIG. 6.10
growth of basis with dim
XgXg
XhXhXg
jkjk
kjjk
,2211
Data mining and statistical learning - lecture 6
27
Multidimensional splines
How to fit data smoothly in higher dimensions?
B) Formulate a new problem
• The solution is thin-plate splines
• The similar properties for λ=0.
• The solution in 2 dimension is essentially sum of radial basis functions
fJxfyi
ii 2min
jjT xxxxf 0
Data mining and statistical learning - lecture 6
28
Wavelets
Introduction
• The idea: to fit bumpy function by removing noise
• Application area: Signal processing, compression
• How it works: The function is represented in the basis of bumpy functions. The small coefficients are filtered.
Data mining and statistical learning - lecture 6
29
Wavelets
Basis functions (Haar Wavelets, Symmlet-8 Wavelets)
INSERT FIG 5.13
Data mining and statistical learning - lecture 6
30
Wavelets
Example
Insert FIG 5.14