-
SmoothingR.W. Oldford
Contents1 Fitting locally 1
1.1 Piecewise fitting . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 51.2 Multidimensional
splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 391.3 Local neighbourhoods . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391.4
Multidimensional x . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 80
2 Linear smoothers 802.1 Complementary viewpoints . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3 More than one explanatory variate 983.1 Multiple regression
smooths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 1053.2 Additive models . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153.3
Advantages of additive models . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 119
1 Fitting locally
Recall the Facebook data on like and Impressions where we fit a
cubic to capture the average of the ysas a function of x.
y = β0 + β1x + β2x2 + β3x3 + r.
Here y was log(like + 1) and x was log(Impressions)
1
-
8 10 12 14
02
46
8
Facebook
log(Impressions)
log(
like
+ 1
)
There are a few things worth noting about this model• it is a
polynomial• it is defined for any x ∈ ℜ, that is for any x within
the range of the xis and any x outside of the model• the
corresponding generative model has
Y = µ(x) + R
with E(R) = 0. That is the mean or expected value of Y is a
function of x• that function is
µ(x) = β0 + β1x + β2x2 + β3x3.
We might reasonably ask which, if any, of these implicit
assumptions generalize?It is not clear, for example, whether a
cubic is appropiate here? Perhaps a lo degree polynomial would
bebetter? Or lower degree? Or perhaps some non-polynomial curve
might make more sense? We might wantto be guided by whatever prior
information we might have about the functional form of the model.
Barringthe availability of such prior information, we might prefer
to have a less prescriptive model for the functionalform of µ(x).In
most cases, we would rather like to “let the data speak for
themselves” in suggesting what shape thedependence of y on x might
take. We would rather not have to specify that shape in
advance.
2
-
A less prescriptive approach might be to take the generative
model at face value, namely that we are modellingthe mean of Y at
any given point x. Having data values (x1, y1), . . . , (xN , yN )
it might make sense to simplyestimated µ(x) by the arithmetic
average of the yi values for those points whose corresponding xi is
eitherequal to the x of interest or which is nearly so.For example,
a plot pf log(impressions) versusPost.Month‘ might simply “connect
the dots” of themonthly averages:plot(fb$Post.Month, fb$x,
main = "Facebook",xlab = "Post.Month)",ylab =
"log(Impressions)",pch=19,col=adjustcolor("firebrick", 0.7))
averages
-
2 4 6 8 10 12
810
1214
Facebook
Post.Month)
log(
Impr
essi
ons)
LS lineAveragesMedians
The averages might better reflect the month to month differences
than does the least-squares fitted line,though the latter is
simpler. The medians on the other hand produce a smoother curve
that is not soinfluenced by outlying y values.With the monthly
posts, we have many observations available for each month, that is
many y values forevery unique x. More generally, we might choose to
have
µ̂(x) = 1kx
∑xi ∈ Nbhd(x)
yi
where Nbhd(x) denotes a neighbourhood of x and kx is the number
of points in that neighbourhood. It might,for example, be a fixed
number of neighbours neighbourhood or a distance based
neighbourhood. Either way,µ(x) is being determined locally, in this
case by the average of all of the yis in the local
neighbourhood.For example, consider the following artifically
generated set of point pairs:
4
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Fake data
x
y
1.1 Piecewise fitting
We could, for example, cut the range of the x values up into
several fixed neighbourhoods and use the averagey in each
neighbourhood as µ̂(x) for any value of x in the neighbourhood.
This might be accomplished asfollows:# varying numbers, constant
widthbreaks_v
-
local_v
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5Constant width nbhd
x
y
plot(x,y,col="grey80", pch=19, cex=0.5,main = "Constant
proportion nbhd")
plot_ave(local_p, nbhd_p, x, mu_p,col="red", lwd=5)
7
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Constant proportion nbhd
x
y
# Note that the first neighbourhoods had constant widths,# and
hence varying numbers of pointslocal_v
## [1] "[-0.306,-0.154]" "(-0.154,-0.00201]" "(-0.00201,0.15]"##
[4] "(0.15,0.302]" "(0.302,0.454]" "(0.454,0.606]"## [7]
"(0.606,0.758]" "(0.758,0.91]" "(0.91,1.06]"## [10] "(1.06,1.21]"#
The second neighbourhoods had approximately constant number# of
points, this being about 10% of the x values in each.# Hence they
had varying widths.#local_p
## [1] "[-0.306,0.0714]" "(0.0714,0.19]" "(0.19,0.291]"## [4]
"(0.291,0.38]" "(0.38,0.45]" "(0.45,0.539]"## [7] "(0.539,0.636]"
"(0.636,0.773]" "(0.773,0.869]"## [10] "(0.869,1.21]"
And now both in the one plot.
8
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Both at once
x
y
Each plot has a separate mean value cover each entire
neighbourhood. Note however - they visibly havedifferent
neighbourhood sizes depending on the density of the points (in x) -
they are in some agreement,particularly about the coarse shape of
µ(x) - they disagree in places, some regions have quite different
values- they suggest that µ(x) is - locally flat (zero slope) -
discontinuousWe might for example look at larger, or smaller
neighbourhoods. First, let’s make the neighbourhoods larger.
9
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Constant width nbhd
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Constant proportion nbhd
x
y
And now smaller neighbourhoods:
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Constant width nbhd
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Constant proportion nbhd
x
y
(Note that the intervals were not drawn here to cover the
entirel neighbourhood but only the points in eachneighbourhood.)Not
surprisingly, the smaller neighbourhoods capture finer structure
and the larger neighbourhoods coarserstructure.But we still have
these “flats”” everywhere. This, together with the obvious
discontinuities for µ(x), alsosuggests something about the
derivative µ′(x), namely that µ′(x) = 0 almost everywhere except
the discon-tinuites where it is not defined (or essentially
infinite).We could replace the flats by local lines. This allows
µ(x) to change somewhat more smoothly and lets thederivative take
non-zero values within the neighbourhoods.We need only change the
way we calculate the estimates of µ(x).## Now we need to adapt the
local average function# to fit local lines.#
10
-
get_ave
-
mu[nbhd_i]
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Constant proportion nbhd
x
y
Clearly, this follows the data more closely and more smoothly in
each neighbourhood. Unfortunately at theneighbourhood boundaries
discontinuities can occur.How do we rid ourselves of the
discontinuities?
1.1.1 Splines
First, we note that within each neighbourhood, we fit a linear
model, say a polynomial. For example, in thejth neighbourhood a p
degree polynomial would look like
µ(j)(x) = β(j)0 + β(j)1 x + . . . + β(j)p xp ∀ x ∈ Nbhdj .
Since no x is in more than one neighbourhood, we could fit all
parameters at once in a single model involvingall of the data. If
there were J neighbourhoods, this means a linear model having a
total of J(p + 1)parameters. The entire model could be written
as
µ(x) =J∑
j=1INbhdj (x)µ(j)(x)
where IA(x) is the indicator function evaluating to 1 whenever x
∈ A and to zero otherwise.
13
-
To rid ourselves of discontinuities, we need only restrict the
parameters so that the discontinuities disappear.For example,
suppose the discontinuities occur at x = kj for j = 1, . . . , K
with K = J − 1. (The boundarypoints kj are sometimes called knots.)
Then we would restrict the parameters by forcing the curves to
meetat every boundary. That is
β(j)0 + β
(j)1 kj + . . . + β(j)p k
pj = β
(j+1)0 + β
(j+1)1 kj + . . . + β(j+1)p k
pj
for j = 1, . . . , K. This introduces K = J −1 restrictions so
that there are really only J(p+1)−(J −1) = pJ +1parameters.The
curve µ(x) would be continuous but could now have “kinks” at the
joins. We could make these go awayby forcing the first derivatives
to match at the boundaries as well. This would introduce another
set ofK = J − 1 restrictions giving only pJ + 1 − (J − 1) = (p −
1)J + 2 free parameters. We could then matchthe second derivative,
giving a still smoother function that is constrained by further K
restrictions. If wematch the function an also match on d
derivatives, the total number of parameters which remain are
J(p + 1) − (d + 1)K = pJ + J − (d + 1)J + (d + 1)
= (p − d)J + d + 1
= (p − d)K + p + 1.
If we choose d = p − 1 this becomes simply K + p + 1. A
piecewise polynomial of degree p that matches onthe function and on
p − 1 of its derivatives is called a p-degree spline or a pth order
spline. (The namecomes from the flexible tool called the spline
that was used in drafting to draw smooth curves by hand.)A popular
choice is the cubic spline, having p = 3. These provide enough
flexibility for most purposes and,at least according to statistical
folklore, are smooth enough that the human visual system cannot
detect thelocations of the knots!The linear model that would
result, say for a cubic spline having p = 3, would be written as a
linearcombination like
µ(x) = β0 + β1b1(x) + . . . + βK+3bK+3(x)
for some particular choices of bj(x). These functions can be
thought of as a set of basis functions in the sameway that the N
-dimensional vectors
bj = (bj(x1), . . . , bj(xN ))T
that result from evaluating the functions at the points x1, . .
. , xN form a set of basis vectors (together withthe 1 vector
multiplying β0) for the vector space spanned by the column vectors
of an N × (K +4) regressionmatrix X = [1, b1, . . . , bK+3].As a
basis the set of functions generate a space (having K + 4
dimensions) and, conversely, for that spacethere are any number of
bases which would generate it. For example, let
h(x, kj) = (x − kj)3+ =
(x − kj)3 when x > kj
0 otherwise
where kj is the location of the jth knot. Using these functions,
it can be shown that the cubic spline can bewritten as
µ(x) = β0 + β1x + β2x2 + β3x3 + β4h(x, k1) + . . . + βK+3h(x,
kK).
In this way, our basis has been formed by starting with the
basis functions for a simple cubic and then addinga truncated power
basis function for each knot.This holds more generally for a
p-degree spline where we can write
µ(x) = β0 + β1x + . . . + βpxp + βp+1h1(x, k1) + . . . +
βK+phK(x, kK)
14
https://en.wikipedia.org/wiki/Flat_spline
-
in terms of the basis functions being those of a p degree
polynomial plus the truncated power basis functions:
hj(x, kj) = (x − kj)p+ for j = 1, . . . , K.
Note again that this choice is just one possible choice of basis
functions. This particular choice is fairlyeasy to understand
conceptually but it is unfortunately a poor choice for computation.
The problem is thatpowers of large numbers can lead to numerical
instability and round-off error. Instead, an equivalent set ofbasis
functions but one which is better computationally is the so-called
B-spline basis, which also allows forefficient computation even
when K, the number of knots, is large.The p-degree splines with
fixed knots are a very flexible set of functions for µ(x). Because
we are free tochoose the knots, we can choose how many and place
them where please. More knots means more flexibleand so we could
even choose to add more in x regions where we think that µ(x)
varies more. These fixedknots splines are also sometimes called
regression splines and, as in any polynomial regression problem
wherewe might add more terms to better fit the data, with these
regression splines we could add more knots forthe same reason and
in a more targetted fashion.This can be used for any x; all we need
are the coefficient estimates and the knot values to make a
predictionat any x.
1.1.1.1 Fitting splines in RThere is a splines package in R that
contains the function bs() which will calculate an X matrix
(excludingthe intercept term unless it is specifically requested)
corresponding to a B-spline basis for a p-degree splineat fixed
knots. Unless the value of the argument degree is provided, the
default spline will be cubic.For example:library(splines)p
-
parOptions
-
0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Basis vector 5
x
Bas
is
0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Basis vector 6
x
Bas
is
0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Basis vector 7
x
Bas
is
0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Basis vector 8
x
Bas
is
17
-
0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Basis vector 9
x
Bas
is
0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Basis vector 10
x
Bas
is
0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Basis vector 11
x
Bas
is
0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Basis vector 12
x
Bas
is
par(parOptions)
Clearly, the basis functions are not polynomials. The estimated
smooth µ̂(x) will be a linear combination ofthese functions.With
this X matrix in hand, we can now fit the cubic spline to this
data:fit
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5Cubic spline, knots at lines
x
y
The linear combination of the basis functions (excluding an
intercept) is given by the fitted coefficientestimates,
namely:coef(fit)
## (Intercept) bs(x, degree = p, knots = knots_p)1## -1.4686086
0.6263294## bs(x, degree = p, knots = knots_p)2 bs(x, degree = p,
knots = knots_p)3## 0.3132570 0.4536932## bs(x, degree = p, knots =
knots_p)4 bs(x, degree = p, knots = knots_p)5## 0.6135543
0.5744901## bs(x, degree = p, knots = knots_p)6 bs(x, degree = p,
knots = knots_p)7## 0.6375489 2.6419658## bs(x, degree = p, knots =
knots_p)8 bs(x, degree = p, knots = knots_p)9## 1.7455348
2.0888355## bs(x, degree = p, knots = knots_p)10 bs(x, degree = p,
knots = knots_p)11## 2.3340356 2.8483187## bs(x, degree = p, knots
= knots_p)12## 2.6743683
Note also that we need not have fit this via least-squares:
19
-
library(robust)
fit
-
fit1
-
1.1.2 Natural splines
As we have seen in several examples already, polynomial
functions can be very wild at the end of the rangeof the x data and
beyond. With p-degree splines, the polynomials fit to the outside
edge of the range of thex data are likely to be even wilder since
they are based on many fewer points.To help address this problem,
the spline is sometimes constrained to be only a linear function
beyond thelargest and smallest knots. For a p-degree spline (with
odd p), we define a natural spline with knots atk1 < k2 < · ·
· < kK to be a function µ(x) such that
• µ(x) is a polynomial of degree p in each interior
neighbouthood [k1, k2], . . . [kK−1, kK ],• µ(x) is a polynomial of
degree (p − 1)/2 on (−∞, k1] and on [kK , ∞), and• µ(x) is
continuous and has continuous derivatives of order 1, . . . , (p −
1) at its knots k1, . . . , kN .
This forces the polynomials at either end to be fluctuate less
by severely reducing their degree. Perhaps themost common natural
spline used is the natural cubic spline which forces the fit on
either end to be astraight line.Recall that the model degrees of
freedom associated with a p-degree spline was K + p + 1, where K
was thenumber of knots. In the case of a natural p degree spline
the model degrees of freedom are
(p + 1)(K − 1)︸ ︷︷ ︸interior
neighbourhoods
+ 2(
1 + p − 12
)︸ ︷︷ ︸
exteriorneighbourhoods
− Kp︸︷︷︸continuityconstraints
= K.
This means that the model degrees of freedom for natural splines
depends only on the number of knots!Which is, you have to admit,
pretty amazing.
1.1.2.1 Fitting natural splinesIn splines package a basis matrix
is provided by the function ns(...) and only for natural cubic
splines.Degree 3 polynomials are by far the most common and, as it
turns out, are all we really need in mostcircumstances (see
“smoothing splines” below).In R, in addition to any interior knots
provided by the user (or determined automatically from a user
suppliedarguments) two additional boundary knots, k0 and kK+1 may
be supplied (k0 < ki and kK+1 > kK). Thesedetermine the
points beyond which the lower degree polynomials are fit. By
default, ns chooses the boundaryknots at the minimum and maximum x
values in the data.Previously we fit a cubic spline to this data
with knots atknots_p
## 10% 20% 30% 40% 50% 60%## 0.07138955 0.18966397 0.29106870
0.38030525 0.45015999 0.53857863## 70% 80% 90%## 0.63629916
0.77296504 0.86884995
And considered what the basis functions looked like.
Analogously, we can plot the natural spline basisfunctions for this
data as well.Xmat.ns
-
ylim=blim,xlim = extendrange(x),xlab="x",
ylab="Basis",main=paste("ns basis vector", j),col="firebrick")
}
0.0 0.5 1.0
−0.
20.
00.
20.
40.
60.
8
ns basis vector 1
x
Bas
is
0.0 0.5 1.0
−0.
20.
00.
20.
40.
60.
8
ns basis vector 2
x
Bas
is
0.0 0.5 1.0
−0.
20.
00.
20.
40.
60.
8
ns basis vector 3
x
Bas
is
0.0 0.5 1.0
−0.
20.
00.
20.
40.
60.
8
ns basis vector 4
x
Bas
is
23
-
0.0 0.5 1.0
−0.
20.
00.
20.
40.
60.
8
ns basis vector 5
x
Bas
is
0.0 0.5 1.0
−0.
20.
00.
20.
40.
60.
8
ns basis vector 6
x
Bas
is
0.0 0.5 1.0
−0.
20.
00.
20.
40.
60.
8
ns basis vector 7
x
Bas
is
0.0 0.5 1.0
−0.
20.
00.
20.
40.
60.
8
ns basis vector 8
x
Bas
is
par(parOptions)
24
-
0.0 0.5 1.0
−0.
20.
00.
20.
40.
60.
8
ns basis vector 9
x
Bas
is
0.0 0.5 1.0
−0.
20.
00.
20.
40.
60.
8
ns basis vector 10
x
Bas
isWe can also compare the two fits that would result for our
fake data.fit.bs
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Comparing cubic spline, knots at lines
x
y
bs − cubic splinens − natural spline
The straight line fits at the end of the natural cubic splines
can be seen in the plot. Note the essentialagreement between the
two nearly everywhere else.Now, fit.bs used model fit.bs$rank = 13
degrees of freedom in building its fit whereas the natural
splinefit.ns used only fit.ns$rank = 11 degrees of freedom. Note
that ns added two boundary knots to the 9we provided it, hence the
11 degrees of freedom for this model. The natural spline has two
fewer degrees offreedom. We might “spend” these two degrees of
freedom on the placement of two more interior knots inthe natural
spline.Suppose we use these extra model degrees of freedom by
having two more interior knots, say between the60% and 70%
quantile. this should allow us to fit the abrupt change in the
middle a little better.knots_p2
-
ypred.ns2
-
Suppose that µ(x) is at least twice differentiable. Then the
first derivative µ′(x) measures the slope of thefunction at any
point x. A function of any slope can be smooth. However, if that
slope changes frequentlyand or abruptly, then the function would
not be smooth but rather it would be rough.The second derivative
µ′′(x) measures how quickly the slope changes at any given point x.
If this is large(positive or negative) then the slope is changing
quickly. Large values of (µ′′(x))2 indicate that there is anabrupt
change in slope at the point x. One possible measure of roughness
then might be∫
(µ′′(t))2 dt
as the average of (µ′′(x))2 over all x. The smaller is the
average, the smoother is µ(x).One way to proceed would be find µ̂
that minimizes
RSS(µ, λ) =N∑
i=1(yi − µ(xi))2 + λ
∫(µ′′(t))2 dt
with λ ≥ 0. This is a penalized residual sum of squares - the
first term is the residual sum of squares, thesecond term a penalty
function which is larger the rougher is µ(x).Alternatively, had we
been using a Gaussian generative model for the residuals, then we
might cast thepenalty function in probablistic terms. For example,
we have
Yi ∼ N(µ(xi), σ2)
as the generative model for the response Y conditional on the
explanatory variate xi. Seeing that xiappears only through the
function µ(xi), this is in some sense the conditional distribution
of Y given bothxi and µ(·). We might still want to condition on xi
but would like to constrain the functional form of µ(·)in some way.
A useful fiction that helps us think about this might be to imagine
that the µ(·) that hasgenerated the observed responses was itself
randomly drawn from some collection of possible functions
µ(·).Because we think that we are more likely to have been served
up a smooth µ(·) than a rough one, we mightimagine that the
probability of any particular function µ(·) is proportional to
e−λ∫
(µ′′(t))2dt.
Clearly functions with larger (average over the whole line)
changes in the slope have lower probability;smoother µ(·) have
higher probability. If this were indeed a probability, then we
might apply Bayes’stheorem and find the conditional distribution of
µ(xi) given Yi = yi and xi. We could then choose µ̂(x)to maximize
this probability. This turns out to be equivalent to minimizing the
penalized residual sum ofsquares.Given the Gaussian generative
model, the penalized residual sum of squares may be thought of as a
so-called “Bayesian” method in that it turns out to be the
objective function which results from maximizingthe posterior
probability of µ(·) (in this language the marginal probability
distribution of µ(·) is called theprior distribution because it is
available prior to any data). Alternatively, the objective function
is simplya penalized log-likelihood function whose penalty might
have been constructed (or not) by imagining aprior distribution for
µ(·).We might also recognize the penalized residual sum of squares
in this case as the objective function thatwould result from using
λ as a Lagrange multiplier when minimizing the residual sum of
squares subjectto the constraint that
∫(µ′′(t))2 dt = 0. The constraint would enforce complete
smoothness (i.e. a straight
line).Note, for example, that however one might think of the
penalized residual sum of squares, it is clear that thevalue taken
by λ (which is still our choice) determines the smoothness of our
estimated µ(x). If, for example,λ = ∞ then no change in the slope
is allowed and we have the smoothest possible function, a line.
Indeed, wewould have the least-squares fitted line had by
minimizing the first term alone. At the other extreme, if we
28
-
had λ = 0, then only the residual sum of squares would need to
be minimized. With no further restrictionson µ, we would have
µ̂(xi) = yi which amounts to connecting the dots in the scatterplot
from left to right(assuming no ties among the xi, otherwise we
average y at those points having the same value of x).It turns out
that for any fixed λ, the solution to this penalized residual sun
of squares problem, is a naturalcubic spline with knots at every
unique value of the xi! That is the solution requires µ(x) to be
ofthe form
µ(x) =N∑
j=1Nj(x)βj
where the Nj(x) are a set of N basis functions for this family
of natural cubic splines.Knowing that the solution must be a
natural cubic spline of the above form, we can rewrite the
penalizedresidual sum of squares as
RSS(µ, λ) = (y − Nβ)T (y − Nβ) + λ βT ΩN β
where N = [Nij ] is an N × N matrix whose (i, j) element is Nij
= Nj(xi), the j th natural cubic spline basisfunction evaluated at
xi and Ω = [ωij ] is an N × N matrix whose (i, j) element is ωij
=
∫N ′′i (t)N ′′j (t)dt.
The solution is now easily found to be
β̂ =(NT N + λΩN
)−1 NT yand the fitted smoothing spline is
µ̂(x) =N∑
j=1Nj(x)β̂j .
Looking at this solution, it would appear to be
overparameterized – there are as many βjs as there areobservations
yi. This solution however has been constrained and to a more or
less amount according to thevalue of λ. No longer are the number
and location of knots chosen to make the function smoother or
rougher(for the smoothing spline knots must be at the unique xi)
but we now choose a value for the smoothingparameter λ. The larger
is λ the more smooth is the resulting estimated function µ̂(x).
1.1.3.1 Choosing λ via degrees of freedomBut how should we
choose λ? One way would be to somehow connect the value of λ to a
measure of thecomplexity (or roughness) of the fitted model.A
traditional measure of the complexity of a linear model is the
number of linear parameters in that model.This in turn was
equivalent to the rank of the X matrix of the “hat-matrix” H =
X
(XT X
)−1 XT .Recall that the role of the hat matrix is to determine
the N dimensional fitted mean vector µ̂ = Hy, whichit does by
orthogonally projecting the vector y onto the space spanned by the
columns of X = colsp(X).Note that colsp(H) = colsp(X), that is it
is the same space. The difference is that the columns of X forma
basis for that space (assuming that X has full column rank) whereas
the columns H are generators of thespace but there are too many to
be a basis. There are N generators in H when we need only r =
rank(X)which is typically smaller than N .One way to get a set of r
basis vectors from H is to find its eigen-decomposition. Suppose we
do that andthat the ordered eigen-values are ρ1 ≥ ρ2 ≥ · · · ≥ ρN ≥
0. Let the corresponding eigen-vectors be u1, . . . , uN .Then the
decomposition gives
H =N∑
i=1ρiuiuTi
and soµ̂ =
∑Ni=1 ρiuiuTi y
=∑N
i=1 uiρi < ui , y >
29
-
where < ·, · > indicates the inner product of its vector
arguments. An interpretation of the last piece isthat the vector y
is first decomposed with respect to the orthonormal basis {u1, . .
. , uN } for ℜN . Each ρimoderates the contribution of each piece
to the corresponding basis vector ui. Now, the nature or H is
that
ρi =
1 i = 1, . . . , r0 i = (r + 1), . . . , N.Being a projection
matrix (idempotent) the eigen-values ρi of H either select a basis
vector (ρi = 1) or donot (ρi = 0). Note that the dimension of this
space is the sum of the eigenvalues or the trace, tr(H), of H.When
fitting cubic splines, instead of X we had a basis matrix B, and
corresponding projection matrixHB = B
(BT B
)−1 BT . When using bs(...) or ns(...) in R we could specify the
degrees of freedom dfand have the function choose the appropriate
number (and location) of the knots with which to build B.The
question is whether we can parameterize a smoothing spline in a
similar way?A smoothing spline estimates the mean value at the
points x1, . . . , xN by
µ̂ = (µ̂(x1), µ̂(x2), . . . µ̂(xN ))T
= Nβ̂
= N(NT N + λΩN
)−1 NT y= Sλy , say.
So Sλ acts here like the hat-matrix HB for a cubic spline.
Unfortunately, it is not idempotent and hencenot a projection
matrix.We can however see what it’s made of. Note that N is N × N
and of full rank N .
Sλ = N(NT N + λΩN
)−1 NT=
(N−T
(NT N + λΩN
)N−1
)−1=
(IN + λN−T ΩN N−1
)−1= (IN + λK)−1
where K = N−T ΩN N−1. Note that K does not involve the smoothing
parameter λ. Note also that theobjective function being minimized
can now be rewritten as
RSS(µ, λ) = (y − µ)T (y − µ) + λ µT Kµ.
Consequently, K is sometimes called the penalty matrix.If K =
VDVT is the eigen decomposition of the (real symmetric) matrix K
with D = diag(d1, . . . , dN ) andd1 ≥ · · · ≥ dN ≥ 0, then it
would appear that the components of µ in directions of
eigen-vectors of K areto be penalized more when they correspond to
large eigen-values di than when they correspond to
smalleigen-values di.Note that both any constant function µ̂1(x) =
a and any straight line function µ̂1(x) = a + bx are in thespace
spanned by the natural spline basis functions Ni(x). Hence there is
one non-zero linear combinationof these basis functions that leads
to the constant and another that leads to a straight line. This
impliesthat the two smallest eigen-values of K are dN−1 = dN =
0.This can be made more apparent by examining the solution vector
µ̂ = Sλy. First, note that
Sλ = V (IN + λD)−1 VT
30
-
is the eigen decomposition of Sλ having eigen-values
ρi(λ) =1
1 + λdN−i+1
for i = 1, . . . , N . Large values of di will produce small
values of ρi(λ). Similarly, large values of the smoothingparameter
λ will produce small values of ρi(λ).A closer look at µ̂
reveals
µ̂ =∑N
i=1 ρi(λ)vivTi y
=∑N
i=1 viρi(λ) < vi , y >
where the ith eigen-vector vi here is of Sλ and corresponds to
its ith largest eigen-value ρi(λ). This is thereverse order of the
eigen-vectors of K in that the largest ρi(λ) corresponds to the
smallest di.This is a lot like the relationship we saw for p-order
splines. The difference here is that the eigen-values ρi(λ)are not
just zero or one. The two largest are 1 (corresponding to d1 = d2 =
0) and the rest are less than one.With HB of the ordinary pth order
splines, the eigen-values of HB selected the components of y in the
direc-tions of the eigen-vectors corresponding to the largest
eigen-values (namely 1) and dropped the componentsin the directions
corresponding to the smallest eigen-values (namely 0). That is the
nature of a projectionoperator and so this kind of spline is
sometimes called a projection smoother.In contrast, the effect of
Sλ on y is to shrink the components of y in the directions its
eigen-vectors. Itshrinks more in those directions with small
eigen-values (ρi(λ)) and less in directions with large ones.
UnlikeHλ, Sλ ×Sλ ̸= Sλ and hence not a projection matrix. Instead,
Sλ ×Sλ ⪯ Sλ – the product has smaller eigen-values than the
original. Because of this shrinkage, the smoothing spline is
sometimes called a shrinkingsmoother.Analogous to the model degrees
of freedom of a projection smoother being equivalent to the trace
of itsprojection matrix, tr(HB), we will take the effective degrees
of freedom to be the trace of the smoothermatrix, tr(Sλ). Both are
the sums of their respective eigen-values.For the smoothing spline,
then, the effective degrees of freedom can be expressed as
dfλ = tr(Sλ) =N∑
i=1
11 + λdi
.
Which in turn means that rather than specify the smoothing
parameter λ, we could specify the effectivedegrees of freedom dfλ
and solve the above equation for λ.Now as λ → 0, dfλ → N and Sλ →
IN . The result is a perfect (and likely very non-smooth) fit.
Similarly,as λ → ∞, dfλ → 2 (corresponding to dN−1 = dN = 0) and Sλ
→ H, the hat matrix for a straight lineregression of y on x.For λ
values in between these two extremes, as λ increases we have
greater shrinkage of the eigen-values ρi(λ)and a lower value of the
effective degrees of freedom dfλ. Again, the shrinkage is greatest
in the direction ofthe eigen-vectors vi corresponding to the
smallest eigen-values of Sλ.To get some sense of what basis
functions these correspond to, we could get the eigen decomposition
of Sλfor our fake data example.In R we will first fit the smoothing
spline on the data set using dfλ = 11 to try to match our earlier
naturalspline fit. This is accomplished by these using the function
smooth.spline(...) as follows.df
-
ypred.sm
-
S[,i]
-
We see that the ρi(λ) values drop of very quickly indicating
that the components of y in the direction ofthe smallest
eigen-values are shrunk a great deal (effectively made zero). Note
also that the eigen-values doseem to sum to our intended effective
degrees of freedom dfλ or sum(eigS$values) = 11.0018443.So, the
question is, what do the eigen-vectors corresponding to the largest
eigen-values look like? And howdo they compare to those
corresponding to the small eigen-values? To that end, we plot them
as a curveevaluated at the x values in our data set.First, let’s
look at the curve for some small values.plotEigenBases
-
0.0 0.5 1.0
−0.
50.
00.
5
i = 1 ; rho = 1
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 2 ; rho = 1
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 3 ; rho = 0.9986
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 4 ; rho = 0.9913
x
eige
n fu
nctio
n
par(parOptions)
The next few areparOptions
-
0.0 0.5 1.0
−0.
50.
00.
5
i = 5 ; rho = 0.9747
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 6 ; rho = 0.9373
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 7 ; rho = 0.8749
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 8 ; rho = 0.7851
x
eige
n fu
nctio
n
par(parOptions)
And how about a fewer farther out.parOptions
-
0.0 0.5 1.0
−0.
50.
00.
5
i = 10 ; rho = 0.5529
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 20 ; rho = 0.0551
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 30 ; rho = 0.0079
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 40 ; rho = 0.002
x
eige
n fu
nctio
n
par(parOptions)parOptions
-
0.0 0.5 1.0
−0.
50.
00.
5
i = 50 ; rho = 6e−04
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 60 ; rho = 2e−04
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 150 ; rho = 0
x
eige
n fu
nctio
n
0.0 0.5 1.0
−0.
50.
00.
5
i = 200 ; rho = 0
x
eige
n fu
nctio
n
par(parOptions)
As these plots indicate, the eigen-vectors contribute bumpier
and bumpier basis vectors as the associatedeigen-values diminish.
The smoother radically downweights according to the corresponding
eigen-value ρi(λ).It does seem that as the effective degrees of
freedom get smaller, the smoother is the fitted function µ̂(x)just
as is the case with the usual model degrees of freedom.More
formally, axioms for an effective dimension of a finite collection
of vectors, say the N column vectorsof a matrix M = [m1, . . . , mN
] have been proposed (Oldford 1985) which are satisfied by the
function
dα ({m1, . . . , mN }) = dα (M) =N∑
i=1
(γi
γmax
)αfor at least α = 1, 2. Here γi is the ith largest singular
value of M and γmax = maxi γi. This “dα effectivedimension” also
arises in a variety of estimation and diagnostic problems related
to linear response models.For example, dα(H) = p for all α ̸=
0.
38
https://www.researchgate.net/publication/275521797_NEW_GEOMETRIC_THEORY_FOR_THE_LINEAR_MODEL
-
The effective degrees of freedom are therefore also an effective
dimension of the set of column (or row) vectorsof Sλ when α =
1.
1.2 Multidimensional splines
The splines we have used above have been designed to fit a curve
to a single explanatory variate x. Butwhat if we have more than one
explanatory variate?Suppose for example that we have only two
explanatory variates x, and z say.One way to proceed would be to
use an additive model
µ(x, z) = β0 + f1(x) + f2(z)
and use a penalty function (∫f ′′1 (t)2dt
)+
(∫f ′′2 (t)2dt
).
The resulting minimization is obtained when each fi(·) is itself
a univariate spline. This result extends toany number of
explanatory variates.Another suggestion would be to choose a set of
basis functions from existing univariate ones. For example, ifwe
have a set of m basis functions b1(x), . . . , bm(x) for x and
another set of n basis functions c1(x), . . . , cn(x)for z, then we
could introduce basis functions
gjk(x, z) = bj(x) × ck(z)
for j = 1, . . . , m and k = 1, . . . , n, a so-called tensor
product basis. Then
µ(x, z) = β0 +m∑
j=1
n∑k=1
βjkgjk(x, z)
would be fitted via least-squares as before.Clearly both of
these methods generalize to any number of dimensions but the
effective model degrees offreedom grows multiplicatively with the
number of explanatory variates using the tensor product bases
andonly linearly using additive bases.One dimensional splines can
also be generalized to higher dimensions via an appropriate penalty
for highcurvature in µ(x) with x ∈ ℜd.For example, when d = 2 we
could choose our roughness penalty as∫ ∫ [(
∂2µ(x, z)∂x2
)2+ 2
(∂2µ(x, z)
∂x∂z
)2+
(∂2µ(x, z)
∂z2
)2]dxdz.
Minimizing the residual sum of squares plus λ times this penalty
function leads to a smooth two dimensionalsurface called a
thin-plate spline.
1.3 Local neighbourhoods
Compared to fitting splines with prespecified knots, smoothing
splines seem to fit the data more locally inthat they had knots at
every unique point xi in the data. Another approach to finding a
flexible functionto fit the data would be focus on each point x and
fit a mean function to that point based on its
localneighbourhood.
39
-
1.3.1 K nearest neighbour fitting
The simplest way to proceed might again be to fit a local
average at every value x = xi in the data set. Valuesfor other
values of x might be found by simple linear interpolation between
these fitted values (i.e. simply“connect the dots”).One way to
define a local neighbourhood would be to find the k nearest
neighbours of that x value. Thereexists a function called knn.reg
in the R package FNN that will compute this average for every point
xi inthe data.require(FNN)
# Let's try a few values for k#knn.fit5
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
5 nearest neighbours
x
y
plot(x,y,col="grey80", pch=19, cex=0.5,main = "21 nearest
neighbours")
lines(x[Xorder],
knn.fit21$pred[Xorder],col=adjustcolor("firebrick", 0.75),lwd=2,
lty=2)
41
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
21 nearest neighbours
x
y
plot(x,y,col="grey80", pch=19, cex=0.5,main = "51 nearest
neighbours")
lines(x[Xorder],
knn.fit51$pred[Xorder],col=adjustcolor("firebrick", 0.75),lwd=2,
lty=5)
42
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
51 nearest neighbours
x
y
As was the case with piecewise neighbourhoods, as the size of
these local neighbourhoods increases, thesmoother becomes the
fitted function.Just as we did for the case of piecewise defined
neighbourhoods, we might also replace the averages withany fitted
model based on the k nearest neighbours of any location x.To that
end, here is a little function that will allow us to experiment a
bit.library(FNN)library(robust)
# This function allows us to see how# the local fits behave as
we change# various elements#
KNNfit
-
"hampel","bisquare","lms","lts"),
newplot=TRUE, # create a new plot or# add to and existing
one
showPoints=FALSE, # highlight points used?pcol="red", #
highlight colourshowWeights=FALSE, # the weights used to
# define the nbhd for# the fit
wcol="pink", # weight colourshowLine=TRUE, # show the fitted
# line at xlocfullLine=TRUE, # full or partial
linelcol="steelblue", # line colourcex=0.5, # point sizepch=19, #
point charactercol="grey80", # point colourlwd=2, # line width... #
other plot parameters)
{if (newplot) {
plot(x=x, y=y, cex=cex, pch=pch, col=col, ...)}data
-
# We can select these points via weightsif (showWeights){N
-
KNNfit(x, y, xloc=xloc, k=k, method="lm",fullLine=TRUE,
showPoints=TRUE,main = paste("least-squares", "fit at x =", xloc,
"with k =", k))
KNNfit(x, y, xloc=xloc, k=k, method="lms",fullLine=TRUE,
showPoints=TRUE,main = paste("LMS", "fit at x =", xloc, "with k =",
k))
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
least−squares fit at x = 0.25 with k = 5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
LMS fit at x = 0.25 with k = 5
x
y
par(parOptions)
Note that the fit might produce a good location, even though the
slope of the line used to get that locationis somewhat surprising.
That is because it is responding to very local structure (it is
based only on the redpoints in the plot).So what happens if we
increase the size of the neighbourhood? How about k = 30?xloc
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
least−squares fit at x = 0.25 with k = 30
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
LTS fit at x = 0.25 with k = 30
x
y
par(parOptions)
Although the lines are still different, there is much better
agreement in this case. When the neighbourhoodis sized right so
that a few points do not influence the outcome, these fitting
methods should be in nearagreement (especially as regards the
fitted line at the single location xloc). If it is too small, a few
pointscould dominate a least-squares fit. If it is too large, the
underlying function may have changed enough thata simple straight
line model might be too simple.We might compare the fits over a
wide range of x locations. We’ll do that now, but using only a
short linesegment to represent the fitted line at each location.
The horizontal range of each line segment covers theneighbourhood
on which the line was constructed.First for least-squares and k =
30:k
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
lm fit with k = 30
x
y
And now for lts and k = 30:
48
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
lts fit with k = 30
x
y
We can see that both sets of fits find the middle peak of the
plot, but that the peak is higher for “lts” herebecause it ignores
the rightmost points in that neighbourhood as outliers.
1.3.2 Local weighting
Another way to think about this local fitting is to imagine that
all points in the data set are being used,except that those outside
of the local neighbourhood have a zero weight (in the sense of
weighted leastsquares).For example, our least-squares fit only on
the k nearest neighbours looks like a least-squares fit on allof
the data but with weights that are 1 for points in the
neighbourhood and zero for points outside theneighbourhood. The two
lines could be seen as follows:## Our data here are simply x and
y#ourdata
-
plot(x,y,col="grey80", pch=19, cex=0.5,main = "Locally weighted
least-squares")
abline(a=alpha_hat, b=beta_hat,col=adjustcolor("steelblue",
0.75),lty=2, lwd=3)
KNNfit(x, y, xloc=xloc, k=30, method="lm",fullLine=TRUE,
showPoints=FALSE,showWeights = TRUE,
newplot=FALSE,lcol=adjustcolor("steelblue", 0.75),lwd=3)
points(x,y, col="grey80",pch=19, cex=0.5)
legend("topleft",legend=c("Weight 1 for all data","Non-zero red
weights"),lty=c(2,1), lwd=3,col=c(adjustcolor("steelblue",
0.75),
adjustcolor("steelblue", 0.75)))
50
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Locally weighted least−squares
x
y
Weight 1 for all dataNon−zero red weights
When we plot the weights like this a few things become obvious.
µ̂(x) is seen to be the value that minimizesa weighted sum of
squares
N∑i=1
w(x, xi) r2i =N∑
i=1w(x, xi)(yi − µ(xi))2
where
w(x, xi) =
1 xi ∈ Nbhd(x)0 otherwise.And second that the choice of a weight
function is a bit peculiar, namely the indicator function
INbhd(x)(xi).Note also that the neighbourhood of the indicator
function is defined by the k nearest neighbours of x whichhas some
obvious consequences. For example, for some x, the neighbourhood
could be very unbalanced inas much as most (even all) neighbours
could be on one side of x and fewer (even none) could be on the
other.If x happens to be located in a place where there are few
data points, the neighbourhood could stretch somedistance to find
its k near neighbours. Clearly this would stabilize µ̂(x) there but
it would be at the expenseof relying on points possibly quite far
away.Another approach we might try would be to base our
neighbourhood on the Euclidean distance from thelocation of
interest, x. For example, we could choose our neighbourhood to be
to choose all points xi withina distance h say of x, that is
Nbhd(x, h) = {xi : |x − xi| ≤ h} .
51
-
Just as with k nearest neighbours then we might give a weight of
one to all points xi within this neighbourhoodand a weight of zero
to points outside the neighbourhood.One problem with this
neighbourhood is that depending on its size it might include many
points or fewpoints. We’ll return to this later.Alternatively, we
might not worry so much about the neighbourhood but rather simply
choose the weightsmore judiciously. Since we are trying to fit
locally, we could choose higher weights for the closer points
andlower weights for those farther away.For example, we might
consider weights that are proportional to a function, say K(t)
having the followingproperties: ∫
K(t)dt = 1,∫
t K(t)dt = 0, and∫
t2 K(t)dt < ∞.
The first two of these standardize the K(t); the last makes sure
that there is some spread in the weight alongthe real line but also
that there not be too much weight in the extremes. The function
K(t) maps K : ℜ → ℜand is called a kernel function. (Aside: this is
not to be confused with the “kernel” functions you mayhave seen in
other courses – of reproducing kernel Hilbert space methods which
map K : ℜ × ℜ → ℜ.)To help get some intuition on these, imagine
that we also have K(t) ≥ 0 for all t. Then K(t) could be adensity
function (integrating to 1), with mean 0, and finite variance. Note
that it need be symmetric, thoughthat is all that we will
consider.A gaussian kernel would be one such example defined as
K(t) = 1√2π
exp
(− t
2
2
).
Some other examples include:1. An Epanechnikov kernel
K(t) =
34
(1 − t2
)for |t| < 1
0 otherwise.
2. Tukey’s tri-cube weight
K(t) =
(1 − |t|3
)3 for |t| ≤ 10 otherwise.
Since we are interested in applying these functions locally, the
kernel functions K(·) is applied not to eachxi, but rather to the
difference xi − x. Similarly, a means of controlling how quickly
the weights diminsh forany kernel is to introduce a scale
parameter, say h > 0.Our weight function would be calculated
as
w(x, xi) =K
(xi−x
h
)∑Nj=1 K
(xj−x
h
)1.3.2.1 IllustrationTo illustrate how this might work, we can
construct a naive locally weighted sum of squares estimator
asfollows.# Construct a weight function shaped like# a Gaussian or
Normal density.#GaussWeight
-
# Normal densitydnorm(x, mean=xloc, sd=h)
}
Now we choose a point in the x range, say xloc = 0.5 and fit a
straight line using all of the data but usingthe above weight
function to determine the weight that will be given to each point
in the estimation.# location at which we are estimating.xloc
-
for (i in 1:length(wts)) {lines(x=rep(x[i],2),
y=c(ybottom, ybottom + wts[i] * yheight),col="pink",lty=1)
}
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Gaussian weights
x
y
Which doesn’t look that different from the least-squares lines.
For the original least-squares fit we had α̂ =-1.3343939 and β̂ =
2.4593346. With our Gaussian weights, the weighted least-squares
estimates were α̂ =-1.4725965 and β̂ = 2.7143856, which is
different, but not that different.This is to be expected when you
look at the weights for each point shown across the bottom of the
plot.These are not that different from one another and no point has
very small weight.
1.3.3 Scaling also determines locality
To be more responsive to the local structure, we need only
consider changing the size of the scale parameterh that determines
the standard deviation used in the Gaussian weight function.To see
the effect of changing this, and to illustrate a few other points,
we’ll first wrap all of the above local
54
-
fitting and drawing in a single demo function:# This demo
function allows us to see how# the local fits behave as we change
various elements#
demoLoWeSS
-
points(xloc, pred, pch=19, col=pcol)
if (showWeights){weights
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
yspan = 0.95
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
span = 0.82
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
span = 0.68
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
yspan = 0.55
57
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
yspan = 0.41
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
span = 0.28
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
span = 0.14
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
yspan = 0.01
par(parOptions)
1.3.4 Putting the pieces together
We can now easily imagine fitting lines locally at a whole
series of x values.##demoPieces
-
xrange
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
60
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
par(parOptions)
# Here's the plot againplot(x,y,
col="grey80", pch=19, cex=0.5,main = "Local linear smooth:
span=0.8")
#lines(dotLocs, pch=19,col="steelblue", type="b", lwd=2)
61
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Local linear smooth: span=0.8
x
y
## Decrease the spanparOptions
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
63
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
par(parOptions)
64
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
# Connect the dotsplot(x,y,
col="grey80", pch=19, cex=0.5,main = "Local linear smooth:
span=0.4")
#lines(dotLocs, pch=19,col="steelblue", type="b", lwd=2)
65
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Local linear smooth: span=0.4
x
y
#parOptions
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
67
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
68
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
69
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
70
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
x
y
par(parOptions)
# Connnect the dotsplot(x,y,
col="grey80", pch=19, cex=0.5,main = "Local linear smooth:
span=0.1")
#lines(dotLocs, pch=19,col="steelblue", type="b", lwd=2)
71
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Local linear smooth: span=0.1
x
y
##
## And finally, let's look at the small bandwidth# with many
more points# Get the locations BUT do not plot lines
yet.dotLocs
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
100 locations: span = 0.1
x
y
So the thing to do is make a function that just does this.
1.3.4.1 Putting it all togetherHere is one such function that
will produce a smooth curve from minimizing the Locally Weighted
Sum ofSquares, or LoWeSS.## Given a weight function the
corresponding# weighted least squares estimate at any point(s) x#
is easily constructed.## It requires:# x, y - the data# xloc - x
locations at which the estimate# is to be calculated# span - a
bandwidth# weightFn - a weighting function, default will be
GaussWeight# nlocs - number of equi-spaced locations at which
estimate# will be calculated if xloc=NULL, ignored otherwise.#
73
-
LoWeSS
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Our data
x
y
Now ourdata was actually generated asYi = µ(xi) + Ri
with Ri ∼ N(0, 0.2)So we are in the completely artificial
position of being able to compare the fitted smooth with the true
µ(xi)that was used to produce these data. Here’s how the data were
constructed.# The fake data## Get some x's#set.seed(12032345)N
-
breaks
-
0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Our LoWeSS smoother
x
y
What we have constructed here is a naive locally weighted sum of
squares estimator. Clearly, there aresome difficulties that require
attention, such as:
• choosing the bandwidth. What value? Also, ours only looked at
x locations; another might choose aproportion of the nearest x
values.
• what weight function? Perhaps a more robust choice that
actually gives zero weight to points that arefar away.
• what about the ends? Seems that you can only estimate using
data from one side. Should that effectthe choice of bandwidth
there? Or the weight function?
1.3.5 Using both knn and local weights
The naive locally weighted sum of squares above did not define
neighbourhoods, but rather used the scaleparameter h to determine
how weights would diminish. Again this means that where the data is
densest inx, more points will appear in the estimation than where
it is sparser.So we now return to asserting that there will be the
same specified number of points, k, in a neighbourhood.Only points
in that neighbourhood may have non-zero weights – all points
outside of the neighbourhood willhave zero points. We will also use
some kernel function to downweight points in the neighbourhood.
Thekernel will again be evaluated at (xi − x)/h , but now we will
choose h to be a function of the maximumdistance |xi − x| over all
points in the neighbourhood.
77
-
1.3.5.1 loess: locally weighted sums of squaresIn R there is a
function called loess that fits a LOcally wEighted Sum of Squares
estimate that pays a littlemore attention to some of these
problems. For example,fit
-
fit
-
Given the parameter a < 1, the default weighting is given
Tukey’s tricube weight
K(t) ={
(1 − |t|3)3 for |t| ≤ 10 elsewhere
For the ith point in the neighbourhood, we take
ti =|xi − x|
maxj∈Nbhd(x) |xj − x|.
If a > 1, the maximum distance in the above denominator is
taken to be a1/p times the maximum distanceto x of all the xi s.
Note that here p is the number of explanatory variates in case
there is more than 1. (inthis case, |xi − x| is everywhere replaced
by the Euclidean distance ||xi − x||)Loess is not restricted to
fitting local lines. loess can fit any degree polynomial locally
(though typicallyonly degree 1 or 2 is used in practice). Its
default is 2.The fitting mechanism is given by the parameter family
and can be either gaussian (the default), whichwill use
least-squares to fit the local polynomial, or symmetric which will
begin with least squares and thenperform a few iterations of an
M-estimation using Tukey’s bisquare weight function.
1.4 Multidimensional x
As was suggested in the above discussion of loess, all of these
local fitting methods are easily extended tocases with more than
one explanatory variate simply by using distances ||xi −x|| in
place of |xi −x| whereverthe latter appears. For example the kernel
weighting functions all become
K
(||xi − x||
h
)for some span parameter h.
2 Linear smoothers
A smoother is sometimes called a linear smoother if its vector
of fitted values µ̂ at the xis can be written as
µ̂ = Sy
for some N ×N matrix S whose elements depend only on the values
of x1, . . . , xN . The smoothers are calledlinear because they are
linear in y.Nearly all of the smoothers discussed above are linear
smoothers. The only exceptions are those where thefitting (locally
or globally) has been done using weights that depend on the values
of y. This would be thecase, for example, if any of the iteratively
reweighted M-estimates or the high breakdown estimates wereused in
the fitting – the resulting smooths would no longer strictly linear
in y. There are other examples aswell, but we will not be
considering them here.We really only have three different classes
of smoother that we have considered. First are the splines basedon
a fixed set of K knots (or equivalently a specified number of
effective degrees of freedom). Of these weconsidered two
possibilities (p-degrees splines and natural splines).For p-degree
splines we had
µ̂ = Sby = B(BT B
)−1 BT yand similarly for natural splines, we had
µ̂ = Sny = N(NT N
)−1 NT y.80
-
Both of these were called “projection smoothers” since they
orthogonally project y onto the space spannedby the columns of B
for p-degree splines or onto the space spanned by the columns of N
for natural splines.The second class is the set of smoothing
splines was that of the “smoothing splines”. For these we had
µ̂ = Sλy
and showed that
Sλ = N(NT N + λΩN
)−1 NT = (IN + λK)−1The third class is the set of local
polynomial regression smoothers. We include kernel smoothers in
this class(where the local polynomial is simply a constant). In
this case, at each point x we fit a polynomial of somedegree
(typically 0, 1, or 2) by weighted least-squares with weights
depending on the distance of the observedpoints xi from the point
of interest x. When we look at the fitted values of at the observed
xi we have
µ̂i = µ̂(xi) = xTi β̂i
where β̂i =(XT WiX
)−1 XT Wiy is the weighted least-squared estimate of the
coefficient vector with adiagonal matrix Wi of weights that are
peculiar to each observation i. This means that each element µ̂i
ofµ̂ can be written as
µ̂i = sTi y
wheresTi = xTi
(XT WiX
)−1 XT Wiis a 1×N vector dependent only on the values of the xi.
This in turn means we can write any local polynomialregression
smoother’s estimate of the fitted values as
µ̂i = Swy
for
S =
sT1sT2...
sTN
.Note that this matrix, unlike the others, need not be
symmetric.All of these smoothers are therefore linear smoothers in
that they share this common structure:
µ̂ = Sy
for variously defined S, each being defined independently of the
values of y.
2.1 Complementary viewpoints
Since all linear smoothers have the same form we should be able
to look at them in much the same way –no matter how they were
motivated or derived.For example, local regression smoothers like
loess were built on kernel functions which gave higher weightto
observations nearest the point x of interest. We might expect,
therefore, that in computing the fit at anygiven x that the
coefficients multiplying any y would be higher for a yi whose
corresponding xi was closerto x than for one which was farther
away. To check this, we might have a look at the coefficients of
each yias a function of xi.First we need to have the smoother
matrix that corresponds to a loess fit. As we did with the
smoothingsplines, we could compute this for any particular set of x
values as follows:
81
-
smootherMatrixLoess
-
lwd=3,col=adjustcolor("steelblue", 0.5))
abline(v=ourdata$x[row[i]], col="steelblue")
# And another rowi
-
set.seed(12231299)row
-
0.0 0.5 1.0
−0.
010
0.00
00.
005
0.01
00.
015
Two rows of smoother matrix
x
Coe
ffici
ent o
f y
As was the case with the local regression, the coefficient of yi
is higher the closer its xi value is to the valueof x.Similarly,
when we determined the various spline smoothers we considered them
as linear combinations ofvarious basis functions. In particular, we
looked at some of the orthogonal basis functions that went
intodetemining any smoother.We can do the same for any linear
smoother. While the spline based smoother matrices S were
symmetric,this need not be the case in general and it is not the
case for the local regression estimators. So, rather thanworking
with the eigen decomposition of the smoother matrix, we work with
its singular value decomposition.That is, we decompose any smoother
matrix S as
S = UDρVT
for N × N matrices U = [U1, . . . , UN ], V = [V1, . . . , VN ],
Dρ = diag(ρ1, . . . , ρN ) with ρ1 ≥ ρ2 ≥ · · · ≥ρN ≥ 0 and
UT U = IN = VT V.
The smooth can now be written as
µ̂ = UDρVT y
=∑N
i=1 Uiρi < Vi, y >
85
-
which separates into the basis vectors Ui, the singular values
ρi and the orthogonal component of y alongthe direction vectors
Vi.If we consider the smoothing spline, as we did before, we can
plot these various components as followssvd_s
-
0 50 100 150 200 250 300
−2
02
46
810
y components
index
y−co
mpo
nent
parOptions
-
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 1
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 2
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 3
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 4
x
u va
lue
88
-
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 5
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 6
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 7
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 8
x
u va
lue
89
-
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 9
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 10
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 11
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 12
x
u va
lue
par(parOptions)
As can be seen, the singular values and the y components die off
quickly. These are the multipliers of thebasis functions. The
orthogonal basis functions increase in complexity as i increases.
These higher frequencybasis functions are largely obliterated by
the small singular values and y components.Now since loess is also
a linear smoother, we can do the same for loess on this
data.svd_l
-
cex=0.5, pch=19,xlab="index", ylab="rho")
plot(t(svd_l$v) %*%
ourdata$y,col=adjustcolor("steelblue",0.5),cex=0.5,
pch=19,type="b", main="y components",xlab="index",
ylab="y-component")
0 50 100 150 200 250 300
0.0
0.2
0.4
0.6
0.8
1.0
singular values
index
rho
0 50 100 150 200 250 300
05
10
y components
index
y−co
mpo
nent
par(parOptions)
parOptions
-
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 1
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 2
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 3
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 4
x
u va
lue
92
-
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 5
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 6
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 7
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 8
x
u va
lue
93
-
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 9
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 10
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 11
x
u va
lue
0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
basis 12
x
u va
lue
par(parOptions)
The local weighted least squares estimate shows much the same
pattern as the smoothing spline. It too has aset of orthogonal
basis functions for which the singular values and the y components
die out quickly. Again,the orthogonal basis functions increase in
complexity as i increases and the higher frequency basis
functionsare largely obliterated by the small singular values and y
components.To make a few direct comparisons, we could overplot some
of these as follows:S
-
type="l",xlim=extendrange(ourdata$x),ylim=extendrange(c(range(S_l[row,]),
range(S_s[row,]))),xlab="x", ylab="Coefficient of y",main="Two
rows of smoother matrix",lwd=3,col=adjustcolor("firebrick",
0.5))
abline(v=ourdata$x[row[i]], col="grey80")
# And another rowi
-
0.0 0.5 1.0
−0.
010
0.00
00.
005
0.01
00.
015
0.02
0
Two rows of smoother matrix
x
Coe
ffici
ent o
f y
splineloess
##plot(svd_l$d,
type="b", main="singular
values",col=adjustcolor("steelblue",0.5),cex=0.5,
pch=19,xlab="index", ylab="rho")
points(svd_s$d,
type="b",col=adjustcolor("firebrick",0.5),cex=0.5,
pch=19,main="singular values", xlab="index", ylab="rho")
96
-
0 50 100 150 200 250 300
0.0
0.2
0.4
0.6
0.8
1.0
singular values
index
rho
plot(t(svd_s$v) %*%
ourdata$y,col=adjustcolor("firebrick",0.5),cex=0.5, pch=19,
ylim=c(-5,15),type="b", main="y components",xlab="index",
ylab="y-component")
points(t(svd_l$v) %*%
ourdata$y,col=adjustcolor("steelblue",0.5),cex=0.5,
pch=19,type="b", main="y components",xlab="index",
ylab="y-component")
97
-
0 50 100 150 200 250 300
−5
05
1015
y components
index
y−co
mpo
nent
3 More than one explanatory variate
Suppose we have more than one explanatory variate. For example,
consider the ozone data set from thepackage
ElemStatLearn:library(ElemStatLearn)head(ozone)
## ozone radiation temperature wind## 1 41 190 67 7.4## 2 36 118
72 8.0## 3 12 149 74 12.6## 4 18 313 62 11.5## 5 23 299 65 8.6## 6
19 99 59 13.8
Here we have 111 daily measurements taken from May until
September 1973 in New York on four variates:• ozone, the ozone
concentration in parts per billion (ppb),• radiation, the solar
radiation energy measured in langleys,
98
https://en.wikipedia.org/wiki/Langley_(unit)
-
• temperature, the maximum temperature that day in degrees
Fahrenheit, and• wind, the wind speed in miles per hour.
Interest lies in modelling how ozone depends on the other
variates.pairs(ozone, pch=19, col=adjustcolor("firebrick",0.4))
ozone
0 50 150 250 5 10 15 20
050
100
150
050
150
250
radiation
temperature
6070
8090
0 50 100 150
510
1520
60 70 80 90
wind
To begin, let’s just try modelling ozone levels as a function of
only two variates. That will give three variatesin total and allow
us to see what’s going on using some three dimensional graphics
called scatter3d fromthe car package.For example, if we could use a
linear model for the mean ozone level.
µ(xi, zi) = α + β(xi − x) + γ(zi − z)
where xi is the radiation and zi is the temperature. In this
case, we are fitting a plane to a three
99
https://en.wikipedia.org/wiki/Fahrenheit
-
dimensional point cloud.Execute the following code and
explore.library(rgl) # Access to all of open GL graphics
# Get some graphing code (very slightly) adapted from John Fox's
"car" package to accomodate loess.#source("../../Code/scatter3d.R")
# from the course home page#scatter3d(ozone ~ radiation +
temperature, data=ozone)
# Get a png snap shot of this using rgl package's#
snapshot3d("ozoneRegressionPlane.png")#
Figure 1: Fitting a plane to ozone, radiation, and
temperature
Trying some power transformation on the ozone variate, it might
be better to model its cube root instead -
100
-
ozoneˆ(1/3).scatter3d(ozone^(1/3) ~ radiation + temperature,
data=ozone,model.summary=TRUE # print out a summary of the
fitted model)
## $linear#### Call:## lm(formula = y ~ x + z)#### Residuals:##
Min 1Q Median 3Q Max## -0.23670 -0.08050 -0.00671 0.05930
0.38993#### Coefficients:## Estimate Std. Error t value
Pr(>|t|)## (Intercept) 0.10578 0.02951 3.585 0.000508 ***## x
0.14024 0.04010 3.497 0.000685 ***## z 0.55325 0.05044 10.968 <
2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1#### Residual standard error: 0.1121 on 108 degrees of
freedom## Multiple R-squared: 0.6112, Adjusted R-squared: 0.604##
F-statistic: 84.88 on 2 and 108 DF, p-value: < 2.2e-16#
snapshot3d("ozone3RegressionPlane.png")## Note that the above
coefficient estimates# do not match the ones belowlmfit |t|)##
(Intercept) -2.1530326 0.4398302 -4.895 3.46e-06 ***## radiation
0.0021443 0.0006132 3.497 0.000685 ***## temperature 0.0643317
0.0058653 10.968 < 2e-16 ***## ---## Signif. codes: 0 '***'
0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error:
0.5603 on 108 degrees of freedom## Multiple R-squared: 0.6112,
Adjusted R-squared: 0.604## F-statistic: 84.88 on 2 and 108 DF,
p-value: < 2.2e-16
101
-
# The difference in the coefficients is because# scatter3d
standardizes all of the variates before fitting.# Note that the
R-squared, etc. is the same.
Figure 2: Fitting a plane to ozone, radiation, and
temperature
What happens if we add an interaction term?scatter3d(ozone^(1/3)
~ radiation + temperature,
data=ozone,fit = "interaction")
# snapshot3d("ozone3interaction.png")
The difference might best be seen when we have these two
together on the same plot.scatter3d(ozone^(1/3) ~ radiation +
temperature,
data=ozone,fit = c("linear", "interaction"))
# snapshot3d("ozone3linearandinteraction.png")
The plane (linear) fit is in blue, the one with an interaction
in green.
102
-
Figure 3: Fitting a plane to ozone, radiation, and
temperature
103
-
Figure 4: Fitting an interaction to ozone, radiation, and
temperature
104
-
The plane fitted a surface of the form
µ̂(x, z) = β̂0 + β̂1(x − x) + β̂2(z − z)
which is strictly additive in each of x and z. Moreover, each
additive component for x and for z is itselflinear (in x and in z
respectively). Note that we choose to centre each of the
explanatory variates. The“intercept” term β̂0 in this model is
interpreted as the point on the surface (or average of the ys)
where x = xand z = z.When we add an interaction term, we lose the
additivity. The surface being fit now has the form
µ̂(x, z) = β̂0 + β̂1(x − x) + β̂2(z − z) + β̂3(x − x)(z −
z).
This is no longer additive in x or in z. Instead, the
interaction term means that the component for x willdepend on the
value of z and conversely the component for z will depend on the
value of x. That is, for anychoice of value z0 of z, the component
for x is still a straight line
(β̂0 + β̂2(z0 − z)) + (β̂1 + β̂3(z0 − z))(x − x)
except that now both the intercept and the slope depend on the
value of z0. A similar line holds for thecomponent of z at any
value x0 of x.Suppose we add quadratic components for each
variate.scatter3d(ozone^(1/3) ~ radiation + temperature,
data=ozone,fit=c("quadratic"))
# snapshot3d("ozone3RegressionQuadratic.png")
The fitted surface has the form
µ̂(x, z) = β̂0 + β̂1(x − x) + β̂2(z − z) + β̂3(x − x)2 + β̂4(z −
z)2.
As was the case with the simple planar surface, this surface is
also additive in each of x and z.Instead of a straight line though,
each additive function is now a quadratic.A non-additive quadratic
fit would include a cross product (or interaction) term
xzscatter3d(ozone^(1/3) ~ radiation + temperature,
data=ozone,fit=c("quadint"))
# snapshot3d("ozone3quadint.png")#
The effect of the additional interaction term is best
appreciated by having both surfaces appear
together:scatter3d(ozone^(1/3) ~ radiation + temperature,
data=ozone,fit=c("quadratic", "quadint"))
Again, blue is the additive quadratic, green the quadratic with
interaction. The effect of the interaction termon the quadratic
additive fit is very much like that of the same term on the linear
additive fit.
3.1 Multiple regression smooths.
Rather than use simple polynomial models, we could also use
smoothers. And, in the same fashion, we canchoose between additive
models and non-additive, or interaction, models.
105
-
Figure 5: Fitting an additive paraboloid to ozone, radiation,
and temperature
106
-
Figure 6: Fitting a paraboloid to ozone, radiation, and
temperature
107
-
Two smoothing approaches which generalize to the multiple
regression case are local polynomial regres-sion using loess and
thin-plate splines using the smoothing splines with the basis
functins describedearlier.We can have a look at each of these in
turn.
3.1.1 loess
We begin with using local polynomial regression via loess. With
default settings we have:scatter3d(ozone^(1/3) ~ radiation +
temperature,
data=ozone,fit=c("loess"))
# snapshot3d("ozone3loess.png")
Figure 7: loess smooth - to ozone, radiation, and
temperature
108
-
We can see the effect of a more flexible fit by increasing the
effective degrees of freedom for the smooth.scatter3d(ozone^(1/3) ~
radiation + temperature,
data=ozone,fit=c("loess"),df.loess=20)
# snapshot3d("ozone3loessdf20.png")
Figure 8: loess with 20 df - to ozone, radiation, and
temperature
scatter3d(ozone^(1/3) ~ radiation +
temperature,data=ozone,fit=c("loess"),df.loess=30)
# snapshot3d("ozone3loessdf30.png")
109
-
Figure 9: loess with 30 df - to ozone, radiation, and
temperature
110
-
3.1.2 Thin-plate splines
The default smooth here is a “thin-plate spline” as described
earlier.scatter3d(ozone^(1/3) ~ radiation + temperature,
data=ozone,fit=c("smooth"))
# snapshot3d("ozone3smooth.png")
Figure 10: thin plate smooth - to ozone, radiation, and
temperature
We can compare this flexible fit with loess by trying to match
the effective degrees of freedom for
eachsmooth.scatter3d(ozone^(1/3) ~ radiation + temperature,
data=ozone,fit=c("smooth"),
111
-
df.smooth=20)# snapshot3d("ozone3smoothdf20.png")
Figure 11: thin plate smooth with 20 df - to ozone, radiation,
and temperature
We can compare this directly with loess by placing them on the
same plot.scatter3d(ozone^(1/3) ~ radiation + temperature,
data=ozone,fit=c("loess","smooth"),df.loess=20,
df.smooth=20)
Here loess is blue, thin-plate green. The major differences
appear to be at the corners where there are nopoints – differences
reflect different strategies for extrapolation.Again, for
comparison, we look at the thin-plate smooth for 30 effective
degrees of freedom.
112
-
scatter3d(ozone^(1/3) ~ radiation +
temperature,data=ozone,fit=c("smooth"),df.smooth=30)
# snapshot3d("ozone3smoothdf30.png")
Figure 12: thin plate smooth with 30 df - to ozone, radiation,
and temperature
And compare the thin-plate spline (green) with the analogous
loess (blue):scatter3d(ozone^(1/3) ~ radiation + temperature,
data=ozone,fit=c("loess", "smooth"),df.loess=30,
df.smooth=30)
# snapshot3d("ozone3bothdf30.png")
113
-
Figure 13: loess and thin plate smooths each with 30 df - to
ozone, radiation, and temperature
114
-
Again the major differences are in locations with no points in
the dataset.
3.2 Additive models
One of the challenges of multiple regression, particularly for
thin-plate splines and local regression methods isthat as
dimensionality increases so must the local neighbourhood in order
to get enough points to constructthe fit.One way around this is to
use additive models. Since each additive component is a function
only of a singlevariate, the neighbourhoods needed for estimation
need not be as large.The model surface fitted now has the form:
µ̂(x, z) = β̂0 + β̂1f1(x − x) + β̂2f2(z − z)
where f1(·) and f2(·) are splines of some sort (typically either
regression or smoothing splines). The defaultin our demonstration
software scatter3d uses cubic regression splines.These behave much
like additive models in the usual linear model, except now the
functional form of eachcomponent is very
flexible.scatter3d(ozone^(1/3) ~ radiation + temperature,
data=ozone,fit=c("additive"))
# snapshot3d("ozone3additive.png")
As with adding an interaction term, we might compare this with,
for example, a thin plate spline:scatter3d(ozone^(1/3) ~ radiation
+ temperature,
data=ozone,fit=c("additive", "smooth"))
What we are giving up in flexibility, we are gaining in
estimability and in interpretability.We can make this a more
flexible function fit by specifying the span or, to make it
comparable to the othersmoothers, by specifying the effective
degrees of freedom.Now the argument specifies the effective or
target degrees of freedom for each component. This means thatthe
total degrees of freedom for the smooth part (aside from the
intercept) will be the sum of the degrees offreedom for each smooth
component.scatter3d(ozone^(1/3) ~ radiation + temperature,
data=ozone,fit=c("additive"), df.additive=10)
# snapshot3d("ozone3additive10.png")
The corresponding thin-plate spline, for example, would have
effective degrees of freedom of 20.scatter3d(ozone^(1/3) ~
radia