-
Statistical Science1996, Vol. 11, No. 2, 89–121
Flexible Smoothing with B-splinesand PenaltiesPaul H. C. Eilers
and Brian D. Marx
Abstract. B-splines are attractive for nonparametric modelling,
butchoosing the optimal number and positions of knots is a complex
task.Equidistant knots can be used, but their small and discrete
number al-lows only limited control over smoothness and fit. We
propose to use arelatively large number of knots and a difference
penalty on coefficientsof adjacent B-splines. We show connections
to the familiar spline penaltyon the integral of the squared second
derivative. A short overview of B-splines, of their construction
and of penalized likelihood is presented. Wediscuss properties of
penalized B-splines and propose various criteria forthe choice of
an optimal penalty parameter. Nonparametric logistic re-gression,
density estimation and scatterplot smoothing are used as ex-amples.
Some details of the computations are presented.
Key words and phrases: Generalized linear models, smoothing,
non-parametric models, splines, density estimation.
1. INTRODUCTION
There can be little doubt that smoothing has a re-spectable
place in statistics today. Many papers anda number of books have
appeared (Silverman, 1986;Eubank, 1988; Hastie and Tibshirani,
1990; Härdle,1990; Wahba, 1990; Wand and Jones, 1993; Greenand
Silverman, 1994). There are several reasons forthis popularity:
many data sets are too “rich” tobe fully modeled with parametric
models; graphicalpresentation has become increasingly more
impor-tant and easier to use; and exploratory analysis ofdata has
become more common.
Actually, the name nonparametric is not alwayswell chosen. It
might apply to kernel smoothersand running statistics, but spline
smoothers are de-scribed by parameters, although their number canbe
large. It might be better to talk about “overpara-metric”
techniques or “anonymous” models; the pa-rameters have no
scientific interpretation.
Paul H. C. Eilers is Department Head in the com-puting section
of DCMR Milieudienst Rijnmond,’s-Gravelandseweg 565, 3119XT
Schiedam, TheNetherlands (e-mail: [email protected]). Brian D. Marxis
Associate Professor, Department of ExperimentalStatistics,
Louisiana State University, Baton Rouge,LA 70803-5606 (e-mail:
[email protected]).
There exist several refinements of running statis-tics, like
kernel smoothers (Silverman, 1986;Härdle, 1990) and LOWESS
(Cleveland, 1979).Splines come in several varieties: smoothing
splines,regression splines (Eubank, 1988) and B-splines(de Boor,
1978; Dierckx, 1993). With so many tech-niques available, why
should we propose a newone? We believe that a combination of
B-splinesand difference penalties (on the estimated coeffi-cients),
which we call P-splines, has very attractiveproperties. P-splines
have no boundary effects, theyare a straightforward extension of
(generalized) lin-ear regression models, conserve moments
(means,variances) of the data and have polynomial curvefits as
limits. The computations, including thosefor cross-validation, are
relatively inexpensive andeasily incorporated into standard
software.B-splines are constructed from polynomial pieces,
joined at certain values of x, the knots. Once theknots are
given, it is easy to compute the B-splinesrecursively, for any
desired degree of the poly-nomial; see de Boor (1977, 1978), Cox
(1981) orDierckx (1993). The choice of knots has been asubject of
much research: too many knots lead tooverfitting of the data, too
few knots lead to un-derfitting. Some authors have proposed
automaticschemes for optimizing the number and the posi-tions of
the knots (Friedman and Silverman, 1989;Kooperberg and Stone, 1991,
1992). This is a diffi-
89
-
90 P. H. C. EILERS AND B. D. MARX
cult numerical problem and, to our knowledge, noattractive
all-purpose scheme exists.
A different track was chosen by O’Sullivan (1986,1988). He
proposed to use a relatively large num-ber of knots. To prevent
overfitting, a penalty onthe second derivative restricts the
flexibility of thefitted curve, similar to the penalty pioneered
forsmoothing splines by Reinsch (1967) and that hasbecome the
standard in much of the spline litera-ture; see, for example,
Eubank (1988), Wahba (1990)and Green and Silverman (1994). In this
paper wesimplify and generalize the approach of O’Sullivan,in such
a way that it can be applied in any con-text where regression on
B-splines is useful. Onlysmall modifications of the regression
equations arenecessary.
The basic idea is not to use the integral of asquared higher
derivative of the fitted curve inthe penalty, but instead to use a
simple differencepenalty on the coefficients themselves of
adjacentB-splines. We show that both approaches are verysimilar for
second-order differences. In some appli-cations, however, it can be
useful to use differencesof a smaller or higher order in the
penalty. Withour approach it is simple to incorporate a penalty
ofany order in the (generalized) regression equations.
A major problem of any smoothing technique isthe choice of the
optimal amount of smoothing, inour case the optimal weight of the
penalty. We usecross-validation and the Akaike information
crite-rion (AIC). In the latter the effective dimension,that is,
the effective number of parameters, of amodel plays a crucial role.
We follow Hastie andTibshirani (1990) in using the trace of the
smoothermatrix as the effective dimension. Because we usestandard
regression techniques, this quantity canbe computed easily. We find
the trace very usefulto compare the effective amount of smoothing
fordifferent numbers of knots, different degrees of theB-splines
and different orders of penalties.
We investigate the conservation of moments ofdifferent order, in
relation to the degree of theB-splines and the order of the
differences in thepenalty. To illustrate the use of P-splines,
wepresent the following as applications: smoothing ofscatterplots;
modeling of dose–response curves; anddensity estimation.
2. B-SPLINES IN A NUTSHELL
Not all readers will be familiar with B-splines.Basic references
are de Boor (1978) and Dierckx(1993), but, to illustrate the basic
simplicity of theideas, we explain some essential background here.A
B-spline consists of polynomial pieces, connected
in a special way. A very simple example is shown atthe left of
Figure 1(a): one B-spline of degree 1. Itconsists of two linear
pieces; one piece from x1 to x2,the other from x2 to x3. The knots
are x1, x2 and x3.To the left of x1 and to the right of x3 this
B-splineis zero. In the right part of Figure 1(a), three
moreB-splines of degree 1 are shown: each one based onthree knots.
Of course, we can construct as largea set of B-splines as we like,
by introducing moreknots.
In the left part of Figure 1(b), a B-spline ofdegree 2 is shown.
It consists of three quadraticpieces, joined at two knots. At the
joining points notonly the ordinates of the polynomial pieces
match,but also their first derivatives are equal (but nottheir
second derivatives). The B-spline is based onfour adjacent knots:
x1; : : : ; x4. In the right partFigure 1(b), three more B-splines
of degree 2 areshown.
Note that the B-splines overlap each other.First-degree
B-splines overlap with two neighbors,second-degree B-splines with
four neighbors and soon. Of course, the leftmost and rightmost
splineshave less overlap. At a given x, two first-degree (orthree
second-degree) B-splines are nonzero.
These examples illustrate the general propertiesof a B-spline of
degree q:
• it consists of q + 1 polynomial pieces, each ofdegree q;• the
polynomial pieces join at q inner knots;• at the joining points,
derivatives up to order
q− 1 are continuous;• the B-spline is positive on a domain
spanned by
q+ 2 knots; everywhere else it is zero;• except at the
boundaries, it overlaps with 2q
polynomial pieces of its neighbors;• at a given x, q+ 1
B-splines are nonzero.Let the domain from xmin to xmax be divided
into
n′ equal intervals by n′+1 knots. Each interval willbe covered
by q+ 1 B-splines of degree q. The totalnumber of knots for
construction of the B-splineswill be n′ + 2q+ 1. The number of
B-splines in theregression is n = n′ + q. This is easily verified
byconstructing graphs like those in Figure 1.B-splines are very
attractive as base functions for
(“nonparametric”) univariate regression. A linearcombination of
(say) third-degree B-splines gives asmooth curve. Once one can
compute the B-splinesthemselves, their application is no more
difficultthan polynomial regression.
De Boor (1978) gave an algorithm to compute B-splines of any
degree fromB-splines of lower degree.Because a zero-degree B-spline
is just a constant onone interval between two knots, it is simple
to com-
-
FLEXIBLE SMOOTHING 91
Fig. 1. Illustrations of one isolated B-spline and several
overlapping ones (a) degree 1; (b) degree 2.
pute B-splines of any degree. In this paper we useonly
equidistant knots, but de Boor’s algorithm alsoworks for any
placement of knots. For equidistantknots, the algorithm can be
further simplified, asis illustrated by a small MATLAB function in
theAppendix.
Let Bjxy q denote the value at x of the jth B-spline of degree q
for a given equidistant grid ofknots. A fitted curve ŷ to data xi;
yi is the linearcombination ŷx = ∑nj=1 âjBjxy q. When the de-gree
of the B-splines is clear from the context, orimmaterial, we use
Bjx instead of Bjxy q.
The indexing of B-splines needs some care, espe-cially when we
are going to use derivatives. The in-dexing connects a B-spline to
a knot; that is, it givesthe index of the knot that characterizes
the positionof the B-spline. Our choice is to take the
leftmostknot, the knot at which the B-spline starts to be-come
nonzero. In Figure 1(a), x1 is the positioningknot for the first
B-spline. This choice of indexingdemands that we introduce q knots
to the left of thedomain of x. In the formulas that follow for
deriva-tives, the exact bounds of the index in the sums
areimmaterial, so we have left them out.
De Boor (1978) gives a simple formula for deriva-tives of
B-splines:
h∑j
ajB′jxy q =
∑j
ajBjxy q− 1
−∑j
aj+1Bj+1xy q− 1
= −∑j
1aj+1Bjxy q− 1;
(1)
where h is the distance between knots and 1aj =aj − aj−1.
By induction we find the following for the secondderivative:
h2∑j
ajB′′jxy q =
∑j
12ajBjxy q− 2;(2)
where 12aj = 11aj = aj − 2aj−1 + aj−2. This factwill prove very
useful when we compare continuousand discrete roughness penalties
in the next section.
3. PENALTIES
Consider the regression of m data points xi; yion a set of n
B-splines Bj·. The least squares ob-jective function to minimize
is
S =m∑i=1
{yi −
n∑j=1
ajBjxi}2:(3)
Let the number of knots be relatively large, suchthat the fitted
curve will show more variation thanis justified by the data. To
make the result less flex-ible, O’Sullivan (1986, 1988) introduced
a penaltyon the second derivative of the fitted curve and soformed
the objective function
S =m∑i=1
{yi −
n∑j=1
ajBjxi}2
+ λ∫ xmaxxmin
{ n∑j=1
ajB′′jx
}2dx:
(4)
The integral of the square of the second derivativeof a fitted
function has become common as a smooth-ness penalty, since the
seminal work on smoothingsplines by Reinsch (1967). There is
nothing spe-cial about the second derivative; in fact, lower
orhigher orders might be used as well. In the contextof smoothing
splines, the first derivative leads tosimple equations, and a
piecewise linear fit, whilehigher derivatives lead to rather
complex mathe-matics, systems of equations with a high
bandwidth,and a very smooth fit.
We propose to base the penalty on (higher-order)finite
differences of the coefficients of adjacent B-splines:
S =m∑i=1
{yi −
n∑j=1
ajBjxi}2+ λ
n∑j=k+1
1kaj2:(5)
-
92 P. H. C. EILERS AND B. D. MARX
This approach reduces the dimensionality of theproblem to n, the
number of B-splines, instead ofm, the number of observations, with
smoothingsplines. We still have a parameter λ for continuouscontrol
over smoothness of the fit. The differencepenalty is a good
discrete approximation to the in-tegrated square of the kth
derivative. What is moreimportant: with this penalty moments of the
dataare conserved and polynomial regression models oc-cur as limits
for large values of λ. See Section 5 fordetails.
We will show below that there is a very strongconnection between
a penalty on second-order dif-ferences of the B-spline coefficients
and O’Sullivan’schoice of a penalty on the second derivative of
thefitted function. However, our penalty can be han-dled
mechanically for any order of the differences(see the
implementation in the Appendix).
Difference penalties have a long history that goesback at least
to Whittaker (1923); recent applica-tions have been described by
Green and Yandell(1985) and Eilers (1989, 1991a, b, 1995).
The difference penalty is easily introduced intothe regression
equations. That makes it possible toexperiment with different
orders of the differences.In some cases it is useful to work with
even thefourth or higher order. This stems from the factthat for
high values of λ the fitted curve approachesa parametric
(polynomial) model, as will be shownbelow.
O’Sullivan (1986, 1988) used third-degree B-splines and the
following penalty:
h2P = λ∫ xmaxxmin
{∑j
ajB′′jxy 3
}2dx:(6)
From the derivative properties of B-splines it fol-lows that
h2P = λ∫ xmaxxmin
{∑j
12ajBjxy 1}2dx:(7)
This can be written as
h2P = λ∫ xmaxxmin
∑j
∑k
12aj 12ak
·Bjxy 1Bkxy 1dx:(8)
Most of the cross products of Bjxy 1 and Bkxy 1disappear,
because B-splines of degree 1 only over-
lap when j is k− 1, k or k+ 1. We thus have that
h2P = λ∫ xmaxxmin
[{∑j
12ajBjxy 1}2
+ 2∑j
12aj 12aj−1
·Bjxy 1Bj−1xy 1]dx;
(9)
or
h2P = λ∑j
12aj2∫ xmaxxmin
B2jxy 1dx
+ 2λ∑j
12aj 12aj−1
·∫ xmaxxmin
Bjxy 1Bj−1xy 1dx;
(10)
which can be written as
h2P = λ{c1∑j
12aj2 + c2∑j
12aj 12aj−1
};(11)
where c1 and c2 are constants for given (equidistant)knots:
c1 =∫ xmaxxmin
B2jxy 1dxy
c2 =∫ xmaxxmin
Bjxy 1Bj−1xy 1dx:(12)
The first term in (11) is equivalent to our second-order
difference penalty, the second term containscross products of
neighboring second differences.This leads to more complex equations
when mini-mizing the penalized likelihood (equations in whichseven
adjacent aj’s occur, compared to five if onlysquares of second
differences occur in the penalty).The higher complexity of the
penalty equationsstems from the overlapping of B-splines.
Withhigher order differences and/or higher degrees ofthe B-splines,
the complications grow rapidly andmake it rather difficult to
construct an automaticprocedure for incorporating the penalty in
the likeli-hood equations. With the use of a difference penaltyon
the coefficients of the B-splines this problemdisappears.
4. PENALIZED LIKELIHOOD
For least squares smoothing we have to minimizeS in (5). The
system of equations that follows fromthe minimization of S can be
written as:
BTy = BTB+ λDTkDka;(13)where Dk is the matrix representation of
the differ-ence operator 1k, and the elements of B are bij =Bjxi.
When λ = 0, we have the standard normal
-
FLEXIBLE SMOOTHING 93
equations of linear regression with a B-spline basis.With k = 0
we have a special case of ridge regres-sion. When λ > 0, the
penalty only influences themain diagonal and k subdiagonals (on
both sides ofthe main diagonal) of the system of equations.
Thissystem has a banded structure because of the lim-ited overlap
of the B-splines. It is seldom worth thetrouble to exploit this
special structure, as the num-ber of equations is equal to the
number of splines,which is generally moderate (10–20).
In a generalized linear model (GLM), we in-troduce a linear
predictor ηi =
∑nj=1 bijaj and a
(canonical) link function ηi = gµi, where µi is theexpectation
of yi. The penalty now is subtractedfrom the log-likelihood lyy a
to form the penalizedlikelihood function
L = lyy a − λ2
n∑j=k+1
1kaj2:(14)
The optimization of L leads to the following systemof
equations:
BTy− µ = λDTkDka:(15)These are solved as usual with iterative
weightedlinear regressions with the system
BTW̃y− µ̃ +BTW̃Bã= BTW̃B+ λDTkDka;
(16)
where ã and µ̃ are current approximations to thesolution and W̃
is a diagonal matrix of weights
wii =1vi
(∂µi∂ηi
)2;(17)
where vi is the variance of yi, given µi. The onlydifference
with the standard procedure for fittingof GLM’s (McCullagh and
Nelder, 1989), with B-splines as regressors, is the modification of
BTW̃Bby λDTkDk (which itself is constant for fixed λ) ateach
iteration.
5. PROPERTIES OF P-SPLINES
P-splines have a number of useful properties,partially inherited
from B-splines. We give a shortoverview, with somewhat informal
proofs.
In the first place: P-splines show no boundary ef-fects, as many
types of kernel smoothers do. By thiswe mean the spreading of a
fitted curve or densityoutside of the (physical) domain of the
data, gener-ally accompanied by bending toward zero. In Sec-tion 8
this aspect is considered in some detail, inthe context of density
smoothing.P-splines can fit polynomial data exactly. Let dataxi; yi
be given. If the yi are a polynomial in x ofdegree k, then
B-splines of degree k or higher will
exactly fit the data (de Boor, 1977). The same is truefor
P-splines, if the order of the penalty is k + 1 orhigher, whatever
the value of λ. To see that thisis true, take the case of a
first-order penalty andthe fit to data y that are constant (a
polynomial ofdegree 0). Because
∑nj=1 âjBjx = c, we have that∑n
j=1 âjB′jxi=0, for all x. Then it follows from the
relationship between differences and derivatives in(1) that all
1aj are zero, and thus that
∑nj=2 1aj =
0. Consequently, the penalty has no effect and thefit is the
same as for unpenalized B-splines. Thisreasoning can easily be
extended by induction todata with a linear relationship between x
and y,and a second order difference penalty.P-splines can conserve
moments of the data. For
a linear model with P-splines of degree k+ 1 and apenalty of
order k+ 1, or higher, it holds that
m∑i=1xkyi =
m∑i=1xkŷi;(18)
for all values of λ, where ŷi =∑nj=1 bijâj are the fit-
ted values. For GLM’s with canonical links it holdsthat
m∑i=1xkyi =
m∑i=1xkµ̂i:(19)
This property is especially useful in the context ofdensity
smoothing: the mean and variance of the es-timated density will be
equal to mean and varianceof the data, for any amount of smoothing.
This isan advantage compared to kernel smoothers: theseinflate the
variance increasingly with strongersmoothing.
The limit of a P-splines fit with strong smoothingis a
polynomial. For large values of λ and a penaltyof order k, the
fitted series will approach a polyno-mial of degree k − 1, if the
degree of the B-splinesis equal to, or higher than, k. Once again,
the rela-tionships between derivatives of a B-spline fit
anddifferences of coefficients, as in (1) and (2), are thekey. Take
the example of a second-order differencepenalty: when λ is
large,
∑nj=312aj2 has to be very
near zero. Thus each of the second differences hasto be near
zero, and thus the second derivative ofthe fit has to be near zero
everywhere. In view ofthese very useful results, it seems that
B-splinesand difference penalties are the ideal marriage.
It is important to focus on the linearized smooth-ing problem
that is solved at each iteration, becausewe will make use of
properties of the smoothing ma-trix. From (16) follows for the hat
matrix H:
H = BBTW̃B+ λDTkDk−1BTW̃:(20)
-
94 P. H. C. EILERS AND B. D. MARX
The trace of H will approach k as λ increases. Aproof goes as
follows. Let
QB = BTW̃B and Qλ = λDTD:(21)
Write trH as
trH = trQB +Qλ−1QB
= trQ1/2B QB +Qλ−1Q1/2B
= trI+Q−1/2B QλQ−1/2B −1:
(22)
This can be written as
trH = trI+ λL−1 =n∑j=1
11+ λγj
;(23)
where
L = Q−1/2B QλQ−1/2B(24)
and γj, for j = 1; : : : ; n, are the eigenvalues of L.Because k
eigenvalues of Qλ are zero, L has k zeroeigenvalues. When λ is
large, only the k termswith γj = 0 contribute to the leftmost term,
andthus to the trace of H. Hence trH approaches kfor large λ.
6. OPTIMAL SMOOTHING, AIC ANDCROSS-VALIDATION
Now that we can easily influence the smoothnessof a fitted curve
with λ, we need some way to choosean “optimal” value for it. We
propose to use theAkaike information criterion (AIC).
The basic idea of AIC is to correct the log-likelihood of a
fitted model for the effective numberof parameters. An extensive
discussion and appli-cations can be found in Sakamoto, Ishiguro
andKitagawa (1986). Instead of the log-likelihood, thedeviance is
easier to use. The definition of AIC isequivalent to
AICλ = devyy a; λ + 2 ∗dima; λ;(25)
where dima; λ is the (effective) dimension of thevector of
parameters, a, and devyy a; λ is thedeviance.
Computation of the deviance is straightforward,but how shall we
determine the effective dimensionof our P-spline fit? We find a
solution in Hastie andTibshirani (1990). They discuss the effective
dimen-sions of linear smoothers and propose to use thetrace of the
smoother matrix as an approximation.In our case that means dima =
trH. Note thattrH = n when λ = 0, as in (nonsingular)
standardlinear regression.
As trAB = trBA (for conformable matrices),it is computationally
advantageous to use
trH = trBBTWB+ λDTkDk−1BTW= trBTWB+ λDTkDk−1BTWB:
(26)
The latter expression involves only n-by-n matrices,whereas H is
an m-by-m matrix.
In some GLM’s, the scale of the data is known,as for counts with
a Poisson distribution and forbinomial data; then the deviance can
be computeddirectly. For linear data, an estimate of the varianceis
needed. One approach is to take the variance ofthe residuals from
the ŷi that are computed whenλ = 0, say, σ̂20 :
AIC =m∑i=1
yi − µ̂i2σ̂20
+ 2 trH
−2m ln σ̂0 −m ln 2π:(27)
This choice for the variance is rather arbitrary, asit depends
on the numer of knots. Alternatives canbe based on (generalized)
cross-validation. For ordi-nary cross-validation we compute
CVλ =m∑i=1
(yi − ŷi1− hii
)2;(28)
where the hii are the diagonal elements of the hatmatrix H. For
generalized cross-validation (Wahba,1990), we compute
GCVλ =m∑i=1
yi − ŷi2m−∑mi=1 hii2
:(29)
The difference between both quantities is generallysmall. The
best λ is the value that minimizes CVλor GCVλ. The variance of the
residuals at the op-timal λ is a natural choice to use as an
estimate ofσ20 for the computation of AICλ. It is practical towork
with modified versions of CVλ and GCVλ,with values that can be
interpreted as estimates ofthe cross-validation standard
deviation:
CVλ =√
CVλ/my
GCVλ =√mGCVλ:
(30)
The two terms in AICλ represent the devianceand the trace of the
smoother matrix. The latterterm, say Tλ = trHλ, is of interest on
its own,because it can be interpreted as the effective dimen-sion
of the fitted curve.Tλ is useful to compare fits for different
num-
bers of knots and orders of penalties, whereas λ canvary over a
large range of values and has no clearintuitive appeal. We will
show in an example below
-
FLEXIBLE SMOOTHING 95
Table 1Values of several diagnostics for the motorcycle impact
data, for several values of λ
λ 0.001 0.01 0.1 0.2 0.5 1 2 5 10CV 24.77 24.02 23.52 23.37
23.26 23.38 23.90 25.50 27.49GCV 25.32 24.93 24.17 23.94 23.74
23.81 24.28 25.87 27.85AIC 159.6 156.2 149.0 146.7 144.7 145.4
150.6 169.1 194.3trH 21.2 19.4 15.13 13.6 11.7 10.4 9.2 7.7 6.8
that a plot of AIC against T is a useful diagnostictool.
In the case of P-splines, the maximum value thatTλ can attain is
equal to the number of B-splines(when λ = 0). The actual maximum
depends on thenumber and the distributions of the data points.
Theminimum value of Tλ occurs when λ goes to infin-ity; it is equal
to the order of the difference penalty.This agrees with the fact
that for high values of λthe fit of P-splines approaches a
polynomial of de-gree k− 1.
7. APPLICATIONS TO GENERALIZEDLINEAR MODELLING
In this section we apply P-splines to a number ofnonparametric
modelling situations, with normal aswell as nonnormal data.
First we look at a problem with additive errors.Silverman (1985)
used motorcycle crash helmet im-pact data to illustrate smoothing
of a scatterplotwith splines; the data can be found in Härdle
(1990)and (also on diskette) in Hand et al. (1994). Thedata give
head acceleration in units of g, at differ-ent times after impact
in simulated accidents. Wesmooth with B-splines of degree 3 and a
second-order penalty. The chosen knots divide the domainof x (0–60)
into 20 intervals of equal width. Whenwe vary λ on an approximately
geometric grid, weget the results in Table 1, where σ̂0 is
computedfrom GCVλ at the optimal value of λ. At the op-timal value
of λ as determined by GCV, we get theresults as plotted in Figure
2.
It is interesting to note that the amount of workto investigate
several values of λ is largely indepen-dent of the number of data
points when using GCV.The system to be solved is
BTB+ λDTkDka = BTy:(31)
The sum of squares is
S = y−Ba2 = yTy− 2aTBTy+ aTBTBa:(32)
So BTB and BTy have to be computed only once.The hat matrix H is
m by m, but for its trace wefound an expression in (26) that
involves only BTBand DTkDk. So we do not need the original data
forcross-validation at any value of λ.
Our second example concerns logistic regression.The model is
ln(
pi1− pi
)= ηi =
n∑j=1
ajBjxi:(33)
The observations are triples xi; ti; yi, where ti isthe number
of individuals under study at dose xi,and yi is the number of
“successes.” We assume thatyi has a binomial distribution with
probability piand ti trials. The expected value of yi is tipi
andthe variance is tipi1− pi.
Figure 3 shows data from Ashford and Walker(1972) on the numbers
of Trypanosome organismskilled at different doses of a certain
poison. The datapoints and two fitted curves are shown. For the
thickline curve λ = 1 and AIC = 13:4; this value of λ isoptimal for
the chosen B-splines of degree 3 and apenalty of order 2. The thin
line curve shows thefit for λ = 108 (AIC = 27:8). With a
second-orderpenalty, this essentially a logistic fit.
Figure 4 shows curves of AICλ against Tλ atdifferent values of
k, the order of the penalty. Wefind that k = 3 can give a lower
value of AIC (forλ = 5, AIC = 11:8). For k = 4 we find that a
veryhigh value of λ is allowed; then AIC = 11:4, hardlydifferent
from the lowest possible value (11.1). Alarge value of λ with a
fourth-order penalty meansthat effectively the fitted curve for η
is a third-orderpolynomial. The limit of the fit with P-splines
thusindicates a cubic logistic fit as a good parametricmodel. Here
we have seen an application where afourth-order penalty is
useful.
Our third example is a time series of counts yi,which we will
model with a Poisson distributionwith smoothly changing
expectation:
lnµi = ηi =n∑j=1
ajBjxi:(34)
In this special case the xi are equidistant, but thisis
immaterial. Figure 5 shows the numbers of dis-asters in British
coal mines for the years 1850–1962, as presented in (Diggle and
Marron, 1988).The counts are drawn as narrow vertical bars, theline
is the fitted trend. The number of intervals is20, the B-splines
have degree 3 and the order of thepenalty is 2. An optimal value of
λ was searchedon the approximately geometric grid 1, 2, 5, 10
and
-
96 P. H. C. EILERS AND B. D. MARX
Fig. 2. Motorcycle crash helmet impact data: optimal fit with
B-splines of third degree, a second-order penalty and λ = 0:5.
Fig. 3. Nonparametric logistic regression of Trypanosome data:
P-splines of order 3 with 13 knots, difference penalty of order 2;
λ = 1and AIC = 13:4 (thick line); the thin line is effectively the
logistic fit λ = 108 and AIC = 27:8.
Fig. 4. AICλ versus Tλ; the effective dimension, for several
orders of the penalty k.
so on. The minimum of AIC (126.0) was found forλ = 1;000.
The raw data of the coal mining accidents pre-sumably were the
dates on which they occurred.So the data we use here are in fact a
histogramwith one-year-wide bins. With events on a time scaleit
seems natural to smooth counts over intervals,but the same idea
applies to any form of histogram(bin counts) or density smoothing.
This was already
noted by Diggle and Marron (1988). In the next sec-tion we take
a detailed look at density smoothingwith P-splines.
8. DENSITY SMOOTHING
In the preceding section we noted that a time se-ries of counts
is just a histogram on the time axis.Any other histogram might be
smoothed in the same
-
FLEXIBLE SMOOTHING 97
Fig. 5. Numbers of severe accidents in British coal mines:
number per year shown as vertical lines; fitted trend of the
expectation of thePoisson distribution; B-splines of degree 3;
penalty of order 3; 20 intervals between 1850 and 1970; λ = 1;000
and AIC = 126:0.
way. However, it is our experience that this ideais hard to
swallow for many colleagues. They seethe construction of a
frequency histogram as an un-allowable discretization of the data
and as a pre-lude to disaster. Perhaps this feeling stems fromthe
well-known fact that maximum likelihood es-timation of histograms
leads to pathological results,namely, delta functions at the
observations (Scott,1992). However, if we optimize a penalized
likeli-hood, we arrive at stable and very useful results, aswe will
show below.
Let yi, i = 1; : : : ;m, be a histogram. Let the ori-gin of x be
chosen in such a way that the midpointsof the bins are xi = ih;
thus yi is the number of rawobservations with xi − h/2 ≤ x < xi
+ h/2. If pi isthe probability of finding a raw observation in cell
i,then the likelihood of the given histogram is propor-tional to
the multinomial likelihood
∏mi=1p
yii . Equiv-
alently (see Bishop, Fienberg and Holland, 1975,Chapter 13), one
can work with the likelihood of mPoisson distributions with
expectations µi = piy+,where y+ =
∑mi=1 yi.
To smooth the histogram, we again use a general-ized linear
model with the canonical log link (whichguarantees positive µ):
lnµi = ηi =n∑j=1
ajBjxi(35)
and construct the penalized log likelihood
L =m∑i=1yi lnµi −
m∑i=1µi − λ
n∑j=k+1
1kaj22
;(36)
with n a suitable (i.e., relatively large) number ofknots for
the B-splines. The penalized likelihoodequations follow from the
minimization of L:
m∑i=1yi − µiBjxi = λ
n∑l=k+1
djlal:(37)
These equations are solved with iteratively re-weighted
regression, as described in Section 4.
Now we let h, the width of the cells of the his-togram, shrink
to a very small value. If the rawdata are given to infinite
precision, we will even-tually arrive at a situation in which each
cell ofthe histogram has at most one observation. In otherwords, we
have a very large number (m) of cells, ofwhich y+ are 1 and all
others 0. Let I be the set ofindices of cells for which yi = 1.
Then
m∑i=1yiBjxi =
∑i∈IBjxi:(38)
If the raw observations are ut for t = 1; : : : ; r, withr = y+,
then we can write
∑i∈IBjxi =
r∑t=1Bjut = B+j ;(39)
and the penalized likelihood equations in (37)change to
B+j −m∑i=1µiBjxi = λ
n∑l=k+1
djlal:(40)
For any j, the first term on the left-hand side of(40) can be
interpreted as the “empirical sum” of B-spline j, while the second
term on the left can beinterpreted as the “expected sum” of that
B-splinefor the fitted density. When λ = 0, these terms haveto be
equal to each other for each j.
Note that the second term on the left-hand sideof (40) is in
fact a numerical approximation of anintegral:
m∑i=1µiBjxi/y+
≈∫ xmaxxmin
Bjx exp{ n∑l=1alBlx
}dx:
(41)
-
98 P. H. C. EILERS AND B. D. MARX
Table 2The value of AIC at several values of lambda for the Old
Faithful density estimate
λ 0.001 0.01 0.02 0.05 0.1 0.2 0.5 1 10AIC 50.79 48.21 47.67
47.37 47.70 48.61 50.59 52.81 65.66
The smaller h (the larger m), the better the app-proximation. In
other words: the discretization isonly needed to solve an integral
numerically forwhich, as far as we know, no closed form
solutionexists. For practical purposes the simple sum is
suf-ficient, but a more sophisticated integration schemeis
possible. Note that the sums to calculate B+j in-volve all raw
observations, but in fact at each ofthese only q + 1 terms Bjut add
to their corre-sponding B+j .
The necessary computations can be done in termsof the sufficient
statistics B+j : we have seen theirrole in the penalized likelihood
equations above. Butalso the deviance and thus AIC can be
computeddirectly:
devyy a = 2m∑i=1yi lnyi/µi
= 2m∑i=1yi lnyi − 2
m∑i=1yi
n∑j=1
ajBjxi
= 2m∑i=1yi lnyi − 2
n∑j=1
ajB+j :
(42)
In the extreme case, when the yi are either 0 or1, the term
∑yi lnyi vanishes. In any case it is
independent of the fitted density.The density smoother with
P-splines is very
attractive: the estimated density is positive andcontinuous, it
can be described relatively parsimo-niously in terms of the
coefficients of the B-splines,and it is a proper density. Moments
are conserved,as follows from (19). This implies that with
third-degree B-splines and a third-order penalty, meanand variance
of the estimated distribution are equalto those of the raw data,
whatever the amount ofsmoothing; the limit for high λ is a normal
distri-bution.
The P-spline density smoother is not troubled byboundary
effects, as for instance kernel smoothersare. Marron and Ruppert
(1994) give examples anda rather complicated remedy, based on
transforma-tions. With P-splines no special precautions are
nec-essary, but it is important to specify the domain ofthe data
correctly. We will present an example be-low.
We now take as a first example a data set from(Silverman, 1986).
The data are durations of 107eruptions of the Old Faithful geyser.
Third-degreeB-splines were used, with a third-order penalty.
The
domain from 0 to 6 was divided into 20 intervalsto determine the
knots. In the figure two fits areshown, for λ = 0:001 and for λ =
0:05. The lattervalue gives the minimum of AIC, as Table 2 shows.We
see that of the two clearly separated humps, theright one seems to
be a mixture of two peaks.
The second example also comes from (Silverman,1986). The data
are lengths of spells of psychiatrictreatments in a suicide study.
Figure 7 shows theraw data and the estimated density when the
do-main is chosen from 0 to 1,000. Third-degree B-splines were
used, with a second-order penalty. Afairly large amount of
smoothing (λ = 100) is in-dicated by AIC; the fitted density is
nearly expo-nential. In fact, if one considers only the domainfrom
0 to 500, then λ can become arbitrarily largeand a pure exponential
density results. However, ifwe choose the domain from −200 to 800
we get aquite different fit, as Figure 8 shows. By extendingthe
domain we force the estimated density also tocover negative values
of x, where there are no data(which means zero counts).
Consequently, it has todrop toward zero, missing the peak for small
posi-tive values. The optimal value of λ now is 0.01 anda much more
wiggly fit results, with an appreciablyhigher value of AIC. This
nicely illustrates how, witha proper choice of the domain, the
P-spline densitysmoother can be free from the boundary effects
thatgive so much trouble with kernel smoothers.
9. DISCUSSION
We believe that P-splines come near to being theideal smoother.
With their grounding in classic re-gression methods and generalized
linear models,their properties are easy to verify and
understand.Moments of the data are conserved and the
limitingbehavior with a strong penalty is well defined andgives a
connection to polynomial models. Bound-ary effects do not occur if
the domain of the data isproperly specified.
The necessary computations, including cross-validation, are
comparable in size to those for amedium sized regression problem.
The regressioncontext makes it natural to extend P-splines
tosemiparametric models, in which additional ex-planatory variables
occur. The computed fit isdescribed compactly by the coefficients
of the B-splines.
-
FLEXIBLE SMOOTHING 99
Fig. 6. Density smoothing of durations of Old Faithful geyser
eruptions: density histogram and fitted densities; thin line,
third-orderpenalty with λ = 0:001AIC = 84:05; thick line, optimal λ
= 0:05, with AIC = 80:17; B-splines of degree 3 with 20 intervals
on thedomain from 1 to 6.
Fig. 7. Density smoothing of suicide data: positive domain
(0–1,000); B-splines of degree 3; penalty of order 2; 20 intervals,
λ =100; AIC = 69:9.
Fig. 8. Density smoothing of suicide data: the domain includes
negative values −200–800; B-splines of degree 3; penalty of order
2,20 intervals, λ = 0:01; AIC = 83:6.
P-splines can be very useful in (generalized) ad-ditive models.
For each dimension a B-spline ba-sis and a penalty are introduced.
With n knots ineach base and d dimensions, a system of
nd-by-nd(weighted) regression equations results. Backfitting,
the iterative smoothing for each separate dimen-sion, is
eliminated. We have reported on this ap-plication elsewhere (Marx
and Eilers, 1994, 1996).
Penalized likelihood is a subject with a grow-ing popularity. We
already mentioned the work of
-
100 P. H. C. EILERS AND B. D. MARX
O’Sullivan. In the book by Green and Silverman(1994), many
applications and references can befound. Almost exclusively,
penalties are defined interms of the square of the second
derivative of thefitted curve. Generalizations to penalties on
higherderivatives have been mentioned in the literature,but to our
knowledge, practical applications arevery rare. The shift from the
continuous penalty tothe discrete penalty in terms of the
coefficents ofthe B-splines is not spectacular in itself. But
wehave seen that it leads to very useful results, whilegiving a
mechanical way to work with higher-orderpenalties. The modelling of
binomial dose–responsein Section 7 showed the usefulness of
higher-orderpenalties.
A remarkable property of AIC is that it is easier tocompute it
for certain nonnormal distributions, likethe Poisson and binomial,
than for normal distribu-tions. This is so because for these
distributions therelationship between mean and variance is known.We
should warn the reader that AIC may lead toundersmoothing when the
data are overdispersed,since the assumed variance of the data may
then betoo low. We are presently investigating smoothingwith
P-splines and overdispersed distributions likethe negative binomial
and the beta-binomial. Alsoideas of quasilikelihood will be
incorporated.
We have paid extra attention to density smooth-ing, because we
feel that in this area the advan-tages of P-splines really shine.
Traditionally, kernelsmoothers have been popular in this field, but
theyinflate the variance and have troubles with bound-aries of data
domains; their computation is expen-sive, cross-validation even
more so, and one cannotreport an estimated density in a compact
way.
Possibly kernel smoothers still have advantagesin two or more
dimensions, but it seems thatP-splines can also be used for
two-dimensionalsmoothing with Kronecker products of B-splines.With
a grid of, say, 10 by 10 knots and a third-orderpenalty, a system
of 130 equations results, withhalf bandwidth of approximately 30.
This can easilybe handled on a personal computer. The
automaticconstruction of the equations will be more difficultthan
in one dimension. First experiments with thisapproach look
promising; we will report on them indue time.
We have not touched on many obvious and in-teresting extensions
to P-splines. Robustness canbe obtained with any nonlinear
reweighting schemethat can be used with regression models.
Circulardomains can be handled by wrapping the B-splinesand the
penalty around the origin. The penalty canbe extended with weights,
to give a fit with noncon-stant stiffness. It this way it will be
easy to specify
a varying stiffness, but it is quite another matter toestimate
the weights from the data.
Finally, we like to remark that P-splines form abridge between
the purely discrete smoothing prob-lem, as set forth originally by
Whittaker (1923) andcontinuous smoothing. B-splines of degree zero
areconstant on an interval between two knots, and zeroelsewhere;
they have no overlap. Thus the fittedfunction gives for each
interval the value of the co-efficient of the corresponding
B-spline.
APPENDIX: COMPUTATIONAL DETAILS
Here we look at the computation of B-splinesand derivatives of
the penalty. We use S-PLUS andMATLAB as example languages because
of theirwidespread use. Also we give some impressions ofthe speed
of the computations.
In the linear case we have to solve the system ofequations
BTB+ λDTkDkâ = BTy(43)
and to compute y−Bâ2 and trBTB+λDTD−1 ·BTB. We need a function
to compute B, the B-spline base matrix. In S-PLUS, this is a simple
mat-ter, as there is a built-in function spline.des() thatcomputes
(derivatives) of B-splines. We only have toconstruct the sequence
of knots. Let us assume thatxl is the left of the x-domain, xr the
right, and thatthere are ndx intervals on that domain. To computeB
for a given vector x, based on B-splines of degreebdeg, we can use
the following function:
bspline
-
FLEXIBLE SMOOTHING 101
yhat
-
102 P. H. C. EILERS AND B. D. MARX
REFERENCES
Ashford, R. and Walker, P. J. (1972). Quantal response analy-sis
for a mixture of populations. Biometrics 28 981–988.
Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W.
(1975).Discrete Multivariate Analysis: Theory and Practice.
MITPress.
Cleveland, W. S. (1979). Robust locally weighted regressionand
smoothing scatter plots. J. Amer. Statist. Assoc. 74829–836.
Cox, M. G. (1981). Practical spline approximation. In Topics
inNumerical Analysis (P. R. Turner, ed.). Springer, Berlin.
de Boor, C. (1977). Package for calculating with B-splines.SIAM
J. Numer. Anal. 14 441–472.
de Boor, C. (1978). A Practical Guide to Splines.
Springer,Berlin.
Dierckx, P. (1993). Curve and Surface Fitting with
Splines.Clarendon, Oxford.
Diggle P. and Marron J. S. (1988). Equivalence of smooth-ing
parameter selectors in density and intensity estimation.J. Amer.
Statist. Assoc. 83 793–800.
Eilers, P. H. C. (1990). Smoothing and interpolation with
gen-eralized linear models. Quaderni di Statistica e
MatematicaApplicata alle Scienze Economico-Sociali 12 21–32.
Eilers, P. H. C. (1991a). Penalized regression in action:
esti-mating pollution roses from daily averages. Environmetrics2
25–48.
Eilers, P. H. C. (1991b). Nonparametric density estimation
withgrouped observations. Statist. Neerlandica 45 255–270.
Eilers, P. H. C. (1995). Indirect observations, composite
linkmodels and penalized likelihood. In Statistical Modelling(G. U.
H. Seeber et al., eds.). Springer, New York.
Eilers, P. H. C. and Marx, B. D. (1992). Generalized
linearmodels with P-splines. In Advances in GLIM and
StatisticalModelling (L. Fahrmeir et al., eds.). Springer, New
York.
Eubank, R. L. (1988). Spline Smoothing and Nonparametric
Re-gression. Dekker, New York.
Friedman, J. and Silverman, B. W. (1989). Flexible parsimo-nious
smoothing and additive modeling (with discussion).Technometrics 31
3–39.
Green, P. J. and Silverman, B. W. (1994). Nonparametric
Re-gression and Generalized Linear Models. Chapman and
Hall,London.
Green, P. J. and Yandell, B. S. (1985).
Semi-parametricgeneralized linear models. In Generalized Linear
Models(B. Gilchrist et al., eds.). Springer, New York.
Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and
Os-trowski, E. (1994). A Handbook of Small Data Sets. Chap-man and
Hall, London.
Härdle, W. (1990). Applied Nonparametric Regression. Cam-bridge
Univ. Press.
Hastie, T. and Tibshirani, R. (1990). Generalized Additive
Mod-els. Chapman and Hall, London.
Kooperberg, C. and Stone, C. J. (1991). A study of
logsplinedensity estimation. Comput. Statist. Data Anal. 12
327–347.
Kooperberg, C. and Stone, C. J. (1992). Logspline density
esti-mation for censored data. J. Comput. Graph. Statist. 1
301–328.
Marron, J. S. and Ruppert, D. (1994). Transformations to re-duce
boundary bias in kernel density estimation. J. Roy.Statist. Soc.
Ser. B 56 653–671.
Marx, B. D. and Eilers, P. H. C. (1994). Direct generalized
ad-ditive modelling with penalized likelihood. Paper presentedat
the 9th Workshop on Statistical Modelling, Exeter, 1994.
Marx, B. D. and Eilers, P. H. C. (1996). Direct
generalizedadditive modelling with penalized likelihood.
Unpublishedmanuscript.
McCullagh, P. and Nelder, J. A. (1989). Generalized
LinearModels, 2nd ed. Chapman and Hall, London.
O’Sullivan, F. (1986). A statistical perspective on ill-posed
in-verse problems (with discussion). Statist. Sci. 1 505–527.
O’Sullivan, F. (1988). Fast computation of fully automated
log-density and log-hazard estimators. SIAM J. Sci. Statist.Comput.
9 363–379.
Reinsch, C. (1967). Smoothing by spline functions. Numer.
Math.10 177–183.
Sakamoto, Y., Ishiguro, M. and Kitagawa, G. (1986).
AkaikeInformation Criterion Statistics. Reidel, Dordrecht.
Scott, D. W. (1992). Multivariate Density Estimation:
Theory,Practice, and Visualization. Wiley, New York.
Silverman, B. W. (1985). Some aspects of the spline smooth-ing
approach to nonparametric regression curve fitting
(withdiscussion). J. Roy. Statist. Soc. Ser. B 47 1–52.
Silverman, B. W. (1986). Density Estimation for Statistics
andData Analysis. Chapman and Hall, London.
Wahba, G. (1990). Spline Models for Observational Data.
SIAM,Philadelphia.
Wand, M. P. and Jones, M. C. (1993). Kernel Smoothing. Chap-man
and Hall, London.
Whittaker, E. T. (1923). On a new method of graduation.
Proc.Edinburgh Math. Soc. 41 63–75.
CommentS-T. Chiu
Authors Paul Eilers and Brian Marx provide avery interesting
approach to nonparametric curvefitting. They give a brief but very
concise review of
S-T. Chiu is with the Department of Statistics,Colorado State
University, Fort Collins, Colorado80523-0001.
B-splines. I also enjoyed reading the part where theauthors
applied their procedure to some examples.As shown in the paper, the
approach has severalmerits which deserve to be studied in more
detail.
Similar to any nonparametric smoother, the pro-posed procedure
needs a smoothing parameter λ tocontrol the smoothness of the
fitting curve. My com-
-
FLEXIBLE SMOOTHING 103
ments mainly concern the selection of the
smoothingparameter.
It is well known that the classical selectors suchas AIC, GCV,
Mallows’s Cp and so on do not givea satisfactory result. For the
regression case, moredetails about the defects can be found in Rice
(1984)and Chiu (1991a). Scott and Terrell (1987) and Chiu(1991b)
discuss the case of density estimation. Theclassical selectors have
a large sample variation anda tendency to select a small smoothing
parameter,thus producing a very rough curve estimate. It isnatural
to expect that they have a similar problemwhen applied to selecting
the smoothing parameterfor P-splines.
Several procedures have been suggested to rem-edy the defects of
the classical procedures. Chiu(1996) provides a survey of some of
these newerselectors for density estimation. For the regres-sion
case, some procedures are suggested in Chiu(1991a), Hall and
Johnstone (1992) and Hall, Mar-ron and Park (1992).
In the following, I provide a brief review to ex-plain the
defects and some remedy to the classicalselectors for kernel
regression estimate. Let us as-sume the simplest model of a
circular design withequally spaced design points. yt = µxt+ εt,
whereεt are i.i.d. noise. For the kernel estimate µ̂β witha
bandwidth β, we often use the mean of sum ofsquared errors
1 Rβ = E[∑µ̂βxt − µxj2
]
to measure the closeness between µ̂x and µx.The goal of
bandwidth selection is to select the
optimal bandwidth which minimizes Rβ. Since inpractice µ is
unknown, we have to estimate Rβand use the minimizer of the
estimated Rλ asan estimate of the optimal bandwidth. For exam-ple,
Mallows’s Cp has the form
2 R̂β = RSSβ −Tσ2 + 2σ2w0/β:
Here wx is the kernel and σ2 is the error variance.Other
classical procedures such as AIC and GCVhave a similar form and
were shown to be asymp-totically equivalent in Rice (1984). All of
these pro-cedures rely on the residual sum of squares RSSβ.
Mallows (1973) proposed the procedure based onthe observation
that
Rβ = ERSSβ −Tσ2 + 2σ2w0/β:
As we will explain later, the main problem here isthat RSSβ is
not a good estimate of its expectedvalue.
By using the Fourier transform, (1) and (2) couldbe written,
respectively, as
3Rβ = 4π
N∑j=1
ISλj1−Wβλj2
+ σ2N∑j=1
Wβλ2 + σ2
and
4R̂β = 4π
N∑j=1
{IYλj −
σ2
2π
}1−Wβλj2
+ σ2N∑j=1
Wβλ2 + σ2;
where IY and IS are the periodograms ofYt and thesignal St =
µt/T, respectively, and λj = 2πj/T,j = 1; : : : ;N = T/2. Also, Wβλ
is the transferfunction of wt/βT/βT.
Comparing (3) and (4), we see that R̂ attempts touse IYλ−σ2/2π
to estimate ISλ. The difficultyis that at high frequency, IY is
dominated by thenoise and thus does not give a good estimate of
IS.
Chiu (1991a) suggested truncating the high-frequency portion
when we estimate Rβ,
5R̃β = 4π
J∑j=1
{IYλj −
σ2
2π
}1−Wβλj2
+ σ2N∑j=1
Wβλ2 + σ2:
Here J is selected in such a way that there is no sig-nificant
IS beyond frequency λJ. The selector R̃βhas a much better
performance than the classicalones. Hall, Marron and Park (1992)
proposed an-other procedure which downweights the contribu-tion
from the high-frequency part.
It is clear that the bases of the kernel regressionare the
sinusoid waves. The primary reason of suc-cess of criterion (5) is
that most information aboutµ concentrates at low frequency. In
other words, wejust need quite a few bases to approximate the
truecurve well.
However, since each basis of the B-spline is verylocal to a
certain interval, we cannot use just a fewbases to approximate the
curve over the whole re-gion. In my opinion, this could be a big
obstacle tothe understanding and improvement of the
classicalsmoothing parameter selectors.
REFERENCES
Chiu, S.-T. (1991a). Some stabilized bandwidth selectors for
non-parametric regression. Ann. Statist. 19 1528–1546.
-
104 P. H. C. EILERS AND B. D. MARX
Chiu, S.-T. (1991b). Bandwidth selection for kernel density
esti-mation. Ann. Statist. 19 1883–1905.
Chiu, S.-T. (1996). A comparative review of bandwidth
selectionfor kernel density estimation. Statist. Sinica 6
129–145.
Hall, P. and Johnstone, I. (1992). Empirical functionals
andefficient smoothing parameter selection. J. Roy. Statist.
Soc.Ser. B 54 519–521.
Hall, P., Marron, J. S. and Park, B. U. (1992). Smoothed
cross-validation. Probab. Theory Related Fields 92 1–20.
Mallows, C. (1973). Some comments on Cp. Technometrics
15661–675.
Rice, J. (1984). Bandwidth choice for nonparametric
regression.Ann. Statist. 12 1215–1230.
Scott, D. W. and Terrell, G. R. (1987). Biased and
unbiasedcross-validation in density estimation. J. Amer. Statist.
Assoc.82 1131–1146.
CommentDouglas Nychka and David Cummins
One strength of the authors’s presentation is thesimple ridge
regression formulas that result for theestimator. We would like to
point out a decompo-sition using a different set of basis functions
thathelps to interpret this smoother. This alternativebasis,
derived from B-splines, facilitates the compu-tation of the GCV
function and confidence bands forthe estimated curve.
To simplify this discussion assume that W = I sothat the hat
matrix is
H = BBTB+ λDTD−1BT = GI+ λ0−1GT;
G = BQ−1/22 U, Q2 = BTB U, 0 = diagγ and U isan orthogonal
matrix such that Q−1/22 D
TDQ−1/22 =
U0UT. The columns of G can be identified witha new set of
functions known as the Demmler–Reinsch (DR) basis. Specifically
these are piecewisepolynomial functions, ψν so that the elements
ofG satisfy ψνxi = Giν. Besides having useful or-thogonality
properties the DR basis can be orderedby frequency and larger
values of γν will exhibitmore oscillations (in fact ν − 1 zero
crossings). Fig-ure 1(a) plots several of the basis functions form
= 133 equally spaced x’s and 20 equally spacedinterior knots.
Figure 1(b) illustrates the expectedpolynomial increase in the size
of γν as a functionof ν.
The Demmler–Reinsch basis provides an informa-tive
interpretation of the spline estimate. Let f̂ de-
Douglas Nychka is Professor of Statistics and DavidCummins is
with the Department of Statistics, NorthCarolina State University,
Raleigh, North Carolina27695-8203.
note the P-spline and let α = GTy denote the leastsquares
coefficients from regressing y on the DR ba-sis functions:
f̂xi = Hyi = GI+ λ0−1GTyi
=m∑ν=1ψνxi
αν1+ λγν
:
Note that the smoother is just a linear combi-nation of the DR
basis functions using coefficientsthat are downweighted (or
tapered) by the fac-tor 1/1 + λγν from the least squares
estimates.Because of the relationship between γν and ψν(see Figure
1), the basis functions that representhigher-frequency structure
will have coefficientsthat are more severely downweighted. In this
waythe smoother is a low-pass filter, tending to
preservelow-frequency structure and downweighting higher-frequency
terms. The residual sum of squares andthe trace of H can be
computed rapidly (order n)using the DR representation. Thus the GCV
func-tion can also be evaluated in order n operations fora given
value of λ.
Another application of the DR form is in comput-ing a confidence
band. Consider a set of candidatefunctions that contain the true
function with thecorrect level of confidence. The confidence band
isthen the envelope implied by considering all func-tions in this
set. For example, let f̂ denote the func-tion estimate and for
C1;C2 > 0 let
B ={hx h is a B-spline with coefficients b,n∑i=1f̂xi−hxi2≤C1 and
bTDTD b≤C2
}
-
FLEXIBLE SMOOTHING 105
Fig. 1. Illustration of several Demmler–Reinsch basis functions
and the associated eigenvalues for 20 equally spaced knots, 133
equallyspaced observations and second divided differences k = 2:
the upper plot (a) is ψν for ν = 3;5;10;15; the numerals identify
theorder of these basis functions and in the second plot (b)
identify the eigenvalues for these functions.
The constants C1 and C2 are determined so thatPf ∈ B equals the
desired confidence level. Theupper and lower boundaries of the
confidence bandare then
Ux = maxhxx h ∈ B
and
Lx = minhxx h ∈ B
In practice we work with the coefficients andthus the
computation of U and L at each x isa minimization problem with two
quadratic con-
straints. Using the DR basis reduces both con-straints to
quadratic forms with diagonal matricesand thus both are computable
in order n opera-tions. Moreover this strategy does not depend
onthe roughness penalty being divided differences butwill work for
any nonnegative matrix used as apenalty (e.g., thin plate splines).
Currently we areinvestigating the choice of C1 and C2 based on
theGCV estimate of f.
ACKNOWLEDGMENT
This work was supported by NSF Grant DMS-92-17866.
-
106 P. H. C. EILERS AND B. D. MARX
CommentChong Gu
I would like to begin by congratulating the au-thors Eilers and
Marx for a clear exposition of aninteresting variant of penalized
regression splines.My comments center around three questions:
AreP-splines really better? What does optimal smooth-ing stand for?
And what does the future hold fornonparametric function
estimation?
ARE P-SPLINES REALLY BETTER?
P-splines can certainly be as useful as other vari-ants of
penalized regression splines, but I am notsure that they are really
advantageous over the oth-ers. It is true that with huge sample
sizes, one maychoose n much smaller than m to save on computa-tion
without sacrificing performance, but other vari-ants of regression
splines also share the same ad-vantage. The mechanical handling of
the differencepenalty is certainly very interesting
computation-ally, but as far as the end users are concerned, Ido
not see why the discrete penalties are necessar-ily advantageous
over the continuous ones. Higher-order derivative penalties are
certainly as feasibleas discrete penalties computationally, albeit
moredifficult to implement, but the difference is irrele-vant to
the end users whose main interest is theinterface.
The users may be more interested in what the pro-gram computes
rather than how it computes, how-ever, and in this respect, I only
see P-splines loseout to penalized regression splines with the
usualderivative penalties that everyone can understand.Being told
that B-splines provide a good basisfor function approximation, the
users may simplyignore whatever other properties B-splines haveand
still have a clear picture about what they aregetting from
derivative penalties or, for that mat-ter, from Whittaker’s
discrete penalties which usethe differences of adjacent function
values. Withthe P-splines, however, the intuition is unfortu-nately
taken away from the users, and even witha thorough knowledge of all
the properties of B-splines, I am not sure one can easily perceive
what
Chong Gu is Assistant Professor, Department ofStatistics, Purdue
University, West Lafayette, Indi-ana 47907.
the penalty is really doing, other than that it is re-ducing the
effective dimension in some not so easilycomprehensible way.
Penalized smoothers with quadratic penalties areknown to be
equivalent to Bayes estimates withGaussian priors. When Q = DTkDk
is of full rank,the corresponding prior for the B-spline
coefficientsa has mean 0 and covariance proportional to Q−1.When Q
is rank-deficient, the prior has a “fixed ef-fect” component
diffuse in the null space of Q anda “random effect” component with
mean 0 and co-variance proportional to Q+, the Moore–Penrose
in-verse of Q. From this perspective, P-splines differfrom other
variants of penalized regression splinesonly in the specification
of Q.
WHAT DOES OPTIMAL SMOOTHINGSTAND FOR?
One probably can never overstate the importanceof smoothing
parameter selection for any success-ful practical application of
any smoothing method.AIC and cross-validation are among the most
ac-cepted (and successful) working criteria for modelselection, yet
their optimalities are established, the-oretically or empirically,
only for specific problemsettings under appropriate conditions.
Naive adap-tations of these criteria in new problem settings donot
necessarily deliver fits that are nearly optimal.
Specifically, I am somewhat worried about the“optimality” of the
naive adaptations of these cri-teria proclaimed in Section 6.
First, it is not clear inwhat sense these criteria are “optimal” in
the prob-lem settings to which they are applied; second, thereis no
empirical (or theoretical) evidence illustratingthe presumed
“optimality.” AIC or cross-validationmay deliver nearly optimal
fits, but they surely donot by themselves define the notion of
optimality.
My worries stem from previous empirical ex-periments with
smoothing parameter selection bymyself and by others, especially in
non-Gaussianregression problems (commonly referred to as
gen-eralized linear models). Using Kullback–Leiblerdiscrepancy or
its symmetrized version to defineoptimality, it has been found that
a naive adap-tation of GCV in non-Gaussian regression, whichappears
similar to what the authors suggest in Sec-tion 7, may return
anything but nearly optimal fits.See, for example, Cox and Chang
(1990), Gu (1992)
-
FLEXIBLE SMOOTHING 107
and Xiang and Wahba (1996). For the density esti-mation problem
in Section 8, I could not find thedefinition of the H matrix to
understand the AICproposed, but whatever it is, it should be
subjectto the same scrutiny before being recommended
as“optimal.”
In ordinary Gaussian regression, the optimalityof GCV is well
established in the literature. Forthe AIC score presented in (27),
however, I wouldlike some empirical evidence to be convinced of
itsoptimality. The skepticism is partly due to someempirical
evidence suggesting that the trace of Hmay not be a consistent
characterization of the ef-fective dimension of the model. Such
evidence canbe found in Gu (1996), available online at
http://www.stat.lsa.umich.edu/~chong/ps/modl.ps.
WHAT DOES THE FUTURE HOLDFOR FUNCTION ESTIMATION?
In response to Statistical Science’s desiderationfor
speculations regarding future research direc-tions, I would like to
take this opportunity to offersome of my thoughts.
It has long been said that all smoothing methodsperform
similarly in one dimension, provided thatthe smoothing parameter
selection is done prop-erly, yet time and again new and not so new
meth-ods keep being invented. The real challenge, how-ever, seems
to lie in multivariate problems. Amidthe curse of dimensionality
and potential structuresassociated with multivariate problems, the
choiceof methods can make a real difference in multidi-mension, in
the ease of computation and smoothingparameter selection, in the
convenience of incor-poration of structures, and so on. Among
methodswith the most potential are the adaptive regres-sion splines
developed by Friedman, Stone and co-workers, and the smoothing
splines developed bythe Wisconsin spline school lead by Wahba. The
pe-nalized regression spline approach, however, seemssomewhat
handicapped by the lack of effective ba-sis, say in dimensions
beyond two or three.
More challenging still, an important line of re-search that has
been largely neglected is inference.What one usually gets from the
function estimationliterature are point estimates possibly with
asymp-totic convergence rates, and intuitive smoothingparameter
selectors not always accompanied byjustifications. Besides a few
entries based on theBayes model of smoothing splines by Wahba
(1983),Cox, Koh, Wahba and Yandell (1988), Barry (1993)and some
follow-ups, practical procedures that of-fer interval estimates,
test of hypothesis, and soon, are largely missing in the
literature. To guardagainst the danger of overinterpreting data by
theuse of nonparametric methods, such inferentialtools should be a
top priority in future research.Under a Bayes model where the
target function istreated as a realization of a stochastic process,
thedevelopment may proceed within the conventionalinferential
framework. Under the traditional set-ting where the target function
is considered fixed,however, one may have to turn his back on the
con-ventional Neyman–Pearson thinking before he cancall any useful
inferential tools non-ad-hoc.
REFERENCES
Barry, D. (1993). Testing for additivity of a regression
function.Ann. Statist. 21 235–254.
Cox, D. D. and Chang, Y.-F. (1990). Iterated state space
al-gorithms and cross validation for generalized smoothingsplines.
Technical Report 49, Dept. Statistics, Univ. Illinois.
Cox, D. D., Koh, E., Wahba, G. and Yandell, B. S. (1988).Testing
the (parametric) null model hypothesis in (semipara-metric) partial
and generalized spline models. Ann. Statist.16 113–119.
Gu, C. (1992). Cross validating non Gaussian data. Journal
ofComputational and Graphical Statistics 1 169–179.
Gu, C. (1996). Model indexing and smoothing parameter selec-tion
in nonparametric function estimation. Technical Report93-55 (rev.),
Dept. Statistics, Purdue Univ.
Wahba, G. (1983). Bayesian “confidence intervals” for the
cross-validated smoothing spline. J. Roy. Statist. Soc. Ser. B
45133–150.
Xiang, D. and Wahba, G. (1996). A generalized approximatecross
validation for smoothing splines with non-Gaussiandate. Statist.
Sinica. To appear.
-
108 P. H. C. EILERS AND B. D. MARX
CommentM. C. Jones
Eilers and Marx present a clear and interestingaccount of their
P-spline smoothing methodology.Clearly, P-splines constitute
another respectableapproach to smoothing. However, their good
prop-erties appear to be, broadly, on a par with those ofvarious
other approaches; the method is no nearerto, or further from,
“being the ideal smoother” thanothers.
“P-splines have no boundary effects, they are astraightforward
extension of (generalized) linear re-gression models, conserve
moments (means, vari-ances) of the data, and have polynomial curve
fits aslimits.” Except for the third point, the same claimscan be
made of spline smoothing (Green and Sil-verman, 1994) or local
polynomial fitting (Fan andGijbels, 1996).
Conservation of moments seems unimportant. Inregression, I do
not see the desirability. In densityestimation, simple corrections
of kernel density es-timates for variance inflation exist, but make
lit-tle difference away from the normal density (Jones,1991).
Indeed, getting means and variances rightis a normality-based
concept, so corrected kernelestimators act in a normal-driven
semiparametricmanner. Efron and Tibshirani (1996) propose
moresophisticated moment conservation, but initial indi-cations are
that this is no better nor worse than al-ternative semiparametric
density estimators (Hjort,1996).
“The computations, including those for cross-validation, are
relatively inexpensive and easilyincorporated into standard
software.” Again, pro-ponents of the two competing methods I
havementioned would claim the same for the first halfof this and
advocates of regression splines wouldclaim the lot.
The authors make no particularly novel contri-bution to
automatic bandwidth selection. Cross-validation and AIC are in a
class of methods (e.g.,Härdle, 1990, pages 166–167) which, while
not be-ing downright bad, allow scope for improvement.
M.C. Jones is Reader in Statistical Science, Depart-ment of
Statistics, The Open University, Walton Hall,Milton Keynes, MK7
6AA, United Kingdom.
Calculating thesebandwidth selectors quickly isless important
than developing better selectors. Forlocal polynomials,
improvements are offered (fornormal errors) by Fan and Gijbels
(1995) and Rup-pert, Sheather and Wand (1995) and unpublishedwork
extends these to more general situations.
The comparison of (5) with (11) focusses on thesmall extra
complexity of the latter. But which ismore interpretable: a
roughness penalty on a curveor on a series of coefficients?
Changing the penaltyin a smoothing spline setup allows different
para-metric limits (e.g., Ansley, Kohn and Wong, 1993);how can
P-splines cope with this?
An exasperating aspect of spline-based approach-es is the lack
of straightforward (asymptotic) meansquared error–type results to
indicate theoreti-cal performance relative to kernel/local
polynomialapproaches for which such results are simply ob-tained
and, within limitations, informative. I doubtwhether P-splines can
facilitate such developments(reason given below).
It seems that P-splines have no particular attrac-tiveness for
multivariate applications. The examplesare noteworthy only for
looking like results obtain-able by other methods too.
The idea behind density estimation P-splinesis to treat a fine
binning as Poisson regressiondata. OK, but again equally applicable
to otherapproaches and already investigated for local poly-nomial
smoothing. Simonoff (1996, Section 6.4)and Jones (1996) explain how
such regressionapproaches to density estimation are
discretizedversions of certain “direct” local likelihood
densityestimation methods (Hjort and Jones, 1996; Loader,1996).
Binning is the major computational device ofall kernel-type
estimators (Fan and Marron, 1994).The local likelihood approach is
already deeplyunderstood theoretically.
Comparison of P-splines’s reasonable boundaryperformance with
local polynomials’s reasonableboundary performance is not yet
available throughtheory or simulations.
An interesting point mentioned in the paper is theapparent
continuum between few-parameter para-metric fits at one end and
fully “nonparametric”techniques at the other, with many-parameter
para-
-
FLEXIBLE SMOOTHING 109
metric models and semiparametric approaches inbetween: a
dichotomy into parametric and nonpara-metric is inappropriate, and
there is a huge greyarea of overlap. The equivalent
degrees-of-freedomideas of Hastie and Tibshirani (1990) provide a
fine(but possibly improveable?) attempt to give this con-tinuum a
scale. Theoretical development might bemade more difficult by
P-splines for reasons asso-ciated with quantifying the
“nonparametricness” ofintermediate methods.
Finally, we come back to my main point. In an ad-mirable
“personal view of smoothing and statistics,”Marron (1996) gives a
list of smoothing methodsand another of factors (to which I might
add others)involved in the choice between methods. Marronsays “All
of the methods : : : listed : : :have differingstrengths and
weaknesses in : : :divergent senses.None of these methods dominates
any other in allof the senses. : : :Since these factors are so
differ-ent, almost any method can be ‘best’, simply by
anappropriate personal weighting of the various fac-tors involved.”
P-splines are a reasonable additionto Marron’s first list, but have
no special statuswith respect to his second.
REFERENCES
Ansley, C. F., Kohn, R. and Wong, C. M. (1993).
Nonparametricspline regression with prior information. Biometrika
80 75–88.
Efron, B. and Tibshirani, R. (1996). Using specially
designedexponential families for density estimation. Ann. Statist.
24000–000.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling
andIts Applications. Chapman and Hall, London.
Fan, J. and Marron, J. S. (1994). Fast implementations of
non-parametric curve estimators. J. Comput. Graph. Statist.
335–56.
Hjort, N. L. (1996). Performance of Efron and Tibshirani’s
semi-parametric denisty estimator. Unpublished manuscript.
Hjort, N. L. and Jones, M. C. (1996). Locally parametric
non-parametric density estimation. Ann. Statist. 24 1619–1647.
Jones, M. C. (1991). On correcting for variance inflation in
ker-nel density estimation. Comput. Statist. Data Anal. 11
3–15.
Jones, M. C. (1996). On close relations of local likelihood
densityestimation. Unpublished manuscript.
Loader, C. R. (1996). Local likelihood density estimation.
Ann.Statist. 24 1602–1618
Marron, J. S. (1996). A personal view of smoothing and
statis-tics (with discussion). Comput. Statist. To appear.
Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An
ef-fective bandwidth selector for local least squares regression.J.
Amer. Statist. Assoc. 90 1257–1270.
Simonoff, J. S. (1996). Smoothing Methods in
Statistics.Springer, New York.
CommentJoachim Engel and Alois Kneip
Paul Eilers and Brian Marx have provided uswith a nice and
flexible addition to the smoother’stoolkit. Their proposed P-spline
estimator can beconsidered as some compromise between the
usualB-spline estimation and the smoothing spline ap-proach.
Different from many papers on B-splines,however, they do not
consider the delicate problemof optimal knot selection. Instead,
they propose touse a large number of equidistant knots. Smoothingis
introduced by a roughness penalty on the differ-ence of spline
coefficients.P-spline estimation is equivalent to smoothing
splines when choosing as many knots as there are
Joachim Engel is with Wirtschaftstheorie II, Uni-versität Bonn,
and Department of Mathematics, PHLudwigsburg, Germany. Alois Kneip
is with Insti-tut de Statistique, Université Catholique de
Louvain,Belgium.
observations n = m with a knot placed at eachdata point.
However, this is not the situation theauthors have in mind. They
propose to choose alarge number n of knots, but n < m. Such an
ap-proach is of considerable interest. We know frompersonal
experience that nonparametric regressionfits based on B-splines are
often visually more ap-pealing than, for example, kernel estimates.
Thesame seems to be true for P-splines if a moder-ate number of
knots is used. Furthermore, as theauthors indicate, P-splines
together with the dif-ference penalty enjoy many important
practical ad-vantages and are flexible enough to be applied
indifferent modelling situations, for example, in addi-tive models
or self-modelling regression where thebackfitting algorithm is
used.
Nevertheless, we do not yet see much evidencefor the authors’s
claim that P-splines “come nearbeing the ideal smoother.” For
example, local poly-nomial regression is known to exhibit no
boundary
-
110 P. H. C. EILERS AND B. D. MARX
problems (in first order) and to possess certain op-timality and
minimax properties (Fan, 1993). Fordensity estimation Engel and
Gasser (1995) showa minimax property of the fixed bandwith
kernelmethod within a large class of estimators contain-ing
penalized likelihood estimators. The presentedpaper does not
provide any argument, neither theo-retical nor by simulations,
supporting any superior-ity of P-splines over their many
competitors.
In the regression case, the theoretical propertiesof P-splines
might be evaluated by combining ar-guments of de Boor (1978) on the
asymptotic biasand variance of B-splines in (dependence on m,
thespline order k and the smoothness of the underlyingfunction)
with the well-known results on smoothingsplines.
The authors propose to use AIC or cross-validation to select the
smoothing parameter λ.However, a careful look at their method
revealsthat there are in fact two free parameters: λ andthe number
n of knots. If n ≈ m, then we essen-tially obtain a smoothing
spline fit, while results
might be very different if n � m. Indeed, the esti-mate might
crucially depend on n. Therefore, whynot determine λ and n by
cross-validation or a re-lated method? The following theoretical
argumentsmay suggest that such a procedure will work. Notethat AIC
and cross-validation are very close to un-biased risk estimation
which consists of estimatingthe optimal values of λ and n by
minimizing
m∑i=1yi − µ̂i2 + 2σ2 trHλ;n;
where H ≡Hλ;n is the corresponding smoother ma-trix. Let ASEλ;n
denote the average squared er-ror of the fit obtained by using some
parameters λand n. Under some technical conditions, it then
fol-lows from results of Kneip (1994) that, as m→∞,
ASEλ̂; m̂/ASEλopt;mopt →P 1:Here λ̂ and m̂ are the parameters
estimated by un-biased risk estimation, while λopt and mopt
repre-sent the optimal choice of the parameters minimiz-ing
ASE.
CommentCharles Kooperberg
Eilers and Marx present an interesting approachto spline
modeling. While function estimation basedon smoothing splines often
yields reasonable re-sults, the computational burden can be very
large.If the number of basis functions is limited, however,the
computations become much easier, and whenthe knots are equally
spaced, the solution indeedbecomes rather elegant. To increase the
credibilityof the claim that P-splines are close to the
“idealsmoother,” several issues need to be addressed:
1. In density estimation, when the range of the datais � (�+),
it is useful that a density estimate bepositive on � (�+), for
example, for resampling.Some methods can estimate densities on
boundedor unbounded intervals. P-splines do not seem tohave this
property: lower and upper bounds haveto be specified and there
seems to be no natural
Charles Kooperberg is Assistant Professor, Depart-ment of
Statistics, University of Washington, Seattle,Washington
98195-0001.
way to extrapolate beyond these bounds. Is thereany way around
that? Can infinity be a bound?
How would one specify the bounds? From thesuicide example it
appears that this may influ-ence the results considerably.
2. To use P-splines, additional choices need to bemade. How many
knots should one use? Is theprocedure insensitive to the number of
knots pro-vided that there are enough of them? If so, howmany is
enough? How does the computationalburden depend on the number of
knots?
What order of penalty should be used? Do youadvocate examining
several possible penalties, asin the logistic regression example,
or do you haveanother recommendation, such as using k = 3
fordensity estimation so that the limit of your esti-mate as λ→∞ is
a normal density? Since manysmoothing and density estimation
procedures areused as EDA tools, good defaults are very
worth-while.
3. It would be interesting to see an application ofthe P-spline
methodology to more challengingdata, such as the income data
described below,
-
FLEXIBLE SMOOTHERS 111
which involves thousands of cases, a narrow peakand a severe
outlier.
How would the P-spline algorithm, whereknots are positioned
equidistantly, behave whenthere are severe outliers, which would
dominatethe positioning of the knots? Is it possible to po-sition
knots nonequidistantly, for example, basedon order statistics?
4. Are there theoretical results about the large sam-ple
behavior of P-splines?
POLYNOMIAL SPLINES AND LOGSPLINEDENSITY ESTIMATION
Besides the penalized likelihood approach, thereis an entirely
different approach to function es-timation based on splines.
Whereas for P-splinesboth the number and the locations of the
knotsare fixed in advance and the smoothness is gov-erned by a
smoothing parameter, in the polynomialspline framework the number
and location of theknots are determined adaptively using a
step-wise algorithm and no smoothing parameter isneeded. Such
polynomial spline methods have beenused for regression (Friedman,
1991), density es-timation (Kooperberg and Stone, 1992),
polychoto-mous (multiple logistic) regression (Kooperberg,Bose and
Stone, 1997), survival analysis (Kooper-berg, Stone and Truong,
1995a) and spectral den-sity estimation (Kooperberg, Stone and
Truong,1995b).
In univariate polynomial spline methodologiesthe algorithm
starts with a fairly small number ofknots. It then adds knots in
those regions wherean added knot would have the most influence,
us-ing Rao (score) statistics to decide on the bestlocation; after
a prespecified maximum number ofknots is reached, knots are deleted
one at a time,using Wald statistics to decide which knot to
re-move. Out of the sequence of fitted models, the onehaving the
smallest value for the BIC criterion isselected.
Polynomial spline algorithms for multivariatefunction estimation
are similar, except that at eachaddition step the algorithm adds
either a knot inone variable or a tensor product of two or
moreunivariate basis functions. We have successfully ap-plied such
methodologies to data sets as small as 50for one-dimensional
density estimation and as largeas 112,000 for a 63-dimensional
polychotomousregression problem with 46 classes. For nonadap-tive
polynomial spline methodologies theoreticalresults regarding the
L2-rate of convergence are es-tablished. Stone, Hansen, Kooperberg
and Truong
(1996) provide an overview of polynomial splinesand their
applications.
Logspline density estimation, in which a (univari-ate)
log-density is modeled by a cubic spline, is dis-cussed in
Kooperberg and Stone (1992) and Stoneet al. (1996). Software for
the 1992 version, writtenin C and interfaced to S-PLUS, is
publically avail-able from Statlib. (The 1992 version of
LOGSPLINEemploys only knot deletion; here, however, we focuson the
1996 version, which uses both knot additionand knot deletion.)
LOGSPLINE can provide esti-mates on both finite and infinite
intervals, and itcan handle censored data.
The results of LOGSPLINE on the Old Faithfuldata and the suicide
data are very similar to the cor-responding results of P-splines
[the suicide data isan example in Kooperberg and Stone (1992)].
Herewe consider a much more challenging data set. Thesolid line in
Figure 1 shows the logspline density es-timate based on a random
sample of 7,125 annualnet incomes in the United Kingdom [Family
Expen-diture Survey (1968–1983)]. (The data have beenrescaled to
have mean 1.) The nine knots that wereselected by LOGSPLINE are
indicated. Note thatfour of these knots are extremely close to the
peaknear 0:24. This peak is due to the UK old age pen-sion, which
caused many people to have nearly iden-tical incomes. In Kooperberg
and Stone (1992) weconcluded that the height and location of this
peakare accurately estimated by LOGSPLINE. There areseveral reasons
why this data is more challengingthan the Old Faithful and suicide
data: the dataset is much larger, so that it is more of a
challengeto computing resources (the LOGSPLINE estimatetook 9
seconds on a Sparc 10 workstation); the widthof the peak is about
0.02, compared to the range 11.5of the data; there is a severe
outlier (the largest ob-servation is 11.5, the second largest is
7.8); and therise of the density to the left of the peak is
verysteep.
To get an impression of what the P-splines proce-dure would
yield for this data, I first removed thelargest observation so that
there would not be anylong gaps in the data, reducing the maximum
ob-servation to 7.8. The dashed line in Figure 1 is theLOGSPLINE
estimate to the data with fixed knotsat i/20 × 7:8, for i = 0;1; :
: : ;20 (using 20 inter-vals, as in most P-spline examples.) The
resultingfit should be similar to a P-spline fit with λ = 0.In this
estimate it appears that the narrow peakis completely missed and
that, because of the steeprise of the density to the left of the
peak and thelack of sufficiently many knots near the peak, twomodes
are estimated where only one mode exists.
-
112 P. H. C. EILERS AND B. D. MARX
Fig. 1. Logspline density estimate for the income data (solid
line); the x indicate the locations of the knots; logspline
approximation ofthe P-spline estimate with penalty parameter 0
(dashed line).
It would be very much of interest to see how theP-spline
methodology behaves on this data, and inparticular whether it can
accurately represent thesharp peak near 0.24.
ACKNOWLEDGMENT
Research supported in part by NSF Grant DMS-94-03371.
CommentDennis D. Cox
The main new idea in this paper is a roughnesspenalty based on
the B-spline coefficients. Therewill be critics—I give some
criticisms below—butthere is considerable appeal in the simplicity
ofthe idea. If I had to develop the software ab initio,it is clear
that the roughness penalties proposedhere would require less effort
to implement thanthe standard ones based on L2-norm of a
secondderivative.
There is a precedent for the use of the B-splinecoefficients in
such a direct way, from computer
Dennis D. Cox is with Department of Statistics, RiceUniversity,
P.O. Box 1892, Houston, Texas 77251.
graphics (CG) and computer aided design (CAD).The “control
point” typically used in parametric B-spline representations of
curves and surfaces basi-cally consists of the B-spline
coefficients. See Foleyand van Dam (1995, Section 11.2.3). This is
demon-strated in Figure 1, where the control points for thesolid
curve are just random uniform added to a lin-ear trend, and the
same points are shrunk toward0.5 before adding the trend to obtain
the controlpoints for the dashed curve. The ordinate of eachcontrol
point is the cubic cardinal B-spline coeffi-cient and the abscissa
is the midpoint of support.In CG/CAD applications, the control
points are ma-nipulated to obtain a curve or surface with
desirableshape or smoothness. The CG/CAD practitioners be-come
familiar with these control points and develop
-
FLEXIBLE SMOOTHING 113
Fig. 1. Example of control points: the solid curve derives from
the solid control points, and the dashed curve from the triangular
controlpoints.
a feel for their influence on the curve or surface.Similarly,
statisticians may find after some effortthat B-spline coefficients
are very natural.
If I had equally easy to use software for smoothingsplines or
P-splines, I would prefer the former, par-tially from Bayesian
considerations. The Bayesianinterpretation of P-splines (i.e., the
differenced B-spline coefficients are a Gaussian white noise un-der
the prior) is more artificial than the usual pri-ors as in Wahba
(1978). In particular, the usualpriors are specified independently
of sample size,whereas one would want to use more B-splines witha
larger sample. Furthermore, the integral of thesecond derivative
squared is easier to interpret froma non-Bayesian perspective than
the sum of squaresof second differences of B-spline
coefficients.
I take issue with the authors’s claim that theirmethod does not
have boundary problems. P-splinesare approximately equivalent to
smoothing splineswhich do have boundary effects (Speckman, 1983).To
explain, consider minimizing from equation (5),
Sa =m∑i=1
{yi −
n∑j=1
ajBjxi}2+ λ
n∑j=312aj2:
A discrete form of the variational derivation inSpeckman (1983)
leads to the system
λ12a3 +∑i
B1xi∑j
ajBjxi
=∑i
yiB1xi;
λ13a4 − λ12a3 +∑i
B2xi∑j
ajBjxi
=∑i
yiB2xi;
λ14ak +∑i
Bkxi∑j
ajBjxi
=∑i
yiBkxi; 3 ≤ k ≤ n− 2;
−λ13an − λ12an +∑i
Bn−1xi∑j
ajBjxi
=∑i
yiBn−1xi;
λ12an +∑i
Bnxi∑j
ajBjxi
=∑i
yiBnxi:
-
114 P. H. C. EILERS AND B. D. MARX
Notice that the equations for coefficients near theend involve
lower-order differencing so there is lesssmoothness imposed.
ACKNOWLEDGMENT
Research supported by NSF Grant DMS-90-01726.
CommentStephan R. Sain and David W. Scott
We have been interested in formulations of thesmoothing problem
that are simultaneously globalin nature with locally adaptive
behavior. Rough-ness penalties based on functionals such as the
in-tegral of squared second derivatives of the fittedcurve have
enjoyed much popularity. The solutionto such optimization problems
is often a spline. Theauthors are to be congratulated for
introducing theidea of penalizing on the smoothness of the
splinecoefficients, which reduces the dimensionality of theproblem
as well as reducing the complexity of thecalculations. There is
much to say for this approach.
It is generally of interest to try to work outthe equivalent
kernel formulation of all smooth-ing methods. This was done for
Nadarya–Watsonregression smoothing by Silverman (1984),
whodemonstrated the asymptotic manner in which theestimator adapted
locally.
In the density estimation setting, we have beeninvestigating the
nature of the best locally adaptivedensity estimator along the
lines of the Breiman–Meisel–