Top Banner
Statistical Science 1996, Vol. 11, No. 2, 89–121 Flexible Smoothing with B-splines and Penalties Paul H. C. Eilers and Brian D. Marx Abstract. B-splines are attractive for nonparametric modelling, but choosing the optimal number and positions of knots is a complex task. Equidistant knots can be used, but their small and discrete number al- lows only limited control over smoothness and fit. We propose to use a relatively large number of knots and a difference penalty on coefficients of adjacent B-splines. We show connections to the familiar spline penalty on the integral of the squared second derivative. A short overview of B- splines, of their construction and of penalized likelihood is presented. We discuss properties of penalized B-splines and propose various criteria for the choice of an optimal penalty parameter. Nonparametric logistic re- gression, density estimation and scatterplot smoothing are used as ex- amples. Some details of the computations are presented. Key words and phrases: Generalized linear models, smoothing, non- parametric models, splines, density estimation. 1. INTRODUCTION There can be little doubt that smoothing has a re- spectable place in statistics today. Many papers and a number of books have appeared (Silverman, 1986; Eubank, 1988; Hastie and Tibshirani, 1990; H ¨ ardle, 1990; Wahba, 1990; Wand and Jones, 1993; Green and Silverman, 1994). There are several reasons for this popularity: many data sets are too “rich” to be fully modeled with parametric models; graphical presentation has become increasingly more impor- tant and easier to use; and exploratory analysis of data has become more common. Actually, the name nonparametric is not always well chosen. It might apply to kernel smoothers and running statistics, but spline smoothers are de- scribed by parameters, although their number can be large. It might be better to talk about “overpara- metric” techniques or “anonymous” models; the pa- rameters have no scientific interpretation. Paul H. C. Eilers is Department Head in the com- puting section of DCMR Milieudienst Rijnmond,’s- Gravelandseweg 565, 3119XT Schiedam, The Netherlands (e-mail: [email protected]). Brian D. Marx is Associate Professor, Department of Experimental Statistics, Louisiana State University, Baton Rouge, LA 70803-5606 (e-mail: [email protected]). There exist several refinements of running statis- tics, like kernel smoothers (Silverman, 1986; ardle, 1990) and LOWESS (Cleveland, 1979). Splines come in several varieties: smoothing splines, regression splines (Eubank, 1988) and B-splines (de Boor, 1978; Dierckx, 1993). With so many tech- niques available, why should we propose a new one? We believe that a combination of B-splines and difference penalties (on the estimated coeffi- cients), which we call P-splines, has very attractive properties. P-splines have no boundary effects, they are a straightforward extension of (generalized) lin- ear regression models, conserve moments (means, variances) of the data and have polynomial curve fits as limits. The computations, including those for cross-validation, are relatively inexpensive and easily incorporated into standard software. B-splines are constructed from polynomial pieces, joined at certain values of x, the knots. Once the knots are given, it is easy to compute the B-splines recursively, for any desired degree of the poly- nomial; see de Boor (1977, 1978), Cox (1981) or Dierckx (1993). The choice of knots has been a subject of much research: too many knots lead to overfitting of the data, too few knots lead to un- derfitting. Some authors have proposed automatic schemes for optimizing the number and the posi- tions of the knots (Friedman and Silverman, 1989; Kooperberg and Stone, 1991, 1992). This is a diffi- 89
33

1996, Vol. 11, No. 2, 89{121 Flexible Smoothing with B ...eilers-marx) Flexible smoothi… · Paul H. C. Eilers and Brian D. Marx Abstract. B-splines are attractive for nonparametric

Oct 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Statistical Science1996, Vol. 11, No. 2, 89–121

    Flexible Smoothing with B-splinesand PenaltiesPaul H. C. Eilers and Brian D. Marx

    Abstract. B-splines are attractive for nonparametric modelling, butchoosing the optimal number and positions of knots is a complex task.Equidistant knots can be used, but their small and discrete number al-lows only limited control over smoothness and fit. We propose to use arelatively large number of knots and a difference penalty on coefficientsof adjacent B-splines. We show connections to the familiar spline penaltyon the integral of the squared second derivative. A short overview of B-splines, of their construction and of penalized likelihood is presented. Wediscuss properties of penalized B-splines and propose various criteria forthe choice of an optimal penalty parameter. Nonparametric logistic re-gression, density estimation and scatterplot smoothing are used as ex-amples. Some details of the computations are presented.

    Key words and phrases: Generalized linear models, smoothing, non-parametric models, splines, density estimation.

    1. INTRODUCTION

    There can be little doubt that smoothing has a re-spectable place in statistics today. Many papers anda number of books have appeared (Silverman, 1986;Eubank, 1988; Hastie and Tibshirani, 1990; Härdle,1990; Wahba, 1990; Wand and Jones, 1993; Greenand Silverman, 1994). There are several reasons forthis popularity: many data sets are too “rich” tobe fully modeled with parametric models; graphicalpresentation has become increasingly more impor-tant and easier to use; and exploratory analysis ofdata has become more common.

    Actually, the name nonparametric is not alwayswell chosen. It might apply to kernel smoothersand running statistics, but spline smoothers are de-scribed by parameters, although their number canbe large. It might be better to talk about “overpara-metric” techniques or “anonymous” models; the pa-rameters have no scientific interpretation.

    Paul H. C. Eilers is Department Head in the com-puting section of DCMR Milieudienst Rijnmond,’s-Gravelandseweg 565, 3119XT Schiedam, TheNetherlands (e-mail: [email protected]). Brian D. Marxis Associate Professor, Department of ExperimentalStatistics, Louisiana State University, Baton Rouge,LA 70803-5606 (e-mail: [email protected]).

    There exist several refinements of running statis-tics, like kernel smoothers (Silverman, 1986;Härdle, 1990) and LOWESS (Cleveland, 1979).Splines come in several varieties: smoothing splines,regression splines (Eubank, 1988) and B-splines(de Boor, 1978; Dierckx, 1993). With so many tech-niques available, why should we propose a newone? We believe that a combination of B-splinesand difference penalties (on the estimated coeffi-cients), which we call P-splines, has very attractiveproperties. P-splines have no boundary effects, theyare a straightforward extension of (generalized) lin-ear regression models, conserve moments (means,variances) of the data and have polynomial curvefits as limits. The computations, including thosefor cross-validation, are relatively inexpensive andeasily incorporated into standard software.B-splines are constructed from polynomial pieces,

    joined at certain values of x, the knots. Once theknots are given, it is easy to compute the B-splinesrecursively, for any desired degree of the poly-nomial; see de Boor (1977, 1978), Cox (1981) orDierckx (1993). The choice of knots has been asubject of much research: too many knots lead tooverfitting of the data, too few knots lead to un-derfitting. Some authors have proposed automaticschemes for optimizing the number and the posi-tions of the knots (Friedman and Silverman, 1989;Kooperberg and Stone, 1991, 1992). This is a diffi-

    89

  • 90 P. H. C. EILERS AND B. D. MARX

    cult numerical problem and, to our knowledge, noattractive all-purpose scheme exists.

    A different track was chosen by O’Sullivan (1986,1988). He proposed to use a relatively large num-ber of knots. To prevent overfitting, a penalty onthe second derivative restricts the flexibility of thefitted curve, similar to the penalty pioneered forsmoothing splines by Reinsch (1967) and that hasbecome the standard in much of the spline litera-ture; see, for example, Eubank (1988), Wahba (1990)and Green and Silverman (1994). In this paper wesimplify and generalize the approach of O’Sullivan,in such a way that it can be applied in any con-text where regression on B-splines is useful. Onlysmall modifications of the regression equations arenecessary.

    The basic idea is not to use the integral of asquared higher derivative of the fitted curve inthe penalty, but instead to use a simple differencepenalty on the coefficients themselves of adjacentB-splines. We show that both approaches are verysimilar for second-order differences. In some appli-cations, however, it can be useful to use differencesof a smaller or higher order in the penalty. Withour approach it is simple to incorporate a penalty ofany order in the (generalized) regression equations.

    A major problem of any smoothing technique isthe choice of the optimal amount of smoothing, inour case the optimal weight of the penalty. We usecross-validation and the Akaike information crite-rion (AIC). In the latter the effective dimension,that is, the effective number of parameters, of amodel plays a crucial role. We follow Hastie andTibshirani (1990) in using the trace of the smoothermatrix as the effective dimension. Because we usestandard regression techniques, this quantity canbe computed easily. We find the trace very usefulto compare the effective amount of smoothing fordifferent numbers of knots, different degrees of theB-splines and different orders of penalties.

    We investigate the conservation of moments ofdifferent order, in relation to the degree of theB-splines and the order of the differences in thepenalty. To illustrate the use of P-splines, wepresent the following as applications: smoothing ofscatterplots; modeling of dose–response curves; anddensity estimation.

    2. B-SPLINES IN A NUTSHELL

    Not all readers will be familiar with B-splines.Basic references are de Boor (1978) and Dierckx(1993), but, to illustrate the basic simplicity of theideas, we explain some essential background here.A B-spline consists of polynomial pieces, connected

    in a special way. A very simple example is shown atthe left of Figure 1(a): one B-spline of degree 1. Itconsists of two linear pieces; one piece from x1 to x2,the other from x2 to x3. The knots are x1, x2 and x3.To the left of x1 and to the right of x3 this B-splineis zero. In the right part of Figure 1(a), three moreB-splines of degree 1 are shown: each one based onthree knots. Of course, we can construct as largea set of B-splines as we like, by introducing moreknots.

    In the left part of Figure 1(b), a B-spline ofdegree 2 is shown. It consists of three quadraticpieces, joined at two knots. At the joining points notonly the ordinates of the polynomial pieces match,but also their first derivatives are equal (but nottheir second derivatives). The B-spline is based onfour adjacent knots: x1; : : : ; x4. In the right partFigure 1(b), three more B-splines of degree 2 areshown.

    Note that the B-splines overlap each other.First-degree B-splines overlap with two neighbors,second-degree B-splines with four neighbors and soon. Of course, the leftmost and rightmost splineshave less overlap. At a given x, two first-degree (orthree second-degree) B-splines are nonzero.

    These examples illustrate the general propertiesof a B-spline of degree q:

    • it consists of q + 1 polynomial pieces, each ofdegree q;• the polynomial pieces join at q inner knots;• at the joining points, derivatives up to order

    q− 1 are continuous;• the B-spline is positive on a domain spanned by

    q+ 2 knots; everywhere else it is zero;• except at the boundaries, it overlaps with 2q

    polynomial pieces of its neighbors;• at a given x, q+ 1 B-splines are nonzero.Let the domain from xmin to xmax be divided into

    n′ equal intervals by n′+1 knots. Each interval willbe covered by q+ 1 B-splines of degree q. The totalnumber of knots for construction of the B-splineswill be n′ + 2q+ 1. The number of B-splines in theregression is n = n′ + q. This is easily verified byconstructing graphs like those in Figure 1.B-splines are very attractive as base functions for

    (“nonparametric”) univariate regression. A linearcombination of (say) third-degree B-splines gives asmooth curve. Once one can compute the B-splinesthemselves, their application is no more difficultthan polynomial regression.

    De Boor (1978) gave an algorithm to compute B-splines of any degree fromB-splines of lower degree.Because a zero-degree B-spline is just a constant onone interval between two knots, it is simple to com-

  • FLEXIBLE SMOOTHING 91

    Fig. 1. Illustrations of one isolated B-spline and several overlapping ones (a) degree 1; (b) degree 2.

    pute B-splines of any degree. In this paper we useonly equidistant knots, but de Boor’s algorithm alsoworks for any placement of knots. For equidistantknots, the algorithm can be further simplified, asis illustrated by a small MATLAB function in theAppendix.

    Let Bjxy q denote the value at x of the jth B-spline of degree q for a given equidistant grid ofknots. A fitted curve ŷ to data xi; yi is the linearcombination ŷx = ∑nj=1 âjBjxy q. When the de-gree of the B-splines is clear from the context, orimmaterial, we use Bjx instead of Bjxy q.

    The indexing of B-splines needs some care, espe-cially when we are going to use derivatives. The in-dexing connects a B-spline to a knot; that is, it givesthe index of the knot that characterizes the positionof the B-spline. Our choice is to take the leftmostknot, the knot at which the B-spline starts to be-come nonzero. In Figure 1(a), x1 is the positioningknot for the first B-spline. This choice of indexingdemands that we introduce q knots to the left of thedomain of x. In the formulas that follow for deriva-tives, the exact bounds of the index in the sums areimmaterial, so we have left them out.

    De Boor (1978) gives a simple formula for deriva-tives of B-splines:

    h∑j

    ajB′jxy q =

    ∑j

    ajBjxy q− 1

    −∑j

    aj+1Bj+1xy q− 1

    = −∑j

    1aj+1Bjxy q− 1;

    (1)

    where h is the distance between knots and 1aj =aj − aj−1.

    By induction we find the following for the secondderivative:

    h2∑j

    ajB′′jxy q =

    ∑j

    12ajBjxy q− 2;(2)

    where 12aj = 11aj = aj − 2aj−1 + aj−2. This factwill prove very useful when we compare continuousand discrete roughness penalties in the next section.

    3. PENALTIES

    Consider the regression of m data points xi; yion a set of n B-splines Bj·. The least squares ob-jective function to minimize is

    S =m∑i=1

    {yi −

    n∑j=1

    ajBjxi}2:(3)

    Let the number of knots be relatively large, suchthat the fitted curve will show more variation thanis justified by the data. To make the result less flex-ible, O’Sullivan (1986, 1988) introduced a penaltyon the second derivative of the fitted curve and soformed the objective function

    S =m∑i=1

    {yi −

    n∑j=1

    ajBjxi}2

    + λ∫ xmaxxmin

    { n∑j=1

    ajB′′jx

    }2dx:

    (4)

    The integral of the square of the second derivativeof a fitted function has become common as a smooth-ness penalty, since the seminal work on smoothingsplines by Reinsch (1967). There is nothing spe-cial about the second derivative; in fact, lower orhigher orders might be used as well. In the contextof smoothing splines, the first derivative leads tosimple equations, and a piecewise linear fit, whilehigher derivatives lead to rather complex mathe-matics, systems of equations with a high bandwidth,and a very smooth fit.

    We propose to base the penalty on (higher-order)finite differences of the coefficients of adjacent B-splines:

    S =m∑i=1

    {yi −

    n∑j=1

    ajBjxi}2+ λ

    n∑j=k+1

    1kaj2:(5)

  • 92 P. H. C. EILERS AND B. D. MARX

    This approach reduces the dimensionality of theproblem to n, the number of B-splines, instead ofm, the number of observations, with smoothingsplines. We still have a parameter λ for continuouscontrol over smoothness of the fit. The differencepenalty is a good discrete approximation to the in-tegrated square of the kth derivative. What is moreimportant: with this penalty moments of the dataare conserved and polynomial regression models oc-cur as limits for large values of λ. See Section 5 fordetails.

    We will show below that there is a very strongconnection between a penalty on second-order dif-ferences of the B-spline coefficients and O’Sullivan’schoice of a penalty on the second derivative of thefitted function. However, our penalty can be han-dled mechanically for any order of the differences(see the implementation in the Appendix).

    Difference penalties have a long history that goesback at least to Whittaker (1923); recent applica-tions have been described by Green and Yandell(1985) and Eilers (1989, 1991a, b, 1995).

    The difference penalty is easily introduced intothe regression equations. That makes it possible toexperiment with different orders of the differences.In some cases it is useful to work with even thefourth or higher order. This stems from the factthat for high values of λ the fitted curve approachesa parametric (polynomial) model, as will be shownbelow.

    O’Sullivan (1986, 1988) used third-degree B-splines and the following penalty:

    h2P = λ∫ xmaxxmin

    {∑j

    ajB′′jxy 3

    }2dx:(6)

    From the derivative properties of B-splines it fol-lows that

    h2P = λ∫ xmaxxmin

    {∑j

    12ajBjxy 1}2dx:(7)

    This can be written as

    h2P = λ∫ xmaxxmin

    ∑j

    ∑k

    12aj 12ak

    ·Bjxy 1Bkxy 1dx:(8)

    Most of the cross products of Bjxy 1 and Bkxy 1disappear, because B-splines of degree 1 only over-

    lap when j is k− 1, k or k+ 1. We thus have that

    h2P = λ∫ xmaxxmin

    [{∑j

    12ajBjxy 1}2

    + 2∑j

    12aj 12aj−1

    ·Bjxy 1Bj−1xy 1]dx;

    (9)

    or

    h2P = λ∑j

    12aj2∫ xmaxxmin

    B2jxy 1dx

    + 2λ∑j

    12aj 12aj−1

    ·∫ xmaxxmin

    Bjxy 1Bj−1xy 1dx;

    (10)

    which can be written as

    h2P = λ{c1∑j

    12aj2 + c2∑j

    12aj 12aj−1

    };(11)

    where c1 and c2 are constants for given (equidistant)knots:

    c1 =∫ xmaxxmin

    B2jxy 1dxy

    c2 =∫ xmaxxmin

    Bjxy 1Bj−1xy 1dx:(12)

    The first term in (11) is equivalent to our second-order difference penalty, the second term containscross products of neighboring second differences.This leads to more complex equations when mini-mizing the penalized likelihood (equations in whichseven adjacent aj’s occur, compared to five if onlysquares of second differences occur in the penalty).The higher complexity of the penalty equationsstems from the overlapping of B-splines. Withhigher order differences and/or higher degrees ofthe B-splines, the complications grow rapidly andmake it rather difficult to construct an automaticprocedure for incorporating the penalty in the likeli-hood equations. With the use of a difference penaltyon the coefficients of the B-splines this problemdisappears.

    4. PENALIZED LIKELIHOOD

    For least squares smoothing we have to minimizeS in (5). The system of equations that follows fromthe minimization of S can be written as:

    BTy = BTB+ λDTkDka;(13)where Dk is the matrix representation of the differ-ence operator 1k, and the elements of B are bij =Bjxi. When λ = 0, we have the standard normal

  • FLEXIBLE SMOOTHING 93

    equations of linear regression with a B-spline basis.With k = 0 we have a special case of ridge regres-sion. When λ > 0, the penalty only influences themain diagonal and k subdiagonals (on both sides ofthe main diagonal) of the system of equations. Thissystem has a banded structure because of the lim-ited overlap of the B-splines. It is seldom worth thetrouble to exploit this special structure, as the num-ber of equations is equal to the number of splines,which is generally moderate (10–20).

    In a generalized linear model (GLM), we in-troduce a linear predictor ηi =

    ∑nj=1 bijaj and a

    (canonical) link function ηi = gµi, where µi is theexpectation of yi. The penalty now is subtractedfrom the log-likelihood lyy a to form the penalizedlikelihood function

    L = lyy a − λ2

    n∑j=k+1

    1kaj2:(14)

    The optimization of L leads to the following systemof equations:

    BTy− µ = λDTkDka:(15)These are solved as usual with iterative weightedlinear regressions with the system

    BTW̃y− µ̃ +BTW̃Bã= BTW̃B+ λDTkDka;

    (16)

    where ã and µ̃ are current approximations to thesolution and W̃ is a diagonal matrix of weights

    wii =1vi

    (∂µi∂ηi

    )2;(17)

    where vi is the variance of yi, given µi. The onlydifference with the standard procedure for fittingof GLM’s (McCullagh and Nelder, 1989), with B-splines as regressors, is the modification of BTW̃Bby λDTkDk (which itself is constant for fixed λ) ateach iteration.

    5. PROPERTIES OF P-SPLINES

    P-splines have a number of useful properties,partially inherited from B-splines. We give a shortoverview, with somewhat informal proofs.

    In the first place: P-splines show no boundary ef-fects, as many types of kernel smoothers do. By thiswe mean the spreading of a fitted curve or densityoutside of the (physical) domain of the data, gener-ally accompanied by bending toward zero. In Sec-tion 8 this aspect is considered in some detail, inthe context of density smoothing.P-splines can fit polynomial data exactly. Let dataxi; yi be given. If the yi are a polynomial in x ofdegree k, then B-splines of degree k or higher will

    exactly fit the data (de Boor, 1977). The same is truefor P-splines, if the order of the penalty is k + 1 orhigher, whatever the value of λ. To see that thisis true, take the case of a first-order penalty andthe fit to data y that are constant (a polynomial ofdegree 0). Because

    ∑nj=1 âjBjx = c, we have that∑n

    j=1 âjB′jxi=0, for all x. Then it follows from the

    relationship between differences and derivatives in(1) that all 1aj are zero, and thus that

    ∑nj=2 1aj =

    0. Consequently, the penalty has no effect and thefit is the same as for unpenalized B-splines. Thisreasoning can easily be extended by induction todata with a linear relationship between x and y,and a second order difference penalty.P-splines can conserve moments of the data. For

    a linear model with P-splines of degree k+ 1 and apenalty of order k+ 1, or higher, it holds that

    m∑i=1xkyi =

    m∑i=1xkŷi;(18)

    for all values of λ, where ŷi =∑nj=1 bijâj are the fit-

    ted values. For GLM’s with canonical links it holdsthat

    m∑i=1xkyi =

    m∑i=1xkµ̂i:(19)

    This property is especially useful in the context ofdensity smoothing: the mean and variance of the es-timated density will be equal to mean and varianceof the data, for any amount of smoothing. This isan advantage compared to kernel smoothers: theseinflate the variance increasingly with strongersmoothing.

    The limit of a P-splines fit with strong smoothingis a polynomial. For large values of λ and a penaltyof order k, the fitted series will approach a polyno-mial of degree k − 1, if the degree of the B-splinesis equal to, or higher than, k. Once again, the rela-tionships between derivatives of a B-spline fit anddifferences of coefficients, as in (1) and (2), are thekey. Take the example of a second-order differencepenalty: when λ is large,

    ∑nj=312aj2 has to be very

    near zero. Thus each of the second differences hasto be near zero, and thus the second derivative ofthe fit has to be near zero everywhere. In view ofthese very useful results, it seems that B-splinesand difference penalties are the ideal marriage.

    It is important to focus on the linearized smooth-ing problem that is solved at each iteration, becausewe will make use of properties of the smoothing ma-trix. From (16) follows for the hat matrix H:

    H = BBTW̃B+ λDTkDk−1BTW̃:(20)

  • 94 P. H. C. EILERS AND B. D. MARX

    The trace of H will approach k as λ increases. Aproof goes as follows. Let

    QB = BTW̃B and Qλ = λDTD:(21)

    Write trH as

    trH = trQB +Qλ−1QB

    = trQ1/2B QB +Qλ−1Q1/2B

    = trI+Q−1/2B QλQ−1/2B −1:

    (22)

    This can be written as

    trH = trI+ λL−1 =n∑j=1

    11+ λγj

    ;(23)

    where

    L = Q−1/2B QλQ−1/2B(24)

    and γj, for j = 1; : : : ; n, are the eigenvalues of L.Because k eigenvalues of Qλ are zero, L has k zeroeigenvalues. When λ is large, only the k termswith γj = 0 contribute to the leftmost term, andthus to the trace of H. Hence trH approaches kfor large λ.

    6. OPTIMAL SMOOTHING, AIC ANDCROSS-VALIDATION

    Now that we can easily influence the smoothnessof a fitted curve with λ, we need some way to choosean “optimal” value for it. We propose to use theAkaike information criterion (AIC).

    The basic idea of AIC is to correct the log-likelihood of a fitted model for the effective numberof parameters. An extensive discussion and appli-cations can be found in Sakamoto, Ishiguro andKitagawa (1986). Instead of the log-likelihood, thedeviance is easier to use. The definition of AIC isequivalent to

    AICλ = devyy a; λ + 2 ∗dima; λ;(25)

    where dima; λ is the (effective) dimension of thevector of parameters, a, and devyy a; λ is thedeviance.

    Computation of the deviance is straightforward,but how shall we determine the effective dimensionof our P-spline fit? We find a solution in Hastie andTibshirani (1990). They discuss the effective dimen-sions of linear smoothers and propose to use thetrace of the smoother matrix as an approximation.In our case that means dima = trH. Note thattrH = n when λ = 0, as in (nonsingular) standardlinear regression.

    As trAB = trBA (for conformable matrices),it is computationally advantageous to use

    trH = trBBTWB+ λDTkDk−1BTW= trBTWB+ λDTkDk−1BTWB:

    (26)

    The latter expression involves only n-by-n matrices,whereas H is an m-by-m matrix.

    In some GLM’s, the scale of the data is known,as for counts with a Poisson distribution and forbinomial data; then the deviance can be computeddirectly. For linear data, an estimate of the varianceis needed. One approach is to take the variance ofthe residuals from the ŷi that are computed whenλ = 0, say, σ̂20 :

    AIC =m∑i=1

    yi − µ̂i2σ̂20

    + 2 trH

    −2m ln σ̂0 −m ln 2π:(27)

    This choice for the variance is rather arbitrary, asit depends on the numer of knots. Alternatives canbe based on (generalized) cross-validation. For ordi-nary cross-validation we compute

    CVλ =m∑i=1

    (yi − ŷi1− hii

    )2;(28)

    where the hii are the diagonal elements of the hatmatrix H. For generalized cross-validation (Wahba,1990), we compute

    GCVλ =m∑i=1

    yi − ŷi2m−∑mi=1 hii2

    :(29)

    The difference between both quantities is generallysmall. The best λ is the value that minimizes CVλor GCVλ. The variance of the residuals at the op-timal λ is a natural choice to use as an estimate ofσ20 for the computation of AICλ. It is practical towork with modified versions of CVλ and GCVλ,with values that can be interpreted as estimates ofthe cross-validation standard deviation:

    CVλ =√

    CVλ/my

    GCVλ =√mGCVλ:

    (30)

    The two terms in AICλ represent the devianceand the trace of the smoother matrix. The latterterm, say Tλ = trHλ, is of interest on its own,because it can be interpreted as the effective dimen-sion of the fitted curve.Tλ is useful to compare fits for different num-

    bers of knots and orders of penalties, whereas λ canvary over a large range of values and has no clearintuitive appeal. We will show in an example below

  • FLEXIBLE SMOOTHING 95

    Table 1Values of several diagnostics for the motorcycle impact data, for several values of λ

    λ 0.001 0.01 0.1 0.2 0.5 1 2 5 10CV 24.77 24.02 23.52 23.37 23.26 23.38 23.90 25.50 27.49GCV 25.32 24.93 24.17 23.94 23.74 23.81 24.28 25.87 27.85AIC 159.6 156.2 149.0 146.7 144.7 145.4 150.6 169.1 194.3trH 21.2 19.4 15.13 13.6 11.7 10.4 9.2 7.7 6.8

    that a plot of AIC against T is a useful diagnostictool.

    In the case of P-splines, the maximum value thatTλ can attain is equal to the number of B-splines(when λ = 0). The actual maximum depends on thenumber and the distributions of the data points. Theminimum value of Tλ occurs when λ goes to infin-ity; it is equal to the order of the difference penalty.This agrees with the fact that for high values of λthe fit of P-splines approaches a polynomial of de-gree k− 1.

    7. APPLICATIONS TO GENERALIZEDLINEAR MODELLING

    In this section we apply P-splines to a number ofnonparametric modelling situations, with normal aswell as nonnormal data.

    First we look at a problem with additive errors.Silverman (1985) used motorcycle crash helmet im-pact data to illustrate smoothing of a scatterplotwith splines; the data can be found in Härdle (1990)and (also on diskette) in Hand et al. (1994). Thedata give head acceleration in units of g, at differ-ent times after impact in simulated accidents. Wesmooth with B-splines of degree 3 and a second-order penalty. The chosen knots divide the domainof x (0–60) into 20 intervals of equal width. Whenwe vary λ on an approximately geometric grid, weget the results in Table 1, where σ̂0 is computedfrom GCVλ at the optimal value of λ. At the op-timal value of λ as determined by GCV, we get theresults as plotted in Figure 2.

    It is interesting to note that the amount of workto investigate several values of λ is largely indepen-dent of the number of data points when using GCV.The system to be solved is

    BTB+ λDTkDka = BTy:(31)

    The sum of squares is

    S = y−Ba2 = yTy− 2aTBTy+ aTBTBa:(32)

    So BTB and BTy have to be computed only once.The hat matrix H is m by m, but for its trace wefound an expression in (26) that involves only BTBand DTkDk. So we do not need the original data forcross-validation at any value of λ.

    Our second example concerns logistic regression.The model is

    ln(

    pi1− pi

    )= ηi =

    n∑j=1

    ajBjxi:(33)

    The observations are triples xi; ti; yi, where ti isthe number of individuals under study at dose xi,and yi is the number of “successes.” We assume thatyi has a binomial distribution with probability piand ti trials. The expected value of yi is tipi andthe variance is tipi1− pi.

    Figure 3 shows data from Ashford and Walker(1972) on the numbers of Trypanosome organismskilled at different doses of a certain poison. The datapoints and two fitted curves are shown. For the thickline curve λ = 1 and AIC = 13:4; this value of λ isoptimal for the chosen B-splines of degree 3 and apenalty of order 2. The thin line curve shows thefit for λ = 108 (AIC = 27:8). With a second-orderpenalty, this essentially a logistic fit.

    Figure 4 shows curves of AICλ against Tλ atdifferent values of k, the order of the penalty. Wefind that k = 3 can give a lower value of AIC (forλ = 5, AIC = 11:8). For k = 4 we find that a veryhigh value of λ is allowed; then AIC = 11:4, hardlydifferent from the lowest possible value (11.1). Alarge value of λ with a fourth-order penalty meansthat effectively the fitted curve for η is a third-orderpolynomial. The limit of the fit with P-splines thusindicates a cubic logistic fit as a good parametricmodel. Here we have seen an application where afourth-order penalty is useful.

    Our third example is a time series of counts yi,which we will model with a Poisson distributionwith smoothly changing expectation:

    lnµi = ηi =n∑j=1

    ajBjxi:(34)

    In this special case the xi are equidistant, but thisis immaterial. Figure 5 shows the numbers of dis-asters in British coal mines for the years 1850–1962, as presented in (Diggle and Marron, 1988).The counts are drawn as narrow vertical bars, theline is the fitted trend. The number of intervals is20, the B-splines have degree 3 and the order of thepenalty is 2. An optimal value of λ was searchedon the approximately geometric grid 1, 2, 5, 10 and

  • 96 P. H. C. EILERS AND B. D. MARX

    Fig. 2. Motorcycle crash helmet impact data: optimal fit with B-splines of third degree, a second-order penalty and λ = 0:5.

    Fig. 3. Nonparametric logistic regression of Trypanosome data: P-splines of order 3 with 13 knots, difference penalty of order 2; λ = 1and AIC = 13:4 (thick line); the thin line is effectively the logistic fit λ = 108 and AIC = 27:8.

    Fig. 4. AICλ versus Tλ; the effective dimension, for several orders of the penalty k.

    so on. The minimum of AIC (126.0) was found forλ = 1;000.

    The raw data of the coal mining accidents pre-sumably were the dates on which they occurred.So the data we use here are in fact a histogramwith one-year-wide bins. With events on a time scaleit seems natural to smooth counts over intervals,but the same idea applies to any form of histogram(bin counts) or density smoothing. This was already

    noted by Diggle and Marron (1988). In the next sec-tion we take a detailed look at density smoothingwith P-splines.

    8. DENSITY SMOOTHING

    In the preceding section we noted that a time se-ries of counts is just a histogram on the time axis.Any other histogram might be smoothed in the same

  • FLEXIBLE SMOOTHING 97

    Fig. 5. Numbers of severe accidents in British coal mines: number per year shown as vertical lines; fitted trend of the expectation of thePoisson distribution; B-splines of degree 3; penalty of order 3; 20 intervals between 1850 and 1970; λ = 1;000 and AIC = 126:0.

    way. However, it is our experience that this ideais hard to swallow for many colleagues. They seethe construction of a frequency histogram as an un-allowable discretization of the data and as a pre-lude to disaster. Perhaps this feeling stems fromthe well-known fact that maximum likelihood es-timation of histograms leads to pathological results,namely, delta functions at the observations (Scott,1992). However, if we optimize a penalized likeli-hood, we arrive at stable and very useful results, aswe will show below.

    Let yi, i = 1; : : : ;m, be a histogram. Let the ori-gin of x be chosen in such a way that the midpointsof the bins are xi = ih; thus yi is the number of rawobservations with xi − h/2 ≤ x < xi + h/2. If pi isthe probability of finding a raw observation in cell i,then the likelihood of the given histogram is propor-tional to the multinomial likelihood

    ∏mi=1p

    yii . Equiv-

    alently (see Bishop, Fienberg and Holland, 1975,Chapter 13), one can work with the likelihood of mPoisson distributions with expectations µi = piy+,where y+ =

    ∑mi=1 yi.

    To smooth the histogram, we again use a general-ized linear model with the canonical log link (whichguarantees positive µ):

    lnµi = ηi =n∑j=1

    ajBjxi(35)

    and construct the penalized log likelihood

    L =m∑i=1yi lnµi −

    m∑i=1µi − λ

    n∑j=k+1

    1kaj22

    ;(36)

    with n a suitable (i.e., relatively large) number ofknots for the B-splines. The penalized likelihoodequations follow from the minimization of L:

    m∑i=1yi − µiBjxi = λ

    n∑l=k+1

    djlal:(37)

    These equations are solved with iteratively re-weighted regression, as described in Section 4.

    Now we let h, the width of the cells of the his-togram, shrink to a very small value. If the rawdata are given to infinite precision, we will even-tually arrive at a situation in which each cell ofthe histogram has at most one observation. In otherwords, we have a very large number (m) of cells, ofwhich y+ are 1 and all others 0. Let I be the set ofindices of cells for which yi = 1. Then

    m∑i=1yiBjxi =

    ∑i∈IBjxi:(38)

    If the raw observations are ut for t = 1; : : : ; r, withr = y+, then we can write

    ∑i∈IBjxi =

    r∑t=1Bjut = B+j ;(39)

    and the penalized likelihood equations in (37)change to

    B+j −m∑i=1µiBjxi = λ

    n∑l=k+1

    djlal:(40)

    For any j, the first term on the left-hand side of(40) can be interpreted as the “empirical sum” of B-spline j, while the second term on the left can beinterpreted as the “expected sum” of that B-splinefor the fitted density. When λ = 0, these terms haveto be equal to each other for each j.

    Note that the second term on the left-hand sideof (40) is in fact a numerical approximation of anintegral:

    m∑i=1µiBjxi/y+

    ≈∫ xmaxxmin

    Bjx exp{ n∑l=1alBlx

    }dx:

    (41)

  • 98 P. H. C. EILERS AND B. D. MARX

    Table 2The value of AIC at several values of lambda for the Old Faithful density estimate

    λ 0.001 0.01 0.02 0.05 0.1 0.2 0.5 1 10AIC 50.79 48.21 47.67 47.37 47.70 48.61 50.59 52.81 65.66

    The smaller h (the larger m), the better the app-proximation. In other words: the discretization isonly needed to solve an integral numerically forwhich, as far as we know, no closed form solutionexists. For practical purposes the simple sum is suf-ficient, but a more sophisticated integration schemeis possible. Note that the sums to calculate B+j in-volve all raw observations, but in fact at each ofthese only q + 1 terms Bjut add to their corre-sponding B+j .

    The necessary computations can be done in termsof the sufficient statistics B+j : we have seen theirrole in the penalized likelihood equations above. Butalso the deviance and thus AIC can be computeddirectly:

    devyy a = 2m∑i=1yi lnyi/µi

    = 2m∑i=1yi lnyi − 2

    m∑i=1yi

    n∑j=1

    ajBjxi

    = 2m∑i=1yi lnyi − 2

    n∑j=1

    ajB+j :

    (42)

    In the extreme case, when the yi are either 0 or1, the term

    ∑yi lnyi vanishes. In any case it is

    independent of the fitted density.The density smoother with P-splines is very

    attractive: the estimated density is positive andcontinuous, it can be described relatively parsimo-niously in terms of the coefficients of the B-splines,and it is a proper density. Moments are conserved,as follows from (19). This implies that with third-degree B-splines and a third-order penalty, meanand variance of the estimated distribution are equalto those of the raw data, whatever the amount ofsmoothing; the limit for high λ is a normal distri-bution.

    The P-spline density smoother is not troubled byboundary effects, as for instance kernel smoothersare. Marron and Ruppert (1994) give examples anda rather complicated remedy, based on transforma-tions. With P-splines no special precautions are nec-essary, but it is important to specify the domain ofthe data correctly. We will present an example be-low.

    We now take as a first example a data set from(Silverman, 1986). The data are durations of 107eruptions of the Old Faithful geyser. Third-degreeB-splines were used, with a third-order penalty. The

    domain from 0 to 6 was divided into 20 intervalsto determine the knots. In the figure two fits areshown, for λ = 0:001 and for λ = 0:05. The lattervalue gives the minimum of AIC, as Table 2 shows.We see that of the two clearly separated humps, theright one seems to be a mixture of two peaks.

    The second example also comes from (Silverman,1986). The data are lengths of spells of psychiatrictreatments in a suicide study. Figure 7 shows theraw data and the estimated density when the do-main is chosen from 0 to 1,000. Third-degree B-splines were used, with a second-order penalty. Afairly large amount of smoothing (λ = 100) is in-dicated by AIC; the fitted density is nearly expo-nential. In fact, if one considers only the domainfrom 0 to 500, then λ can become arbitrarily largeand a pure exponential density results. However, ifwe choose the domain from −200 to 800 we get aquite different fit, as Figure 8 shows. By extendingthe domain we force the estimated density also tocover negative values of x, where there are no data(which means zero counts). Consequently, it has todrop toward zero, missing the peak for small posi-tive values. The optimal value of λ now is 0.01 anda much more wiggly fit results, with an appreciablyhigher value of AIC. This nicely illustrates how, witha proper choice of the domain, the P-spline densitysmoother can be free from the boundary effects thatgive so much trouble with kernel smoothers.

    9. DISCUSSION

    We believe that P-splines come near to being theideal smoother. With their grounding in classic re-gression methods and generalized linear models,their properties are easy to verify and understand.Moments of the data are conserved and the limitingbehavior with a strong penalty is well defined andgives a connection to polynomial models. Bound-ary effects do not occur if the domain of the data isproperly specified.

    The necessary computations, including cross-validation, are comparable in size to those for amedium sized regression problem. The regressioncontext makes it natural to extend P-splines tosemiparametric models, in which additional ex-planatory variables occur. The computed fit isdescribed compactly by the coefficients of the B-splines.

  • FLEXIBLE SMOOTHING 99

    Fig. 6. Density smoothing of durations of Old Faithful geyser eruptions: density histogram and fitted densities; thin line, third-orderpenalty with λ = 0:001AIC = 84:05; thick line, optimal λ = 0:05, with AIC = 80:17; B-splines of degree 3 with 20 intervals on thedomain from 1 to 6.

    Fig. 7. Density smoothing of suicide data: positive domain (0–1,000); B-splines of degree 3; penalty of order 2; 20 intervals, λ =100; AIC = 69:9.

    Fig. 8. Density smoothing of suicide data: the domain includes negative values −200–800; B-splines of degree 3; penalty of order 2,20 intervals, λ = 0:01; AIC = 83:6.

    P-splines can be very useful in (generalized) ad-ditive models. For each dimension a B-spline ba-sis and a penalty are introduced. With n knots ineach base and d dimensions, a system of nd-by-nd(weighted) regression equations results. Backfitting,

    the iterative smoothing for each separate dimen-sion, is eliminated. We have reported on this ap-plication elsewhere (Marx and Eilers, 1994, 1996).

    Penalized likelihood is a subject with a grow-ing popularity. We already mentioned the work of

  • 100 P. H. C. EILERS AND B. D. MARX

    O’Sullivan. In the book by Green and Silverman(1994), many applications and references can befound. Almost exclusively, penalties are defined interms of the square of the second derivative of thefitted curve. Generalizations to penalties on higherderivatives have been mentioned in the literature,but to our knowledge, practical applications arevery rare. The shift from the continuous penalty tothe discrete penalty in terms of the coefficents ofthe B-splines is not spectacular in itself. But wehave seen that it leads to very useful results, whilegiving a mechanical way to work with higher-orderpenalties. The modelling of binomial dose–responsein Section 7 showed the usefulness of higher-orderpenalties.

    A remarkable property of AIC is that it is easier tocompute it for certain nonnormal distributions, likethe Poisson and binomial, than for normal distribu-tions. This is so because for these distributions therelationship between mean and variance is known.We should warn the reader that AIC may lead toundersmoothing when the data are overdispersed,since the assumed variance of the data may then betoo low. We are presently investigating smoothingwith P-splines and overdispersed distributions likethe negative binomial and the beta-binomial. Alsoideas of quasilikelihood will be incorporated.

    We have paid extra attention to density smooth-ing, because we feel that in this area the advan-tages of P-splines really shine. Traditionally, kernelsmoothers have been popular in this field, but theyinflate the variance and have troubles with bound-aries of data domains; their computation is expen-sive, cross-validation even more so, and one cannotreport an estimated density in a compact way.

    Possibly kernel smoothers still have advantagesin two or more dimensions, but it seems thatP-splines can also be used for two-dimensionalsmoothing with Kronecker products of B-splines.With a grid of, say, 10 by 10 knots and a third-orderpenalty, a system of 130 equations results, withhalf bandwidth of approximately 30. This can easilybe handled on a personal computer. The automaticconstruction of the equations will be more difficultthan in one dimension. First experiments with thisapproach look promising; we will report on them indue time.

    We have not touched on many obvious and in-teresting extensions to P-splines. Robustness canbe obtained with any nonlinear reweighting schemethat can be used with regression models. Circulardomains can be handled by wrapping the B-splinesand the penalty around the origin. The penalty canbe extended with weights, to give a fit with noncon-stant stiffness. It this way it will be easy to specify

    a varying stiffness, but it is quite another matter toestimate the weights from the data.

    Finally, we like to remark that P-splines form abridge between the purely discrete smoothing prob-lem, as set forth originally by Whittaker (1923) andcontinuous smoothing. B-splines of degree zero areconstant on an interval between two knots, and zeroelsewhere; they have no overlap. Thus the fittedfunction gives for each interval the value of the co-efficient of the corresponding B-spline.

    APPENDIX: COMPUTATIONAL DETAILS

    Here we look at the computation of B-splinesand derivatives of the penalty. We use S-PLUS andMATLAB as example languages because of theirwidespread use. Also we give some impressions ofthe speed of the computations.

    In the linear case we have to solve the system ofequations

    BTB+ λDTkDkâ = BTy(43)

    and to compute y−Bâ2 and trBTB+λDTD−1 ·BTB. We need a function to compute B, the B-spline base matrix. In S-PLUS, this is a simple mat-ter, as there is a built-in function spline.des() thatcomputes (derivatives) of B-splines. We only have toconstruct the sequence of knots. Let us assume thatxl is the left of the x-domain, xr the right, and thatthere are ndx intervals on that domain. To computeB for a given vector x, based on B-splines of degreebdeg, we can use the following function:

    bspline

  • FLEXIBLE SMOOTHING 101

    yhat

  • 102 P. H. C. EILERS AND B. D. MARX

    REFERENCES

    Ashford, R. and Walker, P. J. (1972). Quantal response analy-sis for a mixture of populations. Biometrics 28 981–988.

    Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975).Discrete Multivariate Analysis: Theory and Practice. MITPress.

    Cleveland, W. S. (1979). Robust locally weighted regressionand smoothing scatter plots. J. Amer. Statist. Assoc. 74829–836.

    Cox, M. G. (1981). Practical spline approximation. In Topics inNumerical Analysis (P. R. Turner, ed.). Springer, Berlin.

    de Boor, C. (1977). Package for calculating with B-splines.SIAM J. Numer. Anal. 14 441–472.

    de Boor, C. (1978). A Practical Guide to Splines. Springer,Berlin.

    Dierckx, P. (1993). Curve and Surface Fitting with Splines.Clarendon, Oxford.

    Diggle P. and Marron J. S. (1988). Equivalence of smooth-ing parameter selectors in density and intensity estimation.J. Amer. Statist. Assoc. 83 793–800.

    Eilers, P. H. C. (1990). Smoothing and interpolation with gen-eralized linear models. Quaderni di Statistica e MatematicaApplicata alle Scienze Economico-Sociali 12 21–32.

    Eilers, P. H. C. (1991a). Penalized regression in action: esti-mating pollution roses from daily averages. Environmetrics2 25–48.

    Eilers, P. H. C. (1991b). Nonparametric density estimation withgrouped observations. Statist. Neerlandica 45 255–270.

    Eilers, P. H. C. (1995). Indirect observations, composite linkmodels and penalized likelihood. In Statistical Modelling(G. U. H. Seeber et al., eds.). Springer, New York.

    Eilers, P. H. C. and Marx, B. D. (1992). Generalized linearmodels with P-splines. In Advances in GLIM and StatisticalModelling (L. Fahrmeir et al., eds.). Springer, New York.

    Eubank, R. L. (1988). Spline Smoothing and Nonparametric Re-gression. Dekker, New York.

    Friedman, J. and Silverman, B. W. (1989). Flexible parsimo-nious smoothing and additive modeling (with discussion).Technometrics 31 3–39.

    Green, P. J. and Silverman, B. W. (1994). Nonparametric Re-gression and Generalized Linear Models. Chapman and Hall,London.

    Green, P. J. and Yandell, B. S. (1985). Semi-parametricgeneralized linear models. In Generalized Linear Models(B. Gilchrist et al., eds.). Springer, New York.

    Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and Os-trowski, E. (1994). A Handbook of Small Data Sets. Chap-man and Hall, London.

    Härdle, W. (1990). Applied Nonparametric Regression. Cam-bridge Univ. Press.

    Hastie, T. and Tibshirani, R. (1990). Generalized Additive Mod-els. Chapman and Hall, London.

    Kooperberg, C. and Stone, C. J. (1991). A study of logsplinedensity estimation. Comput. Statist. Data Anal. 12 327–347.

    Kooperberg, C. and Stone, C. J. (1992). Logspline density esti-mation for censored data. J. Comput. Graph. Statist. 1 301–328.

    Marron, J. S. and Ruppert, D. (1994). Transformations to re-duce boundary bias in kernel density estimation. J. Roy.Statist. Soc. Ser. B 56 653–671.

    Marx, B. D. and Eilers, P. H. C. (1994). Direct generalized ad-ditive modelling with penalized likelihood. Paper presentedat the 9th Workshop on Statistical Modelling, Exeter, 1994.

    Marx, B. D. and Eilers, P. H. C. (1996). Direct generalizedadditive modelling with penalized likelihood. Unpublishedmanuscript.

    McCullagh, P. and Nelder, J. A. (1989). Generalized LinearModels, 2nd ed. Chapman and Hall, London.

    O’Sullivan, F. (1986). A statistical perspective on ill-posed in-verse problems (with discussion). Statist. Sci. 1 505–527.

    O’Sullivan, F. (1988). Fast computation of fully automated log-density and log-hazard estimators. SIAM J. Sci. Statist.Comput. 9 363–379.

    Reinsch, C. (1967). Smoothing by spline functions. Numer. Math.10 177–183.

    Sakamoto, Y., Ishiguro, M. and Kitagawa, G. (1986). AkaikeInformation Criterion Statistics. Reidel, Dordrecht.

    Scott, D. W. (1992). Multivariate Density Estimation: Theory,Practice, and Visualization. Wiley, New York.

    Silverman, B. W. (1985). Some aspects of the spline smooth-ing approach to nonparametric regression curve fitting (withdiscussion). J. Roy. Statist. Soc. Ser. B 47 1–52.

    Silverman, B. W. (1986). Density Estimation for Statistics andData Analysis. Chapman and Hall, London.

    Wahba, G. (1990). Spline Models for Observational Data. SIAM,Philadelphia.

    Wand, M. P. and Jones, M. C. (1993). Kernel Smoothing. Chap-man and Hall, London.

    Whittaker, E. T. (1923). On a new method of graduation. Proc.Edinburgh Math. Soc. 41 63–75.

    CommentS-T. Chiu

    Authors Paul Eilers and Brian Marx provide avery interesting approach to nonparametric curvefitting. They give a brief but very concise review of

    S-T. Chiu is with the Department of Statistics,Colorado State University, Fort Collins, Colorado80523-0001.

    B-splines. I also enjoyed reading the part where theauthors applied their procedure to some examples.As shown in the paper, the approach has severalmerits which deserve to be studied in more detail.

    Similar to any nonparametric smoother, the pro-posed procedure needs a smoothing parameter λ tocontrol the smoothness of the fitting curve. My com-

  • FLEXIBLE SMOOTHING 103

    ments mainly concern the selection of the smoothingparameter.

    It is well known that the classical selectors suchas AIC, GCV, Mallows’s Cp and so on do not givea satisfactory result. For the regression case, moredetails about the defects can be found in Rice (1984)and Chiu (1991a). Scott and Terrell (1987) and Chiu(1991b) discuss the case of density estimation. Theclassical selectors have a large sample variation anda tendency to select a small smoothing parameter,thus producing a very rough curve estimate. It isnatural to expect that they have a similar problemwhen applied to selecting the smoothing parameterfor P-splines.

    Several procedures have been suggested to rem-edy the defects of the classical procedures. Chiu(1996) provides a survey of some of these newerselectors for density estimation. For the regres-sion case, some procedures are suggested in Chiu(1991a), Hall and Johnstone (1992) and Hall, Mar-ron and Park (1992).

    In the following, I provide a brief review to ex-plain the defects and some remedy to the classicalselectors for kernel regression estimate. Let us as-sume the simplest model of a circular design withequally spaced design points. yt = µxt+ εt, whereεt are i.i.d. noise. For the kernel estimate µ̂β witha bandwidth β, we often use the mean of sum ofsquared errors

    1 Rβ = E[∑µ̂βxt − µxj2

    ]

    to measure the closeness between µ̂x and µx.The goal of bandwidth selection is to select the

    optimal bandwidth which minimizes Rβ. Since inpractice µ is unknown, we have to estimate Rβand use the minimizer of the estimated Rλ asan estimate of the optimal bandwidth. For exam-ple, Mallows’s Cp has the form

    2 R̂β = RSSβ −Tσ2 + 2σ2w0/β:

    Here wx is the kernel and σ2 is the error variance.Other classical procedures such as AIC and GCVhave a similar form and were shown to be asymp-totically equivalent in Rice (1984). All of these pro-cedures rely on the residual sum of squares RSSβ.

    Mallows (1973) proposed the procedure based onthe observation that

    Rβ = ERSSβ −Tσ2 + 2σ2w0/β:

    As we will explain later, the main problem here isthat RSSβ is not a good estimate of its expectedvalue.

    By using the Fourier transform, (1) and (2) couldbe written, respectively, as

    3Rβ = 4π

    N∑j=1

    ISλj1−Wβλj2

    + σ2N∑j=1

    Wβλ2 + σ2

    and

    4R̂β = 4π

    N∑j=1

    {IYλj −

    σ2

    }1−Wβλj2

    + σ2N∑j=1

    Wβλ2 + σ2;

    where IY and IS are the periodograms ofYt and thesignal St = µt/T, respectively, and λj = 2πj/T,j = 1; : : : ;N = T/2. Also, Wβλ is the transferfunction of wt/βT/βT.

    Comparing (3) and (4), we see that R̂ attempts touse IYλ−σ2/2π to estimate ISλ. The difficultyis that at high frequency, IY is dominated by thenoise and thus does not give a good estimate of IS.

    Chiu (1991a) suggested truncating the high-frequency portion when we estimate Rβ,

    5R̃β = 4π

    J∑j=1

    {IYλj −

    σ2

    }1−Wβλj2

    + σ2N∑j=1

    Wβλ2 + σ2:

    Here J is selected in such a way that there is no sig-nificant IS beyond frequency λJ. The selector R̃βhas a much better performance than the classicalones. Hall, Marron and Park (1992) proposed an-other procedure which downweights the contribu-tion from the high-frequency part.

    It is clear that the bases of the kernel regressionare the sinusoid waves. The primary reason of suc-cess of criterion (5) is that most information aboutµ concentrates at low frequency. In other words, wejust need quite a few bases to approximate the truecurve well.

    However, since each basis of the B-spline is verylocal to a certain interval, we cannot use just a fewbases to approximate the curve over the whole re-gion. In my opinion, this could be a big obstacle tothe understanding and improvement of the classicalsmoothing parameter selectors.

    REFERENCES

    Chiu, S.-T. (1991a). Some stabilized bandwidth selectors for non-parametric regression. Ann. Statist. 19 1528–1546.

  • 104 P. H. C. EILERS AND B. D. MARX

    Chiu, S.-T. (1991b). Bandwidth selection for kernel density esti-mation. Ann. Statist. 19 1883–1905.

    Chiu, S.-T. (1996). A comparative review of bandwidth selectionfor kernel density estimation. Statist. Sinica 6 129–145.

    Hall, P. and Johnstone, I. (1992). Empirical functionals andefficient smoothing parameter selection. J. Roy. Statist. Soc.Ser. B 54 519–521.

    Hall, P., Marron, J. S. and Park, B. U. (1992). Smoothed cross-validation. Probab. Theory Related Fields 92 1–20.

    Mallows, C. (1973). Some comments on Cp. Technometrics 15661–675.

    Rice, J. (1984). Bandwidth choice for nonparametric regression.Ann. Statist. 12 1215–1230.

    Scott, D. W. and Terrell, G. R. (1987). Biased and unbiasedcross-validation in density estimation. J. Amer. Statist. Assoc.82 1131–1146.

    CommentDouglas Nychka and David Cummins

    One strength of the authors’s presentation is thesimple ridge regression formulas that result for theestimator. We would like to point out a decompo-sition using a different set of basis functions thathelps to interpret this smoother. This alternativebasis, derived from B-splines, facilitates the compu-tation of the GCV function and confidence bands forthe estimated curve.

    To simplify this discussion assume that W = I sothat the hat matrix is

    H = BBTB+ λDTD−1BT = GI+ λ0−1GT;

    G = BQ−1/22 U, Q2 = BTB U, 0 = diagγ and U isan orthogonal matrix such that Q−1/22 D

    TDQ−1/22 =

    U0UT. The columns of G can be identified witha new set of functions known as the Demmler–Reinsch (DR) basis. Specifically these are piecewisepolynomial functions, ψν so that the elements ofG satisfy ψνxi = Giν. Besides having useful or-thogonality properties the DR basis can be orderedby frequency and larger values of γν will exhibitmore oscillations (in fact ν − 1 zero crossings). Fig-ure 1(a) plots several of the basis functions form = 133 equally spaced x’s and 20 equally spacedinterior knots. Figure 1(b) illustrates the expectedpolynomial increase in the size of γν as a functionof ν.

    The Demmler–Reinsch basis provides an informa-tive interpretation of the spline estimate. Let f̂ de-

    Douglas Nychka is Professor of Statistics and DavidCummins is with the Department of Statistics, NorthCarolina State University, Raleigh, North Carolina27695-8203.

    note the P-spline and let α = GTy denote the leastsquares coefficients from regressing y on the DR ba-sis functions:

    f̂xi = Hyi = GI+ λ0−1GTyi

    =m∑ν=1ψνxi

    αν1+ λγν

    :

    Note that the smoother is just a linear combi-nation of the DR basis functions using coefficientsthat are downweighted (or tapered) by the fac-tor 1/1 + λγν from the least squares estimates.Because of the relationship between γν and ψν(see Figure 1), the basis functions that representhigher-frequency structure will have coefficientsthat are more severely downweighted. In this waythe smoother is a low-pass filter, tending to preservelow-frequency structure and downweighting higher-frequency terms. The residual sum of squares andthe trace of H can be computed rapidly (order n)using the DR representation. Thus the GCV func-tion can also be evaluated in order n operations fora given value of λ.

    Another application of the DR form is in comput-ing a confidence band. Consider a set of candidatefunctions that contain the true function with thecorrect level of confidence. The confidence band isthen the envelope implied by considering all func-tions in this set. For example, let f̂ denote the func-tion estimate and for C1;C2 > 0 let

    B ={hx h is a B-spline with coefficients b,n∑i=1f̂xi−hxi2≤C1 and bTDTD b≤C2

    }

  • FLEXIBLE SMOOTHING 105

    Fig. 1. Illustration of several Demmler–Reinsch basis functions and the associated eigenvalues for 20 equally spaced knots, 133 equallyspaced observations and second divided differences k = 2: the upper plot (a) is ψν for ν = 3;5;10;15; the numerals identify theorder of these basis functions and in the second plot (b) identify the eigenvalues for these functions.

    The constants C1 and C2 are determined so thatPf ∈ B equals the desired confidence level. Theupper and lower boundaries of the confidence bandare then

    Ux = maxhxx h ∈ B

    and

    Lx = minhxx h ∈ B

    In practice we work with the coefficients andthus the computation of U and L at each x isa minimization problem with two quadratic con-

    straints. Using the DR basis reduces both con-straints to quadratic forms with diagonal matricesand thus both are computable in order n opera-tions. Moreover this strategy does not depend onthe roughness penalty being divided differences butwill work for any nonnegative matrix used as apenalty (e.g., thin plate splines). Currently we areinvestigating the choice of C1 and C2 based on theGCV estimate of f.

    ACKNOWLEDGMENT

    This work was supported by NSF Grant DMS-92-17866.

  • 106 P. H. C. EILERS AND B. D. MARX

    CommentChong Gu

    I would like to begin by congratulating the au-thors Eilers and Marx for a clear exposition of aninteresting variant of penalized regression splines.My comments center around three questions: AreP-splines really better? What does optimal smooth-ing stand for? And what does the future hold fornonparametric function estimation?

    ARE P-SPLINES REALLY BETTER?

    P-splines can certainly be as useful as other vari-ants of penalized regression splines, but I am notsure that they are really advantageous over the oth-ers. It is true that with huge sample sizes, one maychoose n much smaller than m to save on computa-tion without sacrificing performance, but other vari-ants of regression splines also share the same ad-vantage. The mechanical handling of the differencepenalty is certainly very interesting computation-ally, but as far as the end users are concerned, Ido not see why the discrete penalties are necessar-ily advantageous over the continuous ones. Higher-order derivative penalties are certainly as feasibleas discrete penalties computationally, albeit moredifficult to implement, but the difference is irrele-vant to the end users whose main interest is theinterface.

    The users may be more interested in what the pro-gram computes rather than how it computes, how-ever, and in this respect, I only see P-splines loseout to penalized regression splines with the usualderivative penalties that everyone can understand.Being told that B-splines provide a good basisfor function approximation, the users may simplyignore whatever other properties B-splines haveand still have a clear picture about what they aregetting from derivative penalties or, for that mat-ter, from Whittaker’s discrete penalties which usethe differences of adjacent function values. Withthe P-splines, however, the intuition is unfortu-nately taken away from the users, and even witha thorough knowledge of all the properties of B-splines, I am not sure one can easily perceive what

    Chong Gu is Assistant Professor, Department ofStatistics, Purdue University, West Lafayette, Indi-ana 47907.

    the penalty is really doing, other than that it is re-ducing the effective dimension in some not so easilycomprehensible way.

    Penalized smoothers with quadratic penalties areknown to be equivalent to Bayes estimates withGaussian priors. When Q = DTkDk is of full rank,the corresponding prior for the B-spline coefficientsa has mean 0 and covariance proportional to Q−1.When Q is rank-deficient, the prior has a “fixed ef-fect” component diffuse in the null space of Q anda “random effect” component with mean 0 and co-variance proportional to Q+, the Moore–Penrose in-verse of Q. From this perspective, P-splines differfrom other variants of penalized regression splinesonly in the specification of Q.

    WHAT DOES OPTIMAL SMOOTHINGSTAND FOR?

    One probably can never overstate the importanceof smoothing parameter selection for any success-ful practical application of any smoothing method.AIC and cross-validation are among the most ac-cepted (and successful) working criteria for modelselection, yet their optimalities are established, the-oretically or empirically, only for specific problemsettings under appropriate conditions. Naive adap-tations of these criteria in new problem settings donot necessarily deliver fits that are nearly optimal.

    Specifically, I am somewhat worried about the“optimality” of the naive adaptations of these cri-teria proclaimed in Section 6. First, it is not clear inwhat sense these criteria are “optimal” in the prob-lem settings to which they are applied; second, thereis no empirical (or theoretical) evidence illustratingthe presumed “optimality.” AIC or cross-validationmay deliver nearly optimal fits, but they surely donot by themselves define the notion of optimality.

    My worries stem from previous empirical ex-periments with smoothing parameter selection bymyself and by others, especially in non-Gaussianregression problems (commonly referred to as gen-eralized linear models). Using Kullback–Leiblerdiscrepancy or its symmetrized version to defineoptimality, it has been found that a naive adap-tation of GCV in non-Gaussian regression, whichappears similar to what the authors suggest in Sec-tion 7, may return anything but nearly optimal fits.See, for example, Cox and Chang (1990), Gu (1992)

  • FLEXIBLE SMOOTHING 107

    and Xiang and Wahba (1996). For the density esti-mation problem in Section 8, I could not find thedefinition of the H matrix to understand the AICproposed, but whatever it is, it should be subjectto the same scrutiny before being recommended as“optimal.”

    In ordinary Gaussian regression, the optimalityof GCV is well established in the literature. Forthe AIC score presented in (27), however, I wouldlike some empirical evidence to be convinced of itsoptimality. The skepticism is partly due to someempirical evidence suggesting that the trace of Hmay not be a consistent characterization of the ef-fective dimension of the model. Such evidence canbe found in Gu (1996), available online at http://www.stat.lsa.umich.edu/~chong/ps/modl.ps.

    WHAT DOES THE FUTURE HOLDFOR FUNCTION ESTIMATION?

    In response to Statistical Science’s desiderationfor speculations regarding future research direc-tions, I would like to take this opportunity to offersome of my thoughts.

    It has long been said that all smoothing methodsperform similarly in one dimension, provided thatthe smoothing parameter selection is done prop-erly, yet time and again new and not so new meth-ods keep being invented. The real challenge, how-ever, seems to lie in multivariate problems. Amidthe curse of dimensionality and potential structuresassociated with multivariate problems, the choiceof methods can make a real difference in multidi-mension, in the ease of computation and smoothingparameter selection, in the convenience of incor-poration of structures, and so on. Among methodswith the most potential are the adaptive regres-sion splines developed by Friedman, Stone and co-workers, and the smoothing splines developed bythe Wisconsin spline school lead by Wahba. The pe-nalized regression spline approach, however, seemssomewhat handicapped by the lack of effective ba-sis, say in dimensions beyond two or three.

    More challenging still, an important line of re-search that has been largely neglected is inference.What one usually gets from the function estimationliterature are point estimates possibly with asymp-totic convergence rates, and intuitive smoothingparameter selectors not always accompanied byjustifications. Besides a few entries based on theBayes model of smoothing splines by Wahba (1983),Cox, Koh, Wahba and Yandell (1988), Barry (1993)and some follow-ups, practical procedures that of-fer interval estimates, test of hypothesis, and soon, are largely missing in the literature. To guardagainst the danger of overinterpreting data by theuse of nonparametric methods, such inferentialtools should be a top priority in future research.Under a Bayes model where the target function istreated as a realization of a stochastic process, thedevelopment may proceed within the conventionalinferential framework. Under the traditional set-ting where the target function is considered fixed,however, one may have to turn his back on the con-ventional Neyman–Pearson thinking before he cancall any useful inferential tools non-ad-hoc.

    REFERENCES

    Barry, D. (1993). Testing for additivity of a regression function.Ann. Statist. 21 235–254.

    Cox, D. D. and Chang, Y.-F. (1990). Iterated state space al-gorithms and cross validation for generalized smoothingsplines. Technical Report 49, Dept. Statistics, Univ. Illinois.

    Cox, D. D., Koh, E., Wahba, G. and Yandell, B. S. (1988).Testing the (parametric) null model hypothesis in (semipara-metric) partial and generalized spline models. Ann. Statist.16 113–119.

    Gu, C. (1992). Cross validating non Gaussian data. Journal ofComputational and Graphical Statistics 1 169–179.

    Gu, C. (1996). Model indexing and smoothing parameter selec-tion in nonparametric function estimation. Technical Report93-55 (rev.), Dept. Statistics, Purdue Univ.

    Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated smoothing spline. J. Roy. Statist. Soc. Ser. B 45133–150.

    Xiang, D. and Wahba, G. (1996). A generalized approximatecross validation for smoothing splines with non-Gaussiandate. Statist. Sinica. To appear.

  • 108 P. H. C. EILERS AND B. D. MARX

    CommentM. C. Jones

    Eilers and Marx present a clear and interestingaccount of their P-spline smoothing methodology.Clearly, P-splines constitute another respectableapproach to smoothing. However, their good prop-erties appear to be, broadly, on a par with those ofvarious other approaches; the method is no nearerto, or further from, “being the ideal smoother” thanothers.

    “P-splines have no boundary effects, they are astraightforward extension of (generalized) linear re-gression models, conserve moments (means, vari-ances) of the data, and have polynomial curve fits aslimits.” Except for the third point, the same claimscan be made of spline smoothing (Green and Sil-verman, 1994) or local polynomial fitting (Fan andGijbels, 1996).

    Conservation of moments seems unimportant. Inregression, I do not see the desirability. In densityestimation, simple corrections of kernel density es-timates for variance inflation exist, but make lit-tle difference away from the normal density (Jones,1991). Indeed, getting means and variances rightis a normality-based concept, so corrected kernelestimators act in a normal-driven semiparametricmanner. Efron and Tibshirani (1996) propose moresophisticated moment conservation, but initial indi-cations are that this is no better nor worse than al-ternative semiparametric density estimators (Hjort,1996).

    “The computations, including those for cross-validation, are relatively inexpensive and easilyincorporated into standard software.” Again, pro-ponents of the two competing methods I havementioned would claim the same for the first halfof this and advocates of regression splines wouldclaim the lot.

    The authors make no particularly novel contri-bution to automatic bandwidth selection. Cross-validation and AIC are in a class of methods (e.g.,Härdle, 1990, pages 166–167) which, while not be-ing downright bad, allow scope for improvement.

    M.C. Jones is Reader in Statistical Science, Depart-ment of Statistics, The Open University, Walton Hall,Milton Keynes, MK7 6AA, United Kingdom.

    Calculating thesebandwidth selectors quickly isless important than developing better selectors. Forlocal polynomials, improvements are offered (fornormal errors) by Fan and Gijbels (1995) and Rup-pert, Sheather and Wand (1995) and unpublishedwork extends these to more general situations.

    The comparison of (5) with (11) focusses on thesmall extra complexity of the latter. But which ismore interpretable: a roughness penalty on a curveor on a series of coefficients? Changing the penaltyin a smoothing spline setup allows different para-metric limits (e.g., Ansley, Kohn and Wong, 1993);how can P-splines cope with this?

    An exasperating aspect of spline-based approach-es is the lack of straightforward (asymptotic) meansquared error–type results to indicate theoreti-cal performance relative to kernel/local polynomialapproaches for which such results are simply ob-tained and, within limitations, informative. I doubtwhether P-splines can facilitate such developments(reason given below).

    It seems that P-splines have no particular attrac-tiveness for multivariate applications. The examplesare noteworthy only for looking like results obtain-able by other methods too.

    The idea behind density estimation P-splinesis to treat a fine binning as Poisson regressiondata. OK, but again equally applicable to otherapproaches and already investigated for local poly-nomial smoothing. Simonoff (1996, Section 6.4)and Jones (1996) explain how such regressionapproaches to density estimation are discretizedversions of certain “direct” local likelihood densityestimation methods (Hjort and Jones, 1996; Loader,1996). Binning is the major computational device ofall kernel-type estimators (Fan and Marron, 1994).The local likelihood approach is already deeplyunderstood theoretically.

    Comparison of P-splines’s reasonable boundaryperformance with local polynomials’s reasonableboundary performance is not yet available throughtheory or simulations.

    An interesting point mentioned in the paper is theapparent continuum between few-parameter para-metric fits at one end and fully “nonparametric”techniques at the other, with many-parameter para-

  • FLEXIBLE SMOOTHING 109

    metric models and semiparametric approaches inbetween: a dichotomy into parametric and nonpara-metric is inappropriate, and there is a huge greyarea of overlap. The equivalent degrees-of-freedomideas of Hastie and Tibshirani (1990) provide a fine(but possibly improveable?) attempt to give this con-tinuum a scale. Theoretical development might bemade more difficult by P-splines for reasons asso-ciated with quantifying the “nonparametricness” ofintermediate methods.

    Finally, we come back to my main point. In an ad-mirable “personal view of smoothing and statistics,”Marron (1996) gives a list of smoothing methodsand another of factors (to which I might add others)involved in the choice between methods. Marronsays “All of the methods : : : listed : : :have differingstrengths and weaknesses in : : :divergent senses.None of these methods dominates any other in allof the senses. : : :Since these factors are so differ-ent, almost any method can be ‘best’, simply by anappropriate personal weighting of the various fac-tors involved.” P-splines are a reasonable additionto Marron’s first list, but have no special statuswith respect to his second.

    REFERENCES

    Ansley, C. F., Kohn, R. and Wong, C. M. (1993). Nonparametricspline regression with prior information. Biometrika 80 75–88.

    Efron, B. and Tibshirani, R. (1996). Using specially designedexponential families for density estimation. Ann. Statist. 24000–000.

    Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling andIts Applications. Chapman and Hall, London.

    Fan, J. and Marron, J. S. (1994). Fast implementations of non-parametric curve estimators. J. Comput. Graph. Statist. 335–56.

    Hjort, N. L. (1996). Performance of Efron and Tibshirani’s semi-parametric denisty estimator. Unpublished manuscript.

    Hjort, N. L. and Jones, M. C. (1996). Locally parametric non-parametric density estimation. Ann. Statist. 24 1619–1647.

    Jones, M. C. (1991). On correcting for variance inflation in ker-nel density estimation. Comput. Statist. Data Anal. 11 3–15.

    Jones, M. C. (1996). On close relations of local likelihood densityestimation. Unpublished manuscript.

    Loader, C. R. (1996). Local likelihood density estimation. Ann.Statist. 24 1602–1618

    Marron, J. S. (1996). A personal view of smoothing and statis-tics (with discussion). Comput. Statist. To appear.

    Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An ef-fective bandwidth selector for local least squares regression.J. Amer. Statist. Assoc. 90 1257–1270.

    Simonoff, J. S. (1996). Smoothing Methods in Statistics.Springer, New York.

    CommentJoachim Engel and Alois Kneip

    Paul Eilers and Brian Marx have provided uswith a nice and flexible addition to the smoother’stoolkit. Their proposed P-spline estimator can beconsidered as some compromise between the usualB-spline estimation and the smoothing spline ap-proach. Different from many papers on B-splines,however, they do not consider the delicate problemof optimal knot selection. Instead, they propose touse a large number of equidistant knots. Smoothingis introduced by a roughness penalty on the differ-ence of spline coefficients.P-spline estimation is equivalent to smoothing

    splines when choosing as many knots as there are

    Joachim Engel is with Wirtschaftstheorie II, Uni-versität Bonn, and Department of Mathematics, PHLudwigsburg, Germany. Alois Kneip is with Insti-tut de Statistique, Université Catholique de Louvain,Belgium.

    observations n = m with a knot placed at eachdata point. However, this is not the situation theauthors have in mind. They propose to choose alarge number n of knots, but n < m. Such an ap-proach is of considerable interest. We know frompersonal experience that nonparametric regressionfits based on B-splines are often visually more ap-pealing than, for example, kernel estimates. Thesame seems to be true for P-splines if a moder-ate number of knots is used. Furthermore, as theauthors indicate, P-splines together with the dif-ference penalty enjoy many important practical ad-vantages and are flexible enough to be applied indifferent modelling situations, for example, in addi-tive models or self-modelling regression where thebackfitting algorithm is used.

    Nevertheless, we do not yet see much evidencefor the authors’s claim that P-splines “come nearbeing the ideal smoother.” For example, local poly-nomial regression is known to exhibit no boundary

  • 110 P. H. C. EILERS AND B. D. MARX

    problems (in first order) and to possess certain op-timality and minimax properties (Fan, 1993). Fordensity estimation Engel and Gasser (1995) showa minimax property of the fixed bandwith kernelmethod within a large class of estimators contain-ing penalized likelihood estimators. The presentedpaper does not provide any argument, neither theo-retical nor by simulations, supporting any superior-ity of P-splines over their many competitors.

    In the regression case, the theoretical propertiesof P-splines might be evaluated by combining ar-guments of de Boor (1978) on the asymptotic biasand variance of B-splines in (dependence on m, thespline order k and the smoothness of the underlyingfunction) with the well-known results on smoothingsplines.

    The authors propose to use AIC or cross-validation to select the smoothing parameter λ.However, a careful look at their method revealsthat there are in fact two free parameters: λ andthe number n of knots. If n ≈ m, then we essen-tially obtain a smoothing spline fit, while results

    might be very different if n � m. Indeed, the esti-mate might crucially depend on n. Therefore, whynot determine λ and n by cross-validation or a re-lated method? The following theoretical argumentsmay suggest that such a procedure will work. Notethat AIC and cross-validation are very close to un-biased risk estimation which consists of estimatingthe optimal values of λ and n by minimizing

    m∑i=1yi − µ̂i2 + 2σ2 trHλ;n;

    where H ≡Hλ;n is the corresponding smoother ma-trix. Let ASEλ;n denote the average squared er-ror of the fit obtained by using some parameters λand n. Under some technical conditions, it then fol-lows from results of Kneip (1994) that, as m→∞,

    ASEλ̂; m̂/ASEλopt;mopt →P 1:Here λ̂ and m̂ are the parameters estimated by un-biased risk estimation, while λopt and mopt repre-sent the optimal choice of the parameters minimiz-ing ASE.

    CommentCharles Kooperberg

    Eilers and Marx present an interesting approachto spline modeling. While function estimation basedon smoothing splines often yields reasonable re-sults, the computational burden can be very large.If the number of basis functions is limited, however,the computations become much easier, and whenthe knots are equally spaced, the solution indeedbecomes rather elegant. To increase the credibilityof the claim that P-splines are close to the “idealsmoother,” several issues need to be addressed:

    1. In density estimation, when the range of the datais � (�+), it is useful that a density estimate bepositive on � (�+), for example, for resampling.Some methods can estimate densities on boundedor unbounded intervals. P-splines do not seem tohave this property: lower and upper bounds haveto be specified and there seems to be no natural

    Charles Kooperberg is Assistant Professor, Depart-ment of Statistics, University of Washington, Seattle,Washington 98195-0001.

    way to extrapolate beyond these bounds. Is thereany way around that? Can infinity be a bound?

    How would one specify the bounds? From thesuicide example it appears that this may influ-ence the results considerably.

    2. To use P-splines, additional choices need to bemade. How many knots should one use? Is theprocedure insensitive to the number of knots pro-vided that there are enough of them? If so, howmany is enough? How does the computationalburden depend on the number of knots?

    What order of penalty should be used? Do youadvocate examining several possible penalties, asin the logistic regression example, or do you haveanother recommendation, such as using k = 3 fordensity estimation so that the limit of your esti-mate as λ→∞ is a normal density? Since manysmoothing and density estimation procedures areused as EDA tools, good defaults are very worth-while.

    3. It would be interesting to see an application ofthe P-spline methodology to more challengingdata, such as the income data described below,

  • FLEXIBLE SMOOTHERS 111

    which involves thousands of cases, a narrow peakand a severe outlier.

    How would the P-spline algorithm, whereknots are positioned equidistantly, behave whenthere are severe outliers, which would dominatethe positioning of the knots? Is it possible to po-sition knots nonequidistantly, for example, basedon order statistics?

    4. Are there theoretical results about the large sam-ple behavior of P-splines?

    POLYNOMIAL SPLINES AND LOGSPLINEDENSITY ESTIMATION

    Besides the penalized likelihood approach, thereis an entirely different approach to function es-timation based on splines. Whereas for P-splinesboth the number and the locations of the knotsare fixed in advance and the smoothness is gov-erned by a smoothing parameter, in the polynomialspline framework the number and location of theknots are determined adaptively using a step-wise algorithm and no smoothing parameter isneeded. Such polynomial spline methods have beenused for regression (Friedman, 1991), density es-timation (Kooperberg and Stone, 1992), polychoto-mous (multiple logistic) regression (Kooperberg,Bose and Stone, 1997), survival analysis (Kooper-berg, Stone and Truong, 1995a) and spectral den-sity estimation (Kooperberg, Stone and Truong,1995b).

    In univariate polynomial spline methodologiesthe algorithm starts with a fairly small number ofknots. It then adds knots in those regions wherean added knot would have the most influence, us-ing Rao (score) statistics to decide on the bestlocation; after a prespecified maximum number ofknots is reached, knots are deleted one at a time,using Wald statistics to decide which knot to re-move. Out of the sequence of fitted models, the onehaving the smallest value for the BIC criterion isselected.

    Polynomial spline algorithms for multivariatefunction estimation are similar, except that at eachaddition step the algorithm adds either a knot inone variable or a tensor product of two or moreunivariate basis functions. We have successfully ap-plied such methodologies to data sets as small as 50for one-dimensional density estimation and as largeas 112,000 for a 63-dimensional polychotomousregression problem with 46 classes. For nonadap-tive polynomial spline methodologies theoreticalresults regarding the L2-rate of convergence are es-tablished. Stone, Hansen, Kooperberg and Truong

    (1996) provide an overview of polynomial splinesand their applications.

    Logspline density estimation, in which a (univari-ate) log-density is modeled by a cubic spline, is dis-cussed in Kooperberg and Stone (1992) and Stoneet al. (1996). Software for the 1992 version, writtenin C and interfaced to S-PLUS, is publically avail-able from Statlib. (The 1992 version of LOGSPLINEemploys only knot deletion; here, however, we focuson the 1996 version, which uses both knot additionand knot deletion.) LOGSPLINE can provide esti-mates on both finite and infinite intervals, and itcan handle censored data.

    The results of LOGSPLINE on the Old Faithfuldata and the suicide data are very similar to the cor-responding results of P-splines [the suicide data isan example in Kooperberg and Stone (1992)]. Herewe consider a much more challenging data set. Thesolid line in Figure 1 shows the logspline density es-timate based on a random sample of 7,125 annualnet incomes in the United Kingdom [Family Expen-diture Survey (1968–1983)]. (The data have beenrescaled to have mean 1.) The nine knots that wereselected by LOGSPLINE are indicated. Note thatfour of these knots are extremely close to the peaknear 0:24. This peak is due to the UK old age pen-sion, which caused many people to have nearly iden-tical incomes. In Kooperberg and Stone (1992) weconcluded that the height and location of this peakare accurately estimated by LOGSPLINE. There areseveral reasons why this data is more challengingthan the Old Faithful and suicide data: the dataset is much larger, so that it is more of a challengeto computing resources (the LOGSPLINE estimatetook 9 seconds on a Sparc 10 workstation); the widthof the peak is about 0.02, compared to the range 11.5of the data; there is a severe outlier (the largest ob-servation is 11.5, the second largest is 7.8); and therise of the density to the left of the peak is verysteep.

    To get an impression of what the P-splines proce-dure would yield for this data, I first removed thelargest observation so that there would not be anylong gaps in the data, reducing the maximum ob-servation to 7.8. The dashed line in Figure 1 is theLOGSPLINE estimate to the data with fixed knotsat i/20 × 7:8, for i = 0;1; : : : ;20 (using 20 inter-vals, as in most P-spline examples.) The resultingfit should be similar to a P-spline fit with λ = 0.In this estimate it appears that the narrow peakis completely missed and that, because of the steeprise of the density to the left of the peak and thelack of sufficiently many knots near the peak, twomodes are estimated where only one mode exists.

  • 112 P. H. C. EILERS AND B. D. MARX

    Fig. 1. Logspline density estimate for the income data (solid line); the x indicate the locations of the knots; logspline approximation ofthe P-spline estimate with penalty parameter 0 (dashed line).

    It would be very much of interest to see how theP-spline methodology behaves on this data, and inparticular whether it can accurately represent thesharp peak near 0.24.

    ACKNOWLEDGMENT

    Research supported in part by NSF Grant DMS-94-03371.

    CommentDennis D. Cox

    The main new idea in this paper is a roughnesspenalty based on the B-spline coefficients. Therewill be critics—I give some criticisms below—butthere is considerable appeal in the simplicity ofthe idea. If I had to develop the software ab initio,it is clear that the roughness penalties proposedhere would require less effort to implement thanthe standard ones based on L2-norm of a secondderivative.

    There is a precedent for the use of the B-splinecoefficients in such a direct way, from computer

    Dennis D. Cox is with Department of Statistics, RiceUniversity, P.O. Box 1892, Houston, Texas 77251.

    graphics (CG) and computer aided design (CAD).The “control point” typically used in parametric B-spline representations of curves and surfaces basi-cally consists of the B-spline coefficients. See Foleyand van Dam (1995, Section 11.2.3). This is demon-strated in Figure 1, where the control points for thesolid curve are just random uniform added to a lin-ear trend, and the same points are shrunk toward0.5 before adding the trend to obtain the controlpoints for the dashed curve. The ordinate of eachcontrol point is the cubic cardinal B-spline coeffi-cient and the abscissa is the midpoint of support.In CG/CAD applications, the control points are ma-nipulated to obtain a curve or surface with desirableshape or smoothness. The CG/CAD practitioners be-come familiar with these control points and develop

  • FLEXIBLE SMOOTHING 113

    Fig. 1. Example of control points: the solid curve derives from the solid control points, and the dashed curve from the triangular controlpoints.

    a feel for their influence on the curve or surface.Similarly, statisticians may find after some effortthat B-spline coefficients are very natural.

    If I had equally easy to use software for smoothingsplines or P-splines, I would prefer the former, par-tially from Bayesian considerations. The Bayesianinterpretation of P-splines (i.e., the differenced B-spline coefficients are a Gaussian white noise un-der the prior) is more artificial than the usual pri-ors as in Wahba (1978). In particular, the usualpriors are specified independently of sample size,whereas one would want to use more B-splines witha larger sample. Furthermore, the integral of thesecond derivative squared is easier to interpret froma non-Bayesian perspective than the sum of squaresof second differences of B-spline coefficients.

    I take issue with the authors’s claim that theirmethod does not have boundary problems. P-splinesare approximately equivalent to smoothing splineswhich do have boundary effects (Speckman, 1983).To explain, consider minimizing from equation (5),

    Sa =m∑i=1

    {yi −

    n∑j=1

    ajBjxi}2+ λ

    n∑j=312aj2:

    A discrete form of the variational derivation inSpeckman (1983) leads to the system

    λ12a3 +∑i

    B1xi∑j

    ajBjxi

    =∑i

    yiB1xi;

    λ13a4 − λ12a3 +∑i

    B2xi∑j

    ajBjxi

    =∑i

    yiB2xi;

    λ14ak +∑i

    Bkxi∑j

    ajBjxi

    =∑i

    yiBkxi; 3 ≤ k ≤ n− 2;

    −λ13an − λ12an +∑i

    Bn−1xi∑j

    ajBjxi

    =∑i

    yiBn−1xi;

    λ12an +∑i

    Bnxi∑j

    ajBjxi

    =∑i

    yiBnxi:

  • 114 P. H. C. EILERS AND B. D. MARX

    Notice that the equations for coefficients near theend involve lower-order differencing so there is lesssmoothness imposed.

    ACKNOWLEDGMENT

    Research supported by NSF Grant DMS-90-01726.

    CommentStephan R. Sain and David W. Scott

    We have been interested in formulations of thesmoothing problem that are simultaneously globalin nature with locally adaptive behavior. Rough-ness penalties based on functionals such as the in-tegral of squared second derivatives of the fittedcurve have enjoyed much popularity. The solutionto such optimization problems is often a spline. Theauthors are to be congratulated for introducing theidea of penalizing on the smoothness of the splinecoefficients, which reduces the dimensionality of theproblem as well as reducing the complexity of thecalculations. There is much to say for this approach.

    It is generally of interest to try to work outthe equivalent kernel formulation of all smooth-ing methods. This was done for Nadarya–Watsonregression smoothing by Silverman (1984), whodemonstrated the asymptotic manner in which theestimator adapted locally.

    In the density estimation setting, we have beeninvestigating the nature of the best locally adaptivedensity estimator along the lines of the Breiman–Meisel–