For Peer Review Only Nonparametric Inference for Partly Linear Additive Cox Models based on Polynomial Spline Estimation Journal: Journal of the American Statistical Association Manuscript ID Draft Manuscript Type: Article – Theory & Methods Keywords: Conditional hazard rate, Hypothesis testing, Local Asymptotics, Partial likelihood Journal of the American Statistical Association
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
For Peer Review OnlyNonparametric Inference for Partly Linear Additive Cox
Models based on Polynomial Spline Estimation
Journal: Journal of the American Statistical Association
Manuscript ID Draft
Manuscript Type: Article – Theory & Methods
Keywords: Conditional hazard rate, Hypothesis testing, Local Asymptotics, Partial likelihood
Journal of the American Statistical Association
For Peer Review OnlyNonparametric Inference for Partly Linear Additive Cox
Models based on Polynomial Spline Estimation
Abstract
The global smoothing method based on polynomial splines is a popular techniquefor nonparametric regression estimation and has received great attention in the liter-ature. However, it is tremendously challenging to obtain local asymptotic propertiesof the polynomial spline estimators and to make inference for the regression functions.We develop a general theory of local asymptotics for the polynomial spline estimationof partly linear additive Cox models. We obtain a uniform Bahadur representation ofand design-adaptive asymptotic normality of the resulting nonparametric estimators.Furthermore, we propose a distance-based statistic for specification tests of the additivecomponents and establish the limiting distribution of the test statistic. We propose abootstrap procedure to calculate the p-value of the above test statistic and prove itsconsistency. Based on the polynomial-spline estimation, we also introduce a two-stepestimation method, which possesses an oracle property in the sense that any additivecomponent could be estimated as if other additive components were known. All ofthe above local asymptotics and testing results are also established for this two-stepprocedure. Simulations demonstrate nice finite sample performance of the proposedprocedure. Analysis of the Framingham Heart Study data illustrates the use of ourmethodology.
Keywords: Conditional hazard rate, Hypothesis testing, Local Asymptotics, Partial likeli-hood.
1 Introduction
The global smoothing method based on polynomial splines is a popular technique for non-parametric regression estimation. Its main advantage over popular kernel smoothing is its fastimplementation and nice finite sample performance. This is significant in high dimensional
1
Page 1 of 58 Journal of the American Statistical Association
smoothing. Global convergence rates of the resulting polynomial spline estimators are ex-haustively studied. See Stone (1985, 1986, 1994), Kooperberg et al. (1995a, b), Huang (1998,1999, 2001), Huang and Stone (1998), Huang et al. (2000), and Huang and Shen (2004),among others. However, it is very challenging to obtain local asymptotic distributions ofthe polynomial spline estimators, which hampers development of the methodology. Zhou etal. (1998) and Huang (2003) established the local asymptotic normality of such estimators inthe framework of nonparametric regression models. To our knowledge, this kind of results arenot available for other models. This motivates us to study local asymptotics of the polynomialspline estimators for partly linear additive Cox models. The local asymptotic results are thenused for statistical inference, in addition to providing theoretical insights about the propertiesof estimators that cannot be explained by global asymptotic results.
As an extension of the Cox (1972) model, the partly linear additive Cox model (Huang,1999) specifies the conditional hazard of failure time T given the covariate (x,w) ∈ Rd × RJ
as
λ{t;x,w} = lim∆↓0
[∆−1 Pr{t ≤ T < t+∆|T ≥ t,x,w}
]= λ0(t) exp{β′x + ϕ(w)}, (1.1)
where λ0(·) ≥ 0 is an unspecified base-line hazard, β is a d-vector of parameters, and ϕ(w)
is an unknown function of w with the additive structure ϕ(w) = ϕ1(w1) + · · ·+ ϕJ(wJ). Theparameters of interest are β and ϕ’s. This model avoids the curse-of-dimensionality inherentin the saturated multivariate semiparametric hazard regression model (Sasieni, 1992). See thediscussions in Hastie and Tibshirani (1990). It allows one to explore nonlinearity of certaincovariates and retains the nice interpretability of the linear structure in Cox’s (1972) model.
The polynomial spline estimator of β is efficient in the sense that it achieves the semipara-metric information bound (Huang,1999). This indicates that the estimator of β is asymptoti-cally most efficient among all the regular estimators (van der Vaart, 1991; Bickel et al. , 1993,Chapter 3). Since the information lower bound could not be consistently estimated, Jiangand Jiang (2011) proposed a bootstrap based inference method for β. However, only a globalrate of convergence for the resulting estimators of ϕ’s is available in the literature (Huang,1999). It has remained unsolved to establish the local asymptotics of the polynomial splineestimators of ϕ’s, since the publication of Huang (1999). In this article we solve this longstanding problem as well as others. Since many Cox’s types of models are specific examplesof model (1.1), the local asymptotics for the estimation of ϕ’s can also be used to justifythe appropriateness of these models, for example, by testing if ϕj(·)’s admit some specificparametric forms. This makes our theoretic results significant in applications.
Unlike the least squares based polynomial spline estimation in Zhou et al. (1998) and
2
Page 2 of 58Journal of the American Statistical Association
Huang (2003), there is no explicit formula for our estimators in the current setting. It is atremendously novel challenge to obtain the asymptotic distributions of our estimators. Sucha challenge also arises from the fact that the score functions involving a diverging numberof parameters are asymptotically infinite dimensional, in contrast to the local polynomialestimation of the additive components involving locally only finitely many parameters (Fanand Yao, 2003). Although for additive regression models the unknown additive functionscan be estimated by the backfitting algorithm (Buja et al., 1989) using the kernel smoothingas building block, local asymptotics of such backfitting estimators of ϕ’s based on the kernelsmoothing is hard to establish and is not available in the literature, not to mention that for thepolynomial spline estimation. In fact, there are no formal results in the literature about thelocal asymptotics for any estimation methods of the partial additive Cox model. We bravedthis difficulty and have made determined efforts to derive a uniform Bahadur representationand asymptotic normality of the polynomial spline estimators of ϕ’s, which are importantfor establishing confidence bands and for hypothesis testing about the model structure. Wealso provide consistent estimators of the baseline hazard and the asymptotic variance of thenonparametric part estimation. This variance estimator allows one to construct pointwiseinterval estimates of ϕ’s. Our techniques are very different from those in Zhou et al. (1998) andHuang (2003) for nonparametric regression models, because of the nature of counting processwith a diverging number of parameters. Our local asymptotic results show some remarkableproperties of our estimators: they have no boundary problems and are design-adaptive (Fan,1992; Fan and Gijbels, 1996).
However, like the kernel estimation for nonparametric additive regression models (Opsomerand Ruppert, 1997; Opsomer 2000), the polynomial-spline estimation for one additive com-ponent of interest depends on the other additive nuisance components. This is not a desiredproperty. Motivated by the two-stage estimation methods for varying-coefficient regressionmodels (Fan and Zhang, 1999) and for nonparametric additive regression models (Horowitzand Mammen, 2004), we introduce a two-step estimation method to solve this problem. It isshown that the two-step estimator is more efficient than the polynomial spline estimator. Inparticular, the two-step approach possesses an oracle property: any additive component canbe estimated well as if the remaining additive components were known.
With the uniform Bahadur representation, we further study a nonparametric specificationtest for the additive component. This kind of tests are available for the kernel smoothing,but for the polynomial spline smoothing there is no formal nonparametric test for any modelsin the literature. We propose two test statistics and establish asymptotic null and alterna-tive distributions of the proposed test statistics. Other testing problems can be dealt withanalogously. Like other nonparametric tests, the asymptotic null distribution may not give a
3
Page 3 of 58 Journal of the American Statistical Association
good approximation in finite sample situations due to the low convergence rate (Bickel andRosenblatt, 1973; Fan, Zhang and Zhang, 2001; Fan and Jiang, 2005; Hong and Lee, 2013).Hence, we propose a bootstrap approach to calculate the p-values of the proposed tests andprove its consistency. It is expected that our test methodology will promote the developmentof nonparametric inference using the global polynomial spline estimation as build block due toits popularity and advantages on fast implementation and stable performance in high dimen-sional nonparametric screening (Fan, Feng and Wu, 2011) and in large financial time seriesdata analysis (Liu and Yang, 2016). Our methodology can also be extended to the partlylinear additive hazards model, an important alternative to model (1.1) considered in Lu andSong (2015), and other non- and semi- parametric regression models.
The remainder of this paper is organized as follows. In Section 2 we introduce the partiallikelihood of model (1.1) along with the polynomial spline estimation. In Section 3 we intro-duce some notations and technical assumptions for our theoretical results. In Section 4 weconcentrate on the uniform linear representation and asymptotic normality of the estimatorsof ϕ’s. In Section 5 we suggest the two-step estimation. In Section 6 we study the specificationtest. In Section 7 we present details of implementation of the proposed method and conductsimulations, and a real example is employed to illustrate the use of our methodology. Someconcluding remarks are given in Section 8. Technical proofs are provided in the Appendix.
2 Partial likelihood and polynomial spline estimation
Based on Cox’s (1975) partial likelihood, Stone (1986) proposed polynomial spline estimationfor the fully nonparametric additive Cox model and studied the global convergence rates of theresulting estimator, and Huang (1999) extended this estimation method to the partly linearadditive Cox model. The idea of this approach is to approximate function ϕ(·) in the partiallikelihood by a polynomial spline.
Suppose that there are n independent individuals in a study cohort. In practice, not allof the survival times T1, · · · , Tn are fully observable, due to termination of the study or earlywithdrawal from a study. Instead one observes an event time Si = min(Ti, Ci) for the i-thsubject, where Ci is the censoring time. Let δi = I(Ti ≤ Ci) be the censoring indicator and(Xi,Wi) be an associated vector of covariates. The observed data are {(Si, δi,Xi,Wi) : i =
1, · · · , n}, which is an i.i.d. sample from the population (S, δ,X,W) with S = min(T,C) andδ = I(T ≤ C). Then the partial likelihood function (Cox, 1975) for model (1.1) is
L(β, ϕ) =n∏
i=1
{ ri(β, ϕ)∑j∈Ri
rj(β, ϕ)
}δi, (2.2)
where ri(β, ϕ) = exp{β′Xi + ϕ(Wi)}, and Ri = {j : Sj ≥ Si} is the risk set at time Si.
4
Page 4 of 58Journal of the American Statistical Association
Maximizing the above partial likelihood leads to over-fitting, in the absence of any restric-tion on the form of ϕ; in particular, parameters will be unidentifiable. Huang (1999) resortedto an approximation of ϕ(·) in a space of polynomial splines, using the same number of knotsfor all ϕj(·)’s. Here we allow different numbers of knots for different ϕj(·)’s. This relaxationbrings us two benefits: on the one hand, we can employ different smoothing parameters to ac-commodate different degrees of smoothness of each additive function ϕj; on the other hand, wecan under-smooth nuisance functions in our hypothesis testing problems and obtain the Wilksresult. Specifically, without loss of generality, assume that W = (W1, . . . ,WJ)
′ takes values inW = [0, 1]J , and let Wi = (Wi1, . . . ,WiJ)
′. For approximating function ϕj(·) (j = 1, . . . , J),we need a knot sequence ξj = {ξj,i}
Kj+1i=0 such that 0 = ξj,0 < ξj,1 < · · · < ξj,K < ξj,Kj+1 = 1.
Let Ij,i = [ξj,i, ξj,i+1) for i = 0, . . . , Kj − 1, and Ij,Kj= [ξj,Kj
, ξj,Kj+1]. Then {Ij,i}Kj
i=0 is apartition of [0, 1]. The knot number can be chosen by data such that Kj = Kj,n → ∞ asn → ∞. The space of polynomial splines of order ℓj (degree ℓj − 1) and knot sequence ξj,denoted by S(ℓj, ξj), consists of functions s(·) satisfying
(i) for 0 ≤ i ≤ Kj, s(·) is a polynomial of order ℓj − 1 in Ij,i;
(ii) for ℓj ≥ 2, s(·) is ℓj − 2 times continuously differentiable on [0, 1].
Since S(ℓj, ξj) is a qj-dimensional linear space with qj = Kj + ℓj, for any ϕnj(·) ∈ S(ℓj, ξj),there exists a local basis {Bj,i(·)}
qji=1 for S(ℓj, ξj), such that ϕnj(wj) =
∑qji=1 bjiBj,i(wj) for
j = 1, . . . , J (Schumaker, 1981, page 124). For example, for fixed ξj and ℓj, let
Let the maximizer of (2.4) be (β, b). Since ϕj’s can only be identified up to an additiveconstant, we assume E{δϕj(Wj)} = 0 or equivalently E{ϕj(Wj)|δ = 1} = 0 and center theestimators of ϕj’s as in Huang (1999). Specifically, let
ϕ∗j(wj) = b
′jBj(wj), ϕ
∗j =
n∑i=1
δiϕ∗j(Wij)/
n∑i=1
δi,
ϕ∗(w) = b′B(w), and ϕ∗ =
n∑i=1
δiϕ∗(Wi)/
n∑i=1
δi.
Then ϕj(wj) is estimated by ϕj(wj) = ϕ∗j(wj) − ϕ∗
j , and ϕ(w) by ϕ(w) =∑J
j=1 ϕj(wj) =
ϕ∗(w)− ϕ∗.Under mild conditions, the cumulative baseline hazard function Λ0(t) =
∫ t
0λ0(u)du is
estimated by the Breslow estimator (Breslow, 1972, 1974)
Λ0(t) =
∫ t
0
[ n∑i=1
Yi(u) exp{β′Xi + ϕ(Wi)}
]−1n∑
i=1
dNi(u),
where Yi(u) = I(Si ≥ u) and Ni(u) = I(Si ≤ u, δi = 1).When there is no X-variable, the above approach reduces to the polynomial spline esti-
mators in Stone (1986), where the local asymptotics is still unavailable. Computationally,maximization problem (2.4) can easily be implemented by the existing Cox regression pro-gram, see Section 7.2. In the next section we will establish local asymptotic normality of ϕand ϕj for j = 1, . . . , J . Several appealing properties of them will be revealed.
3 Notations and Conditions
The following regularity conditions are needed for our theoretic results.Condition (A):
(A1) Let Y (s) = I(S ≥ s). Then P (Y (τ) = 1) > 0, where τ is the observing end time pointfor the event time S.
(A2) Denote by Qn(w) the empirical distribution function of {Wi}ni=1. Let Q(w) be thedistribution function of Wi, which has a positive continuous density q(w) in its supportW .
(A3) Let r∗(s) = Y (s) exp{β′X + ϕ(W)} and f(w, s) = E{r∗(s)|W = w)}. Assume that
Condition (A1) is commonly used in the literature. In Condition (A2), positive designdensity q(w) in its support is required, which ensures there are enough data points forsmoothing. Condition (A3) is mild. Since P (Y (τ) = 1) > 0 and Y (τ) exp{β′X + ϕ(W)} ≤r∗(s) ≤ exp{β′X + ϕ(W)}, it is trivial to show that our condition (A3)(i) is weaker than0 < E[exp{β′X + ϕ(W)}] < ∞. Note that E{r∗(s) − f(W, s)}2 ≤ Var{r∗(s)}. Condition(A3)(ii) holds if Var{r∗(s)} <∞.
The following Condition (B) were used in Huang (1999) and listed for convenience. UnderCondition (B), the estimator of β is
√n−consistent, so that each ϕj(·) can be estimated well
as if β in known (Opsomer and Ruppert, 1999; Jiang et al., 2007).
Condition (B):
(B1) (i) The regression parameter β belongs to an open subset (not necessarily bounded) ofRd, and each ϕj lies in Aj for j = 1, . . . , J , where Aj is the class of functions ϕj on [0, 1]
whose ℓjth derivative exists and satisfies the following Lipschitz condition of order α:
|ϕ(ℓj)j (s)− ϕ
(ℓj)j (t)| ≤ C|s− t|αj for s, t ∈ [0, 1],
where αj ∈ (0, 1] satisfies pj = ℓj + αj > 0.5. Let p = minj=1,...,J pj.
(ii) E(δX) = 0 and E{δϕj(Wj)} = 0, 1 ≤ j ≤ J.
(B2) The failure time T and the censoring time C are conditionally independent given thecovariate (X,W).
(B3) (i) Only the observations for which the event time Si (1 ≤ i ≤ n) is in a finite interval[0, τ ], say, are used in the partial likelihood. The baseline cumulative hazard functionΛ0(τ) =
∫ τ
0λ0(s) ds < ∞. (ii) The covariate X takes values in a bounded subset of Rd,
and the covariate W takes values in W .
(B4) There exists a small positive constant ε such that (i) P (δ = 1|X,W) > ε and (ii)P (C > τ |X,W) > ε almost surely with respect to the probability measure of (X,W).
(B5) Let 0 < c1 < c2 <∞ be two constants. the joint density f(t,w, δ = 1) of (S,W, δ = 1)
satisfies c1 ≤ f(t,w, δ = 1) < c2 for all (t,w) ∈ [0, τ ]×W .
(B6) For a positive integer q ≥ 1, assume that the qth partial derivative of the joint densityf(t, x,w, δ = 1) of (S,X,W, δ = 1) with respect to t or w exists and is bounded. [Fordiscrete covariate X, f(t,x,w, δ = 1) is defined to be (∂2/∂t∂w)P (S ≤ t,X = x,W1 ≤w1, . . . ,WJ ≤ wJ , δ = 1).
7
Page 7 of 58 Journal of the American Statistical Association
(B7) Let Kj ≡ Kj,n be a positive integer such that Kj,n = O(nv) for 0.25/p < v < 0.5.Assume that h = O(n−v).
(B8) The information matrix I(β) for estimation of β is positive definite, where I(β) wasdefined in Theorem 3.1 of Huang (1999).
Note that condition 0.25/p < v < 0.5 in (B7) is used to ensure that β is√n-consistent
(Theorem 3.2 of Huang (1999)).
4 Local asymptotics
Let hj,i = ξj,i+1 − ξj,i, hj ≡ max0≤i≤Kj|hj,i|, h = min1≤j≤J hj, and h = max1≤j≤J hj. Assume
thatmax
1≤i≤Kj
|hj,i − hj,i−1| = o(K−1j ) and hj/ min
0≤k≤Kj
hj,k ≤M,
whereM > 0 is a predetermined constant. This condition was employed in univariate nonpara-metric regression models by Zhou et al. (1998). Under this condition, we have M−1 < Kjhj <
M , the condition required for numerical computation. Throughout this paper, denote by A⊗2
the matrix AA′, for any vector or matrix A. Put Ni(t) = I(Si≤t,δi=1) and N(t) =∑n
i=1Ni(t).Let
Fi(t) = σ{Ni(s), Yi(s+), Xi,Wi, δi, s ≤ t}
represent the failure time, censoring and covariates information for the ith subject up to timet. Then
Mi(t) = Ni(t)−∫ t
0
r∗i (s)λ0(s) ds (4.5)
is an orthogonal local square integrable martingale with respect to Fi(t), such that ⟨Mi(t),Mj(t)⟩ =0 for i = j (Kalbfleisch and Prentice, 1980; Fleming and Harrington, 1991), where r∗i (s) =
Yi(s) exp{β′Xi + ϕ(Wi)}. Let F(t) = ∪ni=1Fi(t) be the smallest σ-algebra containing Fi(t).
Then M(t) =∑n
i=1Mi(t) is a martingale with respect to F(t). For k = 0, 1, 2, let R∗k(s) =
E[B(W)⊗kY (s) exp{β′X + ϕ(W)}],
Σ0 =
∫ τ
0
[R∗
2(s)/R∗0(s)− {R∗
1(s)/R∗0(s)}⊗2
]R∗
0(s) dΛ0(s),
ξn1 = n−1
n∑i=1
∫ τ
0
{B(Wi)− R∗
1(s)/R∗0(s)
}dMi(s).
Since Σ0 is positive definitive (see Lemma 8), it has an inverse Σ−10 . Let e∗j be a J × J
diagonal matrix with the jth diagonal element being 1 and 0’s for others, Iqj be a qj × qj
identity matrix. Set ej = e∗j ⊗ Iqj , where ⊗ denotes the Kronecker product of matrices. Let
8
Page 8 of 58Journal of the American Statistical Association
ℓ = min1≤j≤J ℓj. The notation an ≍ bn represents that they are in the same order. Then wehave the following uniform Bahadur representation.
Theorem 1. Under Conditions (A) and (B), if ϕj(wj) is ℓ ≥ 2 times continuously differen-tiable on [0, 1], nh3 → ∞, and h ≍ h, then
ϕ(w)− ϕ(w)− α(w) = B′(w)Σ−10 ξn1 + op(h
ℓ + 1/√nh),
ϕj(wj)− ϕj(wj)− αj(wj) = B′(w)ejΣ−10 ξn1 + op(h
ℓ + 1/√nh),
uniformly in w ∈ W , where α(w) =∑J
j=1 αj(wj),
αj(wj) = −Kj∑i=0
1
ℓj !hℓjj,iϕ
(ℓj)j (wj)B
∗ℓj(wj − ξj,ihj,i
)I(wj∈Ij,i),
and B∗ℓ (·) is the Bernoulli polynomial defined inductively as follows:
B∗0(x) = 1, B∗
ℓ (x) =
∫ x
0
ℓB∗ℓ−1(z) dz + b∗ℓ ,
with b∗ℓ = −ℓ∫ 1
0
∫ x
0B∗
ℓ−1(z) dz dx being the ℓth Bernoulli number.
The above uniform Bahadur representation is useful for establishing the local asymptoticdistribution of ϕj(·) and for statistical inference about the additive components. In Section 6,we derive the asymptotic distributions of the specification test using this Bahadur represen-tation. Even though this representation is established for model (1.1) with the number ofadditive components J fixed. The result can be extended to the models with J diverging asn goes to ∞, if the partial likelihood in (2.4) is penalized by using the group lasso (Yuan andLin, 2006; Ma et al., 2015) or the elastic net (Zou and Hastie, 2005), among others. Thisshould facilitate statistical inference after model selection for Cox’s types of models in highdimensional settings, but will be investigated in our next project.
Let q =∑J
j=1 qj. Then Σ0 is a q × q matrix with q → ∞. Partition Σ0 and Σ−10
into J × J block matrices with the jth diagonal blocks of size qj × qj. For i, j = 1, . . . , J ,let Σ0,ij and Σij
0 be the (i, j)th blocks of the matrix Σ0 and Σ−10 , respectively, and let
σn,j(wj) = {n−1B′j(wj)Σ
jj0 Bj(wj)}1/2 and σn(w) = {n−1B′(w)Σ−1
0 B(w)}1/2. The followingtheorem describes the asymptotic normality of the polynomial spline estimators.
Theorem 2. Under Conditions of Theorem 1, if ϕj(wj) is ℓ ≥ 2 times continuously differen-tiable on [0, 1] and nh3 → ∞, then
(i) {ϕj(wj)− ϕj(wj)− αj(wj)}/σn,j(wj)D−→ N (0, 1), and
(ii) {ϕ(w)− ϕ(w)− α(w)}/σn(w)D−→ N (0, 1),
9
Page 9 of 58 Journal of the American Statistical Association
where σn,j(wj) ≍ (nh)−1/2 and σn(w) ≍ (nh)−1/2, uniformly in w.
Corollary 1. Under the conditions of Theorem 2, if J = 1, then
{ϕ1(w1)− ϕ1(w1)− α1(w1)}/σn,1(w1)D−→ N (0, 1).
Remark 1. Theorem 2 shows that the asymptotic bias αj(wj) of ϕj(wj) shares the sameform as that of the regression spline estimator in Zhou et al. (1998) and does not depend onthe design distribution Q(w). This reflects that the polynomial spline estimators are designadaptive: the bias does not depend on the design density, a property shared by the localpolynomial regression (Fan, 1992). The asymptotic variance of ϕj(wj) is σ2
n,j(wj) ≍ (nh)−1,which suggests that ϕj(wj) is
√nh-consistent.
By Theorem 2, the asymptotic mean squares error of estimator ϕj can be defined as
AMSE{ϕj(wj)} = α2j (wj) + σ2
n,j(wj).
If ξj,i ≤ wj < ξj,i+1 (i = 0, . . . , Kj) and hj,i = hj = h, then
AMSE{ϕj(wj)} = h2ℓ{ 1
ℓ !ϕ(ℓ)j (wj)B
∗ℓ (wj − ξj,i
h)}2
+1
nB′
j(wj)Σjj0 Bj(wj).
Minimizing the above AMSE over h, one can get the theoretically optimal value of h at theorder of n−1/(2ℓ+1), which is in the same order of the optimal bandwidth for kernel smoothing(Fan and Gijbels, 1996).
Theorem 3. Under the conditions in Theorem 2,
limC→∞
lim supn→∞
supw∈W
P{|ϕ(w)− ϕ(w)| ≥ C(hℓ + 1/√nh)} = 0.
This extends the result of Corollary 3.1 in Huang (2003) to the current situation. In lightof Theorem 3, if nh2ℓ+1 → 0, then
√nh|ϕ(w) − ϕ(w)| = Op(1) uniformly in w ∈ W , and
Theorem 2 can be interpreted as follows: the asymptotic normality holds for all w ∈ W .
Remark 2. If h = cn−1/(2ℓ+1) for some c > 0, then, by Theorem 3, ϕj(wj) − ϕj(wj) =
Op(n−ℓ/(2ℓ+1) uniformly for any wj ∈ [0, 1]. This indicates that the nonparametric part esti-
mation shares a nice property, free of boundary effects (Gasser and Muller, 1984), with thelocal polynomial estimation for hazard regression with one covariate (Fan et al. , 1997). Asimilar property was also revealed for the regression spline estimation in Zhou et al. (1998).
The following result indicates that the estimator of baseline is uniformly consistent.
Theorem 4. Under the conditions in Theorem 2, we have
supt∈[0,τ ]
|Λ0(t)− Λ0(t)| = op(1). (4.6)
10
Page 10 of 58Journal of the American Statistical Association
The definition of σ2n(w) suggests the following plug-in estimator:
σ2n(w) = n−1B′(w)Σ
−1
0 B(w),
where Σ0 =∫ τ
0[Rn2(s)/Rn0(s)−{Rn1(s)/Rn0(s)}⊗2]Rn0(s) dΛ0(s) and for k = 0, 1, 2,Rnk(s) =
n−1∑n
i=1 B(Wi)⊗kYi(s) exp{β′Xi + ϕ(Wi)}. Similarly, σ2
n,j(wj) is estimated by
σ2n,j(wj) = {n−1B′
j(wj)Σjj
0 Bj(wj)}1/2,
where Σjj
0 is the jth diagonal block of Σ−1
0 . The following theorem shows that σ2n(w) and
σ2n,j(wj) are consistent.
Theorem 5. Under the conditions in Theorem 2, we have
(i) σ2n(w) − σ2
n(w) = op{1/(nh)} and σ2n,j(wj) − σ2
n,j(wj) = op{1/(nh)}, uniformly forw ∈ W;
(ii) {ϕj(wj)− ϕj(wj)− αj(wj)}/σn,j(wj)D−→ N (0, 1);
(iii) {ϕ(w)− ϕ(w)− α(w)}/σn(w)D−→ N (0, 1).
Remark 3. Since the convergence rate of ϕj is√nh (see Theorem 2), the result of Theo-
rem 5(i) indicates that the variance estimator of ϕj is consistent, which contrasts with esti-mating the variance of β, for which there is no direct consistent variance estimator in theliterature (Huang, 1999).
By the proof of Theorem 2, σ2n(w) = Op{1/(nh)}. Since αj = O(hℓ), the optimal h
in the sense of minimizing AMSE(ϕ) is of order n−1/(2ℓ+1). With the asymptotic normalityin Theorem 5, if h = o(n−1/(2ℓ+1)) (undersmoothing), then {ϕj(wj) − ϕj(wj)}/σn,j(wj)
D−→N (0, 1), which can be used to construct pointwise confidence intervals for ϕj(wj).
5 Two-step estimation
The local asymptotic results in the previous theorems show that the estimator of each ϕj
depends on the remaining ϕk’s (k = j). In the following we propose a two-step estimationmethod to remove this kind of dependence. This estimation needs an initial estimates of βand ϕk (for k = j), which are taken as β and ϕk in Section 2.
Consider estimating ϕj(wj) of interest. Regarding the remaining parameters as nuisanceand replacing them by the corresponding initial estimators, similar to (2.4), we obtain thelogarithm of an approximated partial likelihood:
ℓj(bj) =n∑
i=1
δi[β
′Xi + b′
jBj(Wij) + ϕ−j(Wi,−j)
− log∑k∈Ri
exp{β′Xk + b′
jBj(Wkj) + ϕ−j(Wk,−j)}]. (5.7)
11
Page 11 of 58 Journal of the American Statistical Association
k=1(=j) ϕk(Wik), and Bj is defined similarly to Bj in Section 2 butwith a new number of knots, qj, so that the corresponding hj is different from before andwe denote it by hj to stress this difference. We also use αj(·) to replace αj(·) to reflect thischange.
Denote by bj the maximizer of (5.7). Let ϕ∗j(wj) = b′
jBj(wj) and ϕ∗j=
∑ni=1 δiϕ
∗j(Wij)/
∑ni=1 δi.
Similar to ϕj, we define the two-step estimator of ϕj as
ϕj(wj) = ϕ∗j(wj)− ϕ∗
j.
Then the two-step estimator of ϕ(w) is simply ϕ(w) =∑J
j=1 ϕj(wj).
For k = 0, 1, 2, let R∗kj(s) = E[Bj(Wj)
⊗kY (s) exp{β′X+ϕ(W)}], Σ0,jj =∫ τ
0
[R∗
2j(s)/R∗0j(s)−
{R∗1j(s)/R
∗0j(s)}⊗2
]R∗
0j(s) dΛ0(s), σn,j(wj) = {n−1B′j(wj)Σ
−1
0,jjBj(wj)}1/2, and
ξn,j = n−1
n∑i=1
∫ τ
0
{Bj(Wij)− R∗
1j(s)/R∗0j(s)
}dMi(s).
Then ξn,j is a martingale of mean zero and variance Σ0,jj. The following theorem gives auniform Bahadur representation of the two-step estimator and a natural by product of thelimiting distribution.
Theorem 6. Under the conditions of Theorem 1, if nhjhℓ → 0, h = o(hj) and nh3j → ∞,then
ϕj(wj)− ϕj(wj)− αj(wj) = B′j(wj)Σ
−1
0,jj ξn,j + op(hℓj + 1/
√nhj),
uniformly in wj ∈ [0, 1]. Furthermore,
{ϕj(wj)− ϕj(wj)− αj(wj)}/σn,j(wj)D−→ N (0, 1).
The results of Theorem 6 are the same as those in Theorems 1 and 2 with the J = 1 casethat W is univariate. This indicates that the two-step estimation is essentially equivalent toan oracle method in the sense that ϕj could estimate ϕj well as if β and the remaining ϕk’s wereall known. The result holds regardless of the finite dimension of W, so asymptotically thereis no curse of dimensionality. This is a desired property shared by other two-step estimationapproaches (Fan and Zhang, 1999; Horowitz and Mammen, 2004; Jiang and Li, 2008; Liu,Yang and Härdle, 2013; Ma et al., 2015).
In Theorem 6, the undersmoothing condition h = o(hj) is used for the initial estimate.In practical implementation of the two-step estimation, the initial estimate ϕ−j in the firststage, should be different from the best polynomial spline estimate which is certainly notundersmoothed. When same knots are used for polynomial spline estimator ϕj in Theorem 2and for two-step estimator ϕj in the 2nd stage, then αj(wj) = αj(wj) and Σ
−1
0,jj = Σ−10,jj. Since
12
Page 12 of 58Journal of the American Statistical Association
0 , it is seen from Theorems 2 and 6 that ϕj is asymptotically more efficient than ϕj.The actual advantage of the two-step estimation depends on how far away Σ0 is from the blockdiagonal matrix. If the correlation, weighted by risk function r∗(s), between variable Bi(Wi)
for approximating ϕi and variable Bj(Wj) for approximating ϕj is zero for all i = j, then thediagonal blocks of Σ0 disappear and ϕj and ϕj have the same asymptotic distribution.
6 Nonparametric hypothesis testing
The uniform Bahadur representation of the global splines estimators in Theorem 1 facilitatesstatistical inference for the nonparametric additive components. For illustration we considerthe following testing problem for ϕ1:
For this test problem, there is no formal work using the global splines estimation in the litera-ture for any nonparametric models including the nonparametric regression models consideredin Zhou et al. (1998). When ϕ1,0(·) = 0, it reduces to testing significance of ϕ1. The testingproblem is a nonparametric null hypothesis versus a nonparametric alternative, because thenuisance parameters under H0 are still nonparametric. Other testing problems, for example,testing the significance of a group of variables, can be solved analogously.
We consider the intuitive discrepancy measures
Tn = nh1
∫ 1
0
{ϕ1(w1)− ϕ1,0(w1)}2a(w1) dw1,
Tn = nh1
∫ 1
0
{ϕ1(w1)− ϕ1,0(w1)}2a(w1) dw1,
where a(·) is a bounded, nonnegative and integrable weighting function. The above distance-based statistics were used for density estimation in Bickel and Rosenblatt (1973). They can beregarded as extensions to the Kolmogorov-Smirnov and Cramer-von Mises types of statistics.Other tests can be developed using our uniform Bahadur representations, such as the general-ized likelihood ratio tests in Fan, Zhang and Zhang (2001) and Fan and Jiang (2005). Whilethe generalized likelihood ratio tests have Wilks’ property and the asymptotic optimality interms of rates of convergence for nonparametric hypothesis testing according to the formula-tions of Ingster (1993), the distance based test has its own advantage, as advocated in Hongand Lee (2013). Let σn(u, v) = n−1B′
1(u)Σ110 B1(v) and σn(u, v) = n−1B
′1(u)Σ
−1
0,11B1(v). Thefollowing theorem gives the limiting null distributions of our test statistics, demonstratingthat the Wilks phenomenon is still observed in the current situation.
13
Page 13 of 58 Journal of the American Statistical Association
where Φ(·) is the standard normal distribution. Hence, as n → ∞ the power goes to one,since s∗−1
n c∗ → +∞ and 0 ≤ s∗−1n σ∗
n ≤ 1. This shows that the test is consistent. With theabove alternative distribution, asymptotic optimality of the test can be obtained using theargument of the generalized likelihood ratio test in Fan, Zhang and Zhang (2001) and Fanand Jiang (2005).
To implement the proposed test, we need to obtain the null distribution of Tn (or Tn).Theoretically the asymptotic null distribution of Tn (or Tn) can be used in determining thep-value, but it may not give a good approximation in a finite sample setting because of the lowconvergence rate in Theorem 7, which is a common phenomenon in nonparametric hypothesistesting (Bickel and Rosenblatt, 1973; Fan and Jiang, 2005; Hong and Lee, 2013). To dealwith this difficulty, we propose the following bootstrap method to find the p-value. Let Fn bethe empirical distribution of the observations {Si, δi,Wi,Xi}ni=1. The bootstrap procedure isdetailed as follows.
1. Resample a bootstrap sample {S∗i , δ
∗i ,W∗
i ,X∗i }ni=1 from Fn.
2. Based on the bootstrap sample, fit model (1.1) to get the estimate of ϕ, denoted byϕ∗1, using the same routine as for ϕ. Then compute the bootstrap version of the test
statistic Tn:
T ∗n = nh1
∫ 1
0
{ϕ∗1(w1)− ϕ1(w1)}2a(w1) dw1.
3. Repeat steps 1 and 2 to obtain a sample of T ∗n ’s, T ∗(k)
n , k = 1, . . . , B, say.
4. Use the bootstrap sample {T ∗(k)n }Bk=1 to determine the quantiles of Tn.
The above bootstrap method can be obviously modified for Tn, and we use T ∗n to denote the
bootstrap test statistic corresponding to Tn. The following theorem establishes consistency ofthe proposed bootstrap method.
Theorem 9. Assume that the conditions in Theorem 7 hold. Then under H0, supt |P (T ∗n <
t|Fn)− P (Tn < t)| → 0 a.s. and supt |P (T ∗n < t|Fn)− P (Tn < t)| → 0 a.s.
Although the asymptotic null distribution of Tn (or Tn) may not be approximated well,the null distribution of T ∗
n (or T ∗n) can be obtained by resampling. Theorem 9 ensures that
the null distribution of Tn (or Tn) is well approximated by the conditional distribution of T ∗n
(or T ∗n), given the original sample. See Figure 2 in the simulation section for finite sample
performance of the proposed test.
15
Page 15 of 58 Journal of the American Statistical Association
To implement the estimation method in (2.4), we need to specify the location of knot se-quence {ξj,k}
Kj,n
k=1 . Theoretically, asymptotic optimal knot placement can be derived fromour asymptotic result in Section 4 by following an argument similar to that in Agarwal andStudden (1980). Practically, the equally spaced knots and quantile knots methods are usu-ally used for placing the knots. Throughout this section we use the latter, which places theknots at the sample quantiles of the variable and results in approximately the same numberof observed values of the variable between any two adjacent knots. The numbers of knots{Kj,n, j = 1, · · · , J} are smoothing parameters chosen by minimizing the bic:
bic = −2 log(likelihood) + log(n){d+J∑
j=1
(Kj,n + ℓj − 1)},
where Kj,n and ℓj − 1 denote respectively the number of knots and degree of the spline forestimating ϕj. For Cox’s type of models with single index, λ(t|x) = λ0(t) exp{ψ(β′
0x)}, Huangand Liu (2006) used the bic and showed via simulations that polynomial splines estimationperformed well in general. Other methods can be used to decide Kj,n, such as the cross-validation and generalized cross-validation. For details, see O’Sullivan (1988), Hastie andTibshirani (1990), and Nan et al. (2005).
7.2 Simulations
We conduct simulations to illustrate nice performance of the polynomial spline smoother, todemonstrate that the proposed bootstrap method gives an accurate estimate of the distributionof our test statistic Tn, and to check the consistency and power of the proposed test. We alsocompare the polynomial spline smoother with others.
There are several estimation methods for model (1.1), for example, polynomial splinesestimation and kernel based estimation, but there is no solid limiting theory for the kernel-based profile partial likelihood (ppl) estimation except for the univariate ϕ(w) (J = 1) inCai et al. (2007). It is well known that the polynomial splines estimation is easy to calculateand has very good finite sample performance. We compare the polynomial spline estimationwith the two-step estimation and the oracle one in the sense that while estimating a specificadditive component it assumed the remaining components were known. For univariate ϕ(w),we also compare our method with the ppl method in Cai et al. (2007). It is easy to calculateour estimators.
16
Page 16 of 58Journal of the American Statistical Association
For each simulation, we use natural cubic splines without intercept and with degrees offreedom (df) between 3 and 20 for approximating each ϕj. With this option, the quantile knotsmethod is used to place the df − 1 knots at the sample quantiles of the variable. We employthe BIC criterion in Section 7.1 to select the number of knots for ϕj. Our codes are availableupon request. Jiang and Jiang (2001) investigated the polynomial splines estimator of β indetail. For saving space, we here focus on estimation of the nonparametric components.
Example 1. (Bivariate ϕ) We sample data according to the following scheme. First, wegenerate covariate X from the bivariate normal distribution with marginal N(0, 1) and cor-relation 0.5, and Z1 and Z2 independently from U(0, 1). Next, we set W1 = Z1 and W2 =
0.5Z1 +0.5√3Z2, so that W1 and W2 have correlation 0.5. Then, we generate the failure time
from an exponential distribution with, hazard function
λ(t;X,W) = λ0 exp{β1X1 + β2X2 + ϕ(W)},
where λ0 = 1, β1 = 0.6, β2 = 0.4, and ϕ(w) = ϕ1(w1) + ϕ2(w2) with ϕ1(w1) = exp(2w1) −0.5{exp(2) − 1} and ϕ2(w2) = 0.5π sin(πw2) − 1. Finally, given (X,W), we generate thecensoring variable from U [0, 1] so that it is independent of the failure time variable. For thissetting the censoring rate is about 30%.
Figure 1: Estimated functions with 95% confidence intervals. Top panel - polynomial splineestimation; bottom panel - orcale polynomial spline estimation. Solid- true, dashed- median,dotted dash- 2.5% and 97.5% percentiles.
17
Page 17 of 58 Journal of the American Statistical Association
To compare the polynomial spline estimator with the oracle one, we run 1, 000 simulations,and for each of them we draw a random sample of sizes n = 200 and compute the estimates.Figure 1 shows the median curves with envelopes formed via pointwise 2.5% and 97.5% samplepercentiles for the two estimation approaches. Both estimators have good performance in thisexample, even though the confidence intervals of the oracle estimator are a little bit narrower.We also calculate the two-step polynomial spline estimator. As expected, its performance isquite close to the oracle one, we omit the results for saving space.
To investigate difference between the distribution of T ∗n and that of Tn, we run 1000
simulations. For each simulation, we obtain three bootstrap samples (the results are almostthe same if more bootstrap samples are drawn). In total we have 3, 000 bootstrap samples,which gives us 3, 000 realized values of T ∗
n . Using the kernel density estimate, we obtainthe distribution of T ∗
n . We also calculate the value of Tn in each simulation and obtained1, 000 realized values of Tn, which provides a kernel density estimate of the distribution of Tn.Figure 2 displays the estimated densities of Tn and T ∗
n at sample size n = 400 and n = 1200. Itis evident that both densities almost coincide. Therefore, it is reasonable to use the bootstrapmethod to approximate the null distribution of Tn if one has a moderate large sample. Theresult agrees with Theorem 9.
0 2 4 6 8 10
0.00
0.10
0.20
0.30
Estimated Density
0 1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Estimated Density
Figure 2: Estimated densities. Left panel: n=400; right panel: n=1200. Solid — true, dotted— the bootstrap approximation.
Last, we check the power of our test. We evaluate the power for a sequence of alternativemodels indexed by γ,
Hγ : ϕ1(w1) = exp(2w1)− 0.5{exp(2)− 1}+ γg(w1),
where g(w1) = 2w1−1 and γ = 0, 0.2, . . . , 0.8, 1. The alternative sequence ranges from the nullmodel to reasonably far away from it. When γ = 0, the alternative becomes the null, and the
18
Page 18 of 58Journal of the American Statistical Association
corresponding power should be about the significance level, which indicates the test keeps thesize. As γ increases, the alternative gets further away from the null hypothesis, and the powergets larger. For a fixed value of γ, the rejection rates of the null hypothesis should increase toone as n goes to ∞, which suggests the test is consistent. These phenomena are observed inTable 1, which exemplifies nice performance of the proposed methodology. For this example,the two-step estimation and the corresponding test Tn have very similar performance, so theresults are omitted for saving space. ⋄
Example 2. In the previous example, the polynomial spline estimator is very close to thetwo-step estimator and the oracle estimator. One may wonder if the two-step estimator isonly better in its asymptotic theory. We here consider a similar setting of Example 1 butwith W1 = Z1, W2 = −(9/
√19)Z1+Z2, ϕ1(w1) = −8w1(1−w1)(1− 2w1)(1+w1)− 2/15, and
ϕ2(w2) = π/2 sin(2πw2). Then the correlation coefficient between W1 and W2 is as high as−0.9. We run 1000 simulations. For each simulation we draw a random sample of size n = 400
and calculate the polynomial spline estimates, the oracle estimates and the two-step estimates,using the natural cubic B-spline and select the number of knots by the BIC criterion for eachof them. The initial estimator for the two-step estimator employs the 5th order polynomialspline estimation, so that it is undersmoothed. Figure 3 displays the estimated percentiles ofthe additive components. Obviously, the two-step estimator is similar to the oracle estimatorand is significantly better than the polynomial spline estimator, since the envelopes formed bythe 2.5th and 97.5th percentiles for the two-step estimator are much narrower. This reflexesthe two-step estimator’s oracle property. ⋄
19
Page 19 of 58 Journal of the American Statistical Association
Example 3. (Univariate ϕ) In this example, we compare our estimator with the ppl estimatorin Cai et al. (2007). Since the ppl estimation can only deal with one-dimensional ϕ(·), weconsider the following model:
λ(t;X,W ) = λ0 exp{β1X1 + β2X2 + ϕ(W )},
where λ0 = 1, β1 = 0.6, β2 = 0.4, and ϕ(w) = −8w(1 − w2). The survival function isS(t) = exp[−tλ0 exp{β1X1 + β2X2 +ϕ(W )}]. We generate W from U(0, 1) and X = (X1, X2)
′
from a bivariate normal distribution with the correlation coefficient 0.5 and the marginaldistributions N(0, 1). Given X and W , we generate the failure time T from the above survivalfunction. The censoring time δ is generated from U(0, 6), which produces about 63.2% ofcensoring. The sample size is taken as n = 200. We run 600 simulations and calculate bothestimators of ϕ(·). Figure 4 displays the estimated ϕ(·). It is seen that both estimates are closeexcept in the boundary. This is expected, since no one dominates the other in the univariatenonparametric regression. However, the computational advantage for the polynomial splineestimator is significant. For this example, computing time for the former is about 1% of thatfor the latter. In fact, the CPU time is 55.24 seconds for the former and 5216 seconds for thelatter, on a personal laptop (with 8GM RAM and Intel Core i5-5200U [email protected]).
7.3 A real data example
We use our proposed methodology to analyze the “Framingham Heart Study (FHS)” data(Dawber, 1980). There are 1571 observations and about 90.42% censoring in the dataset.We are interested in the failure time, measured from the time at the “age 45” exam to the
21
Page 21 of 58 Journal of the American Statistical Association
occurrence of coronary heart disease (CHD). The risk factors include age (at age “45” exam),gender, systolic blood pressure (SBP), body mass index (BMI), cholesterol level, waiting time,and cigarette smoking status. We first fit the data using model (1.1) with
x = (Smoking status, Gender)′ and ϕ(w) =5∑
j=1
ϕj(wj),
where w1 = Cholesterol, w2 = BMI, w3 = SBP, w4 = Age, and w5 = Waiting time. Thisincludes the Cox proportional hazards model as a special case.
We use natural cubic splines without intercept and with degrees of freedom (df) between 1
and 10 for approximating each ϕj. With this option, the df − 1 knots are chosen according tothe quantile knots method. Based on the BIC criterion in Section 7.1 for selecting the numberof knots for ϕj, we estimate the model parameters/functions and check significance of eachadditive component using the proposed test statistic. Our test suggests that Age and Waitingtime are not significant, because the p-values are 0.58 and 0.98, respectively. Since Age is aconfounding variable, we remove variable w5 and retain w4 and fit the data using model (1.1)with
x = (Age, Smoking status, Gender)′ and ϕ(w) =3∑
j=1
ϕj(wj),
where w1 = Cholesterol, w2 = BMI, and w3 = SBP. The BIC method automatically choosesdf = 1 for each of these ϕj’s, so we get the resulting estimates in Figure 5. It is seen that bothestimates are very close, which is expected because the sample size is large and both estimatesare consistent. Obviously, the effect of each variable is strictly increasing as the variable levelincreases. Then we consider hypothesis testing H0 : ϕj = 0 versus H1 : ϕj = 0 for j = 1, 2, 3.
The corresponding p-values are reported in Table 2, which indicates that all these functionsare significant. This validates the use of Cox’s model for the FHS data in Clegg et al. (1999).
Table 2: Significance tests of ϕj’s
p values ϕ1(·) ϕ2(·) ϕ3(·)Tn 0.003 0.054 0.004
Tn 0.010 0.048 0.005
8 Discussion
We have studied local asymptotics of the polynomial spline estimators of the partly linear ad-ditive Cox models and hypothesis testing problems for additive nonparametric components.
22
Page 22 of 58Journal of the American Statistical Association
Figure 5: Estimated additive components for the FHS data. Dotted: Polynomial splineestimates; Solid: two-step estimates.
We have made deterministic efforts to establish the uniform Bahadur representation and thedesign-adaptive asymptotic normality of the polynomial spline estimators. We have also pro-posed a two-step estimation procedure for estimating the additive components and establishedits oracle property in the sense that one component can be estimated as if all other compo-nents were known in advance. It has been demonstrated that the two-step estimators are moreefficient. We have proposed a distance-based statistic for specification tests of the additivecomponents. Asymptotic distributions of the proposed test have been obtained. We have alsoproposed a consistent bootstrap approach to calculate the p-value of the proposed test. Oursimulations demonstrate nice performance of the estimation methods and the test. We haveapplied our approach to analyze the FHS data.
Since our approach is based on the partial likelihood of Cox’s models, essentially it can beextended to likelihood-based inference. The proposed methodology is also applicable to othermodels with polynomial spline estimation, for example, the generalized additive models andtransformation models. An interesting project is to study quantile estimation of the single-index models (Kong and Xia, 2007) and the transformation models (Chen, Jin and Ying, 2002;Ma and Kosorok, 2005; Chen and Tong, 2010; Lu and Zhang, 2010) using polynomial splineapproximation to nonparametric components. This is among our investigation in our futureprojects.
Appendix. Proofs of Theorems
Throughout the proofs, for any column vector a, let ∥a∥ = (a′a)1/2 be the Euclidean norm,and for any square matrix A, let ∥A∥ = sup{∥Ax∥ : ∥x∥ = 1} be the operator norm of A,
23
Page 23 of 58 Journal of the American Statistical Association
which corresponds to the Euclidean norm when A is a column vector. Denote by λmin(B)
and λmax(B) the minimum and maximum eigenvalues of the square matrix B, respectively.For any probability measure P , define L2(P ) = {f :
∫f 2 dP < ∞}. Let ∥ · ∥2 be the usual
L2-norm with respect to P , that is, ∥f∥2 =√∫
f 2 dP , and let ∥ · ∥∞ denote the supremumnorm. For k = 0, 1, 2, let
Vnk(s,b) = n−1
n∑i=1
B(Wi)⊗kYi(s) exp{β′Xi + b′B(Wi)}.
For ease of notation, we introduce the following matrices:
Σn0 =
∫ τ
0
[Rn2(s)
Rn0(s)−{Rn1(s)
Rn0(s)
}⊗2]Rn0(s) dΛ0(s),
Σn1(b) = n−1
n∑i=1
∫ τ
0
[Vn2(s,b)Vn0(s,b)
−{Vn1(s,b)
Vn0(s,b)
}⊗2]dMi(s),
Σn2(b) =∫ τ
0
[Vn2(s,b)Vn0(s,b)
−{Vn1(s,b)
Vn0(s,b)
}⊗2]Rn0(s) dΛ0(s).
We denote the score function by U(β,b) = ∂ℓ(β,b)/∂b, and the Hessian matrix by Σn(b) =−n−1∂U(β,b)/∂b′. Then by (4.5),
Σn(b) = n−1
∫ τ
0
[Vn2(s,b)Vn0(s,b)
−{Vn1(s,b)
Vn0(s,b)
}⊗2]dN(s)
= Σn1(b) +Σn2(b).
To ease the arguments for proofs, we introduce the centered versions of variables, B(w) =
B(w)−∑n
j=1 δjB(Wj)/∑n
j=1 δj and Xi = Xi −∑n
j=1 δjXj/∑n
j=1 δj. By (2.4), it is straight-forward to verify that
ℓ(β,b) =n∑
i=1
δi[β′Xi + b′B(Wi)
− log∑k∈Ri
exp{β′Xk + b′B(Wk)}]. (A.1)
In the following, we present the proofs of our theorems. To streamline our arguments, weneed some technical lemmas which are relegated to the supplementary material.
Proof of Theorem 1. The proof consists of the following four steps.(i) Taylor’s expansion for the score function. Let U(β,b) = ∂ℓ(β,b)/∂b. Using (2.4) and
24
Page 24 of 58Journal of the American Statistical Association
αn(w) = op(hℓ) uniformly for w ∈ W . Since β − β = Op(1/
√n), it is easy to show that
B′(w)rn0 = Op(n−1/2). Therefore, by (A.7),
ϕ∗(w)− ϕ∗n0(w) = vn(w) + op(h
ℓ + 1/√nh), (A.8)
uniformly in w ∈ W . Then, by (A.5) and (A.8),
ϕ∗(w)− ϕ(w)− α(w) = vn(w) + op(hℓ + 1/
√nh), (A.9)
uniformly in w ∈ W .(iii) Asymptotic analysis for vn. Let v∗
n = n−1∑n
i=1
∫ τ
0{B(Wi)−Vn1(s,b0)/Vn0(s,b0)} dMi(s).
Then
vn(w) = B′(w)Σ−1n (b0)v∗
n.
Note that Mi(t) is a martingale and B(Wi) − Vn1(s,b0)/Vn0(s,b0) is Fi(s)-predictable. Itfollows that E(v∗
n) = 0 and
E∥v∗n∥2 = tr{E(v∗⊗2
n )} = n−1
∫ τ
0
E[tr{Vn(s)}]Rn0(s) dΛ0(s),
where Vn(s) = Vn2(s,b0)/Vn0(s,b0)− {Vn1(s,b0)/Vn0(s,b0)}⊗2. Let
Gn(a, s) =EPn{v2(W)r∗n(s)}
EPn{r∗n(s)}−[EPn{v(W)r∗n(s)}
EPn{r∗n(s)}
]2,
where Pn is the empirical distribution function of {Wi,Xi, Yi(s)}ni=1 and r∗n(s) = Y (s) exp{β′X+
ϕ∗n0(W)}. Hence, for vector a such that ∥a∥ = 1, we have
a′Vn(s)a = Gn(a, s).
By Lemma 8(i), a′Vna = G(a, s) + o(h), where G(a, s) is the population version of Gn(a, s).Then, by Lemma 4, a′Vn(s)a = O(h), uniformly for s ∈ [0, τ ], and hence the eigenvaluesof Vn(s) are all of order O(h) and tr{Vn(s)} = O(1), uniformly for s ∈ [0, τ ]. Therefore,E∥v∗
n∥2 = O(1/n). Applying the Markov inequality, we obtain that ∥v∗n∥ = Op(1/
√n). Let
ξn1 = n−1
n∑i=1
∫ τ
0
{B(Wi)− R∗
1(s)/R∗0(s)
}dMi(s),
ξn2 = n−1
n∑i=1
∫ τ
0
{R∗
1(s)/R∗0(s)− Rn1(s)/Rn0(s)
}dMi(s),
ξn3 = n−1
n∑i=1
∫ τ
0
{Rn1(s)/Rn0(s)− Vn1(s,b0)/Vn0(s,b0)
}dMi(s)
≡ n−1
n∑i=1
∫ τ
0
Rn(s) dMi(s).
26
Page 26 of 58Journal of the American Statistical Association
uniformly w ∈ W . Naturally, it can be written that
vn(w) = B′(w)Σ−1n0 ξn1
+B′(w){Σ−1n (b0)−Σ−1
n0 }ξn1 + op(1/√nh), (A.14)
uniformly in w ∈ W . By Lemma 8, ∥Σn1(b0)∥ = Op(h/√n). By Lemma 1, |ϕ(Wi) −
ϕ∗n0(Wi)| = O(hℓ), uniformly for i = 1, . . . , n. By Lemma 1, Condition (A3) and Taylor’s
expansion,
Vnk(s,b0)− Rnk(s) =1
n
n∑i=1
r∗i (s)B(Wi)⊗k{ϕ∗
n0(Wi)− ϕ(Wi)}
+O(h2ℓ), a.s. (A.15)
uniformly for components and for s ∈ [0, τ ]. By (2.3), (A.15) and an argument similar to thatfor Lnk(s) in Lemma 11, it is easy to show that, for ℓ ≥ 2, Rnk(s) − Vnk(s,b0) = op(h
ℓ+1)
uniformly for components. Thus, Σn2(b0)−Σn0 = op(hℓ+1), uniformly for components. This
leads to ∥Σn2(b0)−Σn0∥ = o(hℓ). Hence, by Lemma 8, for ℓ ≥ 2,
∥Σn(b0)−Σn0∥ ≤ ∥Σn1(b0)∥+ ∥Σn2(b0)−Σn0∥
= Op(h/√n) + op(h
ℓ), (A.16)
which, combined with Lemma 8, yields that ∥Σ−1n0 {Σn(b0) − Σn0}∥ = op(1). Then applying
Lemma 12, we establish that
Σ−1n (b0) = [I +Σ−1
n0 {Σn(b0)−Σn0}]−1Σ−1n0
= Σ−1n0 −Σ−1
n0 {Σn(b0)−Σn0}Σ−1n0 + γnΣ
−1n0 , (A.17)
where ∥γn∥ = O(∥Σ−1n0 {Σn(b0) −Σn0}∥2). This, combined with (A.16) and Lemma 8, leads
to ∥γn∥ = Op(1/n) + op(h2ℓ−2). Then, for ℓ ≥ 2,
∥γnΣ−1n0 ∥ = {Op(1/n) + op(h
2ℓ−2)}h−1 = op(1/√h).
Note that E(ξ⊗2n1 ) = n−1Σ0. It follows that
E{tr(ξ⊗2n1 )} = tr{E(ξ⊗2
n1 )} = n−1tr(Σ0) ≤ qnλmax(Σ0)/n = O(1/n),
and hence
∥ξn1∥ = Op(1/√n). (A.18)
Then, by (A.14) and (A.17), we have
vn(w) = B′(w)Σ−1n0 ξn1 − B′(w)Σ−1
n0 {Σn(b0)−Σn0}Σ−1n0 ξn1
+op(1/√nh), (A.19)
28
Page 28 of 58Journal of the American Statistical Association
uniformly in t ∈ [0, τ ]. This, combined with the Doob-Meyer decomposition, leads to
Λ0(t)− Λ0(t) = n−1
∫ t
0
r−10 (u)dM(u) + op(1)
≡ γn(t) + op(1),
uniformly in t ∈ [0, τ ]. Since r0(u) is F(u)-predictable and M(u) is a martingale with respectto F(u), by the martingale central limit theorem, γn(t) is obviously of order Op(1/
√n). By
the Borel-Lebesgue covering theorem, for any small ε > 0, there exist a finite number of openintervals, (τj − ε, τj + ε) for j = 1, . . . , L, such that τj ∈ (0, τ) and [0, τ ] ⊂ ∩L
j=1(τj − ε, τj + ε).Since each γn(t) is of order op(1), max1≤j≤L γn(τj) = op(1). For any t ∈ [0, τ ], it must be inone of the intervals, for example, the kth interval (τk − ε, τk + ε). Then |t− τk| < ε. Note thatr0(u) = n−1
∑ni=1 r
∗i (u). It follows that
|γn(t)− γn(τk)| = |∫ t
τk
r0(u)−1n−1
n∑i=1
{dNi(u)− r∗i (u)dΛ0(u)} |
≤ |∫ t
τk
r0(u)−1n−1
n∑i=1
dNi(u) | + |∫ t
τk
dΛ0(u)} |
= O(ε),
uniformly in t. Hence,
supt∈[0,τ ]
|γn(t)| ≤ max1≤j≤L
|γn(τj)|+ supt∈[0,τ ]
|γn(t)− γn(τk)| = op(1).
Then the result of theorem follows.
Proof of Theorem 5. By definitions of Σ0, Σn0 and Σ0, we have
Σ0 −Σn0 =
∫ τ
0
[Rn2(s)
Rn0(s)−{Rn1(s)
Rn0(s)
}⊗2]Rn0(s) d{Λ0(s)− Λ0(s)}, (A.34)
and
σ2n(w)− σ2
n(w) = n−1B′(w)(Σ−1
0 −Σ−10 )B(w)
= n−1B′(w)(Σ−1
0 −Σ−1n0 )B(w) + n−1B′(w)(Σ−1
n0 −Σ−10 )B(w)
≡ Un1(w) + Un2(w).
By (A.20) and Lemma 15, we have
|Un2(w)| ≤ n−1∥B(w)′Σ−1n0 ∥∥Σn0 −Σ0∥∥Σ−1
0 B(w)∥
= op{1/(nh)}, (A.35)
uniformly in w. Then
σ2n(w)− σ2
n(w) = Un1(w) + op{1/(nh)}, (A.36)
32
Page 32 of 58Journal of the American Statistical Association
uniformly in w. For any vector a such that ∥a∥ = 1, we have |a′Σ0a| = |a′Σ0a|{1 + op(1)}.Then, similar to (A.20),
∥B(w)′Σ−1
0 ∥ = Op(h−1), (A.37)
uniformly in w. Note that Rnk(s) = Op(1) for k = 0, 1, 2. By (A.34) and Theorem 4 and byusing the argument in Lemma 15, it can be shown that
∥Σ0 −Σn0∥ = op(1). (A.38)
Then by (A.20), (A.37) and (A.38),
|Un1(w)| ≤ n−1∥B′(w)Σ−1
0 ∥∥Σ0 −Σn0∥∥Σ−1n0 B(w)∥
= n−1Op(h−1)op(1)Op(h
−1) = op{1/(nh)},
uniformly in w. This combined with (A.36) leads to
σ2n(w)− σ2
n(w) = op{1/(nh)},
uniformly in w. Similarly, σ2n,j(wj) − σ2
n,j(wj) = op{1/(nh)}, uniformly in wj. This, togetherwith Theorem 2, completes the proof of the theorem.
Proof of Theorem 6. Similar to the proof of Theorem 1, we have the Bahadur representation.In the following we give an outline here. Let Uj(bj) = ∂ℓj(bj)/∂bj and
Vnk(s,bj) = n−1
n∑i=1
Bj(Wij)⊗kYi(s) exp{β
′Xi + b′
jBj(Wij) + ϕ−j(Wi,−j)}.
Then Uj(bj) = 0. Similar to (A.9), we have
ϕ∗j(wj)− ϕj(wj)− αj(wj) = vnj(wj) + op(h
ℓj + 1/
√nhj), (A.39)
uniformly in wj, where
vnj(wj) = Bj(wj)Σnj(b0j)−1n−1
n∑i=1
∫ τ
0
{Bj(Wij)− Vn1(s,bj)/Vn0(s,bj)} dMi(s).
Using a similar argument to (A.24), we obtain that
vnj(wj) = Bj(wj)Σ−1
0,jj ξnj + op(1/
√nhj), (A.40)
uniformly in wj. Combining (A.39) and (A.40) and using the same argument as for (A.30),we establish the Bahadur representation:
ϕj(wj)− ϕj(wj)− αj(wj) = Bj(wj)Σ−1
0,jj ξnj + op(hℓj + 1/
√nhj).
Then by the same argument of Theorem 2, we obtain the asymptotic normality result.
33
Page 33 of 58 Journal of the American Statistical Association
It is easy to see that d∗1n is uncorrelated with Tn2 and is asymptotically normal with E(d∗1n) = 0
and var(d∗1n) = 4chb′Σ110 b{1 + o(1)}. Let the ith component of b be bi. Then
bi =
∫ ξ1,i
ξ1,i−ℓ1
B1,i(w1)g(w1)a(w1) dw1 = O(h1)
uniformly for i = 1, . . . , q1. Hence, by Lemma 8,
|b′Σ110 b| ≤ λmax(Σ
110 )tr(b⊗2) = O(1).
This produces that var(d∗1n) = O(h). Hence, d∗1n = op(1). Then, by (A.42),
s∗−1n {Tn − µ∗
n(1 + op(1))− c∗} D−→ N (0, 1),
where s∗2n = σ∗2n + 4chb′Σ11
0 b = O(h). That is, the result in part (i) holds. Using the sameargument, we establish the result in part (ii).
Proof of Theorem 9. The results are proven by drawing a parallel between the approximatedpartial likelihood and its bootstrap analogue. The argument employed here can be usefulfor proving consistency of bootstrap methods in other scenarios. Let ωki = 1(S∗
k = Si, δ∗k =
δi,W∗k = Wi,X∗
k = Xi), for k, i = 1, . . . , n. Then
P (ωki = 1|Fn) = 1/n, and P (ωki = 0|Fn) = 1− 1/n,
where Fn is the empirical distribution of {Si, δi,Wi,Xi}ni=1. Given the bootstrap sample{S∗
i , δ∗i ,W∗
i ,X∗i }, i = 1, . . . , n, the logarithm of the approximated partial likelihood is
ℓ∗(β,b) =n∑
i=1
δ∗i
{β′X∗
i + ϕn(W∗i )− log
∑k∈R∗
i
exp[β′X∗k + ϕn(W∗
k)]},
where R∗i = {j : S∗
j ≥ S∗i }. Let ωi =
∑nk=1 ωki for i = 1, . . . , n. That is, ωi is the frequency
that the i-th original sample points were drawn in the bootstrap sample. Then
ℓ∗(β,b) =n∑
i=1
ωiδi
{β′Xi + ϕn(Wi)− log
∑k∈Ri
ωk exp[β′Xk + ϕn(Wk)]
}. (A.43)
This is just a random weighted version of the approximated partial likelihood in (2.4). Notethat the bootstrap estimators (β
∗, b
∗) for (β,b) maximize the likelihood in (A.43). Similar
to (A.30) in the proof of Theorem 1, we have
ϕ(b)j (wj)− ϕj(wj)− αj(wj) = B′(w)ejΣ
−10 ξ∗n1 + op(h
ℓ + 1/√nh), (A.44)
35
Page 35 of 58 Journal of the American Statistical Association
where ϕ(b)j (wj) is the bootstrap estimate of ϕ(wj) defined in the same way as ϕj(wj) but with
b replaced with b∗, and
ξ∗n1 = n−1
n∑i=1
ωi
∫ τ
0
{B(Wi)− R∗
1(s)/R∗0(s)
}dMi(s).
Hence, by (A.30) and (A.44),
ϕ(b)j (wj)− ϕj(wj) = B′(w)ejΣ
−10 ξn1 + op(h
ℓ + 1/√nh), (A.45)
where ξn1 = n−1∑n
i=1(ωi− 1)∫ τ
0
{B(Wi)−R∗
1(s)/R∗0(s)
}dMi(s). Similar to (A.41), we have
T ∗n = µ∗
n + op(1) + T ∗n2, (A.46)
where T ∗n2 = 2n−1h
∑i<j(ωi − 1)(ωj − 1)ε′iΓ
′1AnΓ1εj. Since ωi =
∑nk=1 ωki, it can be written
that
T ∗n2 = 2n−1h
n∑k=1
ηkk + 2n−1h∑k<ℓ
ηkℓ ≡ T ∗n21 + T ∗
n22,
where ηkℓ =∑
i<j(ωki−1/n)(ωℓj−1/n)ε′iΓ′1AnΓ1εj. Note that E[(ωki−1/n)2|Fn] = 1/n−1/n2,
and for i = j, E[(ωki − 1/n)(ωkj − 1/n)|Fn] = −1/n2, almost surely. It is easy to show thatT ∗n21 → 0 almost surely. Furthermore, conditional on Fn, for k = ℓ, E{ηkℓ|Fn} = 0,
var(ηkℓ|Fn) = (1/n− 1/n2)2∑i<j
(ε′iΓ′1AnΓ1εj)
2{1 + o(1)},
and if (k, ℓ) = (k′, ℓ′), E(ηkℓηk′ℓ′ |Fn) = 0. Therefore,
var(σ∗−1n T ∗
n22|Fn) = 4h2n−2∑i<j
(ε′iΓ′1AnΓ1εj/σ
∗n)
2{1 + o(1)}
→ 1,
almost surely. Using a similar argument to that in the proof of Lemma 16, we obtain that,conditional on Fn, σ∗−1
n T ∗n2 is asymptotically standard normal. Therefore, by (A.46) and
Theorem 7, the asymptotic normal distribution of T ∗n conditional on Fn is the same as the
asymptotic normal distribution of Tn. This combined with the Polya theorem completes theproof of the first result of the theorem. For the second result, it holds from the same argument.
References
Agarwal, G. G. and Studden, W. J. (1980). Asymptotic integrated mean square errorusing least squares and bias minimizing splines. The Annals of Statistics 8, 1307–1325.
36
Page 36 of 58Journal of the American Statistical Association
Barrow, D. L. and Smith, P. W. (1978). Asymptotic properties of best L2[0, 1] approxi-mation by spline with variable knots. Quart. Appl. Math. 36, 293–304.
Bickel, P. J. (1975). One-step Huber estimates in linear models. Journal of American Sta-tistical Association 70, 428–433.
Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993). Efficientand Adaptive Estimation for Semiparametric Models. John Hopkins University Press.
Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of the deviations ofdensity function estimates. Annals of Statistics. 1, 1071-1095.
Breslow, N. E. (1972). Contribution to the discussion on the paper by D. R. Cox, “Regres-sion and life tables”. Journal Royal Statistical Society B 34, 216–217.
Breslow, N. E. (1974). Covariance analysis of censored survival data. Biometrics 30, 89–99.
Buja, A., Hastie, T.J., and Tibshirani, R. J. (1989). Linear Smoothers and AdditiveModels. The Annals of Statistics 17, 453–555.
Cai, J., Fan, J., Jiang, J., and Zhou, H. (2007). Partially linear hazard regression formultivariate survival data. Jour. Journal of American Statistical Association 102, 538–551.
Chen, K.N., Jin, Z., and Ying, Z. (2002). Semiparametric analysis of transformationmodel with censored data. Biometrika 82, 659–668.
Chen, K.N. and Tong X. (2010). Varying coefficient transformation models with censoreddata. Biometrika 97, 969–976.
Clegg, L. X., Cai, J., and Sen, P. K. (1999). A marginal mixed baseline hazards modelfor multivariate failure time data. Biometrics 55, 805812.
Cox, D. R. (1972). Regression models and life-tables (with discussion). Journal Royal Sta-tistical Society B 34, 187–220.
Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269–276.
Dawber, T. R. (1980). The Framingham Study: The Epidemilogy of Atherosclerotic Disease.Cambridge, MA: Harvard University Press.
de Boor, C. (1978). A Practical Guide to B-splines. Springer-verlag, New York.
Fan, J. (1992). Design-adaptive nonparametric regression. Journal of the American StatisticalAssociation 87, 998–1004.
37
Page 37 of 58 Journal of the American Statistical Association
Fan, J., Feng, Y. and Song, R. (2011). Nonparametric Independence Screening in SparseUltra-High-Dimensional Additive Models. Journal of the American Statistics Association106, 544–557.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications, New York:Chapman & Hall.
Fan, J., Gijbels, I. and King, M. (1997). Local likelihood and local partial likelihood inhazard regression. The Annals of Statistics 25, 1661–1690.
Fan, J. and Jiang, J.(2005). Generalized likelihood ratio tests for additive models. J.Amer.Statist. Assoc. 100, 890-907.
Fan, J. and Yao, Q. (2003). Nonlinear Time Series: Nonparametric and Parametric Meth-ods. New York: Springer-Verlag.
Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient models. Ann. Statist.27, 1491-1518.
Fan, J, Zhang, C-M. and Zhang, J.(2001). Generalized likelihood ratio statistics andWilks phenomenon. Ann. Statist. 29, 153-193.
Fleming, T. R. and Harrington, D. P. (1991). Counting Process and Survival Analysis.New York: Wiley.
Gasser, T. and Muller, H. -G. (1984) Estimating regression functions and their deriva-tives by the kernel method. Scand. J. Statist. 11, 171–185.
Hall, P. and Heyde, C. C. (1980). Martingale limit theory and its application. Academic Press.
Hastie, T. J., and Tibshirani, R. J. (1990), Generalized Additive Models, New York:Chapman and Hall.
Hong, Y. and Lee, Y-J. (2013). A loss function approach to model specification testingand its relative efficiency. The Annals of Statistics 41, 1166–1203.
Horowitz, J. L. and Mammen, E.(2004). Nonparametric estimation of an additive modelwith a link function. Ann. Statist. 32, 2412-2443.
Huang, J. (1999). Efficient estimation of the partly linear additive Cox model. The Annalsof Statistics 27, 1536–1563.
Huang, J. Z. (1998). Projection estimation for multiple regression with application to func-tional ANOVA models. The Annals of Statistics 26, 242–272.
38
Page 38 of 58Journal of the American Statistical Association
Huang, J. Z. (2001). Concave extended linear modeling: A theoretical synthesis. StatisticaSinica 11, 173–197.
Huang, J. Z. (2003). Local asymptotics for polynomial spline regression. The Annals ofStatistics 31, 1600–1635.
Huang, J. Z. and Liu, L. (2006). Polynomial Spline Estimation and Inference of Pro-portional Hazards Regression Models with Flexible Relative Risk Form. Biometrics 62,793–802.
Huang, J. Z., Kooperberg, C., Stone, C. J. and Truong, Y. K. (2000). FunctionalANOVA modeling for proportional hazards regression. The Annals of Statistics 28, 961–999.
Huang, J. Z. and Shen, H. (2004). Functional coefficient regression models for nonlineartime series: A polynomial spline approach. Scandinavian Journal of Statistics, 31, 515–534.
Huang, J. Z. and Stone, C. J. (1998). The L2 rate of convergence for event historyregression with time-dependent covariates. Scand. J. Statist. 25, 603–620.
Liu, R.,Yang, L. and Härdle, W. K. (2013). Oracally efficient two-step estimation ofgeneralized additive model. Jour. Amer. Statist. Assoc. 108, 619631.
Lu, X. and Song, P. X.-K. (2015). Efficient estimation of the partly linear additive hazardsmodel with current status data. Scandinavian Journal of Statistics 42, 306-328.
Lu, W. and Zhang, H. (2010). On estimation of partially linear transformation models.Jour. Amer. Statist. Assoc., 105, 683–691.
Ma, S., Carroll, R. J., Liang, H. and Xu, S.(2015). Estimation and inference in general-ized additive coefficient models for nonlinear interactions with high-dimensional covariates.Ann. Statist. 43, 2102-2131.
Ma, S. and Kosorok, M.R. (2005). Penalized log-likelihood estimation for partly lineartransformation models with current status data. Ann. Statist., 33, 2256–2290.
Nan, B., Lin, X., Lisabeth, L. D. and Harlow, S. D. (2005). A varying-coefficient Coxmodel for the effect of age at a marker event on age at menopause. Biometrics 61, 576–583.
Opsomer, J.-D. (2000). Asymptotic Properties of Backfitting Estimators. Journal of Multi-variate Analysis 73, 166 – 179.
Opsomer, J.-D. and Ruppert D. (1997). Fitting a Bivariate Additive Model by LocalPolynomial Regression. Ann. Statist. 25, 186 –211.
Opsomer, J.-D. and Ruppert, D. (1999). A root-n consistent backfitting estimator forsemiparametric additive modeling. Journal of Computational and Graphical Statistics 8,715–732.
Ø’Sullivan, F. (1988). Nonparametric estimation of relative risk using splines and crossval-idation. Siam J. Sci. Stat. Comput. 9, 531–542.
Sasieni, P. (1992). Information bounds for the conditional hazard ratio in a nested familyof regression models. Journal Royal Statistical Society B 54, 617–635.
Schumaker, L. (1981). Spline Functions: Basic Theory, Wiley, New York.
Stone, C. J. (1985). Additive regression and other nonparametric models. The Annals ofStatistics 13, 689–705.
Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models.The Annals of Statistics 14, 590–606.
40
Page 40 of 58Journal of the American Statistical Association
Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariatefunction estimation (with discussion). The Annals of Statistics 22, 118–184.
van der Vaart, A. W. (1991). On differentiable functionals. The Annals of Statistics 19,178–204.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with groupedvariables. Journal Royal Statistical Society B 68, 49–67.
Zhou, S., Shen, X. and Wolfe, D. A. (1998). Local asymptotics for regression splinesand confidence regions. The Annals of Statistics 26, 1760–1782.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.Journal Royal Statistical Society B 67, 301–320.
41
Page 41 of 58 Journal of the American Statistical Association
where uj(ξj,i) = ρijgj(ξj,i) is a continuous function of ξj,i for each j. Then (A.51) follows from(A.52) and the facts that hkj,i − hkj,i+1 = o(hkj ) and ξj,i+1 − ξj,i = hj,i.
Lemma 6. Let r∗i (s,b) = Yi(s) exp{β′Xi+b′B(Wi)}. For k = 1 . . . , q, there exists a uniquepair (j, k′) such that k =
∑j−1j′=1 qj′ +k
′, where j = 1, . . . , J and k′ = 1, . . . , qj, so that Bj,k′(w)
and Bj,k′(w) are the kth components of B(w) and B(w), respectively. For v = 1, 2, let
K∗nv(s,b) = n−1
n∑i=1
r∗i (s,b)Bj,k′(Wi){ϕ(Wi)− ϕn0(Wi)}v.
Assume that b∗ lies between b and b0, where b0 is given in Lemma 1. Under the conditions ofTheorem 1, sups∈[0,τ ] |K∗
n1(s,b∗)| = Op(h
ℓ+0.5 + n−0.5) and sups∈[0,τ ] |K∗n2(s,b
∗)| = Op(h2ℓ+1 +
n−1).
Proof. Since there is a ω ∈ [0, 1] such that b∗ = b0 + ω(b − b0), by Lemma 3 we have|(b∗ − b0)
′B(Wi)| = ω|ϕ(Wi) − ϕn0(Wi)| ≤ |ϕ − ϕn0|∞p→ 0, uniformly for i = 1, . . . , n, if
45
Page 45 of 58 Journal of the American Statistical Association
where ξ = a′B(W){r∗(s)}1/2 and η = {r∗(s)}1/2. Using the Holder inequality, we obtainthat a′A(s)a ≥ 0 (that is, A(s) is nonnegative definite), with the equality holding if andonly if there exists a real number c such that cξ + η = 0 almost surely, or equivalently{ca′B(W)+1}{r∗(s)}1/2 = 0 almost surely, which is not possible. Hence, there exists at leastan s0 ∈ [0, τ ] such that A(s0) > 0.
Lemma 8. For any unit vector a, let v(w) = a′B(w) and r∗n(s) = Y (s) exp{β′X+ ϕ∗n0(W)},
where ϕ∗n0(w) = B′(w)b0 is defined in Lemma 1. Under the conditions in Theorem 1, we have
(i) EPn{vk(w)r∗n(s)} = EPn{vk(w)r∗(s)}+ o(h), uniformly for s ∈ [0, τ ].
(ii) there exist constants 0 < ci ≤ di <∞ (independent of n and qn for i = 1, 2, 3) such that
Proof. (i). Using the Rayleigh-Ritz theorem, one has
λmin{Σn(b0)} = min∥a∥=1
a′Σn(b0)a and λmax{Σn(b0)} = max∥a∥=1
a′Σn(b0)a,
where a is a Jqn × 1 unit vector. Let a = (a11, . . . , a1qn , . . . , aJ1, . . . , aJqn)′, v(w) = a′B(w),
and r∗nj(s) = Yj(s) exp{β′Xj + ϕ∗n0(Wj)}. By the definition of Σn(b), it is easy to see that
a′Σn(b0)a = n−1
∫ τ
0
{Vn0(s,b0)η2(a, s)− η1(a, s)⊗2
Vn0(s,b0)2
}dN(s),
where η1(a, s) = n−1∑n
i=1 r∗ni(s)B′(Wi)a, and
η2(a, s) = n−1
n∑i=1
r∗ni(s)a′B(Wi)⊗2a.
Then η1(a, s) = EPn{v(W)r∗n(s)}, and
η2(a, s) = n−1
n∑i=1
{a′B(Wi)}2r∗ni(s) = EPn{v2(W)r∗n(s)},
where v(w) is defined in Lemma 4. Put
Gn(a, s) =EPn{v2(W)r∗n(s)}
EPn{r∗n(s)}−[EPn{v(W)r∗n(s)}
EPn{r∗n(s)}
]2. (A.62)
It can be written thata′Σn(b0)a = n−1
∫ τ
0
Gn(a, s) dN(s). (A.63)
By Taylor’s expansion, Lemma 1 and Condition (B3), r∗n(s) = r∗(s){1+O(hℓ)}, uniformly fors ∈ [0, τ ] and w ∈ W . Since |v(w)| = |a′B(w)| ≤ ∥a∥∥B(w)∥ ≤ J1/2 for any unit vector a,
vk(w)r∗n(s) = vk(w)r∗(s) +O(hℓ),
uniformly for s ∈ [0, τ ] and w ∈ W . It follows that, for k = 0, 1, 2,
EPn{vk(W)r∗n(s)} = EPn{vk(W)r∗(s)}+ o(h), (A.64)
uniformly for s ∈ [0, τ ].
48
Page 48 of 58Journal of the American Statistical Association
uniformly for s ∈ [0, τ ]. Note that v(w) = a′B(w). It follows that
Gn(a, s) =a′EP{B(W)⊗2r∗(s)}a
EP{r∗(s)}−[a′EP{B(W)r∗(s)}
EP{r∗(s)}
]2+ o(h)
= a′A(s)a/R∗0(s) + o(h),
uniformly for s ∈ [0, τ ], where A(s) is defined in Lemma 7. By Condition (A3), with prob-ability tending to one R∗
0(s) is bounded away from zero and infinity. Therefore, by Lemma7, there exists an s0 ∈ [0, τ ] such that A(s0) > 0. It is easy to see that A(s) is continuous.Therefore, there is a neighborhood of s0 in which A(s) > 0 and thus Gn(a, s) > 0. Combining(A.63), (A.65) and Lemma 4 leads to the result in ii(a). Parts ii(b) -(c) hold from the sameargument as above.
(iii). Similar to (ii), for any Jqn × 1 unit vector a, one has
a′Σn1(b0)a = n−1
n∑i=1
∫ 1
0
Gn(a, s) dMi(s).
Since Mi(s) is a martingale, a′Σn1a is of mean zero and variance
n−1
∫ 1
0
E{G2n(a, s)Yi(s)ri(β, ϕ)} dΛ0(s),
which is of order O(h2/n) from Lemma 4 and (A.62). Therefore,
a′Σn1(b0)a = Op(h/√n).
It follows from Rayleigh-Ritz theorem that λmin{Σn1(b0)} = Op(h/√n) and λmax{Σn1(b0)} =
Op(h/√n). Since Σn1(b0) is a symmetrical matrix, ∥Σn1(b0)∥ = Op(h/
√n).
(iv). Since Σn(b0),Σn0, and Σ0 are all symmetrical, the results for them follow from (ii).The result for Σn2 can be proven similarly.
Lemma 9. Under the conditions in Theorem 1, we have B(w)′rn = op(hℓ + 1/
√nh) and
B′(w)rn1 = op(hℓ + 1/
√nh), where rn and rn1 are defined in the proof of Theorem 1.
Proof. The two statements can be proved in the same line, but the first one is much moredifficult to prove than the second one, since ∥β − β∥ converges faster than ∥b − b∥. In thefollowing we provide only the proof of the 1st statement. Let Hk(b) = n−1(∂2Uk/∂b∂b′),
49
Page 49 of 58 Journal of the American Statistical Association
Lemma 10. (Barrow and Smith 1978; Lemma 6.8 of Agarwal and Studden 1980) Assume thatϕj ∈ Cℓj [0, 1] and ξ ∈ [0, 1). Let i be chosen so that ξj,i ≤ ξ < ξj,i+1 and let hj,i = ξj,i+1 − ξj,i,where ξj,i are the knots defined in Section 2. For y ∈ [0, 1), let
Rk(y, ξ) = kℓj(ϕj − ϕ∗n0,j)(ξi + yhj,i),
and K(y, ξ) = {ϕ(ℓj)j (ξ)/ℓj!}p−ℓj(ξ)B∗
ℓj(y).Then there exists a sequence of positive constants
{ϵk}∞k=1 tending to zero and which may be chosen independently of ξ such that
supy
|Rk(y, ξ)−K(y, ξ)| < ϵk.
Lemma 11. For k = 0, 1, let Lnk(s) ≡ n−1∑n
i=1 Bk(Wi)r∗i (s){ϕ(Wi) − ϕ∗
n0(Wi)}, wherer∗i (s) = Yi(s) exp{β′Xi + ϕ(Wi)}. Under the conditions in Theorem 1, we have
(i)∫ τ
0Lnk(s)g(s) dΛ0(s) = op(h
ℓ+1), uniformly for components, where g(s) is a boundedfunction;
(ii) αn(w) = n−1∑n
i=1
∫ τ
0B′(w)Σn(b0)
−1Γn(Wi, s)r∗i (s) dΛ0(s) = op(h
ℓ), uniformly for w ∈W, where Γn(Wi, s) = B(Wi)− Vn1(s,b0)/Vn0(s,b0).
Proof. (i) It can be rewritten that Lnk(s) = Lnk1(s) + Lnk2(s), where
Lnk1(s) = n−1
n∑i=1
Bk(Wi)f(Wi, s){ϕ(Wi)− ϕ∗n0(Wi)},
Lnk2(s) = n−1
n∑i=1
Bk(Wi){r∗i (s)− f(Wi, s)}{ϕ(Wi)− ϕ∗n0(Wi)},
and f(·, s) is defined in Condition (A3). Let L[m]nk2(s) be the mth component of Lnk2(s). Then,
for any ε > 0, by Markov’s inequality,
P{
sup1≤m≤Jqn
∣∣ ∫ τ
0
L[m]nk2(s)g(s) dΛ0(s)
∣∣ > εhℓ+1}
≤ ε−2h−2(ℓ+1)
Jqn∑m=1
E{∫ τ
0
L[m]nk2(s)g(s) dΛ0(s)
}2. (A.67)
By Lemma 1, ∥ϕ(w) − ϕ∗n0(w)∥∞ = O(hℓ). Then, by Condition (A3) and by interchanging
integration and expectation, it is straightforward to show that
E{∫ τ
0
L[m]nk2(s)g(s) dΛ0(s)
}2= O(h2ℓ/n).
This, combined with (A.67), yields that
P{
sup1≤m≤Jqn
∣∣ ∫ τ
0
L[m]nk2(s)g(s) dΛ0(s)
∣∣ > εhℓ+1}= O{1/(nh3)} → 0,
51
Page 51 of 58 Journal of the American Statistical Association
uniformly for m. Following the arguments for (6.30) and (6.31) in Agarwal and Studden(1980), one knows that the second factor on the righthand side of (A.72) is o(h). Then
qℓ∫ τ
0
L(m)nk11(s)g(s) dΛ0(s) =
J∑j=1
p−1∑i=p−ℓj
ρij
∫ ξj,i+1
ξj,i
Bkj,p(w)B
∗ℓj((w − ξj,i)/hj,i)
×g∗(w) dQj(w) + o(h),
uniformly in m. Let qj(w) = dQj(w)/dw. It can be written that
qℓ∫ τ
0
L(m)nk11(s)g(s) dΛ0(s) =
J∑j=1
p−1∑i=p−ℓj
ρijg∗(ξj,i)qj(ξj,i)
×∫ ξj,i+1
ξj,i
Bkj,p(w)B
∗ℓj((w − ξj,i)/hj,i) dw
+J∑
j=1
p−1∑i=p−ℓj
ρij
∫ ξj,i+1
ξj,i
Bkj,p(w)B
∗ℓj((w − ξj,i)/hj,i)
×{g∗(w)qj(w)− g∗(ξj,i)qj(ξj,i)} dw + o(h),
uniformly for m. Then, by the continuity of g∗(·) and qj(·), it is quite easy to show that the2nd term on the righthand side of the above equation is o(h). Hence,
qℓ∫ τ
0
L(m)nk11(s)g(s) dΛ0(s) =
J∑j=1
p−1∑i=p−ℓj
ρijg∗(ξj,i)qj(ξj,i)
×∫ ξj,i+1
ξj,i
Bkj,p(w)B
∗ℓj((w − ξj,i)/hj,i) dw + o(h),
uniformly for m. Applying Lemma 5, we obtain that
qℓ∫ τ
0
L(m)nk11(s)g(s) dΛ0(s) = o(h)
53
Page 53 of 58 Journal of the American Statistical Association
uniformly for m. Note that q−1 = O(h). It follows that∫ τ
0
L(m)nk11(s)g(s) dΛ0(s) = o(hℓ+1), (A.73)
uniformly for m. Hence, (A.69) follows from (A.70), (A.71) and (A.73).(ii) Since each term Γn(Wi, s) may be considered of as a covariate centered by its empirical
average calculated with probability mass function which assigns a weight proportional tor∗ni(s) ≡ Yj(s) exp{β′Xj + ϕ∗
n0(Wj)}, then the average of these centered terms with respectto this same discrete probability function is zero. That is,
n {1 + o(1)}. Note thatthe (i, i′)th component of An is
Aii′ =
∫ 1
0
B1,i(w1)B1,i′(w1)a(w1) dw1.
By (2.3), we have
Aii′ = {0, if |i− i′| ≥ ℓ1;
O(h1) if |i− i′| < ℓ1.(A.80)
Then tr(A⊗2n ) = O(h1). By an argument similar to that for Lemma 8, λmax(Σ
110 ) ≍ h−1
1 , andhence it follows from Lemma 2 that
tr(AnΣ110 AnΣ
110 ) ≍ h−2
1 tr(A⊗2n ) = O(h−1
1 ).
Therefore, σ∗2n = O(h1). Let Hn(εi, εj) = 2n−1h1ε
′iΓ
′1AnΓ1εj, Yni =
∑i−1j=1Hn(εi, εj) (for 2 ≤
i ≤ n), Snm =∑m
i=2 Yni, and Fnm = σ(ε1, . . . , εm) be the σ field generated by {ε1, . . . , εm}.Then Snn = Tn2 and for each n, {Snm,Fnm}nm=2 is a sequence of zero mean and squareintegral martingales. Let s2n = E(S2
nn). Then s2n = var(Tn2) = σ∗2n {1 + o(1)}. Define V 2
n =∑ni=2E(Y
2ni|Fn,i−1). It is straightforward to show that s−2
n V 2n
p→ 1 and for each ϵ > 0
s−2n
n∑i=2
E{Y 2niI(|Yni| > ϵsn)|Fn,i−1}
p→ 0,
as n → ∞. Then applying the martingale central limit theorem (Corollary 3.1 of Hall andHeyde (1980)), σ∗−1
n Tn2 → N (0, 1) in distribution, as n→ ∞.
58
Page 58 of 58Journal of the American Statistical Association