Methods for Sparse Functional Data by Edwin Kam Fai Lei A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto c Copyright 2014 by Edwin Kam Fai Lei
102
Embed
by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Methods for Sparse Functional Data
by
Edwin Kam Fai Lei
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Statistical SciencesUniversity of Toronto
Chapter 2. Data Model for Genetically Correlated Subjects 25
between two individuals depends on G and the individuals’ relationship coefficient:
cov(gij(s), gij′(t)
)= αi,jj′G(s, t). (2.3)
The processes eij(·) and ei′j′(·) are independent when (i, j) 6= (i′, j′). Assume that the
measurements are taken on a closed and bounded interval T , i.e., t ∈ T . Note that model
(2.2) is not the classical functional model that assumes that data come from independent
realizations of Xij(t) = µ(t) +vij(t). In (2.2), we have decomposed the random deviation
vij(t) as gij(t) + eij(t), where the genetic effect gij(t) induces a within-family correlation.
A stochastic process with finite covariance admits a Karhunen-Loeve expansion and
its covariance function admits a spectral basis expansion (Loeve, 1978, Adler and Taylor,
2007). The key proposal is to exploit such expansions for both genetic and environmental
processes, whilst maintaining the dependence structure of related individuals. For the
genetic process gij, we have for s, t ∈ T ,
gij(t) =∞∑l=1
ξijlφl(t), G(s, t) =∞∑l=1
λlφl(s)φl(t), (2.4)
where the φl’s are orthonormal eigenfunctions, ξij1, ξij2, . . . are the FPC scores, which are
uncorrelated random variables with zero mean and variances λ1 > λ2 > . . ., satisfying∑∞l=1 λl < ∞. Based on the underlying genetic model in equation (2.3), we can deduce
that the correlation between ξijl and ξi′j′l′ is λl αi,jj′ for i = i′ and l = l′, and zero
otherwise. This genetic association is the key to consistent parameter estimation, as
it enables us to borrow information across related individuals. This model and basis
expansion in the context of selection and genetics was first described in Kirkpatrick
and Heckman (1989). Similar expansions hold for the environmental process eij with
orthonormal eigenfunctions {ψm}m≥1 and nonincreasing eigenvalues {ρm}m≥1, i.e., for
Chapter 2. Data Model for Genetically Correlated Subjects 26
s, t ∈ T
eij(t) =∞∑m=1
ζijmψm(t), E(s, t) =∞∑m=1
ρmψm(s)ψm(t), (2.5)
where ζijm are uncorrelated FPC scores of eij with zero mean and finite variance ρm. It
is obvious that the correlation between ζijm and ζi′j′m′ is always zero given independent
Figure 2.2: Estimated mean function (dark) with observed trajectories (light) for thebeef cattle data.
We are primarily interested in predicting the growth of beef cattle from sparsely ob-
served measurements. It is thus informative to assess the proposed method by comparing
it with the PACE method that treats all individuals independently, i.e., that doesn’t
take familial genetic correlation into account. We calculate the leave-one-family-out
cross-validation error given by∑
i
∑j
∑k{Uijk−X
−iij (Tijk)}2, where X−iij is the predicted
phenotype of the jth cow in the ith family. Specifically, the model components are es-
Chapter 2. Data Model for Genetically Correlated Subjects 35
5001000
15002000
25003000
500
1000
1500
2000
2500
3000−600
−400
−200
0
200
400
600
800
1000
1200
Age (days)Age (days)
(a) Genetic
5001000
15002000
25003000
500
1000
1500
2000
2500
3000−1000
0
1000
2000
3000
4000
5000
Age (days)Age (days)
(b) Environmental
Figure 2.3: Non-negative definite estimates of the genetic and environmental covariancefunctions for the beef cattle data.
600 800 1000 1200 1400 1600 1800 2000 2200 2400
−0.05
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
Age (days)
(a) Genetic
600 800 1000 1200 1400 1600 1800 2000 2200 2400
−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
Age (days)
(b) Environmental
Figure 2.4: Shown are the first (solid), second (dashed), third (dash-dot), and fourth(dotted) eigenfunctions. Left: first three eigenfunctions of the genetic process, countingfor 98% of the genetic variance. Right: first four eigenfunctions of the environmentalprocess, explaining 98.3% of the environmental variance.
Chapter 2. Data Model for Genetically Correlated Subjects 36
timated based on data excluding family i using the method described in Section 2.3.1.
Then the FPC scores ξ−iijl and ζ−iijm are obtained by substituting these leave-one-family-
out estimates, µ−i, λ−il , ρ−im , φ
−il , ψ
−im ,Σ
−ii,jj′ , into (2.15) and (2.16), leading to X−iij . We use
K−ig and K−ie leading eigenfunctions, chosen to explain 98% of, respectively, the genetic
and the environmental functional variation in the data. The reconstruction using the
PACE method is obtained in a similar manner. See Yao et al. (2005a) for details. Not
surprisingly, the proposed FACE method considerably improves upon the PACE method
by around 18%. Shown in Figure 2.5 are the cross-validated trajectory estimates for
offsprings of two of the fifteen families using FACE and PACE methods. We observe
that FACE offers improved predictions for these eight cows.
552 1224255
450
7.4%7.1%
574 2471300
575
2.9%2.2%
564 2546325
600
2.5%1.9%
558 2540321
653
4.0%2.1%
556 2538310
590
2.9%2.0%
553 2534324
692
2.3%1.4%
581 1707305
675
3.1%1.8%
574 2519330
616
3.7%2.6%
Figure 2.5: Estimated trajectories by leave-one-family-out cross-validation (CV) for twofamilies of cows obtained using FACE method (solid) and PACE method (dashed), wherethe first row presents two half-siblings from one family and the bottom three rows presentsix half-siblings from another family. The legend shows the relative CV error of each cow,∑Nij
k=1{Uijk− X−iij (Tijk)}2/U2
ijk, obtained from two methods, where X−iij is as described inSection 4.
Chapter 2. Data Model for Genetically Correlated Subjects 37
2.5 Simulated Examples
To further illustrate the performance of the proposed method, we carry out two simulation
studies. For Simulation I, we closely mimic the cow data, using the same design, e.g.,
the same family sizes and times of weighings. The underlying model is (2.7) with Kg
terms for the genetic component and Ke terms for the environmental component. The
environmental covariance is derived from the first four estimated eigenfunctions, i.e.,
Ke = 4. In view of the importance of the genetic component, we examine three values of
Kg: Kg = 1, 2, 3, and we use the corresponding genetic eigenfunctions estimated from the
data. We use the half-sibling relationship coefficient αi,jj′ = 1/4 for all i, j and j′ 6= j.
The genetic and environmental FPC scores ξijl and ζijm and the measurement errors εijk
are independently generated from normal distributions, respectively, using the estimated
eigenvalues and error variance from the data. To focus our attention on the covariances
and FPCs, we set the mean function µ to 0 in the data generation but still treat it
as unknown in our analysis. For each underlying model, we generate 100 Monte Carlo
samples, and produce two versions of Xij, the FACE estimate that respects the familial
genetic relationship, and the PACE estimate that ignores familial dependence. To select
Kg and Ke, we again use a 98% threshold for the fraction of variance explained. Within
each sample and for each estimation method, we calculate the integrated squared error
(ISE) for the jth individual in the ith family, ISEij =∫T
{Xij(t) − Xij(t)
}2dt, and the
overall ISE is defined as ISE =∑
i,j ISEij. Improvements of the proposed FACE method
upon the PACE method are summarized in Table 2.1, which indicates a substantial
improvement of 21% to 25%.
In Simulation II, we again follow model (2.7), but with µ(t) = t + sin(2πt), φ1(t) =
ζ1(t) = − cos(2πt/10)/√
5 and φ2(t) = ζ2(t) = sin(2πt/10)/√
5 and corresponding eigen-
values λ1 = 10, λ2 = 5 and ρ1 = 100, ρ2 = 10. The genetic and environmental FPC
scores are generated from normal distributions, and the measurement error εijk is from
N(0, 0.01). We still generate data for 15 families, but the number of siblings within
Chapter 2. Data Model for Genetically Correlated Subjects 38
family is chosen uniformly from {2, . . . , 6} and the number of observations per subject is
chosen uniformly from {5, . . . , 20}. The observation times are uniformly distributed on
[0, 10]. With 100 Monte Carlo samples, the ISE based on the FACE method incorporating
genetic correlation outperformed the PACE method by 30% for the case of half-sibling
families with αi,jj′ = 1/4 for j 6= j′, and by 25% for the case of full-sibling families with
αi,jj′ = 1/2 for j 6= j′. See Table 2.1.
Table 2.1: ISE improvement (%) of the proposed FACE method upon PACE, whereSimulation I uses data-based models with different values of (Kg, Ke) and Simulation IIexamines half-sibling (α = 0.25) and full-sibling (α = 0.5) family relationships.
Simulation I
(Kg, Ke) Mean (SE) 1st Quartile Median 3rd Quartile(1, 4) 21.4 (1.5) 15.1 23.5 28.7(2, 4) 25.1 (1.6) 12.9 28.9 36.3(3, 4) 21.9 (1.6) 10.9 24.7 32.6α Mean (SE) 1st Quartile Median 3rd Quartile
It is worth mentioning that since the covariance operator Σ is Hilbert-Schmidt, its
inverse Σ−1 is not well-defined so the EDR directions may not even exist in H. Following
He et al. (2003) for functional canonical correlation, let RΣ denote the range of Σ and
R−1Σ =
{b ∈ H :
∑∞j=1 α
−1j 〈b, φj〉2 < ∞, b ∈ RΣ
}. Restricted to R−1
Σ , Σ is a one-to-one
operator from R−1Σ ⊂ H onto RΣ whose inverse is defined by Σ−1 =
∑∞j=1 α
−1j φj ⊗ φj.
Let ξj = 〈X,φj〉 denote the jth principal component (or generalized Fourier coefficient)
of X, and assume that
Assumption 3.3.∑∞
j=1
∑∞l=1 α
−2j α−1
l E2{E[ξj1(Y ≤ Y )|Y ]E[ξl1(Y ≤ Y )|Y ]
}<∞.
Proposition 3.1. Under assumptions 3.1-3.3, the eigenspace associated with the K non-
null eigenvalues of Σ−1Λ is well defined in H.
This is a direct analogue to Theorem 4.8 in He et al. (2003) and Theorem 2.1 in Ferre
and Yao (2005), thus the proof is omitted for conciseness.
3.2.2 Functional Cumulative Slicing for Sparse Functional Data
For the data{
(Xi, Yi) : 1 ≤ i ≤ n}
independently and identically distributed (i.i.d.)
as (X, Y ), the predictor trajectories Xi are observed intermittently, contaminated with
noise, and collected in the form of repeated measurements{
(Tij, Uij) : 1 ≤ i ≤ n, 1 ≤
j ≤ Ni
}, where Uij = Xi(Tij) + εij with i.i.d. measurement error εij that are of zero
mean, constant variance σ2x, and independent of all other random variables. When only
a few observations are available for some or even all subjects, individual smoothing to
recover Xi is infeasible and one must adopt the strategy of pooling together data from
across subjects for consistent estimation.
To estimate the FCS kernel Λ defined in (3.2), the key quantity is the unconditional
mean m(t, y) = E[X(·)1(Y ≤ y)
]. For sparsely and irregularly observed Xi, the cross-
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 47
sectional estimation used in multivariate cumulative slicing is no longer applicable. To
maximize the use of available data, we propose to pool together the repeated measure-
ments across subjects via a scatterplot smoother, which works seamlessly in conjunction
with the strategy of cumulative slicing. For specificity, we use a local linear estimator
m(t, y) = a0 (Fan and Gijbels, 1996), minimizing
min(a0,a1)
n∑i=1
Ni∑j=1
{Uij1(Yi ≤ y)− a0 − a1(Tij − t)
}2
K1
(Tij − th1
), (3.3)
where K1 is a non-negative and symmetric univariate kernel density and h1 = h1(n)
is the bandwidth to control the amount of smoothing. Here we follow the suggestion of
ignoring the dependency among the data from the same individual (Lin and Carroll, 2000,
for smoothing correlated data), and use leave-one-curve-out cross-validation to select h1
(Rice and Silverman, 1991). Then an estimator of the FCS kernel function Λ(s, t) is given
by its sample moment,
Λ(s, t) =1
n
n∑i=1
m(s, Yi)m(t, Yi)w(Yi). (3.4)
For the covariance operator Σ, following Yao et al. (2005a), denote the observed raw
covariances byGi(Tij, Til) = UijUil and note E[Gi(Tij, Til)|Tij, Til
]= cov(X(Tij), X(Til))+
σ2δjl, where δjl is 1 if j = l and 0 otherwise. This suggests the diagonal of the raw co-
variances should be removed, and minimizing
min(b0,b1,b2)
n∑i=1
∑1≤j 6=l≤Ni
{Gi(Tij, Til)−b0−b1(Tij−s)−b2(Til−t)
}2
K2
(Tij − sh2
,Til − th2
)(3.5)
yields Σ(s, t) = b0, where K2 is a non-negative bivariate kernel density and h2 = h2(n)
is the bandwidth chosen by leave-one-curve-out cross-validation, see Yao et al. (2005a)
for details on the implementation. Since the inverse operator Σ−1 is unbounded, we
regularize it by projection onto a truncated subspace. To be precise, let sn be a possibly
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 48
divergent sequence and Πsn =∑sn
j=1 φj ⊗ φj (resp. Πsn =∑sn
j=1 φj ⊗ φj) denote the
orthogonal projector onto the eigensubspace associated with the sn largest eigenvalues of
Σ (resp. Σ). Then, Σsn = ΠsnΣΠsn (resp. Σsn = ΠsnΣΠsn) is a sequence of finite rank
operators converging to Σ (resp. Σ) as n→∞ with bounded inverse
Σ−1sn =
sn∑j=1
α−1j φj ⊗ φj, Σ−1
sn =sn∑j=1
α−1j φj ⊗ φj, (3.6)
respectively. Finally we obtain the eigenfunctions associated with the K largest nonzero
eigenvalues of Σ−1sn Λ as the estimates of the EDR directions {βk,sn}k=1,...,K .
The situation for completely observedXi is similar to the multivariate case and consid-
erably simpler. The quantities m(t, y) and Σ(s, t) are easily estimated by their respective
sample moments m(t, y) = n−1∑n
i=1 Xi(t)1(Yi ≤ y) and Σ(s, t) = n−1∑n
i=1 Xi(s)Xi(t),
while the estimate of Λ remains the same as (3.4). For densely observed Xi, individual
smoothing can be used as a preprocessing step to recover smooth trajectories and the
estimation error introduced in this step can be shown to be asymptotically negligible un-
der certain design conditions, i.e., it is equivalent to the ideal situation of the completely
observed Xi’s (Hall et al., 2006).
Remarks. (i) For small values of Yi, m(·, Yi) obtained by (3.3) may be unstable due
to the smaller number of pooled observations in the slice. A suitable weight function w
may be used to refine the estimator Λ(s, t). In our numerical studies, the naive choice
of w ≡ 1 performed fairly well compared to other methods. Analogous to multivariate
case, choosing an optimal w remains an open question. (ii) Ferre and Yao (2005) avoided
inverting Σ with the claim that for a finite rank operator Λ, range(Λ−1Σ) = range(Σ−1Λ);
however, Cook et al. (2010) showed that this required more stringent conditions that
are not easily fulfilled. (iii) Regularization can also be tackled with a ridge penalty
(Σ + ρI)−1, where ρ > 0 and I is the identity operator. However, numerical results from
this regularization scheme are observed to be inferior to those from spectral truncation,
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 49
and thus not pursued further. (iv) For selecting the structural dimension K, the only
relevant work to date is Li and Hsing (2010), where sequential χ2 tests are developed to
determine K in the context of FSIR for completely observed functional data. How to
extend such tests (if feasible at all) to sparse functional data is a substantive problem
that deserves further exploration. Nevertheless, since prediction is the primary concern
in many applications, both K and sn can be easily chosen by minimizing prediction error
when there is a sensible model in place. In the simulated and real examples we adopt
this principle that empirically performs well.
3.3 Asymptotic Properties
In this section we present asymptotic properties of the FCS kernel operator and the EDR
directions for sparsely observed functional data. Here the number of measurements Ni
and the observation times Tij are considered random to reflect a sparse and irregular
design. Specifically, we assume that
Assumption 3.4. Ni are random variables with Nii.i.d.∼ N , where N is a bounded
positive discrete random variable with P{N ≥ 2} > 0, and({Tij, j ∈ Ji}, {Uij, j ∈ Ji}
)are independent of Ni for Ji ⊆ {1, . . . , Ni}.
Writing Ti = (Ti1, . . . , TiNi)> and Ui = (Ui1, . . . , UiNi
)>, the data quadruplets Zi =
{Ti, Ui, Yi, Ni} are thus i.i.d.. Note that extremely sparse designs are covered, with only
a few measurements for each subject. Other regularity conditions are standard and listed
in the Appendix, including assumptions on the smoothness of the mean and covariance
functions of X, the distributions of the observation times, the bandwidths and kernel
functions used in smoothing steps. Denote ‖A‖2H =
∫T
∫T A
2(s, t)dsdt for A ∈ L2(T ×T ).
Theorem 3.2. Under assumptions 3.1, 3.4 and 3.7–3.10 in the Appendix, we have
∥∥Λ− Λ∥∥H
= Op
(1√nh1
),
∥∥Σ− Σ∥∥H
= Op
(1√nh2
),
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 50
The key result here is the L2 convergence of the estimated FCS operator Λ, in which we
exploit the projections of nonparametric U -statistics coupled with an important decom-
position of m(·, y) to overcome the difficulty caused by the dependence among irregularly
spaced measurements. Note that Λ is obtained by averaging the smoothers m(·, Yi) over
Yi, which is crucial to achieve the univariate convergence rate for this bivariate estima-
tor. The convergence of the covariance operator Σ is presented for completeness, given
in Theorem 2 of Yao and Muller (2010).
We are now ready to characterize the estimation of the central subspace SY |X =
span(β1, . . . , βK). Unlike the multivariate or finite-dimensional case where the conver-
gence of SY |X follows immediately from the boundedness of Σ−1, we have to approximate
Σ−1 with a sequence of truncated estimates Σ−1sn in (3.6). Recall that we specifically
regarded the index functions {β1, . . . , βK} as the eigenfunctions associated with the K
largest eigenvalues of Σ−1Λ to suppress the identifiability concern. It is thus equivalent to
consider {β1,sn , . . . , βK,sn} in place of SY |X . For an arbitrary constant C > 0, we require
the eigenvalues of Σ to satisfy
Assumption 3.5. α1 > α2 > . . . > 0, Eξ4j ≤ Cα2
j for j ≥ 1, and αj − αj+1 ≥ C−1j−a−1
for j ≥ 1.
This condition on the decaying speed of eigenvalues αj prevents the spacings between
consecutive eigenvalues from being too small, which also implies αj ≥ Cj−a and, together
with the boundedness of Σ, a > 1. Expressing the index functions as βk =∑∞
j=1 bkjφj, k =
1, . . . , K, we impose a decaying structure on its generalized Fourier coefficients bkj =
〈βk, φj〉,
Assumption 3.6. |bkj| ≤ Cj−b for j ≥ 1 and 1 ≤ k ≤ K, where a+ 12< b.
This implies that {βk}k=1,...,K is smoother relative to Σ. Here, we require a stronger
condition than a/2 + 1 < b assumed by Hall and Horowitz (2007) for functional linear
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 51
model with completely observed Xi. This is not unexpected, as the index model (3.1) is
more flexible and we are dealing with sparse functional data.
Theorem 3.3. Under conditions 3.1–3.6 and 3.7–3.10 in the Appendix, for all k =
1, . . . , K, we have
∥∥βk,sn − βk∥∥H = Op
(s
32a+1
n√nh1
+s
(2a−b+2)+n√nh2
+1
sb−a− 1
2n
),
where (2a− b+ 2)+ = max(0, 2a− b+ 2).
This result explicitly associates the convergence of βk,sn with the regularizing trun-
cation size sn and the decay rates of αj and bkj. Specifically, the first two terms are
attributed to the variability of estimating Σ−1sn Λ using Σ−1
sn Λ, and the last to the ap-
proximation bias of Σ−1sn Λ. This indicates a bias-variance tradeoff associated with the
truncation size sn. One may view sn as a tuning parameter that controls the resolution
or smoothness of the covariance estimation. Furthermore, the first term of the variance
is due to ‖Σ−1sn ΛΣ
−1/2sn −Σ−1
sn ΛΣ−1/2sn ‖ (details in Appendix) and becomes increasingly un-
stable with a larger truncation. The bias and the second part of the variance contributed
from ‖(Σ−1sn − Σ−1
sn )ΛΣ−1/2sn ‖ are to some extent determined by the relative smoothness of
Σ and βk, i.e., a smoother βk with a larger b leads to less discrepancy.
3.4 Simulations
In this section we illustrate the performance of the proposed FCS method in terms of esti-
mation and prediction. We compare the proposed FCS to (i) FSIR with 5 slices (FSIR5),
(ii) FSIR with 10 slices (FSIR10), (iii) functional index model with nonparametric link
(FIND) proposed by Chen et al. (2011), and (iv) functional linear model (FLM) as a
misspecified baseline for assessing prediction. Although FCS and FSIR are “link-free”
for estimating index functions βk, a general index model (3.1) may lead to model pre-
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 52
dictions with high variability, especially given relatively small sample sizes frequently
encountered in functional data analysis. Thus we follow Chen et al. (2011) by assuming
an additive structure on the link function g in (3.1), i.e., Y = β0 +∑K
k=1 gk(〈βk, X〉
)+ ε.
In each Monte Carlo run, a sample of n = 200 functional trajectories are generated
from the process Xi(t) =∑50
j=1 ξijφj(t), where φj(t) = sin(πtj/5)/√
5 for j even and
φj(t) = cos(πtj/5)/√
5 for j odd, FPC scores ξij are i.i.d. N(0, j−1.5), t ∈ [0, 10].
For the setting of sparsely observed functional data, the number of observations per
subject Ni is chosen uniformly from {15, . . . , 20}, the observational times Tij are i.i.d.
U [0, 10], and the measurement error εij is i.i.d. N(0, 0.1). For densely observed func-
tional data, let Tij = 0.1(j − 1) for j = 1, . . . , 101. The EDR directions are generated by
β1(t) =∑50
j=1 bjφj(t), where bj = 1 for j = 1, 2, 3 and bj = 4(j− 2)−3 for 4 ≤ j ≤ 50, and
β2(t) =√
3/10(t/5− 1) that is not representable with finite Fourier terms.
Since neither FSIR nor FIND are directly applicable to sparse functional data for
estimating βk, we adopt a two-stage method as suggested by Ferre and Yao (2003) and
Chen et al. (2011): first we use the PACE (Yao et al., 2005a) method, a functional
principal component approach specifically designed for sparse functional data and is
publicly available at http://www.stat.ucdavis.edu/PACE, to recover Xi with very little
dimension reduction (using fraction of variance explained of 99%), denoted by Xi; then
we apply FSIR or FIND to obtain βk,sn . The following single and multiple index models
are considered
Model I: Y = sin(π〈β1, X〉/4
)+ ε,
Model II: Y = arctan(π〈β1, X〉/2
)+ ε,
Model III: Y = sin(π〈β1, X〉/3
)+ exp
(〈β2, X〉/3
)+ ε,
Model IV: Y = arctan(π〈β1, X〉
)+ sin
(π〈β2, X〉/6
)/2 + ε,
where the regression error ε is i.i.d. N(0, 1) for all models. Due to the nonidentifiability
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 53
of βk’s, we examine the projection operator of the the EDR space, i.e., P =∑K0
k=1 βk⊗βk
with K0 denoting the true structural dimension. To assess the estimation of the EDR
space, we calculate the average of the singular values of (PK,sn − P ) as the model error,
i.e., its operator norm ‖PK,sn − P‖ normalized by the number of singular values with
PK,sn =∑K
k=1 βK,sn ⊗ βK,sn . We compute the average model error and its standard error
over 100 Monte Carlo repetitions, shown in Table 3.1. The structure dimension K and
the truncation parameter sn are chosen by minimizing the average model error. One can
see that, for sparse functional data, the proposed FCS outperforms the other methods for
all models, while FSIR and FIND may suffer from the two-stage approach for estimating
index functions. As expected, the gains in the setting of dense functional data are less
noticeable.
To assess model prediction, we use a backfitting algorithm (Hastie and Tibshirani,
1990) to nonparametrically estimate the link functions gk by fitting Yi = β0+∑K
k=1 gk(Zik)+
εi, where Zik = 〈βk,sn , Xi〉. For dense functional data, Zik = 〈βk,sn , Xi〉 is given by an
integral approximation. When Xi are sparse, we substitute Xi with its PACE estimate
Xi. Unlike the FSIR and FCS, the FIND jointly estimates the index and link functions.
To calculating the prediction error, we additionally generate a validation sample of size
500 in each run, and calculate the Monte Carlo average of the mean squared predic-
tion error MSPE = 500−1∑500
i=1
(Y ∗i − Y ∗i
)2over different values of K and sn, where
Y ∗i = β0 +∑K
k=1 gk(Z∗ik) and Z∗ik = 〈βk,sn , X∗i 〉 with X∗i being the underlying trajectories
in the testing sample.
We report the minimized average MSPE and its standard error with corresponding
choice of {K, sn}, shown in Table 3.2. We see that the FCS substantially improves pre-
diction for sparse functional data for all models. In the dense data setting, the prediction
from FCS and FSIR are virtually indistinguishable, while FIND seems to be suboptimal
and the misspecified FLM fails as expected. The structural dimension K is the inherent
parameter of the underlying model, while the truncation sn plays a role of a tuning pa-
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 54
Table 3.1: Shown are the model error in form of the operator norm ‖PK,sn − P‖ withits standard error (in parentheses), and the optimal K and sn that minimize the averagemodel error over 100 Monte Carlo repitetions.
Design Model FCS FSIR5 FSIR10 FIND
Sparse
I.476 (.016) .540 (.016) .555 (.018) .492 (.014)
K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3
II.415 (.013) .508 (.014) .511 (.016) .424 (.010)
K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3
IV.539 (.005) .535 (.005) .543 (.005) .537 (.008)
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3
rameter that might vary with the purpose of estimation or prediction. In our simulation,
the structural dimension K is correctly specified by both criteria, the average MSPE
and the model error in all cases. Since the model error is not obtainable in practice, we
suggest to approximate the prediction error with a suitable cross-validation procedure
for choosing K together with sn.
3.5 Data Applications
3.5.1 Ebay auction data
In this application, we study the relationship between the winning bid price of n = 156
Palm M515 PDA devices auctioned on eBay between March and May, 2003 and the
bidding history over the 7-day duration of each auction. The observation from a bidding
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 55
Table 3.2: Shown are the average MSPE with its standard error (in parentheses), andthe optimal K and sn that minimize the average MSPE over 100 Monte Carlo repitetions.
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 sn = 3
history represents a “live bid”, the actual price a winning bidder would pay for the device,
known as the “willingness-to-pay” price. Further details on the bidding mechanism can
be found in Liu and Muller (2009). We adopt the view that the bidding histories are i.i.d.
realizations of a smooth underlying price process. Due to the nature of online auctions,
the jth bid of the ith auction usually arrives irregularly at time Tij, and the number of
bids Ni vary widely, from 9 to 52 for this dataset. As common in modeling prices, we
take a log-transform of bid prices. Figure 3.1 shows a sample of 9 randomly selected log
bid histories over the 7-day duration of the auction. Typically, the bid histories are very
sparse until the final hours of each auction when “bid sniping” occurs. At this point,
“snipers” place their bids at the last possible moments in an attempt to deny competing
bidders the chance of placing a higher bid.
Since the main interest is the predictive power of price histories up to time T for
the winning bid prices, we consider the regression of the winning price on the history
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 56
1 3 5 7−4−2
024
Day of Auction
Log
Bid
Pric
e
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
Figure 3.1: Irregularly and sparsely observed log bid price trajectories of 9 randomlyselected auctions over the 7-day duration.
trajectory X(t), t ∈ [0, T ], and set T = 4.5, 4.6, 4.7, . . . , 6.8 (in days). For each analysis
on the domain [0, T ], we select the optimal structural dimension K and the truncation
parameter sn by minimizing the average 5-fold cross-validated prediction error over 20
random partitions. Shown in Figure 3.2 are the minimized average cross-validated pre-
diction errors, compared with FSIR and FLM, where FSIR is obtained using 5 slices
(superior to FSIR using 10 slices). We did not show results from FIND that have consid-
erably larger errors. The results are not surprising: the prediction error decreases as the
bidding histories encompass more data and get closer to the end. Obviously the proposed
FCS outperforms the other methods and FLM yields the least favorable prediction, until
the last moments of the auction when any sensible method could achieve high predictive
power.
As an illustration, we present the analysis for the case of T = 6. The estimated model
components using FCS are shown in Figure 3.3 with the parameters chosen as K = 2 and
sn = 2. The first index function assigns contrasting weights to bids made before and after
the first day, indicating some bidders tend to underbid at the beginning only to quickly
overbid relative to the mean. The second index represents a cautious type of bidding
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 57
4.5 5 5.5 6 6.5
1.02
1.04
1.06
1.08
1.1
1.12
1.14
1.16x 10
−3
T (days)
CV
pre
dic
tion
err
or
FCSFLMFSIR
Figure 3.2: Average 5-fold cross-validated prediction errors over 20 random partitionsacross various time domains [0, T ], for sparse Ebay auction data.
behavior, entering at a lower price and slowly increasing towards the average level. These
features contribute the most towards the prediction of the winning bid prices. Also seen
are the slightly nonlinear patterns in the estimated additive link functions.
3.5.2 Spectrometric data
In this example, we study the spectrometric data consisting of n = 215 pieces of finely
chopped meat, publicly available at http://lib.stat.cmu.edu/datasets/tecator. For each
meat sample, the moisture content and the absorbance spectrum, measured at 100 equally
spaced wavelengths between 850 nm to 1050 nm, were recorded using a Tecator Infratec
Food and Feed Analyzer. Each absorbance spectrum is treated as an i.i.d. realization of
the absorbance process. Thus, the 215 absorbance trajectories, shown in Figure 3.4, can
be regarded as densely observed functional data.
In Table 3.3, we present the minimized average 5-fold cross-validated prediction error
over 20 random partitions for different methods, together with the selected structural
dimensions and the truncation sizes. Similar to our simulation study for dense functional
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 58
0 2 4 6
−1
−0.5
0
0.5
t (days)
β 1
0 2 4 6
−1
−0.5
0
0.5
t (days)
β 2
4 6 8 10−0.2
−0.15
−0.1
−0.05
0
0.05
⟨ β1, X ⟩
g 1
−10 −8 −6 −4 −2−0.2
−0.15
−0.1
−0.05
0
0.05
⟨ β2, X ⟩
g 2
Figure 3.3: Estimated model components for sparse Ebay auction data using FCS withK = 2 and sn = 2. The first and second row of plots shows the estimated index functions,i.e., the EDR directions, and the additive link functions, respectively.
0 10 20 30 40 50 60 70 80 90 1002
2.5
3
3.5
4
4.5
5
5.5
Spectrum channel
Ab
sorb
an
ce
Figure 3.4: Absorbance trajectories of 215 meat samples measured over 100 equallyspaced wavelengths between 850 nm and 1050 nm.
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 59
Table 3.3: Average 5-fold cross-validated prediction error over 20 Monte Carlo runs withselected K and sn, for dense spectrometric data.
FCS FSIR5 FSIR10 FIND FLM.0093 (.0001) .0096 (.0001) .0095 (.0001) .0222 (.0016) .0128 (.0002)K = 2, sn = 5 K = 2, sn = 5 K = 2, sn = 5 K = 2, sn = 5 sn = 8
data, the results for FCS and FSIR are virtually indistinguishable, and both improve
significantly upon FIND and FLM. The estimated EDR directions and additive link
functions are displayed in Figure 3.5 with K = 2 and sn = 5, where the link functions
appear to be nearly linear. The first index function emphasizes the rising trend above
the mean at wavelengths around 930 nm, and the second index picks up the contrast
between wavelengths 930 nm and 950 nm. Such EDR directions suggest that the rise
and fall around wavelengths 930 nm and 950 nm in the spectrometric trajectories, seen
in Figure 3.4, are important features for predicting moisture content.
Figure 3.5: Estimated model components for spectrometric data using FCS for (K, sn) =(2, 5). The first and second row of plots shows the estimated EDR directions and additivelink functions, respectively.
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 60
3.6 Concluding Remarks
In this chapter we introduce a new method of effective dimension reduction for sparse
functional data, where one observes only a few noisy and irregular measurements for
some or all of the subjects. The proposed FCS estimation is link-free and targets at the
EDR space directly by borrowing information across the entire sample. Theoretical anal-
ysis reveals the bias and variance tradeoff associated with the truncation parameter, and
the impact due to decaying structures of the predictor process and the EDR directions.
Numerical results from simulated and real examples are shown superior to existing meth-
ods for sparse functional data. It is worth mentioning that the proposed method in fact
opens a door to more sophisticated dimension reduction approaches for sparse functional
data. Following the strategy of “pooling information together”, we may further extend
the idea of functional cumulative slicing to variance estimation or direction regression, by
analogy to the multivariate case (Zhu et al., 2010). The usefulness and justifications of
these extensions deserve further study and shall be explored in our future investigation.
3.A Regularity Conditions
Without loss of generality, we assume that the known weight function w(·) = 1. Denote
T = [a, b] and T δ = [a − δ, b + δ] for some δ > 0, a single observation time by T and
a pair by (T1, T2)> whose densities are f(t) and f2(s, t), respectively. Recall that the
unconditional mean function m(t, y) = E[X(t)1(Y ≤ y)]. The regularity conditions for
the underlying moment functions and design densities are as follows, where `1, `2 are
non-negative integers,
Assumption 3.7. ∂2
∂s`1∂t`2Σ is continuous on T δ×T δ for `1 +`2 = 2, ∂2m/∂t2 is bounded
and continuous respect to t ∈ T for all y ∈ R.
Assumption 3.8. f(1)1 (t) is continuous on T δ with f1(t) > 0, ∂
∂s`1∂t`2f2 is continuous on
T δ × T δ for `1 + `2 = 1 with f2(s, t) > 0.
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 61
Assumption 3.7 can be guaranteed by twice differentiable process, and 3.8 is standard
and also implies the boundedness and Lipschitz continuity of f . Recall the bandwidths
h1 and h2 used in smoothing steps for m in (3.3) and Σ in (3.5), respectively. We assume
i=1Xi(t)1(Yi ≤ y), while the estimate of ΛFCV remains the same as
(4.4). For densely observed Xi, individual smoothing can be used as a preprocessing step
to recover smooth trajectories and the estimation error introduced in this step can be
shown to be asymptotically negligible under certain design conditions, i.e., it is equivalent
to the ideal situation of the completely observed Xi’s (Hall et al., 2006).
Chapter 4. Cumulative Variance Estimation for Classification 80
4.3 Simulations
In this section we illustrate the practical performance of the proposed FCV method us-
ing reduced rank quadratic discriminant analysis (see Hastie et al., 2009, chap. 4.3.3)
to split the K-dimensional EDR space into C = 2 regions for class prediction. For
i = 1, . . . , n, let Zi =(〈β1,sn , Xi〉, . . . , 〈βK,sn , Xi〉
)>denote the K-variate random vari-
able Xi that has been projected onto the EDR space via FCV. For a new observation
Z0 =(〈β1,sn , X0〉, . . . , 〈βK,sn , X0〉
)>, we calculate the reduced rank quadratic discrimi-
nant function
δk(Z0) = −1
2log |Σk| −
1
2(Z0 − µk)>Σ−1
k (Z0 − µk) + log πk, (4.5)
where µk and Σk are the mean vector and covariance matrix of subpopulation Πk cal-
culated from the reduced variables Zi, respectively, and πk is the estimated proportion
of subpopulation Πk. We classify X0 to subpopulation Π0 if δ0(Z0) > δ1(Z0), and to Π1
if δ0(Z0) < δ1(Z0). We remind the reader from Chapter 3.4 that 〈βk,snXi〉 is given by
an integral approximation when the functional data is dense, while Xi is replaced by its
PACE (Yao et al., 2005a) estimate Xi when the functional data is sparse.
We compare our proposal to (i) functional SAVE in the same reduced rank QDA
framework, (ii) FCS in a reduced rank LDA framework, (iii) QDA on the FPCs (Hall
et al., 2001), and (iv) a Naive Bayes (NB) classifier on the FPCs. In all of the following
simulations we generate a total of n = 100 curves from Π0 and Π1 with respective sizes
n0 = n/2 and n1 = n/2. For k = 0, 1, functional processes from Πk are generated as
Xki(t) =∑40
j=1(θkj + µkj)φj(t), where θkj is i.i.d. N(−(λkj/2)1/2, λkj/2) with probability
1/2 and N((λkj/2)1/2, λkj/2) with probability 1/2. The λkjs and µkjs are selected depend-
ing on the property of FCV we want to illustrate below. In each case the measurement
error on Xki is i.i.d. N(0, 0.01), the domain of observation is t ∈ [0, 1] and eigenfunctions
are φj(t) = sin(πtj/2)/√
2 for j even and φj(t) = cos(πtj/2)/√
2 for j odd. For dense
Chapter 4. Cumulative Variance Estimation for Classification 81
functional data the Tij are 101 equispaced points in [0, 1], while for sparse functional data
the number of observations per subject Ni is chosen uniformly from {5, . . . , 14} and the
observational times Tij are i.i.d. U(0, 1).
Shown in Table 4.1 are the combinations of λkj and µkj that are considered. Model A
captures the general classification problem where both the inter-class means and covari-
ances are different, model B depicts the scenario when only the inter-class covariances
are different, and model C describes the scenario when only the inter-class means are dif-
ferent. We compute the average percent of misclassification and its standard error over
100 Monte Carlo repetitions, shown in Table 4.2 for the sparse design. The structure
dimension K and the truncation parameter sn are chosen by minimizing the misclassifi-
cation rate. These results suggest that FCV is optimal when inter-class covariances are
distinct, but that FCS is optimal otherwise. The results for FCV and FSAVE when inter-
class covariances are equal corroborate those from Zhu and Hastie (2003) who showed
that multivariate SAVE tends to over-emphasize second-order differences between classes,
while ignoring first-order differences.
Table 4.1: Shown are the combinations of θkj and µkj we use in our simulation study.
Model λ0j λ1j µ0j µ1j
A j−3 4j−2 µ01 = µ02 = µ03 = µ04 = 1 0 for all j
B j−3 4j−2 0 for all j 0 for all j
C 3j−2 3j−2 µ01 = µ02 = µ03 = µ04 = 1 0 for all j
4.4 Data Applications
In this section we study temporal gene expression data for yeast cell cycle (Spellman
et al., 1998). Each trajectory contains 18 observations of gene expression, measured
Chapter 4. Cumulative Variance Estimation for Classification 82
Table 4.2: Shown are the average misclassification error (×100%) with its standard error(in parentheses), and the optimal K and sn that minimize the average misclassificationerror over 100 Monte Carlo repetitions for sparse functional data.
Model FCV FSAVE FCS QDA NB
A16.11 (1.36) 19.48 (1.52) 22.94 (.44) 21.93 (.59) 28.83 (.79)K = 3, sn = 3 K = 3, sn = 3 K = 1, sn = 3 sn = 3 sn = 2
K = 2, sn = 2 K = 3, sn = 3 K = 1, sn = 4 sn = 3 sn = 3
Table 4.3: Shown are the average misclassification error (×100%) with its standarderror (in parentheses), and the optimal K and sn that minimize the average 5-fold cross-validated classification error for the temporal gene expression data.
FCV FSAVE FCS QDA NB
15.11 (.25) 15.12 (.27) 21.76 (.34) 37.47 (.51) 40.01 (.69)K = 2, sn = 2 K = 2, sn = 2 K = 1, sn = 2 sn = 3 sn = 4
every 7 minutes between 0 and 119 minutes. 92 genes were identified, of which 43 are
known to regulate the G1 (Y = 1) phase and the remaining 49 are known to regulate
the non-G1 (Y = 0) phase. The functional trajectories are shown in Figure 4.1. To
artificially create sparse functional trajectories from this dense data, we randomly select
9 observations to use from each trajectory. In Table 4.3, we present the minimized average
5-fold cross-validated prediction error over 20 random partitions for different methods,
together with the selected structural dimensions and the truncation sizes. The second
order EDR methods, FCV and FSAVE, are virtually indistinguishable from each other,
but both compare very favorably to the other methods.
Chapter 4. Cumulative Variance Estimation for Classification 83
0 20 40 60 80 100 120−3
−2
−1
0
1
2
3
Time
G1 ph
ase
0 20 40 60 80 100 120−3
−2
−1
0
1
2
3
Time
non−
G1 ph
ase
Figure 4.1: Temporal gene expressions.
4.A Appendix: Proof of Theorem 4.1
It suffices to show that for any b ∈ H, 〈b,Σβk〉 = 0 for all k = 1, . . . , K implies
〈b,Λ(y)b〉 = 0. First, observe that 〈b,Λ(y)b〉 = 〈b,V[X1(Y ≤ y)]b〉 − F (y)〈b,Σb〉. Then,