Facult´ e des Sciences D´ epartement de Math´ ematique Inference for stationary functional time series: dimension reduction and regression Lukasz KIDZI ´ NSKI Th` ese pr´ esent´ ee en vue de l’obtention du grade de Docteur en Sciences, orientation Statistique Promoteur: Siegfried H¨ ormann Jury: Maarten Jansen, Davy Paindaveine, Thomas Verdebout, Laurent Delsol, Piotr Kokoszka Septembre 2014
116
Embed
Inference for stationary functional time series: dimension ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Faculte des Sciences
Departement de Mathematique
Inference for stationary functional time series:
dimension reduction and regression
Lukasz KIDZINSKI
These presentee en vue de l’obtention du grade de Docteur en Sciences,
orientation Statistique
Promoteur: Siegfried Hormann
Jury: Maarten Jansen, Davy Paindaveine, Thomas Verdebout, Laurent Delsol, Piotr Kokoszka
Septembre 2014
“Simplicity is the final achievement.
After one has played a vast quantity of notes and more
notes, it is simplicity that emerges as the crowning reward
of art.”
Fryderyk Chopin
in If Not God, Then What? by Joshua Fost.
Acknowledgements
First and foremost my heartfelt thanks to Professor Siegfried Hormann who, throughout three
years, taught me how to be a great scientist and a better person. I would like to thank him, for all
these long hours spent in front of the blackboard, for the passage from hard theoretical problems to
neat and valuable solutions, for his incredible precision and attention to detail which finally drove
me to be more careful, for his constantly positive attitude, charisma and expertise which will always
be a unique example for me, for the trust he gave me by letting me follow my own paths. It is a
great honour and privilege to be his first PhD student.
My gratitude is also extended to Piotr Kokoszka, for the support he showed and keeps showing
me from the moment we met, for his hospitality and the priceless opportunity to work together at
the Colorado State University.
My thanks goes to David Brillinger as well, who found time to share his experience with me at
UC Berkeley, regardless of the many obstacles.
My sincere thanks also go out to Cheng Soon Ong for accommodating me in the challenging
environment of the ETH Zurich.
My thanks also goes to my thesis committee for their guidance and our yearly recaps, to Davy
Paindaveine for his sharp remarks and exceptional humor, to Maarten Jansen for giving a great
example of scientific commitment, to Pierre Patie for valuable remarks at the beginning of my work,
and to Thomas Bruss for sharing his experience through countless stories and digressions during
lunches and coffee breaks. Likewise, my thanks to other members of my jury, Thomas Verdebout,
Laurent Delsol, for accepting my request and for their time.
Next, I would like to acknowledge the Communaut franaise de Belgique, for the grant within
Actions de Recherche Concertees (2010–2015) and the Belgian Science Policy Office, for the grant
within Interuniversity attraction poles (2012–2017). Thanks for the indispensable means which
allowed me to spend three years on my project.
Furthermore, I am aware that a scientific journey starts much earlier than in a doctoral school.
I would not be who I am without all the support from teachers starting from my childhood up till
now – I know that this thesis is not just mine, but their success too. In particular, I would like to
thank my primary school teacher Krzysztof Lukasiewicz and my high school teacher Jerzy Konarski
who taught me to enjoy mathematics.
Many fellow students and faculty members also supported me at the Universit Libre de Bruxelles.
A great thank you to my office colleague Remi for all the necessary breaks for random mathematical
problems, for refreshing algorithmic competitions, for chess games or simple discussions about the
essence of the universe. Thanks to my second office colleague Rabih, and neighbours, Sarah, Carine,
Stavros, Dominik, Germin and Christophe, fellow students Robson and Isabel and many others for
teaching me French and for maintaining my sanity through chats, dinners, joggings and others.
Thanks to the whole Gauss’oh Fast team, for the taste of victory and to the BSSM co-organisers,
i
Julien, Julie, Patrick, Yves, Thomas, Nicolas and others for quite the same reason.
I am also honoured by the support from outside of the university. Thanks to Daniel, Bella,
Felipe, Astrid, Senna, Thiago, Wolney, Anna, Omid, Maryam, and Sarah for enriching discussions
about science, politics, economics and any sort of regular gossip during Friday’s dinners. Thanks to
Jan, Dominika and Micha l for being there whenever I needed help. Thanks to my fantastic Polish
friends, Sebastian for his persistence, Karol for finding time for me no matter what, Natalia who
makes me remember I can achieve everything and Kinga for her exceptional life attitude. Thanks
to Leo for his constant positive thinking.
Thanks to my family, to my mother and sister who taught me the value of time, who always
believed in me and who will always protect me, to my father who was always motivating me to reach
for more.
Last, but certainly not least, I must acknowledge with tremendous and deep gratitude my lovely
Magda, for her limitless smiles, trust and support for all my ideas and decisions no matter how
crazy they seem. Together we are a team and for such a team every challenge is feasible.
The continuous advances in data collection and storage techniques allow us to observe and record
real-life processes in great detail. Examples include financial transaction data, fMRI images, satellite
photos, earths pollution distribution in time etc. Due to the high dimensionality of such data,
classical statistical tools become inadequate and inefficient. The need for new methods emerges and
one of the most prominent techniques in this context is functional data analysis (FDA).
The main objective of this work is to analyze temporal dependence in FDA. Such dependence
occurs, for example, if the data consist of a continuous time process which has been cut into segments,
days for instance. We are then in the context of so-called functional time series.
Many classical time series problems arise in this new setup, like modeling or prediction. In
this work we we will be concerned mainly with regression and dimension reduction, comparing
time–domain methods with frequency–domain methods.
In this chapter, we further discuss the motivational examples and introduce articles upon which
this thesis is based.
1 Functional data analysis
1.1 Motivation
The main concern of statistics is to obtain essential information from a sample of observations
X1, X2, ..., XN . We are given a finite sample of size N ∈ N , where Xii∈Z can be scalars, vectors
or more complex objects, like genotypes, fMRI scans or images.
Functional data analysis deals with observations which can be naturally expressed as functions.
Figures 1, 2 and 3 present several cases from different areas of science which fit into the framework
of functional data analysis.
When we deal with a physical process it is often natural to assume that it behaves in a con-
tinues manner and that the observations do not oscillate significantly between the measurements.
Although, in the Digital Era, we rarely record analog processes continuously, we often have enough
datapoints that interpolation does not cause a significant measurement error. Models incorporating
this additional structure can lead to more precise and meaningful foundings. In this context, FDA
can be seen as a tool which embeds the continuity feature into the model.
On the other hand, except for the good approximation of a continuous process, FDA can also
prove to be useful in a noisy, discontinuous case. Then, FDA can serve as a tool for denoising and
smoothing the data and is beneficial whenever the underlying process is the main concern.
From a pragmatic perspective, functional data can be seen simply as infinitely dimensional
vectors, with extended notion of variance and mean, and thus we may be tempted to employ
classical multivariate techniques. However, there are many practical and theoretical problems that
need to be addressed. For example, in the context of linear models, the inversion of the (infinite
dimensional) covariance operator is not straight forward and needs to be treated carefully, both,
from the theoretical and practical perspective. This issue, together with our novel approach to the
2
Introduction Functional data analysis
0 5 10 15 20 25 30
8010
012
014
016
018
0
x
Figure 1: Berkeley Growth Data: Heights of 20 girls taken from ages 0 through 18 (left). Growth processeasier to visualize in terms of acceleration (right). Tuddenham and Snyder [49] and Ramsey and Silverman[43]
Figure 2: Lower lip movement (top), acceleration (middle) and EMG of a facial muscle (bottom) of a speakerpronouncing the syllable “bob” for 32 replications. Malfait, Ramsay, and Froda [32]
Figure 3: Projections of DNA minicircles on the planes given by the principal axes of inertia (three panelson the left side: TATA curves, right: CAP curves). Mean curves are plotted in white. Panaretos, Kraus andMaddocks [36]
3
Introduction Functional data analysis
classical functional regression problem, is the topic of Chapter 1.
The FDA approach is also useful in a parsimonious representation of the data by taking advantage
of their smoothness. Instead of looking at a function as a dense vector of values, we can often
represent it in an linear combination of a handful of (well chosen) basis functions.
Finally, there are also advantages in the FDA approach which stem from the structure of the
data. For example, one of the drawbacks of an acclaimed multivariate Principal Component Anal-
ysis (PCA) is it’s scale dependence. It makes no sense to rescale a function componentwise (with
different scaling factors at different arguments) and hence for the functional counterpart of PCA,
the Functional Principal Component Analysis (FPCA), the lack of scale-invariance is not an issue.
A detailed introduction to Functional Principle Components is given in Section 1.6. In Chapter 3
we describe an extension of the technique benefiting from the time–dependent framework.
1.2 Brief overview of functional data research
One of the most influential works in the field of FDA is the seminal book by Ramsay and Silver-
man [43]. Together with the R and Matlab libraries, significantly facilitating both research and
practice in the area, they are a main reference in the field. Many important results were mapped
from the multivariate cases, often taking the advantage of the unique features of functional object,
whereas others, like the analysis of derivatives, were derived uniquely in this setting.
As a running example, Ramsay and Silverman [43] consider growth curves of 10 girls measured
at a set of 31 ages. They argue that statistics obtained on derivatives can be more informative than
the classical analysis of the curves themselves, performed earlier by Tuddenham and Snyder [49].
Practical applications of functional data analysis are spread across many areas of science and
engineering. Panaretos et al. [36] use [0, 1]→ R3 closed curves to analyze the behavior of DNA mi-
crocircles, providing the testing methodology for the comparison of two classes of curves. Aston and
Kirch [2] analyze the stationarity and change point detection for functional time series, with appli-
cations to fMRI data. Hadjipantelis et al. [18] analyze Mandarin language using functional principal
components. Functional time series also naturally emerge in financial applications – Kokoszka and
Reimherr [29] analyze predictability of the shape of intraday price curves.
From the theoretical perspective, Berkes et al. [5] extensively studied the problem of change
points within a set of functional observations, whereas Horvath et al. [26] recently investigated
testing for stationarity. Many multivariate techniques were extended to an infinite dimensional
setup, like functional dynamic factor models [20] or functional depth [31].
These works are only a fraction of the ongoing research and for a more accurate survey on
applications and theory we refer to books [43], [16], [25] and [6].
1.3 Hilbert spaces
For most of the results presented in this work we only require that the functional space is a separable
Hilbert space, i.e. a complete inner product space with a countable basis. This allows us to state
4
Introduction Functional data analysis
more general results, so that the space of square integrable functions L2([a, b]), a < b is a special
case.
Although most of our examples are concerned real–valued functions defined on a finite interval,
one should keep in mind other possible applications, including, for example, multivariate functions
or images and audio files, as described in Section 1.2.
1.4 Notation
Let H1, H2 be two (not necessarily distinct) separable Hilbert spaces. We denote by L(Hi, Hj),
(i, j ∈ 1, 2), the space of bounded linear operators from Hi to Hj . Further we write 〈·, ·〉Hfor the inner product on Hilbert space H and ‖x‖H =
√〈x, x〉H for the corresponding norm.
For Φ ∈ L(Hi, Hj) we denote by ‖Φ‖L(Hi,Hj) = sup‖x‖Hi≤1 ‖Φ(x)‖Hj the operator norm and by
‖Φ‖S(Hi,Hj) =(∑∞
k=1 ‖Φ(ek)‖2Hj)1/2
, where e1, e2, ... ∈ Hi is any orthonormal basis (ONB) of Hi,
the Hilbert-Schmidt norm of Φ. It is well known that this norm is independent of the choice of
the basis. Furthermore, with the inner product 〈Φ,Θ〉S(H1,H2) =∑
k≥1〈Φ(ek),Θ(ek)〉H2 the space
S(H1, H2) is again a separable Hilbert space. For simplifying the notation we use Lij instead of
L(Hi, Hj) and in the same spirit Sij , ‖ · ‖Lij , ‖ · ‖Sij and 〈·, ·〉Sij .
All random variables appearing in this work will be assumed to be defined on some common
probability space (Ω,A, P ). A random element X with values in H is said to be in LpH if νp,H(X) :=
(E‖X‖pH)1/p < ∞. More conveniently we shall say that X has p moments. If X possesses a first
moment, then X possesses a mean µ, determined as the unique element for which E〈X,x〉H =
〈µ, x〉H , ∀x ∈ H. For x ∈ Hi and y ∈ Hj let x⊗ y : Hi → Hj be an operator defined as x⊗ y(v) =
〈x, v〉y. If X ∈ L2H , then it possesses a covariance operator C, given by C = E[(X − µ)⊗ (X − µ)].
It can be easily seen that C is a Hilbert-Schmidt operator. Assume X,Y ∈ L2H . Following Bosq [6],
we say that X and Y are orthogonal (X ⊥ Y ) if EX ⊗ Y = 0. A sequence of orthogonal elements
in H with a constant mean and constant covariance operator is called H–white noise.
1.5 Representation and fit
Since we are dealing with infinite dimensional objects we need to represent and approximate them
in a convenient way. This is important from the practical as the theoretical perspective. From the
practical point, due to the limited computer memory, we will always work with approximations. We
want to use low dimensional approximations for computational reasons.
One of the possibilities to represent a curve, is to select a sufficiently fine grid and process the
vector of values of the function on the intervals induced by the gridpoints. This approach, often
used in practice, does not benefit from the continuity of functions.
In this work, we follow the ideas popularized by Ramsey and Silverman [43], based on the basis
function expansion. Most prominent the Karhunen–Loeve or Fourier extension. Let ei1≤i≤∞ be
an orthonormal basis of a separable Hilbert space H. Then, any element x ∈ H can be uniquely
5
Introduction Functional data analysis
represented as
x =
∞∑i=1
〈x, ei〉ei.
Note that, by Parseval’s formula,
‖x‖2 =∞∑i=1
|〈x, ei〉|2.
Since the sum is finite, for any ε, there exist d that
∞∑i=d
|〈x, ei〉|2 < ε.
We can therefore approximate the function with arbitrary precision ε > 0 using only the first d
basis elements. This approach is consistent with intuition. Indeed, if we use, for example, Fourier
basis functions, then the high frequency components are expected to be negligible and will be
diminishing.
Although the fitting and representation of functional data is an important and intensively studied
topic on its own, in this work we assume that observations are fully observed, i.e. we are given the
actual curves. For more information on fitting we refer to [43].
1.6 Dimension reduction
From a theoretical perspective a curve observation X is an intrinsically infinite–dimensional object.
Besides the choice of an appropriate basis, there is also the need for dimension reduction.
Arguably, functional principal components analysis (FPCA) is the key technique to this problem.
Like its multivariate equivalent, FPCA is based on the analysis of the covariance operator and it is
concerned with finding directions which contribute most to the variability of the observations.
Let X be a functional random variable taking values in some Hilbert space H and C = EX ⊗Xbe its covariance operator. (Without loss of generality we assume here and in many places that
EX0 = 0.) For C to exist, we assume that E‖X‖2 < ∞. One can show that C is a symmetric,
positive definite Hilbert-Schmidt operator and can hence by the spectral theorem be decomposed
into
C =
∞∑i=1
λiei ⊗ ei, (1)
where λ1 ≥ λ2 ≥ ... and λi ≥ 0 are the eigenvalues of C and and eii∈N are the corresponding
eigenfunctions, forming an orthonormal basis of the underlying Hilbert space H.
If we pick the first d basis elements eidi=1 and project the observation X on the space spanned
by them, we obtain the optimal d-dimensional approximation in terms of the mean square error, i.e.
E‖X −d∑i=1
〈X, ei〉ei‖2 ≤ E‖X −d∑i=1
〈X, e′i〉e′i‖2,
6
Introduction Functional Time Series
Time
0 2000 4000 6000 8000 10000
27560
27580
27600
27620
27640
27660
Figure 4: Horizontal component of the magnetic field measured in one minute resolution at Honolulu mag-netic observatory from 1/1/2001 00:00 UT to 1/7/2001 24:00 UT. 1440 measurements per day.
for any other orthonormal collection e′i1≤i≤d. The directions ei are called the principal components
of X and the coefficients 〈X, ei〉 are called PC scores. A simple computation shows that PC scores
are uncorrelated, which is another key feature.
We remark again that a main advantage of FPCA over the multivariate version is that scale-
invariance is not relevant. Consequently, it is much easier to interpret functional PCs and linear
combinations thereof. For detailed theory of multivariate principle components we refer to [28] and
to [45] for the functional setup.
FPCA gained popularity in both iid and time–dependent setup. However, in Chapter 3 we argue
that this technique is no longer optimal for time series and may lead to misconception when used
not carefully. We then propose an extension of FPCA, which benefits from the temporal dependence
structure.
2 Functional Time Series
In many practical situations functions are naturally ordered in time. For example, when we deal
with daily observations of the stock market or with sequences of tumor scans. Then, we are in the
context of a so–called functional time series (FTS).
As a motivating example consider Figure 4. Here, the assumption of independence can be too
strong – values at the beginning of each day are highly correlated with those at the end of the
preceding day. Moreover, we see that big jumps are often followed by significant drops.
These, and similar features, may indicate significant temporal dependence not just within a
subject, but also between different subjects (e.g. days). In this section we discuss possible frameworks
which allow to quantify, test and use this additional information.
7
Introduction Functional Time Series
2.1 Stationarity
Many physical processes are known to have time-invariant distribution. This motivates the frequen-
tionist approach to time series, where we assume that the structure does not change in time and
we interfer from estimated covariances. A functional test for stationarity was recently introduced
by Horvath et al. [26]. Non–stationary time series are also extensively studied, however they are
beyond the scope of this work.
Let Xt be a series of random functions. We say that Xt is stationary in the strong–sense if for
any h ∈ Z, k ∈N and any sequence of indices t1, t2, ..., tk vectors (Xt1 , ..., Xtk) and (Xh+t1 , ..., Xh+tk)
are identically distributed.
We also define weak stationary by looking only on the second order structure of the series. We
say that Xt is weakly stationary if E‖Xt‖2 <∞ and
1. EXt = EX0 for each t ∈ Z and
2. EXtXs = EXt−sX0 for each t, s ∈ Z.
2.2 Model approach
Arguably, one of the most popular models of temporal dependence is the functional autoregressive
model (FAR(p)) studied in great detail by Bosq [6]. In this model we assume that the state in time
t is a linear function of the p previous states and some independent white noise process. The main
concern is the estimation of the p linear operators involved. Once the AR structure is identified, we
can profit from the explicit probabilistic structure and dynamic of the time series. We describe this
model in detail in Chapter 1.
Many time series can, however, not be approximated by some FAR(p) process and the need for
more complex models arises. ARMA or GARCH-type models (cf. [21]) could serve as alternatives,
but the required theoretical foundations beyond the relatively simple auto-regressions is still sparse.
Furthermore, for many time series it is not clear which model they follow. Nonetheless time–series
procedures may still apply. It is then preferable to only impose a certain dependence structure,
rather than requiring a particular model. In the next sections we introduce three popular notions
of dependence and justify the choice of the framework employed throughout this work.
2.3 Lp-m-approximability
In this framework, weak dependence is defined by a “small” Lp distance between the process and
it’s approximation based on only last m innovations. This idea is made precise in the following two
definitions.
Definition 1. Suppose (Xn)n≥1 is the random process with values in H and let F−n = σ(..., Xn−2, Xn−1, Xn)
and F+n = σ(Xn, Xn+1, Xn+2, ...) be the σ-algebras generated by terms up to time n and after time
n respectively. Process Xn is said to be m-dependent if F−n and F+n+m are independent.
8
Introduction Functional Time Series
In practice, processes usually do not have the property from Definition 1, however they can be
often approximated by such series. This motivates the following approach to week dependence
Definition 2 (Hormann and Kokoszka [23]). A random sequence Xnn≥1 with values in H is called
Lp–m–approximable, if it can be represented as
Xn = f(δn, δn−1, δn−2, ...),
where the δi are iid elements taking values in a measurable space S and f is a measurable function
f : S∞ → H. Moreover, if δ′i are independent copies of δi defined on the same probability space,
then for
X(m)n = f(δn, δn−1, δn−2, ..., δn−m+1, δ
′n−m, δ
′n−m−1, ...) (2)
we have
∞∑m=1
E‖Xm −X(m)m ‖p <∞.
Note, that the independent copies in (2) are used for simplicity of proofs and the representation
leading to analogous results. Let us also stress, that the representation 2 is rather general, and
incorporates most time series models encountered in practice. Furthermore, checking the validity of
the dependence condition is simply reduced to p-th order moments, which is typically much simpler
than establishing classical mixing conditions explained next.
2.4 Mixing conditions
There exist numerous variants for mixing. We introduce the strong mixing (or α-mixing) condition,
which is one of the most prominent ones. In the functional context it has e.g. been used by Aston
and Kirch [1]. For an extensive introduction to mixing we refer to Bradley [8].
In this approach we quantify and bound the dependence of sigma fields generated by variables
X0, X−1, ... and Xm, Xm+1, ... for given m ∈N .
Definition 3. A strictly stationary process Xj : j ∈ Z is called strong mixing with mixing
rate rm if
supA,B|P (A ∩B)− P (A)P (B)| = O(rm), rm → 0,
where the supremum is taken over all A ∈ σ(. . . , X−1, X0) and B ∈ σ(Xm, Xm+1, . . .).
9
Introduction Linear models
2.5 Cumulant condition
Another approach to quantifying weak dependence is based on so–called cumulants, expressing the
high order cross–moment structure. In the finite dimensional case, it was popularized by Brillinger
[9]. In the context of Functional Time Series it was brought recently by Panaretos and Tavakoli
[37]. The k−th order cumulant kernel is given by
cum(Xt1(τ1), ..., Xtk(τk)) =∑
v=(v1,...,vp)
(−1)p−1(p− 1)!
p∏l=1
E
∏j∈vl
Xtj (τj)
.where the sum extends over all unordered partitions of 1, ..., k. If we assume that E‖X0‖l2 < ∞for l ≥ 1, then the cumulant kernels are well–defined in L2. For a given cumulant kernel of order 2k
one can define an order cumulant operator Rt1,...,t2k−1: L2([0, 1]k,R)→ L2([0, 1]k,R), defined by
In this paper we are concerned with a regression problem of the form
Yk = Ψ(Xk) + εk, k ≥ 1, (I.1)
where Ψ is a bounded linear operator mapping from space H1 to H2. This model is fairly general
and many special cases have been intensively studied in the literature. Our main objective is the
study of this model when the regressor space H1 is infinite dimensional. Then model (I.1) can be
seen as a general formulation of a functional linear model, which is an integral part of functional
data literature. Its various forms are introduced in Chapters 12–17 of Ramsay and Silverman [25].
A few recent references are Cuevas et al. [11], Malfait and Ramsay [23], Cardot et al. [6], Chiou
et al. [8], Muller and Stadtmuller [24], Yao et al. [28], Cai and Hall [3], Li and Hsing [22], Hall
and Horowotiz [15], Reiss and Ogden [26], Febrero-Bande et al. [13], Crambes et al. [10], Yuan and
Cai [29], Ferraty et al. [14], Crambes and Mas [9].
From an inferential point of view, a natural problem is the estimation of the ‘regression operator’
Ψ. Once an estimator Ψ is obtained, we can use it in an obvious way for prediction of the responses Y .
Both, the estimation and the prediction problem are addressed in this paper. In existing literature,
these problems have been discussed from several angles. For example, there is the distinction between
the ‘functional regressors and responses’ model (e.g., Cuevas et al. [11]) or the perhaps more widely
studied ‘functional regressor and scalar response model’ (e.g., Cardot et al. [5]). Other papers deal
with the effect when random functions are not fully observed but are obtained from sparse, irregular
data measured with error (e.g., Yao et al. [28]). More recently, the focus was on establishing rates
of consistency (e.g., Cai and Hall [3], Cardot and Johannes [7]). The two most popular methods
∗Manuscript has been accepted for publication in Scandinavian Journal of Statistics
16
of estimation are based on principal component analysis (e.g., Bosq [1], Cardot et al. [5], Hall and
Horowitz [15]) or spline smoothing estimators (e.g., Hastie and Mallows [16], Marx and Eiler [12],
Crambes et al. [10]).
In this paper we address the estimation and prediction problem for this model when the data
are fully observed, using the principal component (PC) approach. Let us explain what is the new
contribution and what distinguishes our paper from previous work.
(i) The crucial difficulty for this type of problems is that the infinite dimensional operator Ψ needs
to be approximated by a sample version ΨK of finite dimension K, say. Clearly, K = Kn needs to
depend on the sample size and tend to ∞ in order to obtain an asymptotically unbiased estimator.
In existing papers determination of K and proof of consistency require, among others, unnecessary
moment assumptions and artificial restrictions concerning the spectrum of the covariance operator
of the regressor variables Xk. As our main result, we will complement the current literature by
showing that the PC estimator remains consistent without such technical constraints. We provide
a data-driven procedure for the choice of K, which may even be used as a practical alternative to
cross-validation.
(ii) We allow the regressors Xk to be dependent. This is important for two reasons. First, many
examples in FDA literature exhibit dependencies as the data stem from a continuous time process,
which is then segmented into a sequence of curves, e.g., by considering daily data. Examples of this
kind include intra-day patterns of pollution records, meteorological data, financial transaction data
or sequential fMRI recordings. See, e.g., Horvath and Kokoszka [20].
Second, our framework detailed below will include the important special case of a functional
autoregressive model which has been intensively investigated in the functional literature and is often
used to model autoregressive dynamics of a functional time series. This model is analyzed in detail
in Bosq [2]. We can not only greatly simplify the assumptions needed for consistent estimation,
but also allow for a more general setup. E.g., in our Theorem 2 we show that it is not necessary
to assume that Ψ is a Hilbert-Schmidt operator if our intention is prediction. This quite restrictive
assumption is standard in existing literature, though it even excludes the identity operator.
(iii) As we already mentioned before, the literature considers different forms of functional linear
models. Arguably the most common are the scalar response and functional regressor and the func-
tional response and functional regressor case. We will not distinguish between these cases, but work
with a linear model between two general Hilbert spaces.
In the next section we will introduce notation, assumptions, the estimator and our main results.
In Section 3 we provide a small simulation study which compares our data driven choice of K with
cross-validation (CV). As we will see, this procedure is quite competitive with CV in terms of mean
squared prediction error, while it is clearly favorable to the latter in terms of computational costs.
Finally, in Section 6, we give the proofs.
17
2 Estimation of Ψ
2.1 Notation
Let H1, H2 be two (not necessarily distinct) separable Hilbert spaces. We denote by L(Hi, Hj),
(i, j ∈ 1, 2), the space of bounded linear operators from Hi to Hj . Further we write 〈·, ·〉H for
the inner product on Hilbert space H and ‖x‖2H = 〈x, x〉H for the corresponding norm. For Φ ∈L(Hi, Hj) we denote by ‖Φ‖L(Hi,Hj) = sup‖x‖Hi≤1 ‖Φ(x)‖Hj the operator norm and by ‖Φ‖2S(Hi,Hj)
=∑∞k=1 ‖Φ(ek)‖2Hj , where e1, e2, ... ∈ Hi is any orthonormal basis (ONB) of Hi, the Hilbert-Schmidt
norm of Φ. It is well known that this norm is independent of the choice of the basis. Furthermore,
with the inner product 〈Φ,Θ〉S(H1,H2) =∑
k≥1〈Φ(ek),Θ(ek)〉H2 the space S(H1, H2) is again a
separable Hilbert space. For simplifying the notation we use Lij instead of L(Hi, Hj) and in the
same spirit Sij , ‖ · ‖Lij , ‖ · ‖Sij and 〈·, ·〉Sij .
All random variables appearing in this paper will be assumed to be defined on some common
probability space (Ω,A, P ). A random element X with values in H is said to be in LpH if νp,H(X) :=
(E‖X‖pH)1/p < ∞. More conveniently we shall say that X has p moments. If X possesses a first
moment, then X possesses a mean µ, determined as the unique element for which E〈X,x〉H =
〈µ, x〉H , ∀x ∈ H. For x ∈ Hi and y ∈ Hj let x⊗ y : Hi → Hj be an operator defined as x⊗ y(v) =
〈x, v〉y. If X ∈ L2H , then it possesses a covariance operator C, given by C = E[(X − µ)⊗ (X − µ)].
It can be easily seen that C is a Hilbert-Schmidt operator. Assume X,Y ∈ L2H . Following Bosq [2],
we say that X and Y are orthogonal (X ⊥ Y ) if EX ⊗ Y = 0. A sequence of orthogonal elements
in H with a constant mean and constant covariance operator is called H–white noise.
2.2 Setup
We consider the general regression problem (I.1) for fully observed data. Let us collect our main
assumptions.
(A): We have Ψ ∈ L12. Further εk and Xk are zero mean variables which are assumed to be
L4–m–approximable in the sense of Hormann and Kokoszka [18] (see below). In addition εk is
H2–white noise. For any k ≥ 1 we have Xk ⊥ εk.
Here is the weak dependence concept that we impose.
Definition 5 (Hormann and Kokoszka [18]). A random sequence Xnn≥1 with values in H is called
Lp–m–approximable, if it can be represented as
Xn = f(δn, δn−1, δn−2, ...),
where the δi are iid elements taking values in a measurable space S and f is a measurable function
f : S∞ → H. Moreover, if δ′i are independent copies of δi defined on the same probability space,
then for
X(m)n = f(δn, δn−1, δn−2, ..., δn−m+1, δ
′n−m, δ
′n−m−1, ...)
18
we have
∞∑m=1
νp,H(Xm −X(m)m ) <∞.
Evidently, i.i.d. sequences with finite p-th moments are Lp–m–approximable. This leads to the
classical functional linear model. But it is also easily checked that functional linear processes fit in
this framework. More precisely, if Xn is of the form
Xn =∑k≥0
bk(δn−k),
where bk : H0 → H1 are bounded linear operators such that∑
m≥1
∑k≥m ‖bk‖L01 <∞, and (δn) is
i.i.d. noise with νp,H0(δ0) <∞, then Xn is Lp–m–approximable. Other (also non-linear) examples
of functional time series covered by Lp–m–approximability can be found in [18].
A very important example included in our framework is the autoregressive Hilbertian model of
order 1 (ARH(1)) given by the recursion Xk+1 = Ψ(Xk) + εk+1. It will be treated in more detail in
Section 2.4.
The notion of L4–m–approximability implies that the process is stationary and ergodic and
that it has finite forth moments. The latter is in line with existing literature. We are not aware
of any article that works with less than 4 moments. In contrast, for several consistency results
finite moments of all orders (or even bounded random variables) are assumed. Since our estimator
below is a moment estimator, based on second order moments, one could be tempted to believe that
some of our results may be deduced directly from the ergodic theorem under finite second moment
assumptions. We will explain in the next section, after introducing the estimator, why this line of
argumentation is not working.
Our weak dependence assumption implies that a possible non-zero mean of Xk can be estimated
consistently by the sample mean. Moreover we have (see [19])
√n‖X − µ‖H1 = OP (1).
We conclude that the mean can be accurately removed in a preprocessing step and that EXk = 0 is
not a stringent assumption. Since by Lemma 2.1 in [18] Yk will also be L4–m–approximable, the
same argument justifies that we study a linear model without intercept.
2.3 The estimator
The PC based estimator for Ψ described below was first studied by Bosq [1] and is based on a
finite basis approximation. To achieve optimal approximation in finite dimension, one chooses
eigenfunctions of the covariance operator C = E[X1 ⊗ X1] as a basis. Let ∆ = E[X1 ⊗ Y1]. By
Assumption (A) both, ∆ and C, are Hilbert-Schmidt operators. Let (λi, vi)i≥1 be the eigenvalues
and corresponding eigenfunctions of the operator C, such that λ1 ≥ λ2 ≥ .... The eigenfunctions
are orthonormal and those belonging to a non-zero eigenvalue form an orthonormal basis of Im(C),
19
the closure of the image of C. Note that, with probability one, we have X ∈ Im(C). Since Im(C) is
again a Hilbert-space, we can assume that H1 = Im(C), i.e. that the operator is of full rank. In this
case all eigenvalues are strictly positive. Using linearity of Ψ and the requirement Xk ⊥ εk from
Then, for any x ∈ H1, the derived equation leads to the representation
Ψ(x) = Ψ
( ∞∑j=1
〈vj , x〉vj
)=
∞∑j=1
∆(vj)
λj〈vj , x〉. (I.2)
Here we assume implicitly that dim(H1) =∞. If dim(H1) = M <∞, then (I.2) still holds with ∞replaced by M . This case is well understood and will therefore be excluded.
Equation (I.2) gives a core idea for estimation of Ψ. We will estimate ∆, vj and λj from
our sample X1, . . . , Xn, Y1, . . . , Yn and substitute the estimators into formula (I.2). The estimated
eigenelements (λj,n, vj,n; 1 ≤ j ≤ n) will be obtained from the empirical covariance operator
Cn =1
n
n∑k=1
Xk ⊗Xk.
In a similar straightforward manner we set
∆n =1
n
n∑k=1
Xk ⊗ Yk.
For ease of notation, we will suppress in the sequel the dependence on the sample size n of these
estimators.
Apparently, from the finite sample we cannot estimate the entire sequence (λj , vj), rather we
have to work with a truncated version. This leads to
ΨK(x) =
K∑j=1
∆(vj)
λj〈vj , x〉, (I.3)
where the choice of K = Kn is crucial. Since we want our estimator to be consistent, Kn has to
grow with the sample size to infinity. On the other hand, we know that λj → 0. Hence, it will be
a delicate issue to control the behavior of 1λj
. A small error in the estimation of λj can have an
enormous impact on (I.3).
Define ΨK(x) =∑K
j=1∆(vj)λj〈vj , x〉. Via the ergodic theorem one can show that the individual
terms λj , vj and ∆ in (I.3) converge to their population counterparts. It follows that ‖ΨK −ΨK‖L12 → 0 a.s., as long as K is fixed. In fact, this holds true under finite second moments.
However, as it is well known, the ergodic theorem doesn’t assure rates of convergence. Even if the
20
underlying random variables were bounded, convergence can be arbitrarily slow. Consequently, we
cannot let K grow with the sample size in this approach. We need to impose further structure
on the dynamics of the process and existence of higher order moments. Both are combined in the
concept of L4–m–approximability.
In most existing papers determination of Kn is related to the decay-rate of λj. For example,
To bring our data on the same scale and make results under different settings comparable we set
c1, c2 and c3 such that∑35
k=1 λk = 1. This implies E‖Xi‖2 = 1 in all settings. The noise εk is also
24
assumed to be of the form (I.5), but now with E‖εi‖2 = σ2 ∈ 0.25, 1, 2.25, 4.
We test three operators, all of the form Ψ(x) =∑35
i=1
∑35j=1 ψij〈x, vi〉vj .
• Ψ1 : for 1 ≤ i, j ≤ 35 we set ψii = 1 and ψij = 0 when i 6= j,
• Ψ2 : the coefficients ψij are generated as i.i.d. standard normal random variables,
• Ψ3 : for 1 ≤ i, j ≤ 35 we set ψij = 1ij
We standardize the operators such that the operator norm equals one. The operators Ψ2 are
generated once and then fixed for the entire simulation. We generate samples of size n+1 = 80×4`+1,
` = 0, . . . , 4. Estimation is based on the first n observations. We run 200 simulations for each setup
(Λ,Ψ, σ, n). As a performance measure for our procedure the mean squared error on the (n+ 1)-st
observation
MSE =1
200
200∑k=1
‖Ψ(X(k)n+1)− Ψ(X
(k)n+1)‖2H2
, (I.6)
is used. Here X(k)i is the i-th observation of the k-th simulation run.
Now we compute the median truncation level K obtained from our data-driven procedure de-
scribed in Theorem 2 with mn = n1/2
logn . We compare it to the median truncation level obtained by
cross-validation (KCV ) on the same data. To this end, we divide the sample into training and test
sets in proportion (n− ntest) : ntest, where ntest = maxn/10, 100. The estimator is obtained from
the training set for different truncation levels k = 1, 2, . . . , 35. Then, from the test set we determine
KCV = argmink∈1,...,35∑n
`=n−ntest‖Y`+1 − Ψk(X`)‖2H2
.
The MSE and the size of K and KCV are shown for different constellations in Table 1. We display
the results only for σ = 1. Not surprisingly, the bigger the variance of the noise, the bigger MSE,
but otherwise our findings were the same across all constellations of σ. The table shows that the
choice of K proposed by our method results in an MSE which is competitive with CV. We also see
that an optimal choice of K cannot be solely based on the decay of the eigenvalues as it is the case
in our approach. It clearly also depends on the unknown operator itself. Not surprisingly, the best
results are obtained under settings Λ1 (exponentially fast decay of eigenvalues) and Ψ3 (which is
the smoothest among the three operators).
4 Conclusion
Estimation of the regression operator in functional linear models has obtained much interest over
the last years. Our objective in this paper was to show that one of most widely applied estimators in
this context remains consistent, even if several of the synthetic assumptions used in previous papers
are removed. If our intention is prediction, we can further simplify the technical requirements. Our
approach comes with a data driven choice of the parameter which determines the dimension of the
estimator. While our main intention is to show that this choice leads to a consistent estimator,
we have seen in simulations that our method is performing remarkably well when compared to
cross-validation.
25
Table 1: Truncation levels obtained by Theorem 2 (K) and by cross-validation (KCV ) and correspondingMSE. For each constellation we present med(K) of 200 runs.
By Lemmas 4, 5, 6, 7 and assumption m6n = o(n) we finally obtain for large enough n that
P (‖Ψ− ΨKn‖L12 > ε)
≤ U44m2n
ε4n+ 43U‖∆‖2S12
m4n
ε2n+ 42U(128‖∆‖2L12 + ε2/4)
m6n
ε2n+ P (‖Ψ−ΨKn‖L12 > ε/4)
n→∞−−−→ 0.
5.2 Proof of Theorem 2
In order to simplify the notation we will denote K = Kn. This time as a starting point we take a
representation of Ψ in the basis v1, v2, .... Let Mm = spv1, v2, ..., vm, Mm = spv1, v2, ..., vmwhere spxi, i ∈ I denotes the closed span of the elements xi, i ∈ I. If rank(C) = `, then
vi, i > ` can be any ONB of M⊥` . We write PA for the projection operator which maps on a
closed linear space A. As usual A⊥ denotes the orthogonal complement of A. Since for any m ≥ 1
we can write x = PMm(x) + PM⊥m
(x), the linearity of Ψ and the projection operator gives
Ψ(x) = Ψ(PMm(x)) + Ψ(PM⊥m
(x))
=m∑j=1
〈vj , x〉H1Ψ(vj) + Ψ(PM⊥m(x)).
Now we evaluate Ψ in some vj which is not in the kernel of C. By definitions of Ψ, C and again by
linearity of the involved operators
Ψ(vj) =1
λjΨ(C(vj))
=1
λj
1
n
n∑i=1
〈Xi, vj〉H1Ψ(Xi)
=1
λj
1
n
n∑i=1
〈Xi, vj〉H1(Yi − εi)
=1
λj(∆(vj) + Λ(vj)),
where Λ = − 1n
∑ni=1Xi⊗εi. Hence if m is such that λm > 0 (which will now be implicitely assumed
in the sequel), Ψ can be expressed as
Ψ(x) =
m∑j=1
〈vj , x〉H1
1
λj∆(vj) +
m∑j=1
〈vj , x〉H1
1
λjΛ(vj) + Ψ(PM⊥m
(x)).
32
Note that the first term on the right-hand side is just Ψm(x). Therefore for any x, the distance
between Ψ(x) and Ψm(x) takes the following form
‖Ψ(x)− Ψm(x)‖H2 =
∥∥∥∥∥m∑j=1
〈vj , x〉H1
1
λjΛ(vj) + Ψ(PM⊥m
(x))
∥∥∥∥∥H2
. (I.15)
To assess (I.15) we need the following four lemmas.
Lemma 9. Let (λi, vi)i≥1 and (λi, vi)i≥1 be eigenvalues and eigenfunctions of C and C respectively.
Set j,m ∈ N such that j ≤ m ≤ n, then
‖vj − PMm(vj)‖2H1
≤ 4‖C − C‖2L11
(λm+1 − λj)2.
Proof. Note that by using Parseval’s identity we get
‖vj − PMm(vj)‖2H1
=∞∑k=1
〈vj − PMm(vj), vk〉2H1
=∑k>m
〈vj , vk〉2H1.
Now
(λm+1 − λj)2∑k>m
〈vj , vk〉2H1≤∑k>m
(λk〈vj , vk〉H1 − λj〈vj , vk〉H1)2
=∑k>m
(〈vj , C(vk)〉H1 − λj〈vj , vk〉H1)2.
Since C is a self-adjoint operator, simple algebraic transformations yield
(λm+1 − λj)2∑k>m
〈vj , vk〉2H1≤∑k>m
(〈C(vj), vk〉H1 − λj〈vj , vk〉H1)2
=∑k>m
(〈(C − C)(vj), vk〉H1 − (λj − λj)〈vj , vk〉H1)2
≤ 2∑k>m
|〈(C − C)(vj), vk〉H1 |2 + 2∑k>m
((λj − λj)〈vj , vk〉H1)2.
By Parseval’s inequality and Lemma 3
(λm+1 − λj)2∑k>m
〈vj , vk〉2H1≤ 2‖(C − C)(vj)‖2H1
+ 2|λj − λj |2 ≤ 4‖C − C‖2L11 .
Lemma 10. Let Ψ be defined as in Lemma 2 and K = KnP−→∞. Then ‖PM⊥K (Xn)‖H2
P−→ 0.
Proof. We write here and in the sequel X = Xn. We first remark that for any ε > 0
P (‖PM⊥K (X)‖2H2> ε) = P
( ∞∑i=K+1
|〈vi, X〉H1 |2 > ε
).
33
Since∑∞
i=1 |〈vi, X〉H1 |2 = ‖X‖2H1, there exists a random variable Jε ∈ R such that
∑∞i=Jε|〈vi, X〉H1 |2 <
ε. Since by assumption E‖X‖2H1< ∞, we conclude that Jε is bounded in probability. Hence we
obtain
P (‖PM⊥K (X)‖2H2> ε) ≤ P
( ∞∑i=K+1
|〈vi, X〉H1 |2 > ε ∩ K > Jε
)+ P (K ≤ Jε)
= P (K ≤ Jε),
where the last term converges to zero as n→∞.
Lemma 11. Let Ln = arg maxr ≤ K :∑r
i=1(λK+1 − λi)−2 ≤ ξn, where K = Kn is given as in
Theorem 2 and ξn →∞. Then LnP−→∞.
Proof. Let r ∈ N such that for all 1 ≤ i ≤ r we have λr+1 6= λi. Note that E‖X‖2H1< ∞ implies
λi → 0 and since λi > 0 we can find infinitely many r satisfying this condition. We choose such r
and obtain
P (Ln < r) ≤ P
(r∑i=1
1
(λK+1 − λi)2> ξn ∩ K ≥ r
)+ P (K < r).
Lemma 8 implies that P (K < r) → 0. The first term is bounded by P(∑r
i=11
(λr+1−λi)2> ξn
).
Since λiP−→ λi and r is fixed while ξn →∞, it follows that P (Ln < r)→ 0 if n→∞. Since r can
be chosen arbitrarily large, the proof is finished.
Lemma 12. Let Ψ be defined as in Lemma 2, then ‖PMK(X)− PMK
(X)‖H1
P−→ 0.
Proof. Let us define two variables X(1) =∑L
i=1〈X, vi〉H1vi, X(2) =
∑∞i=L+1〈X, vi〉H1vi and L as in
Lemma 11. Again for simplifying the notation we will write L instead of Ln. Since X = X(1) +X(2)
we derive
‖PMK(X)− PMK
(X)‖H1 ≤ ‖PMK(X(1))− PMK
(X(1))‖H1 + ‖PMK(X(2))‖H1 + ‖PMK
(X(2))‖H1 .
(I.16)
The last two terms are bounded by 2‖X(2)‖H1 . For the first summand in (I.16) we get
‖PMK(X(1))− PMK
(X(1))‖H1 =
∥∥∥∥∥L∑i=1
〈X, vi〉H1(vi − PMK(vi))
∥∥∥∥∥H1
.
Let us choose ξn = o(n) in Lemma 11. The triangle inequality, the Cauchy-Schwarz inequality,
34
Lemma 9 and the definition of L entail
‖PMK(X(1))− PMK
(X(1))‖H1 ≤L∑i=1
|〈X, vi〉H1 |‖vi − PMK(vi)‖H1
≤
(L∑i=1
|〈X, vi〉H1 |2)1/2( L∑
i=1
‖vi − PMK(vi)‖2H1
)1/2
≤ ‖X‖H1
(L∑i=1
‖vi − PMK(vi)‖2H1
)1/2
≤ 2‖X‖H1‖C − C‖L11
(L∑i=1
1
(λK+1 − λi)2
)1/2
≤ 2‖X‖H1‖C − C‖L11√ξn.
This implies the inequality
‖PMK(X)− PMK
(X)‖H1 ≤ 2‖X‖H1‖C − C‖L11√ξn + 2‖X(2)‖H1 . (I.17)
Hence by Lemma 1 we have 2‖X‖H1‖C − C‖L11√ξn = oP (1). Furthermore we have that ‖X(2)‖ =(∑
j>L |〈X, vj〉|2)1/2 P−→ 0. This follows from the proof of Lemma 10.
Lemma 13. Let Ψ be defined as in Lemma 2, then ‖Ψ(PM⊥K(X))‖H2
P−→ 0.
Proof. Some simple manipulations show
‖Ψ(PM⊥K(X))‖H2 = ‖Ψ(X − PMK
(X))‖H2
= ‖Ψ(PMK(X) + PM⊥K
(X)− PMK(X))‖H2
≤ ‖Ψ(PMK(X))−Ψ(PMK
(X))‖H2 + ‖Ψ(PM⊥K(X))‖H2
≤ ‖Ψ‖L12(‖PMK
(X)− PMK(X)‖H1 + ‖PM⊥K (X)‖H1
).
Direct applications of Lemma 10 and Lemma 12 finish the proof.
Proof of Theorem 2. Set
Θn(x) =
Kn∑j=1
Λ(vj)
λj〈vj , x〉H1 .
By the representation (I.15) and the triangle inequality
‖Ψ(X)− Ψ(X)‖H2 ≤ ‖Θn(X)‖H2 + ‖Ψ(PM⊥Kn(X))‖H2 .
Lemma 13 shows that the second term tends to zero in probability.
If in Lemma 1 we define Ψ ≡ 0, then Λ = ∆ and by independence of εk and Xk we get
Λ = 0. By the arguments of Lemma 5 we infer P (‖Θn‖L12 > ε) ≤ Um2n/ε
2n, which implies that
35
‖Θn(X)‖H2
P−→ 0.
6 Acknowledgement
This research was supported by the Communaute francaise de Belgique—Actions de Recherche Con-
certees (2010–2015) and Interuniversity Attraction Poles Programme (IAP-network P7/06), Belgian
Science Policy Office.
Bibliography
[1] Bosq, D. (1991). Modelization, nonparametric estimation and prediction for continuous time
processes. In Nonparametric functional estimation and related topics. NATO Adv. Sci. Inst.
Ser. C Math. Phys. Sci., 335, 509–529, Kluwer Acad. Publ.
[2] Bosq, D. (2000). Linear Processes in Function Spaces., Springer, New York.
[3] Cai, T. & Hall, P. (2006). Prediction in functional linear regression. Ann. Statist. 34, 2159–2179.
[4] Cai, T. & Zhou, H. (2008). Adaptive functional linear regression. technical report.
[5] Cardot, H., Ferraty, F. & Sarda, P. (1999). Functional linear model. Statist. Probab. Lett. 45,
11–22.
[6] Cardot, H., Ferraty, F. & Sarda, P. (2003). Spline estimators for the functional linear model.
Statist. Sinica 13, 571–591.
[7] Cardot, H. & Johannes, J. (2010). Thresholding projection estimators in functional linear mod-
Some further basic estimates show that[E‖Xr+h ⊗Xr −X(r)
r+h ⊗X(r−h)r ‖2S
]1/2
≤√
2ν4(X0)[ν4(X0 −X(r−h)
0 ) + ν4(X0 −X(r)0 )].
Similar estimates can be obtained when r < 0, and the result follows from (2.6).
It is convenient to introduce the following remainder terms:
τX(h) =∑|k|≥h
‖CXk ‖S ; τY X(h) =∑|k|≥h
‖CY Xk ‖S ; τ b(h) =∑|k|≥h
‖bk‖S .
52
Proof of Lemma 1. By repeated application of the triangle inequality, we obtain
supθ∈[−π,π]
‖FXθ − FXθ ‖L
≤ 1
2π
∑|h|≤q
‖CXh − CXh ‖L +∑|h|≤q
|1− ωq(h)|‖CXh ‖L + τX(q)
.
Since by Lemma 5,∑|h|≤q E‖CXh − CXh ‖L = O(qn−1/2), the first term tends to zero. The term∑
|h|≤q |1− ωq(h)|‖CXh ‖L tends to zero by (2.3), Assumption 4 and by the dominated convergence.
Again by (2.3), it follows that τX(q)→ 0. For example, one may then chose
ψXn =qn−1/2 +
∑|h|≤q
|1− ωq(h)|‖CXh ‖L + τX(q)1−γ
, γ ∈ (0, 1).
The same arguments apply to the spectral cross–density operators.
Proof of Theorem 1. Since
maxh∈Z‖bh − bh‖S ≤
1
2π
∫ π
−π‖Bθ −Bθ‖Sdθ,
we focus on the estimation of the frequency response operator Bθ. Define
Bθ = Bθ(K) =∑m≤K
FY Xθ(
1
λm(θ)ϕm(θ)⊗ ϕm(θ)
).
Then1
2‖Bθ −Bθ‖2S ≤ ‖Bθ − Bθ‖2S + ‖Bθ −Bθ‖2S .
Since, by (3.3), ∑`≥1
1
λ2` (θ)‖FY Xθ (ϕ`(θ))‖2 = ‖Bθ‖2S ≤
(∑k∈Z‖bk‖S
)2
<∞,
we see that
‖Bθ −Bθ‖2S =∑`>K
1
λ2` (θ)‖FY Xθ (ϕ`(θ))‖2 → 0, K →∞.
Thus, it remains to prove that1
2π
∫ π
−π‖Bθ − Bθ‖Sdθ
P−→ 0, (6.1)
and that KP−→∞. Condition (6.1) can be replaced by∫ π
−π‖Bθ − Bθ‖Sdθ × IAn
P−→ 0, (6.2)
53
where An ⊂ A is defined as
An := supθ‖FXθ − FXθ ‖ ≤ ψXn ∩ sup
θ‖FY Xθ − FY Xθ ‖ ≤ ψY Xn .
This is because by Lemma 1 we have that P (An)→ 1.
We have
Bθ − Bθ =K∑m=1
[FY Xθ
(1
λm(θ)ϕm(θ)⊗ ϕm(θ)
)− FY Xθ
(1
λm(θ)ϕm(θ)⊗ ϕm(θ)
)]
=K∑m=1
FY Xθ
(1
λm(θ)ϕm(θ)⊗ ϕm(θ)− 1
λm(θ)ϕm(θ)⊗ ϕm(θ)
)
+K∑m=1
(FY Xθ − FY Xθ
)( 1
λm(θ)ϕm(θ)⊗ ϕm(θ)
).
Thus, using ‖F G‖S ≤ ‖F‖L‖G‖S , we get
‖Bθ − Bθ‖S = ‖FY Xθ ‖L
∥∥∥∥∥K∑m=1
(1
λm(θ)ϕm(θ)⊗ ϕm(θ)− 1
λm(θ)ϕm(θ)⊗ ϕm(θ)
)∥∥∥∥∥S
+ ‖FY Xθ − FY Xθ ‖L
(K∑m=1
1
λ2m(θ)
)1/2
.
Since we have supθ∈[−π,π] ‖FY Xθ ‖L ≤ 1π τ
Y X(0), (6.2) follows from
∫ π
−π
∥∥∥∥∥K∑m=1
(1
λm(θ)ϕm(θ)⊗ ϕm(θ)− 1
λm(θ)ϕm(θ)⊗ ϕm(θ)
)∥∥∥∥∥S
dθ × IAn = oP (1) (6.3)
and
ψY Xn
∫ π
−πWKλ (θ)dθ = OP (1). (6.4)
Relation (6.4) is already immediate from the condition K ≤ K(2).
Some routine estimates show that the integrand in (6.3) is bounded by2
K∑m=1
1
λm(θ)‖ϕm(θ)− cm(θ)ϕm(θ)‖+
K∑m=1
|λm(θ)− λm(θ)|λm(θ)λm(θ)
× IAn , (6.5)
where cm(θ) is given as in Section 3. By Lemma 3.2 in Hormann and Kokoszka [13] we have that
‖ϕm(θ)− cm(θ)ϕm(θ)‖ ≤ 2√
2
αm(θ)sup
θ∈[−π,π]‖FXθ − FXθ ‖L,
and
supθ∈[−π,π]
supm≥1|λm(θ)− λm(θ)| ≤ sup
θ∈[−π,π]‖FXθ − FXθ ‖L. (6.6)
54
Thus we obtain the bound
4√
2
K∑m=1
ψXn
λm(θ)
[1
αm(θ)+
1
λm(θ)
]× IAn (6.7)
for (6.5). We further remark that on An we have that λm(θ) ≥ λm(θ)−|λm(θ)−λm(θ)| ≥ λm(θ)−ψXn .
Therefore, since K ≤ K(1), we have that (6.7) is bounded by
4√
2K∑m=1
ψXn
λm(θ)
[1
αm(θ)+
2
λm(θ)
]≤ 4√
2ψXn
(WKλ (θ)WK
α (θ) + 2(WKλ (θ)
)2), (6.8)
where we have made use of the Cauchy-Schwarz inequality in the last step. Using K ≤ K(3) and
K ≤ K(4) it is now easy to infer that (6.2) holds.
It remains to show that KP−→∞, i.e. that K(i) →∞ for 1 ≤ i ≤ 4.
Fix a large k and observe that P (K(1) ≥ k) = P (infθ λk(θ) ≥ 2ψXn ). Now define Bk;n :=
supθ |λk(θ) − λk(θ)| ≤ δk/2 where δk := infθ λk(θ). From Assumption 2 it follows that δk > 0.
Furthermore, it follows from Lemma 1 and (6.6) that P (Bk;n)→ 1 for n→∞. On the other hand
infθ λk(θ) ≥ infθ λk(θ)− supθ |λk(θ)−λk(θ)|, so that on Bk;n we have infθ λk(θ) ≥ δk/2. And hence,
for n large enough, we have infθ λk(θ) ≥ 2ψXn on Bk;n. Consequently P (K(1) ≥ k)→ 1 for n→∞,
irrespective of how large k was chosen.
Now we proveK(4) →∞. Fix again a big k and notice that is suffices to show that P (∫ π−π min1≤m≤k α
−2m (θ)dθ >
xn) → 0, for any xn → ∞. Define B′k;n := supθ |αk(θ) − αk(θ)| ≤ δ′k/2 where δ′k := infθ αk(θ)
and set Ak;n = ∩km=1B′k;n. Then for any fixed k we have P (Ak;n) → 1 and on Ak;n it holds that
min1≤m≤k αm(θ) ≤ min1≤m≤k δ′m/2 = rk. By Assumption 5 rk > 0 for any k. Hence, on Ak;n the
integral∫ π−π min1≤m≤k α
−2m (θ)dθ is bounded by 4π/rk and this is smaller than xn when n is big
enough. This proves K(4) →∞.
The proof of K(2) →∞ and K(3) →∞ is similar and therefore omitted.
1 Appendix
In this appendix, we derive the FPE method of selecting the dimension parameter K used in
Sections 3 and 4. In section 1.1, we discuss the relation of our spectral approach to the time domain
estimation in functional regression. This motivates the derivation of the FPE method in Section 1.2.
Section 1.3 contains the proofs of two results stated in Sections 1.1 and 1.2.
1.1 Relation to ordinary functional regression
As before we consider complex Hilbert spaces H and H ′ and define for elements (a, f), (b, g) ∈ H ′×Hdefine [(a, f), (b, g)] = 〈a, b〉+ 〈f, g〉. This defines an inner product on H ′×H, and with it the space
becomes a Hilbert space. Let us fix a frequency θ ∈ [−π, π], and define a zero mean complex random
55
element ∆ = (Υ,Ξ) ∈ L2H′×H such that
C∆ = E∆⊗∆ =
(CΥ CΥΞ
CΞΥ CΞ
)=
(FYθ FY XθFXYθ FXθ
). (1.1)
Now we regress Υ on Ξ, i.e. we seek h0 ∈ L(H,H ′) (the space of bounded linear operators from H
to H ′) which satisfies
h0 = argminh∈L(H,H′)E‖Υ− h(Ξ)‖2.
Then by the usual projection arguments, h0 solves the equation CΥΞ = h0 CΞ. By the definition
of CΥΞ and CΞ, it follows that h0 is also the solution to (3.1) and hence, by Assumption 2, is
equal to Bθ. Consequently, h0, or equivalently Bθ, can also be estimated from a random sample
((Υk,Ξk) : 1 ≤ k ≤ L) by standard methods known from functional linear models. A typical
estimator (see e.g. Cardot et al. [6]) is
h0;d(f) =
d∑`=1
CΥΞ(v`)
γ`〈f, v`〉 :=
d∑`=1
b`〈f, v`〉, (1.2)
where CΥΞ(f) := 1L
∑Lk=1 Υk〈f,Ξk〉 and γ` and v` are the eigenvalues and eigenvectors of CΞ(f) :=
1L
∑Lk=1 Ξk〈f,Ξk〉.
In practice we do not know C∆, but, as we will see in Lemma 1, below, it can be consistently
estimated from the data, which then in turn allows to generate a random sample ((Υi,Ξi) : 1 ≤ i ≤ L)
with a covariance which is asymptotically equal to C∆. A more direct approach is to define the
functional discrete Fourier transforms
Υk|p =1√2πp
pk∑t=p(k−1)+1
Yte−i(t−p(k−1))θ and Ξk|p =
1√2πp
pk∑t=p(k−1)+1
Xte−i(t−p(k−1))θ.
If we denote CΥΞp and CΞ
p covariance and cross-covariance operators related to the sequence ((Υk|p,Ξk|p) : 1 ≤k ≤ L), the following lemma holds:
Lemma 6. Consider the estimator FXθ|p with the Bartlett weights wp(h) = 1−|h|/p. Under Assump-
tion 3 we have ‖FXθ|p − CΞp ‖2S = OP (p3/n). Under the same conditions we have ‖FY Xθ|p − C
ΥΞp ‖2S =
OP (p3/n).
The lemma, which we prove in Section 1.3, confirms that computing (1.2) from the variables
(Υk|p,Ξk|p), which serve as an approximation to a random sample (Υk,Ξk), yields an estimator
which resembles closely Bθ|p,p,d in (3.4).
1.2 Description of the FPE approach
In order to keep this discussion short, we only consider the scalar response case. This is in line
with our simulation study. Our starting point is the alternative interpretation of Bθ discussed in
Section 1.1. Suppose we have an estimator h0;d for Bθ from a sample ((Υk,Ξk) : 1 ≤ k ≤ L). Now
we pick (Υ,Ξ) independent of this sample and set K = Kθ = argmind≥0E|Υ − h0;d(Ξ)|2. Note
56
that here, by the Riesz representation theorem, h0;d(Ξ) is of the from 〈Ξ, h0;d〉. With d = K in
(1.2) we minimize the mean squared prediction error in this functional regression. The related
model selection criterion is commonly known as final prediction error (FPE) criterion. Of course, to
compute K explicitly is mathematically infeasible, and therefore we go for an approximation. For
this purpose, we first note that that the coefficients bk in (1.2) satisfy
bd := (b1, . . . , bd)′ = argmin(b1,...,bd)∈Cd
L∑i=1
|Υi −d∑`=1
b`〈Ξi, v`〉|2.
Our problem is greatly simplified if we replace the empirical principal component scores by the
population ones and set
bd := (b1, . . . , bd)′ = argmin(b1,...,bd)∈Cd
L∑i=1
|Υi −d∑`=1
b`〈Ξ, v`〉|2,
and then define h0;d(Ξ) =∑d
`=1 b`〈Ξ, v`〉 and K = argmind≥0E|Υ− h0;d(Ξ)|2.
Proposition 1. Suppose that the ((Υi,Ξi) : 1 ≤ i ≤ L) constitute a Gaussian random sample, with
circularly-symmetric observation, i.e. E∆[∆, (a, f)] = 0 for any (a, f) ∈ H ′ ×H. Then for L > d
we have
E|Υ− h0;d(Ξ)|2 = σ2d ×
L
L− d,
where σ2d = 1
L−dE(Υ−Xbd)∗(Υ−Xbd) and X = (〈Ξi, v`〉 : 1 ≤ i ≤ L; 1 ≤ ` ≤ d), Υ = (Υ1, . . . ,ΥL)′.
The proof of this proposition is given in Section 1.3. Assuming Gaussianity is not a restriction,
since our estimator only relies on the second order structure of the data. Furthermore, by Panare-
tos and Tavakoli [24] we know that under general dependence assumptions the discrete Fourier
transforms Υi|p and Ξi|p are asymptotically (p→∞) complex normal random elements.
The proposition then suggests to choose d such that σ2d×
LL−d is minimized. An unbiased estimate
for the unknown σ2d is
1
L− d(Υ− Xbd)
∗(Υ− Xbd).
Finally, replacing the theoretical scores leads to the following dimension selection:
K = argmin0≤d<LL
(L− d)2(Υ− Xbd)
∗(Υ− Xbd), (1.3)
where X = (〈Ξi|p, v`〉 : 1 ≤ i ≤ L; 1 ≤ ` ≤ d) and Υ = (Υ1|p, . . . ,ΥL|p)′.
1.3 Proofs of Lemma 6 and Proposition 1
Proof Lemma 6. We define
CXh =1
Lp
L−1∑k=0
( p−h∑t=1
Xt+h+kp ⊗Xt+kp
), for 0 ≤ h < p,
57
and
CXh =1
Lp
L−1∑k=0
( p∑t=|h|+1
Xt−|h|+kp ⊗Xt+kp
), for −p < h < 0.
Direct verification shows that
CΞθ =
1
2π
∑|h|<p
CXh e−ihθ.
For two random operators An and Bn we write An = Bn + Op(mn) if ‖An − Bn‖S = OP (mn).
Then, for p > h ≥ 0, we deduce with the help of Lemma 5 that
nCXh − LpCXh =L−1∑k=0
( p∑t=p−h+1
Xt+h+kp ⊗Xt+kp
)+
n−h∑t=Lp+1
Xt+h ⊗Xt
= Lh(CXh +Op(L
−1/2))
= Lh(CXh +Op(L
−1/2)).
The same bound can be derived for h < 0. Thus,
CXh =
(1− |h|
p
)CXh +
(n
Lp− 1
)CXh +OP (L−1/2),
and since nLp − 1 ≤ p
n−p we have that
CXh =
(1− |h|
p
)CXh +OP
((p/n)1/2
).
We conclude that ‖CΞθ − FXθ ‖2S = OP
(p3/n
). A similar bound can be obtained for CΥΞ
θ − FY Xθ .
This proves Lemma 6.
Proof of Proposition 1. We have
E|Υ− h0;d(Ξ)|2 = E|Υ−d∑`=1
b`〈Ξ, v`〉|2 = E|d∑`=1
(b` − b`)〈Ξ, v`〉+ Z|2, (1.4)
where Z = (Υ − 〈Ξ, h0〉) +∑
`>d b`〈Ξ, v`〉. We set ε = Υ − 〈Ξ, h0〉. By the projection theorem it
follows that Cov(ε,Ξ) = 0. Furthermore, since we assume that b` are independent of Ξ, and since
principal components scores are orthogonal it follows that (1.4) equals
E|d∑`=1
(b` − b`)〈Ξ, v`〉|2 + E|Z|2 =d∑`=1
E|b` − b`|2γ` + E|Z|2.
58
With Γ = diag(γ1, . . . , γd) and Z = (Z1, . . . , ZL)′ and Zi = εi +∑
`>d b`〈Ξi, v`〉 we get
d∑`=1
E|b` − b`|2γ` = E[(bd − bd)∗Γ(bd − bd)
]= E
[Z∗X(X∗X)−1Γ(X∗X)−1X∗Z
]= tr
(ΓE[(X∗X)−1X∗ZZ∗X(X∗X)−1
]). (1.5)
We have E[ZZ∗] = E|Z|2IL. The imposed circular-symmetry implies that
EΥ〈Ξ, f〉 = 0 and E〈Ξ, f〉〈Ξ, g〉 = 0 ∀f, g ∈ H. (1.6)
Consequently, by Gaussianity it follows that Z and X are independent. (Note that two complex Gaus-
sian random variables U1 and U2, say, are independent if and only if Cov(U1, U2) = Cov(U1, U2) = 0.)
We can therefore conclude by a simple conditioning argument that (1.5) simplifies to
E|Z|2 tr(E[(
(XΓ−1/2)∗(XΓ−1/2))−1]
) =: E|Z|2 tr(EW−1
).
The matrix W−1 is an inverse complex Wishart matrix with expectation EW−1 = IdL−d . Thus
E|Υ− h0;d(Ξ)|2 = E|Z|2 × LL−d .
Bibliography
[1] A. Aue, S. Hormann, L. Horvath, and M. Reimherr. Break detection in the covariance structure
of multivariate time series models. The Annals of Statistics, 37:4046–4087, 2009.
[2] D. Bosq. Linear Processes in Function Spaces. Springer, 2000.
[3] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control.
Prentice Hall, Englewood Cliffs, third edition, 1994.
[4] D. R. Brillinger. Time Series: Data Analysis and Theory. Holt, New York, 1975.
[5] T. Cai and P. Hall. Prediction in functional linear regression. The Annals of Statistics, 34:
2159–2179, 2006.
[6] H. Cardot, F. Ferraty, and P. Sarda. Functional linear model. Statistics and Probability Letters,
45:11–22, 1999.
[7] H. Cardot, F. Ferraty, A. Mas, and P. Sarda. Testing hypothesis in the functional linear model.
Scandinavian Journal of Statistics, 30:241–255, 2003.
[8] J-M. Chiou and H-G. Muller. Diagnostics for functional regression via residual processes.
Computational Statistics and Data Analysis, 15:4849–4863, 2007.
[9] F. Comte and J. Johannes. Adaptive functional linear regression. The Annals of Statistics, 40:
2765–2797, 2012.
59
[10] C. Crambes, A. Kneip, and P. Sarda. Smoothing splines estimators for functional linear regres-
sion. The Annals of Statistics, 37:35–72, 2009.
[11] R. Gabrys, L. Horvath, and P. Kokoszka. Tests for error correlation in the functional linear
model. Journal of the American Statistical Association, 105:1113–1125, 2010.
[12] S. Hormann and L. Kidzinski. A note on estimation in Hilbertian linear models. Scandinavian
Journal of Statistics, 2014. Forthcoming.
[13] S. Hormann and P. Kokoszka. Weakly dependent functional data. The Annals of Statistics, 38:
1845–1884, 2010.
[14] S. Hormann and P. Kokoszka. Functional time series. In C. R. Rao and T. Subba Rao, editors,
Time Series, volume 30 of Handbook of Statistics. Elsevier, 2012.
[15] S. Hormann, L. Horvath, and R. Reeder. A functional version of the ARCH model. Econometric
Theory, 29:267–288, 2013.
[16] S. Hormann, L. Kidzinski, and M. Hallin. Dynamic functional principal components. Journal
of the Royal Statistical Society: Series B, 2014. Forthcoming.
[17] L. Horvath and P. Kokoszka. Inference for Functional Data with Applications. Springer, 2012.
[18] G. M. James, J. Wang, and J. Zhu. Functional linear regression that’s interpretable. The
Annals of Statistics, 37:2083–2108, 2009.
[19] P. Kokoszka and M. Reimherr. Predictability of shapes of intraday price curves. The Econo-
metrics Journal, 16:285–308, 2013.
[20] A. N. Kolmogorov. Interpolation und Extrapolation von stationaren zufalligen Folgen. Bull.
Acad. Sci. U.S.S.R., 5:3–14, 1941.
[21] Y. Li and T. Hsing. On rates of convergence in functional linear regression. Journal of Multi-
variate Analysis, 98:1782–1804, 2007.
[22] I. McKeague and B. Sen. Fractals with point impacts in functional linear regression. The
Annals of Statistics, 38:2559–2586, 2010.
[23] H-G. Muller and U. Stadtmuller. Generalized functional linear models. The Annals of Statistics,
33:774–805, 2005.
[24] V. M. Panaretos and S. Tavakoli. Fourier analysis of stationary time series in function space.
The Annals of Statistics, 41:568–603, 2013.
[25] V. M. Panaretos and S. Tavakoli. Cramer–Karhunen–Loeve representation and harmonic prin-
cipal component analysis of functional time series. Stochastic Processes and their Applications,
123:2779–2807, 2013.
[26] M. B. Priestley. Spectral Analysis and Time Series. Academic Press, 1981.
[27] J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, 2005.
60
[28] X. Shao and W. B. Wu. Asymptotic spectral theory for nonlinear time series. The Annals of
Statistics, 35:1773–1801, 2007.
[29] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications with R Examples.
Springer, 2011.
[30] N. Wiener. The Extrapolation, Interpolation and Smoothing of Stationary Time Series with
Engineering Applications. Wiley, 1949.
[31] W. Wu. Nonlinear System Theory: Another Look at Dependence, volume 102. The National
Academy of Sciences of the United States, 2005.
[32] F. Yao, H-G. Muller, and J-L. Wang. Functional linear regression analysis for longitudinal data.
The Annals of Statistics, 33:2873–2903, 2005.
61
Appendix A
Dynamic Functional Principal Components
Dynamic Functional Principal Components∗
Siegfried Hormann1, Lukasz Kidzinski1, Marc Hallin2,3
1 Department of Mathematics, Universite libre de Bruxelles (ULB), CP210, Bd. du Triomphe, B-1050
3 ORFE, Princeton University, Sherrerd Hall, Princeton, NJ 08540, USA.
Abstract
In this paper, we address the problem of dimension reduction for time series of functional data (Xt : t ∈Z). Such functional time series frequently arise, e.g., when a continuous-time process is segmented into
some smaller natural units, such as days. Then each Xt represents one intraday curve. We argue that
functional principal component analysis (FPCA), though a key technique in the field and a benchmark for
any competitor, does not provide an adequate dimension reduction in a time-series setting. FPCA indeed
is a static procedure which ignores the essential information provided by the serial dependence structure of
the functional data under study. Therefore, inspired by Brillinger’s theory of dynamic principal components,
we propose a dynamic version of FPCA, which is based on a frequency-domain approach. By means of a
simulation study and an empirical illustration, we show the considerable improvement the dynamic approach
entails when compared to the usual static procedure.
Keywords. Dimension reduction, frequency domain analysis, functional data analysis, functional
time series, functional spectral analysis, principal components, Karhunen-Loeve expansion.
1 Introduction
The tremendous technical improvements in data collection and storage allow to get an increasingly
complete picture of many common phenomena. In principle, most processes in real life are continuous
in time and, with improved data acquisition techniques, they can be recorded at arbitrarily high
frequency. To benefit from increasing information, we need appropriate statistical tools that can
help extracting the most important characteristics of some possibly high-dimensional specifications.
Functional data analysis (FDA), in recent years, has proven to be an appropriate tool in many
such cases and has consequently evolved into a very important field of research in the statistical
community.
Typically, functional data are considered as realizations of (smooth) random curves. Then every
observation X is a curve (X(u) : u ∈ U). One generally assumes, for simplicity, that U = [0, 1], but
U could be a more complex domain like a cube or the surface of a sphere. Since observations are
functions, we are dealing with high-dimensional – in fact intrinsically infinite-dimensional – objects.
So, not surprisingly, there is a demand for efficient data-reduction techniques. As such, functional
∗Manuscript has been accepted for publication in Journal of the Royal Statistical Sociaty: Series B
63
principal component analysis (FPCA) has taken a leading role in FDA, and functional principal
components (FPC) arguably can be seen as the key technique in the field.
In analogy to classical multivariate PCA (see Jolliffe [22]), functional PCA relies on an eigen-
decomposition of the underlying covariance function. The mathematical foundations for this have
been laid several decades ago in the pioneering papers by Karhunen [23] and Loeve [26], but it took
a while until the method was popularized in the statistical community. Some earlier contributions
are Besse and Ramsay [5], Ramsay and Dalzell [30] and, later, the influential books by Ramsay and
Silverman [31], [32] and Ferraty and Vieu [11]. Statisticians have been working on problems related
to estimation and inference (Kneip and Utikal [24], Benko et al. [3]), asymptotics (Dauxois et al. [10]
and Hall and Hosseini-Nasab [15]), smoothing techniques (Silverman [34]), sparse data (James et
al. [21], Hall et al. [16]), and robustness issues (Locantore et al. [25], Gervini [12]), to name just a
few. Important applications include FPC-based estimation of functional linear models (Cardot et
al. [9], Reiss and Ogden [33]) or forecasting (Hyndman and Ullah [20], Aue et al. [1]). The usefulness
of functional PCA has also been recognized in other scientific disciplines, like chemical engineering
(Gokulakrishnan et al. [14]) or functional magnetic resonance imaging (Aston and Kirch [2], Viviani
et al. [37]). Many more references can be found in the above cited papers and in Sections 8–10 of
Ramsay and Silverman [32], where we refer to for background reading.
Most existing concepts and methods in FDA, even though they may tolerate some amount of
serial dependence, have been developed for independent observations. This is a serious weakness, as
in numerous applications the functional data under study are obviously dependent, either in time or
in space. Examples include daily curves of financial transactions, daily patterns of geophysical and
environmental data, annual temperatures measured on the surface of the earth, etc. In such cases,
we should view the data as the realization of a functional time series (Xt(u) : t ∈ Z), where the time
parameter t is discrete and the parameter u is continuous. For example, in case of daily observations,
the curve Xt(u) may be viewed as the observation on day t with intraday time parameter u. A key
reference on functional time series techniques is Bosq [8], who studied functional versions of AR
processes. We also refer to Hormann and Kokoszka [19] for a survey.
Ignoring serial dependence in this time-series context may result in misleading conclusions and
inefficient procedures. Hormann and Kokoszka [18] investigate the robustness properties of some
classical FDA methods in the presence of serial dependence. Among others, they show that usual
FPCs still can be consistently estimated within a quite general dependence framework. Then the
basic problem, however, is not about consistently estimating traditional FPCs: the problem is that,
in a time-series context, traditional FPCs are not the adequate concept of dimension reduction
anymore – a fact which, since the seminal work of Brillinger [6], is well recognized in the usual
vector time-series setting. FPCA indeed operates in a static way: when applied to serially dependent
curves, it fails to take into account the potentially very valuable information carried by the past
values of the functional observations under study. In particular, a static FPC with small eigenvalue,
hence negligible instantaneous impact on Xt, may have a major impact on Xt+1, and high predictive
value.
Besides their failure to produce optimal dimension reduction, static FPCs, while cross-sectionally
64
uncorrelated at fixed time t, typically still exhibit lagged cross-correlations. Therefore the resulting
FPC scores cannot be analyzed componentwise as in the i.i.d. case, but need to be considered as
vector time series which are less easy to handle and interpret.
These major shortcomings are motivating the present development of dynamic functional prin-
cipal components (dynamic FPCs). The idea is to transform the functional time series into a vector
time series (of low dimension, ≤ 4, say), where the individual component processes are mutually
uncorrelated (at all leads and lags; autocorrelation is allowed, though), and account for most of
the dynamics and variability of the original process. The analysis of the functional time series can
then be performed on those dynamic FPCs; thanks to their mutual orthogonality, dynamic FPCs
moreover can be analyzed componentwise. In analogy to static FPCA, the curves can be optimally
reconstructed/approximated from the low-dimensional dynamic FPCs via a dynamic version of the
celebrated Karhunen-Loeve expansion.
Dynamic principal components first have been suggested by Brillinger [6] for vector time series.
The purpose of this article is to develop and study a similar approach in a functional setup. The
methodology relies on a frequency-domain analysis for functional data, a topic which is still in its
infancy (see, for instance, Panaretos and Tavakoli [27]).
The rest of the paper is organized as follows. In Section 2 we give a first illustration of the
procedure and sketch two typical applications. In Section 3, we describe our approach and state
a number of relevant propositions. We also provide some asymptotic features. In Section 4, we
discuss its computational implementation. After an illustration of the methodology by a real data
example on pollution curves in Section 5, we evaluate our approach in a simulation study (Section 6).
Appendices A and B detail the mathematical framework and contain the proofs. Some of the more
technical results and proofs are provided in a supplementary document.
After the present paper (which has been available on Arxiv since October 2012) was submitted,
another paper by Panaretos and Tavakoli [28] was published, where similar ideas are proposed. While
both papers aim at the same objective of a functional extension of Brillinger’s concept, there are
essential differences between the solutions developed. The main result in Panaretos and Tavakoli [28]
is the existence of a functional process (X∗t ) of rank q which serves as an “optimal approximation”
to the process (Xt) under study. The construction of (X∗t ), which is mathematically quite elegant,
is based on stochastic integration with respect to some orthogonal-increment (functional) stochas-
tic process (Zω). The disadvantage, from a statistical perspective, is that this construction is not
explicit, and that no finite-sample version of the concept is provided – only the limiting behavior
of the empirical spectral density operator and its eigenfunctions is obtained. Quite on the contrary,
our Theorem 4 establishes the consistency of an empirical, explicitly constructed and easily imple-
mentable version of the dynamic scores – which is what a statistician will be interested in. We also
remark that we are working under milder technical conditions.
65
2 Illustration of the method
An impression of how well the proposed method works can be obtained from Figure 1. Its left panel
shows ten consecutive intraday curves of some pollutant level (a detailed description of the underlying
data is given in Section 5). The two panels to the right show one-dimensional reconstructions of
these curves. We used static FPCA in the central panel and dynamic FPCA in the right panel. The
0.0 0.2 0.4 0.6 0.8 1.0
−6
−4
−2
02
Intraday time
Sq
rt(P
M1
0)
0.0 0.2 0.4 0.6 0.8 1.0
−6
−4
−2
02
Intraday time
Sq
rt(P
M1
0)
0.0 0.2 0.4 0.6 0.8 1.0
−6
−4
−2
02
Intraday time
Sq
rt(P
M1
0)
Figure 1: Ten successive daily observations (left panel), the corresponding static Karhunen-Loeve expansionbased on one (static) principal component (middle panel), and the dynamic Karhunen-Loeve expansion withone dynamic component (right panel). Colors provide the matching between the actual observations and theirKarhunen-Loeve approximations.
difference is notable. The static method merely provides an average level, exhibiting a completely
spurious and highly misleading intraday symmetry. In addition to daily average levels, the dynamic
approximation, to a large extent, also catches the intraday evolution of the curves. In particular,
it retrieves the intraday trend of pollution levels, and the location of their daily spikes and troughs
(which varies considerably from one curve to the other). For this illustrative example we chose
one-dimensional reconstructions, based on one single FPC; needless to say, increasing the number
of FPCs (several principal components), we obtain much better approximations – see Section 4 for
details.
Applications of dynamic PCA in a time series analysis are the same as those of static PCA in the
context of independent (or uncorrelated) observations. This is why obtaining mutually orthogonal
principal components – in the sense of mutually orthogonal processes – is a major issue here. This
orthogonality, at all leads and lags, of dynamic principal components, indeed, implies that any
second-order based method (which is the most common approach in time series) can be carried out
componentwise, i.e. via scalar methods. In contrast, static principal components still have to be
treated as a multivariate time series.
Let us illustrate this superiority of mutually orthogonal dynamic components over the auto- and
cross-correlated static ones by means of two examples.
Change point analysis: Suppose that we wish to find a structural break (change point) in a sequence
of functional observations X1, . . . , Xn. For example, Berkes et al. [4] consider the problem of detect-
66
ing a change in the mean function of a sequence of independent functional data. They propose to
first project data on the p leading principal components and argue that a change in the mean will
show in the score vectors, provided hat the proportion of variance they are accounting for is large
enough. Then a CUSUM procedure is utilized. The test statistic is based on the functional
Tn(x) =1
n
p∑m=1
λ−1m
∑1≤k≤nx
Y statmk − x
∑1≤k≤n
Y statmk
2
, 0 ≤ x ≤ 1.
Here Y statmk is the m-th empirical PC score of Xk and λm is the m-th largest eigenvalue of the empirical
covariance operator related to the functional sample. The assumption of independence implies that
Tn(x) converges, under the no-change hypothesis, to the sum of p squared independent Brownian
bridges. Roughly speaking, this is due to the fact that the partial sums of score vectors (used in the
CUSUM statistic) converge in distribution to a multivariate normal with diagonal covariance. That
is, the partial sums of the individual scores become asymptotically independent, and we just obtain
p independent CUSUM test statistics – a separate one for each score sequence. The independent
test statistics are then aggregated.
This simple structure is lost when data are serially dependent. Then, if a CLT holds,(∑
1≤k≤n Ystatmk : m =
1, . . . , p)′
converges to a normal vector where the covariance (which is still diagonal) needs to be
replaced by the long-run covariance of the score vectors, which is typically non-diagonal.
In contrast, using dynamic principal components, the long-run covariance of the score vectors
remains diagonal; see Proposition 4. Let diag(λ1(0), . . . , λp(0)) be a consistent estimator of this
long-run variance and Y dynmk be the dynamic scores. Then replacing the test functionals Tn(x) by
T dynn (x) =
2π
n
p∑m=1
λ−1m (0)
∑1≤k≤nx
Y dynmk − x
∑1≤k≤n
Y dynmk
2
, 0 ≤ x ≤ 1,
we get that (under appropriate technical assumptions ensuring a functional CLT) the same asymp-
totic behavior holds as for Tn(x), so that again p independent CUSUM test statistics can be aggre-
gated.
Dynamic principal components, thus, and not the static ones, provide a feasible extension of the
Berkes et al. [4] method to the time series context.
Lagged regression: A lagged regression model is a linear model in which the response Wt ∈ Rq, say,
is allowed to depend on an unspecified number of lagged values of a series of regressor variables
(Xt) ∈ Rp. More specifically, the model equation is
Wt = a+∑k∈Z
bkXt−k + εt, (2.1)
with some i.i.d. noise (εt) which is independent of the regressor series. The intercept a ∈ Rq and
the matrices bk ∈ Rq×p are unknown. In time series analysis, the lagged regression is the natural
extension of the traditional linear model for independent data.
67
The main problem in this context, which can be tackled by a frequency domain approach, is
estimation of the parameters. See, for example, Shumway and Stoffer [35] for an introduction. Once
the parameters are known, the model can, e.g., be used for prediction.
Suppose now that Wt is a scalar response and that (Xk) constitutes a functional time series.
The corresponding lagged regression model can be formulated in analogy, but involves estimation
of an unspecified number of operators, which is quite delicate. A pragmatic way to proceed is
to have Xk in (2.1) replaced by the vector of the first p dynamic functional principal component
scores Yk = (Y1k, . . . , Ypk)′, say. The general theory implies that, under mild assumptions (basically
guaranteeing convergence of the involved series),
bk =1
2π
∫ π
−πBθe
ikθdθ, where Bθ = FWYθ
(FYθ)−1
,
and
FYθ =1
2π
∑h∈Z
cov(Yt+h, Yt)e−ihθ and FWY
θ =1
2π
∑h∈Z
cov(Wt+h, Yt)e−ihθ
are the spectral density matrix of the score sequence and the cross-spectrum between (Wt) and
(Yt), respectively. In the present setting the structure greatly simplifies. Our theory will reveal (see
Proposition 9) that FYθ is diagonal at all frequencies and that
Bθ =
(fWY1θ
λ1(θ), . . . ,
fWYpθ
λp(θ)
),
with fWYmθ being the co-spectrum between (Wt) and (Ymt) and λm(θ) is the m-th dynamic eigenvalue
of the spectral density operator of the series (Xk) (see Section 3.2). As a consequence, the influence
of each score sequence on the regressors can be assessed individually.
Of course, in applications, these population quantities are replaced by their empirical versions
and one may use some testing procedure for the null-hypothesis H0 : fWYpθ = 0 for all θ, in order to
justify the choice of the dimension of the dynamic score vectors and to retain only those components
which have a significant impact on Wt.
3 Methodology for L2 curves
In this section, we introduce some necessary notation and tools. Most of the discussion on technical
details is postponed to the Appendices A, B and the supplementary document C. For simplicity, we
are focusing here on L2([0, 1])-valued processes, i.e. on square-integrable functions defined on the
unit interval; in the appendices, however, the theory is developed within a more general framework.
3.1 Notation and setup
Throughout this section, we consider a functional time series (Xt : t ∈ Z), where Xt takes values in
the space H := L2([0, 1]) of complex-valued square-integrable functions on [0, 1]. This means that
68
Xt = (Xt(u) : u ∈ [0, 1]), with ∫ 1
0|Xt(u)|2du <∞
(|z| :=√zz, where z the complex conjugate of z, stands for the modulus of z ∈ C). In most
applications, observations are real, but, since we will use spectral methods, a complex vector space
definition will serve useful.
The space H then is a Hilbert space, equipped with the inner product 〈x, y〉 :=∫ 1
0 x(u)y(u)du,
so that ‖x‖ := 〈x, x〉1/2 defines a norm. The notation X ∈ LpH is used to indicate that, for some
p > 0, E[‖X‖p] <∞. Any X ∈ L1H then possesses a mean curve µ = (E[X(u)] : u ∈ [0, 1]), and any
X ∈ L2H a covariance operator C, defined by C(x) := E[(X − µ)〈x,X − µ〉]. The operator C is a
kernel operator given by
C(x)(u) =
∫ 1
0c(u, v)x(v)dv, with c(u, v) := cov(X(u), X(v)), u, v ∈ [0, 1],
with cov(X,Y ) := E(X −EX)(Y − EY ). The process (Xt : t ∈ Z) is called weakly stationary if, for
all t, (i) Xt ∈ L2H , (ii) EXt = EX0, and (iii) for all h ∈ Z and u, v ∈ [0, 1],
Denote by Ch, h ∈ Z, the operator corresponding to the autocovariance kernels ch. Clearly, C0 = C.
It is well known that, under quite general dependence assumptions, the mean of a stationary func-
tional sequence can be consistently estimated by the sample mean, with the usual√n-convergence
rate. Since, for our problem, the mean is not really relevant, we throughout suppose that the
data have been centered in some preprocessing step. For the rest of the paper, it is tacitly as-
sumed that (Xt : t ∈ Z) is a weakly stationary, zero mean process defined on some probability space
(Ω,A, P ).
As in the multivariate case, the covariance operator C of a random element X ∈ L2H admits an
eigendecomposition (see, e.g., p. 178, Theorem 5.1 in [13])
C(x) =∞∑`=1
λ`〈x, v`〉v`, (3.1)
where (λ` : ` ≥ 1) are C’s eigenvalues (in descending order) and (v` : ` ≥ 1) the corresponding
normalized eigenfunctions, so that C(v`) = λ`v` and ‖v`‖ = 1. If C has full rank, then the sequence
(v` : ` ≥ 1) forms an orthonormal basis of L2([0, 1]). Hence, X admits the representation
X =
∞∑`=1
〈X, v`〉v`, (3.2)
which is called the static Karhunen-Loeve expansion of X. The eigenfunctions v` are called the
(static) functional principal components (FPCs) and the coefficients 〈X, v`〉 are called the (static)
FPC scores or loadings. It is well known that the basis (v` : ` ≥ 1) is optimal in representing X in
69
the following sense: if (w` : ` ≥ 1) is any other orthonormal basis of H, then
E‖X −p∑`=1
〈X, v`〉v`‖2 ≤ E‖X −p∑`=1
〈X,w`〉w`‖2, ∀p ≥ 1. (3.3)
Property (3.3) shows that a finite number of FPCs can be used to approximate the function X
by a vector of given dimension p with a minimum loss of “instantaneous” information. It should
be stressed, though, that this approximation is of a static nature, meaning that it is performed
observation by observation, and does not take into account the possible serial dependence of the
Xt’s, which is likely to exist in a time-series context. Globally speaking, we should be looking for an
approximation which also involves lagged observations, and is based on the whole family (Ch : h ∈ Z)
rather than on C0 only. To achieve this goal, we introduce below the spectral density operator, which
contains the full information on the family of operators (Ch : h ∈ Z).
3.2 The spectral density operator
In analogy to the classical concept of a spectral density matrix, we define the spectral density
operator.
Definition 2. Let (Xt) be a stationary process. The operator FXθ whose kernel is
fXθ (u, v) :=1
2π
∑h∈Z
ch(u, v)e−ihθ, θ ∈ [−π, π],
where i denotes the imaginary unit, is called the spectral density operator of (Xt) at frequency θ.
To ensure convergence (in an appropriate sense) of the series defining fXθ (u, v) (see Appendix A.2),
we impose the following summability condition on the autocovariances
∑h∈Z
(∫ 1
0
∫ 1
0|ch(u, v)|2dudv
)1/2
<∞. (3.4)
The same condition is more conveniently expressed as∑h∈Z‖Ch‖S <∞, (3.5)
where ‖ · ‖S denotes the Hilbert-Schmidt norm (see Section C.1 in the supplementary document).
A simple sufficient condition for (3.5) to hold will be provided in Proposition 7.
This concept of a spectral density operator has been introduced by Panaretos and Tavakoli [27].
In our context, this operator is used to create particular functional filters (see Sections 3.3 and A.3),
which are the building blocks for the construction of dynamic FPCs. A functional filter is defined via
a sequence Φ = (Φ` : ` ∈ Z) of linear operators between the spaces H = L2([0, 1]) and H ′ = Rp. The
filtered variables Yt have the form Yt =∑
`∈Z Φ`(Xt−`), and by the Riesz representation theorem,
the linear operators Φ` are given as
x 7→ Φ`(x) = (〈x, φ1`〉, . . . , 〈x, φp`〉)′, with φ1`, . . . , φp` ∈ H.
70
We shall considerer filters Φ for which the sequences (∑N
`=−N φm`(u)ei`θ : N ≥ 1), 1 ≤ m ≤ p,
converge in L2([0, 1]× [−π, π]). Hence, we assume existence of a square integrable function φ?m(u|θ)such that
limN→∞
∫ π
−π
∫ 1
0
(N∑
`=−Nφm`(u)ei`θ − φ?m(u|θ)
)2
dudθ = 0. (3.6)
In addition we suppose that
supθ∈[−π,π]
∫ 1
0[φ?m(u|θ)]2 du <∞. (3.7)
Then, we write φ?m(θ) :=∑
`∈Z φm`ei`θ or, in order to emphasize its functional nature, φ?m(u|θ) :=∑
`∈Z φm`(u)ei`θ. We denote by C the family of filters Φ which satisfy (3.6) and (3.7). For example,
if Φ is such that∑
` ‖φm`‖ <∞, then Φ ∈ C.
The following proposition relates the spectral density operator of (Xt) to the spectral density
matrix of the filtered sequence (Yt =∑
`∈Z Φ`(Xt−`)). This simple result plays a crucial role in our
construction.
Proposition 2. Assume that Φ ∈ C and let φ?m(θ) be given as above. Then the series∑
`∈Z Φ`(Xt−`)
converges in mean square to a limit Yt. The p-dimensional vector process (Yt) is stationary, with
spectral density matrix
FYθ =
〈FXθ (φ?1(θ)), φ?1(θ)
⟩· · · 〈FXθ (φ?p(θ)), φ
?1(θ)
⟩...
. . ....
〈FXθ (φ?1(θ)), φ?p(θ)⟩· · · 〈FXθ (φ?p(θ)), φ
?p(θ)
⟩ .
Since we do not want to assume a priori absolute summability of the filter coefficients Φ`, the
series FYθ = (2π)−1∑
h∈ZCYh e
ihθ, where CYh = cov(Yh, Y0), may not converge absolutely, and hence
not pointwise in θ. As our general theory will show, the operator FYθ can be considered as an
element of the space L2Cp×p([−π, π]), i.e. the collection of measurable mappings f : [−π, π] → Cp×p
for which∫ π−π ‖f(θ)‖2Fdθ < ∞, where ‖ · ‖F denotes the Frobenius norm. Equality of f and g is
thus understood as∫ π−π ‖f(θ) − g(θ)‖2Fdθ = 0. In particular it implies that f(θ) = g(θ) for almost
all θ.
To explain the important consequences of Proposition 2, first observe that under (3.5), for
every frequency θ, the operator FXθ is a non-negative, self-adjoint Hilbert-Schmidt operator (see
Section C.1 of the supplementary file). Hence, in analogy to (3.1), FXθ admits, for all θ, the spectral
representation
FXθ (x) =∑m≥1
λm(θ)〈x, ϕm(θ)〉ϕm(θ),
where λm(θ) and ϕm(θ) denote the dynamic eigenvalues and eigenfunctions. We impose the order
λ1(θ) ≥ λ2(θ) ≥ . . . ≥ 0 for all θ ∈ [−π, π], and require that the eigenfunctions be standardized so
that ‖ϕm(θ)‖ = 1 for all m ≥ 1 and θ ∈ [−π, π].
71
Assume now that we could choose the functional filters (φm` : ` ∈ Z) in such a way that
limN→∞
∫ π
−π
∫ 1
0
(N∑
`=−Nφm`(u)ei`θ − ϕm(u|θ)
)2
dudθ = 0. (3.8)
We then have FYθ = diag(λ1(θ), . . . , λp(θ)) for almost all θ, implying that the coordinate processes
of (Yt) are uncorrelated at any lag: cov(Ymt, Ym′s) = 0 for all s, t and m 6= m′. As discussed in the
Introduction, this is a desirable property which the static FPCs do not possess.
3.3 Dynamic FPCs
Motivated by the discussion above, we wish to define φm` in such a way that φ?m = ϕm (in L2([0, 1]×[−π, π])). To this end, we suppose that the function ϕm(u|θ) is jointly measurable in u and θ (this
assumption is discussed in Appendix A.1). The fact that eigenfunctions are standardized to unit
length implies∫ π−π∫ 1
0 ϕ2m(u|θ)dudθ = 2π. We conclude from Tonelli’s theorem that
∫ π−π ϕ
2m(u|θ)dθ <
∞ for almost all u ∈ [0, 1], i.e. that ϕm(u|θ) ∈ L2([−π, π]) for all u ∈ Am ⊂ [0, 1], where Am has
Lebesgue measure one. We now define, for u ∈ Am,
φm`(u) :=1
2π
∫ π
−πϕm(u|s)e−i`sds; (3.9)
for u /∈ Am, φm`(u) is set to zero. Then, it follows from the results in Appendix A.1 that (3.8) holds.
We conclude that the functional filters defined via (φm` : ` ∈ Z, 1 ≤ m ≤ p) belong to the class Cand that the resulting filtered process has diagonal autocovariances at all lags.
Definition 3 (Dynamic functional principal components). Assume that (Xt : t ∈ Z) is a mean-zero
stationary process with values in L2H satisfying assumption (3.5). Let φm` be defined as in (3.9).
Then the m-th dynamic functional principal component score of (Xt) is
The next theorem, which tells us how the original process (Xt(u) : t ∈ Z, u ∈ [0, 1]) can be
recovered from (Ymt : t ∈ Z, m ≥ 1), is the dynamic analogue of the static Karhunen-Loeve expansion
(3.2) associated with static principal components.
Theorem 2 (Inversion formula). Let Ymt be the dynamic FPC scores related to the process (Xt(u) : t ∈Z, u ∈ [0, 1]). Then,
Xt(u) =∑m≥1
Xmt(u) with Xmt(u) :=∑`∈Z
Ym,t+`φm`(u) (3.11)
(where convergence is in mean square). Call (3.11) the dynamic Karhunen-Loeve expansion of Xt.
We have mentioned in Remark 2 that dynamic FPC scores are not unique. In contrast, our
proofs show that the curves Xmt(u) are unique. To get some intuition, let us draw a simple analogy
to the static case. There, each v` in the Karhunen-Loeve expansion (3.2) can be replaced by −v`,i.e., the FPCs are defined up to their signs. The `-th score is 〈X, v`〉 or 〈X,−v`〉, and thus is not
unique either. However, the curves 〈X, v`〉v` and 〈X,−v`〉(−v`) are identical.
The sums∑p
m=1Xmt(u), p ≥ 1, can be seen as p-dimensional reconstructions of Xt(u), which
only involve the p time series (Ymt : t ∈ Z), 1 ≤ m ≤ p. Competitors to this reconstruction
are obtained by replacing φm` in (3.10) and (3.11) with alternative sequences ψm` and υm`. The
next theorem shows that, among all filters in C, the dynamic Karhunen-Loeve expansion (3.11)
approximates Xt(u) in an optimal way.
Theorem 3 (Optimality of Karhunen-Loeve expansions). Let Ymt be the dynamic FPC scores
related to the process (Xt : t ∈ Z), and define Xmt as in Theorem 2. Let Xmt =∑
`∈Z Ym,t+` υm`,
with Ymt =∑
`∈Z〈Xt−`, ψm`〉, where (ψmk : k ∈ Z) and (υmk : k ∈ Z) are sequences in H belonging
to C. Then,
E‖Xt −p∑
m=1
Xmt‖2 =∑m>p
∫ π
−πλm(θ)dθ ≤ E‖Xt −
p∑m=1
Xmt‖2 ∀p ≥ 1. (3.12)
Inequality (3.12) can be interpreted as the dynamic version of (3.3). Theorem 3 also suggests
73
the proportion ∑m≤p
∫ π
−πλm(θ)dθ
/E‖X1‖2 (3.13)
of variance explained by the first p dynamic FPCs as a natural measure of how well a functional
time series can be represented in dimension p.
3.4 Estimation and asymptotics
In practice, dynamic FPC scores need to be calculated from an estimated version of FXθ . At the same
time, the infinite series defining the scores need to be replaced by finite approximations. Suppose
again that (Xt : t ∈ Z) is a weakly stationary zero-mean time series such that (3.5) holds. Then, a
natural estimator for Ymt is
Ymt :=
L∑`=−L
〈Xt−`, φm`〉, m = 1, . . . , p and t = L+ 1, . . . n− L, (3.14)
where L is some integer and φm` is computed from some estimated spectral density operator FXθ .
For the latter, we impose the following preliminary assumption.
Assumption B.1 The estimator FXθ is consistent in integrated mean square, i.e.∫ π
−πE‖FXθ − FXθ ‖2S dθ → 0 as n→∞. (3.15)
Panaretos and Tavakoli [27] propose an estimator FXθ satisfying (3.15) under certain functional
cumulant conditions. By stating (3.15) as an assumption, we intend to keep the theory more
widely applicable. For example, the following proposition shows that estimators satisfying Assump-
tion B.1 also exist under L4-m-approximability, a dependence concept for functional data introduced
in Hormann and Kokoszka [18]. Define
FXθ =∑|h|≤q
(1− |h|
q
)CXh e
−ihθ, 0 < q < n, (3.16)
where CXh is the usual empirical autocovariance operator at lag h.
Proposition 5. Let (Xt : t ∈ Z) be L4-m-approximable, and let q = q(n)→∞ such that q3 = o(n).
Then the estimator FXθ defined in (3.16) satisfies Assumption B.1. The approximation error is
O(αq,n), where
αq,n =q3/2
√n
+1
q
∑|h|≤q
|h|‖Ch‖S +∑|h|>q
‖Ch‖S .
Corollary 1. Under the assumptions of Proposition 5 and∑
h |h|‖Ch‖S <∞ the convergence rate
of the estimator (3.16) is O(n−1/5).
Since our method requires the estimation of eigenvectors of the spectral density operator, we also
need to introduce certain identifiability constraints on eigenvectors. Define α1(θ) := λ1(θ) − λ2(θ)
74
and
αm(θ) := minλm−1(θ)− λm(θ), λm(θ)− λm+1(θ) for m > 1,
where λi(θ) is the i-th largest eigenvalue of the spectral density operator evaluated in θ.
Assumption B.2 For all m, αm(θ) has finitely many zeros.
Assumption B.2 essentially guarantees disjoint eigenvalues for all θ. It is a very common assumption
in functional PCA, as it ensures that eigenspaces are one-dimensional, and thus eigenfunctions are
unique up to their signs. To guarantee identifiability, it only remains to provide a rule for choosing
the signs. In our context, the situation is slightly more complicated, since we are working in a
complex setup. The eigenfunction ϕm(θ) is unique up to multiplication by a number on the complex
unit circle. A possible way to fix the direction of the eigenfunctions is to impose a constraint of the
form 〈ϕm(θ), v〉 ∈ (0,∞) for some given function v. In other words, we choose the orientation of
the eigenfunction such that its inner product with some reference curve v is a positive real number.
This rule identifies ϕm(θ), as long as it is not orthogonal to v. The following assumption ensures
that such identification is possible on a large enough set of frequencies θ ∈ [−π, π].
Assumption B.3 Denoting by ϕm(θ) be the m-th dynamic eigenvector of FXθ , there exists v such
that 〈ϕm(θ), v〉 6= 0 for almost all θ ∈ [−π, π].
From now on, we tacitly assume that the orientations of ϕm(θ) and ϕm(θ) are chosen so that
〈ϕm(θ), v〉 and 〈ϕm(θ), v〉 are in [0,∞) for almost all θ. Then, we have the following result.
Theorem 4 (Consistency). Let Ymt be the random variable defined by (3.14) and suppose that
Assumptions B.1–B.3 hold. Then, for some sequence L = L(n) → ∞, we have YmtP−→ Ymt as
n→∞.
Practical guidelines for the choice of L are given in the next section.
4 Practical implementation
In applications, data can only be recorded discretely. A curve x(u) is observed on grid points
0 ≤ u1 < u2 < · · · < ur ≤ 1. Often, though not necessarily so, r is very large (high frequency data).
The sampling frequency r and the sampling points ui may change from observation to observation.
Also, data may be recorded with or without measurement error, and time warping (registration)
may be required. For deriving limiting results, a common assumption is that r → ∞, while a
possible measurement error tends to zero. All these specifications have been extensively studied
in the literature, and we omit here the technical exercise to cast our theorems and propositions in
one of these setups. Rather, we show how to implement the proposed method, after the necessary
preprocessing steps have been carried out. Typically, data are then represented in terms of a finite
(but possibly large) number of basis functions (vk : 1 ≤ k ≤ d), i.e., x(u) =∑d
k=1 xkvk(u). Usually
Fourier bases, b-splines or wavelets are used. For an excellent survey on preprocessing the raw data,
we refer to Ramsey and Silverman [32, Chapters 3–5].
75
In the sequel, we write (aij : 1 ≤ i, j ≤ d) for a d×d matrix with entry aij in row i and column j.
Let x belong to the span Hd := sp(vk : 1 ≤ k ≤ d) of v1, . . . , vd. Then x is of the form v′x, where
v = (v1, . . . , vd)′ and x = (x1, . . . , xd)
′. We assume that the basis functions v1, . . . , vd are linearly
independent, but they need not be orthogonal. Any statement about x can be expressed as an
equivalent statement about x. In particular, if A : Hd → Hd is a linear operator, then, for x ∈ Hd,
A(x) =d∑
k=1
xkA(vk) =d∑
k=1
d∑k′=1
xk〈A(vk), vk′〉vk′ = v′Ax,
where A′ = (〈A(vi), vj〉 : 1 ≤ i, j ≤ d). Call A the corresponding matrix of A and x the corresponding
vector of x.
The following simple results are stated without proof.
Lemma 1. Let A,B be linear operators on Hd, with corresponding matrices A and B, respectively.
Then,
(i) for any α, β ∈ C, the corresponding matrix of αA+ βB is αA + βB;
(ii) A(e) = λe iff Ae = λe, where e = v′e;
(iii) letting A :=p∑i=1
p∑j=1
gijvi ⊗ vj, G := (gij : 1 ≤ i, j ≤ d), where gij ∈ C, and V := (〈vi, vj〉 : 1 ≤
i, j ≤ d), the corresponding matrix of A is A = GV ′.
To obtain the corresponding matrix of the spectal density operators FXθ , first observe that, if
Xk =∑d
i=1Xkivi =: v′Xk, then
CXh = EXh ⊗X0 =
d∑i=1
d∑j=1
EXhiX0jvi ⊗ vj .
It follows from Lemma 1 (iii) that CXh = CXh V
′ is the corresponding matrix of CXh := EXhX
′0; the
linearity property (i) then implies that
FXθ =1
2π
(∑h∈Z
CXh e−ihθ
)V ′ (4.1)
is the corresponding matrix of FXθ . Assume that λm(θ) is the m-th largest eigenvalue of FXθ , with
eigenvector ϕm(θ). Then λm(θ) is also an eigenvalue of FXθ and v′ϕm(θ) is the corresponding eigen-
function, from which we can compute, via its Fourier expansion, the dynamic FPCs. In particular,
we have
φmk =v′
2π
∫ π
−πϕm(s)e−iksds =: v′φmk,
and hence
Ymt =∑k∈Z
∫ 1
0X′t−kv(u)v′(u)φmkdu =
∑k∈Z
X′t−kV φmk. (4.2)
In view of (4.1), our task is now to replace the spectral density matrix
76
FXθ =
1
2π
∑h∈Z
CXh e−ihθ
of the coefficient sequence (Xk) by some estimate. For this purpose, we can use existing multivariate
techniques. Classically, we would put, for |h| < n,
CXh :=
1
n
n∑k=h+1
XkX′k−h, h ≥ 0, and CX
h := CX−h, h < 0
(recall that we throughout assume that the data are centered) and use, for example, some lag window
estimator
FXθ :=
1
2π
∑|h|≤q
w(h/q)CXh e−ihθ, (4.3)
where w is some appropriate weight function, q = qn → ∞ and qn/n → 0. For more details con-
cerning common choices of w and the tuning parameter qn, we refer to Chapters 10–11 in Brockwell
and Davis [7] and to Politis [29]. We then set FXθ := FXθ V
′ and compute the eigenvalues and eigen-
functions λm(θ) and ϕm(θ) thereof, which serve as estimators of λm(θ) and ϕm(θ), respectively. We
estimate the filter coefficients by φmk = v′
2π
∫ π−π ϕm(s)eiksds. Usually, no analytic form of ϕm(s) is
available, and one has to perform numerical integration. We take the simplest approach, which is
to set
φmk =v′
2π(2Nθ + 1)
Nθ∑j=−Nθ
ϕm(πj/Nθ)eiks =: v′φmk, (Nθ 1).
The larger Nθ the better. This clearly depends on the available computing power.
Now, we substitute φmk into (4.2), replacing the infinite sum with a rolling window
Ymt =L∑
k=−LX′t−kV φmk. (4.4)
This expression only can be computed for t ∈ L+ 1, . . . , n−L; for 1 ≤ t ≤ L or n−L+ 1 ≤ t ≤ n,
set X−L+1 = · · · = X0 = Xn+1 = · · · = Xn+L = EX1 = 0. This, of course, creates a certain bias on
the boundary of the observation period. As for the choice of L, we observe that∑
`∈Z ‖φm`‖2 = 1.
It is then natural to choose L such that∑−L≤`≤L ‖φm`‖2 ≥ 1− ε, for some small threshold ε, e.g.,
ε = 0.01.
Based on this definition of φmk, we obtain an empirical p-term dynamic Karhunen-Loeve expan-
sion
Xt =
p∑m=1
L∑k=−L
Ym,t+kφmk, with Ymt = 0, t ∈ −L+ 1, . . . , 0 ∪ n+ 1, . . . , n+ L. (4.5)
Parallel to (3.13), the proportion of variance explained by the first p dynamic FPCs can be
estimated through
PVdyn(p) :=π
Nθ
∑m≤p
Nθ∑j=−Nθ
λm(πj/Nθ)/ 1
n
n∑k=1
‖Xk‖2.
77
We will use (1 − PVdyn(p)) as a measure of the loss of information incurred when considering a
dimension reduction to dimension p. Alternatively, one also can use the normalized mean squared
errorNMSE(p) :=
n∑k=1
‖Xk − Xk‖2/ n∑k=1
‖Xk‖2. (4.6)
Both quantities converge to the same limit.
5 A real-life illustration
In this section, we draw a comparison between dynamic and static FPCA on basis of a real data
set. The observations are half-hourly measurements of the concentration (measured in µgm−3) of
particulate matter with an aerodynamic diameter of less than 10µm, abbreviated as PM10, in ambient
air taken in Graz, Austria from October 1, 2010 through March 31, 2011. Following Stadlober
et al. [36] and Aue et al. [1], a square-root transformation was performed in order to stabilize
the variance and avoid heavy-tailed observations. Also, we removed some outliers and a seasonal
(weekly) pattern induced from different traffic intensities on business days and weekends. Then we
use the software R to transform the raw data, which is discrete, to functional data, as explained
in Section 4, using 15 Fourier basis functions. The resulting curves for 175 daily observations,
X1, . . . , X175, say, roughly representing one winter season, for which pollution levels are known to
be high, are displayed in Figure 2.
0.0 0.2 0.4 0.6 0.8 1.0
02
46
81
01
2
time
sq
rt(P
M1
0)
0.0 0.2 0.4 0.6 0.8 1.0
02
46
81
01
2
time
Figure 2: A plot of 175 daily curves xt(u), 1 ≤ t ≤ 175, where xt(u) are the square-root transformed anddetrended functional observations of PM10, based on 15 Fourier basis functions. The solid black line representsthe sample mean curve µ(u).
From those data, we computed the (estimated) first dynamic FPC score sequence (Y dyn1t : 1 ≤ t ≤ 175).
To this end, we centered the data at their empirical mean µ(u), then implemented the procedure
described in Section 4. We used the traditional Bartlett kernel w(x) = 1− |x| in (4.3) to obtain an
estimator for the spectral density operator, with bandwidth q = bn1/2c = 13. More sophisticated
78
estimation methods, as those proposed, for example, by Politis [29], of course can be considered;
but they also depend on additional tuning parameters, still leaving much of the selection to the
practitioner’s choice. From FXθ we obtain the estimated filter elements φ1`. It turns out that they
fade away quite rapidly. In particular∑10
`=−10 ‖φ1`‖2 ≈ 0.998. Hence, for calculation of the scores
in (4.4) it is justified to choose L = 10. The five central filter elements φ1`(u), ` = −2, . . . , 2, are
plotted in Figure 3.
0.0
0.5
1.0
Figure 3: The five central filter elements φ1,−2(u), . . . , φ1,2(u) (from left to right).
Further components could be computed similarly, but for the purpose of demonstration we focus
on one component only. In fact, the first dynamic FPC already explains about 80% of the total
variance, compared to the 73% explained by the first static FPC. The latter was also computed,
resulting in the static FPC score sequence (Y stat1t : 1 ≤ t ≤ 175). Both sequences are shown in
Figure 4, along with their differences.
0 50 100 150
−2
02
4
Time [days]
1st
FP
C s
co
res
0 50 100 150
−4
−2
02
4
Time [days]
1st
DF
PC
sco
res
0 50 100 150
−2
02
4
Time [days]
Diffe
ren
ce
s
Figure 4: First static (left panel) and first dynamic (middle panel) FPC score sequences, and their differences(right panel).
Although based on entirely different ideas, the static and dynamic scores in Figure 4 (which,
of course, are not loading the same functions) appear to be remarkably close to one another. The
reason why the dynamic Karhunen-Loeve expansion accounts for a significantly larger amount of
the total variation is that, contrary to its static counterpart, it does not just involve the present
observation.
To get more statistical insight into those results, let us consider the first static sample FPC,
v1(u), say, displayed in Figure 5. We see that v1(u) ≈ 1 for all u ∈ [0, 1], so that the static FPC
score Y stat1t =
∫ 10 (Xt(u)− µ(u))v1(u)du roughly coincides with the average deviation of Xt(u) from
79
−1
0
1
0.0 0.4 0.8 0.0 0.4 0.8
34
56
78
0.0 0.4 0.8
34
56
78
0.0 0.4 0.8
34
56
78
0.0 0.4 0.8
34
56
78
0.0 0.4 0.8
34
56
78
0.0 0.4 0.8
34
56
78
Figure 5: First static FPC v1(u) (solid line), and second static FPC v2(u) (dashed line) [left panel]. µ(u)±v1(u) [middle panel] and µ(u) ± v2(u) [right panel] describe the effect of the first and second static FPC onthe mean curve.
the sample mean µ(u): the effect of a large (small) first score corresponds to a large (small) daily
average of√PM10. In view of the similarity between Y dyn
1t and Y stat1t , it is possible to attribute
the same interpretation to the dynamic FPC scores. However, regarding the dynamic Karhunen-
Loeve expansion, dynamic FPC scores should be interpreted sequentially. To this end, let us take
advantage of the fact that∑1
`=−1 ‖φ1`‖2 ≈ 0.92. In the approximation by a single-term dynamic
Karhunen-Loeve expansion, we thus roughly have
Xt(u) ≈ µ(u) +1∑
`=−1
Y dyn1,t+`φ1`(u).
This suggests studying the impact of triples (Y dyn1,t−1, Y
dyn1t , Y dyn
1,t+1) of consecutive scores on the pollu-
tion level of day t. We do this by adding the functions
eff(δ−1, δ0, δ1) :=1∑
`=−1
δ`φ1`(u), with δi = const×±1,
to the overall mean curve µ(u). In Figure 6, we do this with δi = ±1. For instance, the upper left
panel shows µ(u)+eff(−1,−1,−1), corresponding to the impact of three consecutive small dynamic
FPC scores. The result is a negative shift of the mean curve. If two small scores are followed by
a large one (second panel from the left in top row), then the PM10 level increases as u approaches
1. Since a large value of Y dyn1,t+1 implies a large average concentration of
√PM10 on day t + 1, and
since the pollution curves are highly correlated at the transition from day t to day t+ 1, this should
indeed be reflected by a higher value of√PM10 towards the end of day t. Similar interpretations can
be given for the other panels in Figure 6.
It is interesting to observe that, in this example, the first dynamic FPC seems to take over the
roles of the first two static FPCs. The second static FPC (see Figure 5) indeed can be interpreted as
an intraday trend effect; if the second static score of day t is large (small), then Xt(u) is increasing
(decreasing) over u ∈ [0, 1]. Since we are working with sequentially dependent data, we can get
information about such a trend from future and past observations, too. Hence, roughly speaking,
80
we have1∑
`=−1
Y dyn1,t+`φ1`(u) ≈
2∑m=1
Y statmt vm(u).
This is exemplified in Figure 1 of Section 1, which shows the ten consecutive curves x71(u) −µ(u), . . . , x80(u)− µ(u) (left panel) and compares them to the single-term static (middle panel) and
the single-term dynamic Karhunen-Loeve expansions (right panel).
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (−1,−1,−1)
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (−1,−1,+1)
0.0 0.4 0.8
45
67
8
Intraday timeS
qrt
(PM
10)
(δ−1,δ0,δ1) = (−1,+1,−1)
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (−1,+1,+1)
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (+1,−1,−1)
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (+1,−1,+1)
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (+1,+1,−1)
0.0 0.4 0.84
56
78
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (+1,+1,+1)
Figure 6: Mean curves µ(u) (solid line) and µ(u) + eff(δ−1, δ0, δ1), with δi = ±1 (dashed).
6 Simulation study
In this simulation study, we compare the performance of dynamic FPCA with that of static FPCA
for a variety of data-generating processes. For each simulated functional time series (Xt), where
Xt = Xt(u), u ∈ [0, 1], we compute the static and dynamic scores, and recover the approximating
series (Xstatt (p)) and (Xdyn
t (p)) that result from the static and dynamic Karhunen-Loeve expansions,
respectively, of order p. The performances of these approximations are measured in terms of the
corresponding normalized mean squared errors (NMSE)
n∑t=1
‖Xt − Xstatt (p)‖2
/ n∑t=1
‖Xt‖2 and
n∑t=1
‖Xt − Xdynt (p)‖2
/ n∑t=1
‖Xt‖2.
The smaller these quantities, the better the approximation.
Computations were implemented in R, along with the fda package. The data were simulated
according to a functional AR(1) model Xn+1 = Ψ(Xn) + εn+1. In practice, this simulation has to
be performed in finite dimension d, say. To this end, let (vi), i ∈ N be the Fourier basis functions
81
on [0, 1]: for large d, due to the linearity of Ψ,
〈Xn+1, vj〉 = 〈Ψ(Xn), vj〉+ 〈εn+1, vj〉
= 〈Ψ( ∞∑i=1
〈Xn, vi〉vi), vj〉+ 〈εn+1, vj〉 ≈
d∑i=1
〈Xn, vi〉〈Ψ(vi), vj〉+ 〈εn+1, vj〉.
Hence, letting Xn = (〈Xn, v1〉, . . . , 〈Xn, vd〉)′ and εn = (〈εn+1, v1〉, . . . , 〈εn+1, vd〉)′, the first d Fourier
coefficients of Xn approximately satisfy the VAR(1) equation
Xn+1 = PXn + εn, where P = (〈Ψ(vi), vj〉 : 1 ≤ i, j ≤ d). Based on this observation, we used
a VAR(1) model for generating the first d Fourier coefficients of the process (Xn). To obtain P, we
generate a matrix G = (Gij : 1 ≤ i, j ≤ d), where the Gij ’s are mutually independent N(0, ψij), and
then set P := κG/‖G‖. Different choices of ψij are considered. Since Ψ is bounded, we have Pij → 0
as i, j → ∞. For the operators Ψ1, Ψ2 and Ψ3, we used ψij = (i2 + j2)−1/2, ψij = (i2/2 + j3/2)−1,
and ψij = e−(i+j), respectively. For d and κ, we considered the values d = 15, 31, 51, 101 and
κ = 0.1, 0.3, 0.6, 0.9. The noise (εt) is chosen as independent Gaussian and obtained as a lin-
ear combination of the functions (vi : 1 ≤ i ≤ d) with independent zero-mean normal coefficients
(Ci : 1 ≤ i ≤ d), such that Var(Ci) = exp((i − 1)/10). With this approach, we generate n = 400
observations. We then follow the methodology described in Section 4 and use the Barlett kernel
in (4.3) for estimation of the spectral density operator. The tuning parameter q is set equal to√n = 20. A more sophisticated calibration probably can lead to even better results, but we also
observed that moderate variations of q do not fundamentally change our findings. The numerical
integration for obtaining φmk is performed on the basis of 1000 equidistant integration points. In
(4.4) we chose L = min(L′, 60), where L′ = argminj≥0
∑−j≤`≤j ‖φm`‖2 ≥ 0.99. The limitation
L ≤ 60 is imposed to keep computation times moderate. Usually, convergence is relatively fast.
For each choice of d and κ, the experiment as described above is repeated 200 times. The mean
and standard deviation of NMSE in different settings and with values p = 1, 2, 3, 6 are reported in
Table 1. Results do not vary much among setups with d ≥ 31, and thus in Table 1 we only present
the cases d = 15 and d = 101.
We see that, in basically all settings, dynamic FPCA significantly outperforms static FPCA in
terms of NMSE. As one can expect, the difference becomes more striking with increasing dependence
coefficient κ. It is also interesting to observe that the variations of NMSE among the 200 replications
is systematically smaller for the dynamic procedure.
Finally, it should be noted that, in contrast to the static PCA, the empirical version of our
procedure is not “exact”, but is subject to small approximation errors. These approximation errors
can stem from numerical integration (which is required in the calculation of φmk) and are also due
to the truncation of the filters at some finite lag L (see Section 4). Such little deviations do not
matter in practice if a component explains a significant proportion of variance. If, however, the
additional contribution of the higher-order component is very small, then it can happen that it
doesn’t compensate a possible approximation error. This becomes visible in the setting Ψ3 with 3
or 6 components, where for some constellations the NMSE for dynamic components is slightly larger
than for the static ones.
82
1co
mp
onen
t2
com
ponen
ts3
com
ponen
ts6
com
ponen
tsd
κst
ati
cdynam
icst
ati
cdynam
icst
ati
cdynam
icst
ati
cdynam
ic
Ψ1
15
0.1
0.697
(0.1
6)
0.637
(0.1
3)
0.546
(0.1
5)
0.447
(0.1
0)
0.443
(0.1
2)
0.325
(0.0
8)
0.256
(0.0
8)
0.138
(0.0
5)
0.3
0.696
(0.1
6)
0.621
(0.1
4)
0.542
(0.1
5)
0.434
(0.1
1)
0.440
(0.1
3)
0.314
(0.0
8)
0.253
(0.0
8)
0.132
(0.0
5)
0.6
0.687
(0.3
2)
0.571
(0.2
3)
0.526
(0.2
5)
0.392
(0.1
5)
0.423
(0.2
0)
0.283
(0.1
1)
0.240
(0.1
1)
0.119
(0.0
6)
0.9
0.648
(0.7
6)
0.479
(0.4
7)
0.481
(0.5
6)
0.322
(0.2
9)
0.377
(0.4
3)
0.229
(0.2
0)
0.209
(0.2
2)
0.096
(0.0
9)
101
0.1
0.805
(0.1
2)
0.740
(0.0
8)
0.708
(0.1
1)
0.587
(0.0
8)
0.642
(0.1
2)
0.478
(0.0
7)
0.519
(0.0
8)
0.274
(0.0
5)
0.3
0.802
(0.1
3)
0.729
(0.1
1)
0.704
(0.1
2)
0.577
(0.0
9)
0.637
(0.1
1)
0.469
(0.0
8)
0.515
(0.1
0)
0.269
(0.0
5)
0.6
0.792
(0.2
2)
0.690
(0.1
8)
0.689
(0.1
9)
0.545
(0.1
2)
0.619
(0.1
6)
0.441
(0.1
0)
0.495
(0.1
3)
0.252
(0.0
7)
0.9
0.755
(0.6
6)
0.616
(0.4
5)
0.640
(0.5
0)
0.479
(0.3
1)
0.568
(0.4
0)
0.387
(0.2
3)
0.446
(0.3
4)
0.220
(0.1
5)
Ψ2
15
0.1
0.524
(0.2
0)
0.491
(0.1
7)
0.355
(0.1
4)
0.306
(0.1
1)
0.263
(0.1
0)
0.208
(0.0
8)
0.129
(0.0
5)
0.082
(0.0
3)
0.3
0.522
(0.2
1)
0.473
(0.1
8)
0.351
(0.1
6)
0.294
(0.1
2)
0.259
(0.1
2)
0.200
(0.0
8)
0.126
(0.0
6)
0.078
(0.0
4)
0.6
0.507
(0.4
9)
0.413
(0.2
9)
0.331
(0.2
9)
0.255
(0.1
5)
0.240
(0.1
9)
0.174
(0.1
0)
0.114
(0.0
8)
0.068
(0.0
5)
0.9
0.458
(1.1
5)
0.310
(0.5
9)
0.272
(0.6
4)
0.187
(0.3
2)
0.193
(0.4
1)
0.130
(0.2
1)
0.088
(0.1
7)
0.052
(0.0
9)
101
0.1
0.585
(0.1
9)
0.549
(0.1
7)
0.436
(0.1
5)
0.378
(0.1
1)
0.356
(0.1
3)
0.282
(0.1
0)
0.240
(0.0
8)
0.146
(0.0
5)
0.3
0.581
(0.2
1)
0.530
(0.1
8)
0.436
(0.1
2)
0.369
(0.1
1)
0.350
(0.1
3)
0.274
(0.0
9)
0.234
(0.1
0)
0.141
(0.0
6)
0.6
0.564
(0.4
6)
0.469
(0.2
7)
0.405
(0.3
3)
0.321
(0.1
8)
0.323
(0.2
1)
0.242
(0.1
3)
0.212
(0.1
2)
0.125
(0.0
7)
0.9
0.495
(1.0
6)
0.362
(0.5
9)
0.345
(0.6
8)
0.250
(0.3
9)
0.251
(0.5
8)
0.180
(0.3
4)
0.168
(0.2
6)
0.097
(0.1
4)
Ψ3
15
0.1
0.367
(0.2
0)
0.344
(0.1
8)
0.134
(0.0
8)
0.127
(0.0
7)
0.049
(0.0
3)
0.054
(0.0
4)
0.002
(0.0
0)
0.017
(0.0
3)
0.3
0.362
(0.2
4)
0.322
(0.1
7)
0.129
(0.0
9)
0.119
(0.0
7)
0.048
(0.0
3)
0.050
(0.0
4)
0.002
(0.0
0)
0.015
(0.0
3)
0.6
0.334
(0.5
5)
0.253
(0.2
4)
0.113
(0.1
6)
0.097
(0.0
9)
0.041
(0.0
5)
0.040
(0.0
4)
0.002
(0.0
0)
0.011
(0.0
2)
0.9
0.236
(1.1
2)
0.146
(0.4
3)
0.074
(0.2
8)
0.061
(0.1
6)
0.025
(0.0
8)
0.027
(0.0
7)
0.001
(0.0
0)
0.008
(0.0
4)
101
0.1
0.366
(0.1
9)
0.344
(0.1
7)
0.134
(0.0
8)
0.127
(0.0
7)
0.049
(0.0
3)
0.054
(0.0
4)
0.002
(0.0
0)
0.017
(0.0
3)
0.3
0.363
(0.2
5)
0.322
(0.1
8)
0.131
(0.1
0)
0.120
(0.0
7)
0.047
(0.0
3)
0.050
(0.0
4)
0.002
(0.0
0)
0.015
(0.0
3)
0.6
0.325
(0.5
2)
0.251
(0.2
4)
0.113
(0.1
6)
0.098
(0.0
9)
0.040
(0.0
5)
0.040
(0.0
4)
0.002
(0.0
0)
0.011
(0.0
2)
0.9
0.235
(1.0
5)
0.149
(0.4
3)
0.074
(0.2
8)
0.061
(0.1
6)
0.025
(0.0
9)
0.026
(0.0
7)
0.001
(0.0
0)
0.008
(0.0
4)
Tab
le1:
Res
ult
sof
the
sim
ula
tion
sof
Sec
tion
6.
Bold
nu
mbe
rsre
pre
sen
tth
em
ean
of
NM
SE
for
dyn
am
ican
dst
ati
cpro
cedu
res
resu
ltin
gfr
om
200
sim
ula
tion
run
s.T
he
nu
mbe
rsin
brack
ets
show
stan
dard
dev
iati
on
sm
ult
ipli
edby
afa
ctor
10.
The
valu
esκ
give
the
size
of‖Ψ
i‖L
,i
=1,2,3
.W
eco
nsi
der
dim
ensi
on
sof
the
un
der
lyin
gm
odel
sd
=15
an
dd
=10
1.
83
7 Conclusion
Functional principal component analysis is taking a leading role in the functional data literature. As
an extremely effective tool for dimension reduction, it is useful for empirical data analysis as well as
for many FDA-related methods, like functional linear models. A frequent situation in practice is that
functional data are observed sequentially over time and exhibit serial dependence. This happens, for
instance, when observations stem from a continuous-time process which is segmented into smaller
units, e.g., days. In such cases, classical static FPCA still may be useful, but, in contrast to the
i.i.d. setup, it does not lead to an optimal dimension-reduction technique.
In this paper, we propose a dynamic version of FPCA which takes advantage of the potential serial
dependencies in the functional observations. In the special case of uncorrelated data, the dynamic
FPC methodology reduces to the usual static one. But, in the presence of serial dependence, static
FPCA is (quite significantly, if serial dependence is strong) outperformed.
This paper also provides (i) guidelines for practical implementation, (ii) a toy example with
PM10 air pollution data, and (iii) a simulation study. Our empirical application brings empirical
evidence that dynamic FPCs have a clear edge over static FPCs in terms of their ability to represent
dependent functional data in small dimension. In the appendices, our results are cast into a rigorous
mathematical framework, and we show that the proposed estimators of dynamic FPC scores are
consistent.
84
Appendices of Chapter 3
85
A General methodology and proofs
In this subsection, we give a mathematically rigorous description of the methodology introduced in
Section 3.1. We adopt a more general framework which can be specialized to the functional setup of
Section 3.1. Throughout, H denotes some (complex) separable Hilbert space equipped with norm
‖ · ‖ and inner product 〈·, ·〉. We work in complex spaces, since our theory is based on a frequency
domain analysis. Nevertheless, all our functional time series observations Xt are assumed to be
real-valued functions.
A.1 Fourier series in Hilbert spaces.
For p ≥ 1, consider the space LpH([−π, π]), that is, the space of measurable mappings x : [−π, π]→ H
such that∫ π−π ‖x(θ)‖pdθ < ∞. Then, ‖x‖p = ( 1
2π
∫ π−π ‖x(θ)‖pdθ)1/p defines a norm. Equipped with
this norm, LpH([−π, π]) is a Banach space, and for p = 2, a Hilbert space with inner product
(x, y) :=1
2π
∫ π
−π〈x(θ), y(θ)〉dθ.
One can show (see e.g. [8, Lemma 1.4]) that, for any x ∈ L1H([−π, π]), there exists a unique element
I(x) ∈ H which satisfies ∫ π
−π〈x(θ), v〉dθ = 〈I(x), v〉 ∀v ∈ H. (A.1)
We define∫ π−π x(θ)dθ := I(x).
For x ∈ L2H([−π, π]), define the k-th Fourier coefficient as
fk :=1
2π
∫ π
−πx(θ)e−ikθdθ, k ∈ Z. (A.2)
Below, we write ek for the function θ 7→ eikθ, θ ∈ [−π, π].
Proposition 6. Suppose x ∈ L2H([−π, π]) and define fk by equation (A.2). Then, the sequence
Sn :=∑n
k=−n fkek has a mean square limit in L2H([−π, π]). If we denote the limit by S, then
x(θ) = S(θ) for almost all θ.
Proof. See supplementary document.
Let us turn to the Fourier expansion of eigenfunctions ϕm(θ) used in the definition of the dynamic
DPFCs. Eigenvectors are scaled to unit length: ‖ϕm(θ)‖2 = 1. In order for ϕm to belong to
L2H([−π, π]), we additionally need measurability. Measurability cannot be taken for granted. This
comes from the fact that ‖zϕm(θ)‖2 = 1 for all z on the complex unit circle. In principle we could
choose the “signs” z = z(θ) in an extremely erratic way, such that ϕm(θ) is no longer measurable.
To exclude such pathological choices, we tacitly impose in the sequel that versions of ϕm(θ) have
been chosen in a “smooth enough way”, to be measurable.
86
Now we can expand the eigenfunctions ϕm(θ) in a Fourier series in the sense explained above:
ϕm =∑`∈Z
φm`e` with φm` =1
2π
∫ π
−πϕm(s)e−i`sds.
The coefficients φm` thus defined yield the definition (3.10) of dynamic FPCs. In the special case
H = L2([0, 1]), φm` = φm`(u) satisfies by (A.1)∫ 1
0φm`(u)v(u)du =
1
2π
∫ π
−π
∫ 1
0ϕm(u|s)v(u)due−i`sds
=
∫ 1
0
(1
2π
∫ π
−πϕm(u|s)e−i`sds
)v(u)du ∀v ∈ H.
This implies that φm`(u) = 12π
∫ π−π ϕm(u|s)e−i`sds for almost all u ∈ [0, 1], which is in line with the
definition given in (3.9). Furthermore, (3.8) follows directly from Proposition 6.
A.2 The spectral density operator
Assume that the H-valued process (Xt : t ∈ Z) is stationary with lag h autocovariance operator CXhand spectral density operator
FXθ :=1
2π
∑h∈Z
CXh e−ihθ. (A.3)
Let S(H,H ′) be the set of Hilbert-Schmidt operators mapping from H to H ′ (both assumed to
be separable Hilbert spaces). When H = H ′ and when it is clear which space H is meant, we
sometimes simply write S. With the Hilbert-Schmidt norm ‖ ·‖S(H,H′) this defines again a separable
Hilbert space, and so does L2S(H,H′)([−π, π]). We will impose that the series in (A.3) converges in
L2S(H,H)([−π, π]): we then say that (Xt) possesses a spectral density operator.
Remark 3. It follows that the results of the previous section can be applied. In particular we may
deduce that CXk =∫ π−π F
Xθ e
ikθdθ.
A sufficient condition for convergence of (A.3) in L2S(H,H)([−π, π]) is assumption (3.5). Then, it
can be easily shown that the operator FXθ is self-adjoint, non-negative definite and Hilbert-Schmidt.
Below, we introduce a weak dependence assumption established in [18], from which we can derive a
sufficient condition for (3.5).
Definition 4 (Lp–m–approximability). A random H–valued sequence (Xn : n ∈ Z) is called Lp–m–
approximable if it can be represented as Xn = f(δn, δn−1, δn−2, ...), where the δi’s are i.i.d. elements
taking values in some measurable space S and f is a measurable function f : S∞ → H. Moreover,
if δ′1, δ′2, ... are independent copies of δ1, δ2, ... defined on the same measurable space S, then, for
X(m)n := f(δn, δn−1, δn−2, ..., δn−m+1, δ
′n−m, δ
′n−m−1, ...),
we have
∞∑m=1
(E‖Xm −X(m)m ‖p)1/p <∞. (A.4)
87
Hormann and Kokoszka [18] show that this notion is widely applicable to linear and non-linear
functional time series. One of its main advantages is that it is a purely moment-based dependence
measure that can be easily verified in many special cases.
Proposition 7. Assume that (Xt) is L2–m–approximable. Then (3.5) holds and the operators FXθ ,
θ ∈ [−π, π], are trace-class.
Proof. See supplementary document.
Instead of Assumption (3.5), Panaretos and Tavakoli [27] impose for the definition of a spectral
density operator summability of CXh in Schatten 1-norm, that is,∑
h∈Z ‖CXh ‖T < ∞. Under such
slightly more stringent assumption, it immediately follows that the resulting spectral operator is
trace-class. The verification of convergence may, however, be a bit delicate. At least, we could not
find a simple criterion as in Proposition 7.
Proposition 8. Let FXθ be the spectral density operator of a stationary sequence (Xt) for which
the summability condition (3.5) holds. Let λ1(θ) ≥ λ2(θ) ≥ · · · denote its eigenvalues and ϕm(θ)
be the corresponding eigenfunctions. Then, (a) the functions θ 7→ λm(θ) are continuous; (b) if we
strengthen (3.5) into the more stringent condition∑
h∈Z |h|‖CXh ‖S < ∞, the λm(θ)’s are Lipschitz-
continuous functions of θ; (c) assuming that (Xt) is real-valued, for each θ ∈ [−π, π], λm(θ) =
λm(−θ) and ϕm(θ) = ϕm(−θ).
Proof. See supplementary document.
Let x be the conjugate element of x, i.e. 〈x, z〉 = 〈z, x〉 for all z ∈ H. Then x is real-valued iff
x = x.
Remark 4. Since ϕm(θ) is Hermitian, it immediately follows that φm` = φm`, implying that the
dynamic FPCs are real if the process (Xt) is.
A.3 Functional filters
Computation of dynamic FPCs requires applying time-invariant functional filters to the process (Xt).
Let Ψ = (Ψk : k ∈ Z) be a sequence of linear operators mapping the separable Hilbert space H to
the separable Hilbert space H ′. Let B be the backshift or lag operator, defined by BkXt := Xt−k,
k ∈ Z. Then the functional filter Ψ(B) :=∑
k∈Z ΨkBk, when applied to the sequence (Xt), produces
an output series (Yt) in H ′ via
Yt = Ψ(B)Xt =∑k∈Z
Ψk(Xt−k). (A.5)
Call Ψ the sequence of filter coefficients, and, in the style of the scalar or vector time series termi-
nology, call
Ψθ = Ψ(e−iθ) =∑k∈Z
Ψke−ikθ (A.6)
88
the frequency response function of the filter Ψ(B). Of course, series (A.5) and (A.6) only have a
meaning if they converge in an appropriate sense. Below we use the following technical lemma.
Proposition 9. Suppose that (Xt) is a stationary sequence in L2H and possesses a spectral den-
sity operator satisfying supθ tr(FXθ ) < ∞. Consider a filter (Ψk) such that Ψθ converges in
L2S(H,H′)([−π, π]), and suppose that supθ ‖Ψθ‖S(H,H′) <∞. Then,
(i) the series Yt :=∑
k∈Z Ψk(Xt−k) converges in L2H′;
(ii) (Yt) possesses the spectral density operator FYθ = ΨθFXθ (Ψθ)∗;
(iii) supθ tr(FYθ ) <∞.
Proof. See supplementary document.
In particular, the last proposition allows for iterative applications. If supθ tr(FXθ ) < ∞ and Ψθ
satisfies the above properties, then analogue results apply to the output Yt. This is what we are
using in the proofs of Theorems 1 and 2.
A.4 Proofs for Section 3
To start with, observe that Propositions 2 and 4 directly follow from Proposition 9. Part (a) of
Proposition 3 also has been established in the previous Section (see Remark 4), and part (b) is
immediate. Thus, we can proceed to the proof of Theorems 1 and 2.
Proof of Theorems 2 and 3. Assume we have filter coefficients Ψ = (Ψk : k ∈ Z) and Υ = (Υk : k ∈Z), where Ψk : H → Cp and Υk : Cp → H both belong to the class C. If (Xt) and (Yt) are H-valued
and Cp-valued processes, respectively, then there exist elements ψmk and υmk in H such that
Ψ(B)(Xt) =∑k∈Z
(〈Xt−k, ψ1k〉, . . . , 〈Xt−k, ψpk〉)′
and
Υ(B)(Yt) =∑`∈Z
p∑m=1
Yt+`,mυm`.
Hence, the p-dimensional reconstruction of Xt in Theorem 3 is of the form
p∑m=1
Xmt = Υ(B)[Ψ(B)Xt] =: ΥΨ(B)Xt.
Since Ψ and Υ are required to belong to C, we conclude from Proposition 9 that the processes
Yt := Ψ(B)Xt and Xt = Υ(B)Yt are mean-square convergent and possess a spectral density op-
erator. Letting ψm(θ) =∑
k∈Z ψmkeikθ and υm(θ) =
∑`∈Z υm`e
i`θ, we obtain, for x ∈ H and
y = (y1, . . . , ym)′ ∈ Cp, that the frequency response functions Ψθ and Υθ satisfy
On F ′, the integrand (B.3) is greater than or equal to 〈ϕm(θ), v〉/2. On F ′′ the inequality cos(z(θ))〈ϕm(θ), v〉 >〈ϕm(θ), v〉/2 holds, and consequently
〈ϕm(θ), v〉| sin(z(θ))| > 〈ϕm(θ), v〉2
| sin(z(θ))|
>〈ϕm(θ), v〉
π|z(θ)| ≥ 〈ϕm(θ), v〉
πε′.
Altogether, this yields that the integrand in (B.3) is larger than or equal to 〈ϕm(θ), v〉ε′/π. Now, it
is easy to see that, due to Assumption B.3, (B.2) cannot hold. This leads to a contradiction.
Thus, we can conclude that maxj∈Z ‖φmj− φmj‖ = oP (1), so that, for sufficiently slowly growing
L, we also have L maxj∈Z ‖φmj − φmj‖ = oP (1). Consequently,
∣∣∣∣∣ ∑|j|≤L
〈Xk−j , φmj − φmj〉
∣∣∣∣∣ = oP (1)×
L−1L∑
j=−L‖Xk−j‖
. (B.4)
92
It remains to show that L−1L∑
j=−L‖Xk−j‖ = OP (1). By the weak stationarity assumption, we have
E‖Xk‖2 = E‖X1‖2, and hence, for any x > 0,
P
(L−1
L∑j=−L
‖Xk−j‖ > x
)≤∑L
k=−LE‖Xk‖Lx
≤3√E‖X1‖2x
.
Lemma 3. Let L = L(n)→∞. Then, under condition (3.5), we have∣∣∣∣∣ ∑|j|>L
〈Xk−j , φmj〉
∣∣∣∣∣ = oP (1).
Proof. This is immediate from Proposition 4, part (a).
Turning to the proof of Proposition 5, we first establish the following lemma, which an extension
to lag-h autocovariance operators of a consistency result from [18] on the empirical covariance
operator. Define, for |h| < n,
Ch =1
n
n−h∑k=1
Xk+h ⊗Xk, h ≥ 0, and Ch = C−h, h < 0.
Lemma 4. Assume that (Xt : t ∈ Z) is an L4-m-approximable series. Then, for all |h| < n,
E‖Ch − Ch‖S ≤ U√
(|h| ∨ 1)/n, where the constant U neither depends on n nor on h.
Proof. See supplementary document.
Proof of Proposition 5. By the triangle inequality,
2π‖FXθ − FXθ ‖S =
∥∥∥∥∥∑k∈Z
Che−ihθ −
q∑h=−q
(1− |h|
q
)Che
−ihθ
∥∥∥∥∥S
≤
∥∥∥∥∥q∑
h=−q
(1− |h|
q
)(Ch − Ch)e−ihθ
∥∥∥∥∥S
+
∥∥∥∥∥1
q
q∑h=−q
|h|Che−ihθ
∥∥∥∥∥S
+
∥∥∥∥∥ ∑|h|>q
Che−ihθ
∥∥∥∥∥S
≤q∑
h=−q
(1− |h|
q
)‖Ch − Ch‖S +
1
q
q∑h=−q
|h|‖Ch‖S +∑|h|>q
‖Ch‖S .
The last two terms tend to 0 by condition (3.5) and Kronecker’s lemma. For the first term we may
use Lemma 4. Taking expectations, we obtain that, for some U1,
q∑h=−q
(1− |h|
q
)E‖Ch − Ch‖S ≤ U1
q3/2
√n.
93
Note that the bound does not depend on θ; hence q3 = o(n) and condition (3.5) jointly imply that
supθ∈[−π,π]E‖FXθ − FXθ ‖S → 0 as n→∞.
C Technical results and background
C.1 Linear operators
Consider the class L(H,H ′) of bounded linear operators between two Hilbert spaces H and H ′.
For Ψ ∈ L(H,H ′), the operator norm is defined as ‖Ψ‖L := sup‖x‖≤1 ‖Ψ(x)‖. The simplest operators
can be defined via a tensor product v ⊗ w; then v ⊗ w(z) := v〈z, w〉. Every operator Ψ ∈ L(H,H ′)
possesses an adjoint Ψ∗ ∈ L(H ′, H), which satisfies 〈Ψ(x), y〉 = 〈x,Ψ∗(y)〉 for all x ∈ H and y ∈ H ′.It holds that ‖Ψ∗‖L = ‖Ψ‖L. If H = H ′, then Ψ is called self-adjoint if Ψ = Ψ∗. It is called
non-negative definite if 〈Ψx, x〉 ≥ 0 for all x ∈ H.
A linear operator Ψ ∈ L(H,H ′) is said to be Hilbert-Schmidt if, for some orthonormal basis
(vk : k ≥ 1) of H, we have ‖Ψ‖2S :=∑
k≥1 ‖Ψ(vk)‖2 <∞. Then, ‖Ψ‖S defines a norm, the so-called
Hilbert-Schmidt norm of Ψ, which bounds the operator norm ‖Ψ‖L ≤ ‖Ψ‖S , and can be shown to
be independent of the choice of the orthonormal basis. Every Hilbert-Schmidt operator is compact.
The class of Hilbert-Schmidt operators between H and H ′ defines again a separable Hilbert space
with inner product 〈Ψ,Θ〉S :=∑
k≥1〈Ψ(vk),Θ(vk)〉: denote this class by S(H,H ′).
If Ψ ∈ L(H,H ′) and Υ ∈ L(H ′′, H), then ΨΥ is the operator mapping x ∈ H ′′ to Ψ(Υ(x)) ∈ H ′.Assume that Ψ is a compact operator in L(H,H ′) and let (s2
j ) be the eigenvalues of (Ψ∗)Ψ. Then Ψ is
said to be trace class if ‖Ψ‖T :=∑
j≥1 sj <∞. In this case, ‖Ψ‖T defines a norm, the so-called Schat-
ten 1-norm. We have that
‖Ψ‖S ≤ ‖Ψ‖T , and hence any trace-class operator is Hilbert-Schmidt. For self-adjoint non-negative
operators, it holds that ‖Ψ‖T = tr(Ψ) :=∑
k≥1〈Ψ(vk), vk〉. If ΨΨ = Ψ, then we have tr(Ψ) = ‖Ψ‖2S .
For further background on the theory of linear operators we refer to [13].
C.2 Random sequences in Hilbert spaces
All random elements that appear in the sequel are assumed to be defined on a common probability
space (Ω,A, P ). We write X ∈ LpH(Ω,A, P ) (in short, X ∈ LpH) if X is an H-valued random variable
such that E‖X‖p <∞. Every element X ∈ L1H possesses an expectation, which is the unique µ ∈ H
satisfying E〈X, y〉 = 〈µ, y〉 for all y ∈ H. Provided that X and Y are in L2H , we can define the cross-
covariance operator as CXY := E(X−µX)⊗ (Y −µY ), where µX and µY are the expectations of X
and Y , respectively. We have that ‖CXY ‖T ≤ E‖(X−µX)⊗(Y −µY )‖T = E‖X−µX‖‖Y −µY ‖, and
so these operators are trace-class. An important specific role is played by the covariance operator
CXX . This operator is non-negative definite and self-adjoint with tr(CXX) = E‖X − µX‖2. An
H-valued process (Xt) is called (weakly) stationary if (Xt) ∈ L2H , and EXt and CXt+hXt do not
depend on t. In this case, we write CXh , or shortly Ch, for CXt+hXt if it is clear to which process it
belongs.
94
Many useful results on random processes in Hilbert spaces or more general Banach spaces are
collected in Chapters 1 and 2 of [8].
C.3 Proofs for Appendix A
Proof of Proposition 6. Letting 0 < m < n, note that
‖Sn − Sm‖22 =
( ∑m≤|k|≤n
fkek,∑
m≤|`|≤n
f`e`
)
=1
2π
∫ π
−π
∑m≤|k|≤n
∑m≤|`|≤n
〈fk, f`〉ei(k−`)θdθ =∑
m≤|k|≤n
‖fk‖2.
To prove the first statement, we need to show that (Sn) defines a Cauchy sequence in L2H([−π, π]),
which follows if we show that∑
k∈Z ‖fk‖2 < ∞. We use the fact that, for any v ∈ H, the function
〈x(θ), v〉 belongs to L2([−π, π]). Then, by Parseval’s identity and (A.1), we have, for any v ∈ H,
1
2π
∫ π
−π|〈x(θ), v〉|2dθ =
∑k∈Z
(1
2π
∫ π
−π〈x(s), v〉e−iksds
)2
=∑k∈Z|〈fk, v〉|2.
Let (vk : k ≥ 1) be an orthonormal basis of H. Then, by the last result and Parseval’s identity
again, it follows that
‖x‖22 =1
2π
∫ π
−π
∑`≥1
|〈x(θ), v`〉|2dθ =1
2π
∑`≥1
∫ π
−π|〈x(θ), v`〉|2dθ
=∑`≥1
∑k∈Z|〈fk, v`〉|2 =
∑k∈Z‖fk‖2.
As for the second statement, we conclude from classical Fourier analysis results that, for each
v ∈ H,
limn→∞
1
2π
∫ π
−π
(〈x(θ), v〉 −
n∑k=−n
(1
2π
∫ π
−π〈x(s), v〉e−iksds
)eikθ
)2
dθ = 0.
Now, by definition of Sn, this is equivalent to
limn→∞
1
2π
∫ π
−π〈x(θ)− Sn(θ), v〉2 dθ = 0, ∀v ∈ H.
Combined with the first statement of the proposition and∫ π
−π〈x(θ)− S(θ), v〉2 dθ ≤ 2
∫ π
−π〈x(θ)− Sn(θ), v〉2 dθ
+ 2‖v‖2∫ π
−π‖Sn(θ)− S(θ)‖2dθ,
this implies that1
2π
∫ π
−π〈x(θ)− S(θ), v〉2 dθ = 0, ∀v ∈ H. (C.1)
Let (vi), i ∈ N bee an orthonormal basis of H, and define
95
Ai := θ ∈ [−π, π] : 〈x(θ)− S(θ), vi〉 6= 0.
By (C.1), we have that λ(Ai) = 0 (λ denotes the Lebesgue measure), and hence λ(A) = 0 for
A = ∪i≥1Ai. Consequently, since (vi) define an orthonormal basis, for any θ ∈ [−π, π] \A, we have
〈x(θ)− S(θ), v〉 = 0 for all v ∈ H, which in turn implies that x(θ)− S(θ) = 0.
Proof of Proposition 7. Without loss of generality, we assume that EX0 = 0. Since X0 and X(h)h ,