Inference for stationary functional time series: dimension ...

Faculte des Sciences

Departement de Mathematique

Inference for stationary functional time series:

dimension reduction and regression

Lukasz KIDZINSKI

These presentee en vue de l’obtention du grade de Docteur en Sciences,

orientation Statistique

Promoteur: Siegfried Hormann

Jury: Maarten Jansen, Davy Paindaveine, Thomas Verdebout, Laurent Delsol, Piotr Kokoszka

Septembre 2014

“Simplicity is the final achievement.

After one has played a vast quantity of notes and more

notes, it is simplicity that emerges as the crowning reward

of art.”

Fryderyk Chopin

in If Not God, Then What? by Joshua Fost.

Acknowledgements

First and foremost my heartfelt thanks to Professor Siegfried Hormann who, throughout three

years, taught me how to be a great scientist and a better person. I would like to thank him, for all

these long hours spent in front of the blackboard, for the passage from hard theoretical problems to

neat and valuable solutions, for his incredible precision and attention to detail which finally drove

me to be more careful, for his constantly positive attitude, charisma and expertise which will always

be a unique example for me, for the trust he gave me by letting me follow my own paths. It is a

great honour and privilege to be his first PhD student.

My gratitude is also extended to Piotr Kokoszka, for the support he showed and keeps showing

me from the moment we met, for his hospitality and the priceless opportunity to work together at

the Colorado State University.

My thanks goes to David Brillinger as well, who found time to share his experience with me at

UC Berkeley, regardless of the many obstacles.

My sincere thanks also go out to Cheng Soon Ong for accommodating me in the challenging

environment of the ETH Zurich.

My thanks also goes to my thesis committee for their guidance and our yearly recaps, to Davy

Paindaveine for his sharp remarks and exceptional humor, to Maarten Jansen for giving a great

example of scientific commitment, to Pierre Patie for valuable remarks at the beginning of my work,

and to Thomas Bruss for sharing his experience through countless stories and digressions during

lunches and coffee breaks. Likewise, my thanks to other members of my jury, Thomas Verdebout,

Laurent Delsol, for accepting my request and for their time.

Next, I would like to acknowledge the Communaut franaise de Belgique, for the grant within

Actions de Recherche Concertees (2010–2015) and the Belgian Science Policy Office, for the grant

within Interuniversity attraction poles (2012–2017). Thanks for the indispensable means which

allowed me to spend three years on my project.

Furthermore, I am aware that a scientific journey starts much earlier than in a doctoral school.

I would not be who I am without all the support from teachers starting from my childhood up till

now – I know that this thesis is not just mine, but their success too. In particular, I would like to

thank my primary school teacher Krzysztof Lukasiewicz and my high school teacher Jerzy Konarski

who taught me to enjoy mathematics.

Many fellow students and faculty members also supported me at the Universit Libre de Bruxelles.

A great thank you to my office colleague Remi for all the necessary breaks for random mathematical

problems, for refreshing algorithmic competitions, for chess games or simple discussions about the

essence of the universe. Thanks to my second office colleague Rabih, and neighbours, Sarah, Carine,

Stavros, Dominik, Germin and Christophe, fellow students Robson and Isabel and many others for

teaching me French and for maintaining my sanity through chats, dinners, joggings and others.

Thanks to the whole Gauss’oh Fast team, for the taste of victory and to the BSSM co-organisers,

i

Julien, Julie, Patrick, Yves, Thomas, Nicolas and others for quite the same reason.

I am also honoured by the support from outside of the university. Thanks to Daniel, Bella,

Felipe, Astrid, Senna, Thiago, Wolney, Anna, Omid, Maryam, and Sarah for enriching discussions

about science, politics, economics and any sort of regular gossip during Friday’s dinners. Thanks to

Jan, Dominika and Micha l for being there whenever I needed help. Thanks to my fantastic Polish

friends, Sebastian for his persistence, Karol for finding time for me no matter what, Natalia who

makes me remember I can achieve everything and Kinga for her exceptional life attitude. Thanks

to Leo for his constant positive thinking.

Thanks to my family, to my mother and sister who taught me the value of time, who always

believed in me and who will always protect me, to my father who was always motivating me to reach

for more.

Last, but certainly not least, I must acknowledge with tremendous and deep gratitude my lovely

Magda, for her limitless smiles, trust and support for all my ideas and decisions no matter how

crazy they seem. Together we are a team and for such a team every challenge is feasible.

ii

Contents

Acknowledgements i

Table of contents 1

Introduction 2

1 Functional data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Brief overview of functional data research . . . . . . . . . . . . . . . . . . . . 4

1.3 Hilbert spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Representation and fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.6 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Functional Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Model approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Lp-m-approximability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Mixing conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Cumulant condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1

Table of contents

3.3 Frequency domain methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Objectives and structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13

I A Note on Estimation in Hilbertian Linear Models 15

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Estimation of Ψ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 The estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Consistency results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Applications to functional time series . . . . . . . . . . . . . . . . . . . . . . 23

3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

II Estimation in functional lagged regression 38

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2 Model specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Estimation of the impulse response operators . . . . . . . . . . . . . . . . . . . . . . 43

4 Consistency of the estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Assessment of the performance in finite samples . . . . . . . . . . . . . . . . . . . . . 47

5.1 Data generating processes and numerical implementation of the estimators . 47

5.2 Simulation settings and results . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2 Proofs of Lemma 1 and Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . 51

2

Table of contents

1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

1.1 Relation to ordinary functional regression . . . . . . . . . . . . . . . . . . . . 55

1.2 Description of the FPE approach . . . . . . . . . . . . . . . . . . . . . . . . . 56

1.3 Proofs of Lemma 6 and Proposition 1 . . . . . . . . . . . . . . . . . . . . . . 57

A Dynamic Functional Principal Components 62

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2 Illustration of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3 Methodology for L2 curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.1 Notation and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.2 The spectral density operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.3 Dynamic FPCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.4 Estimation and asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 Practical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5 A real-life illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Appendices 85

A General methodology and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.1 Fourier series in Hilbert spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.2 The spectral density operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A.3 Functional filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

A.4 Proofs for Section 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

B Large sample properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

C Technical results and background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

C.1 Linear operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

C.2 Random sequences in Hilbert spaces . . . . . . . . . . . . . . . . . . . . . . . 94

C.3 Proofs for Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3

Table of contents

General Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4

Introduction

Introduction Functional data analysis

The continuous advances in data collection and storage techniques allow us to observe and record

real-life processes in great detail. Examples include financial transaction data, fMRI images, satellite

photos, earths pollution distribution in time etc. Due to the high dimensionality of such data,

classical statistical tools become inadequate and inefficient. The need for new methods emerges and

one of the most prominent techniques in this context is functional data analysis (FDA).

The main objective of this work is to analyze temporal dependence in FDA. Such dependence

occurs, for example, if the data consist of a continuous time process which has been cut into segments,

days for instance. We are then in the context of so-called functional time series.

Many classical time series problems arise in this new setup, like modeling or prediction. In

this work we we will be concerned mainly with regression and dimension reduction, comparing

time–domain methods with frequency–domain methods.

In this chapter, we further discuss the motivational examples and introduce articles upon which

this thesis is based.

1 Functional data analysis

1.1 Motivation

The main concern of statistics is to obtain essential information from a sample of observations

X1, X2, ..., XN . We are given a finite sample of size N ∈ N , where Xii∈Z can be scalars, vectors

or more complex objects, like genotypes, fMRI scans or images.

Functional data analysis deals with observations which can be naturally expressed as functions.

Figures 1, 2 and 3 present several cases from different areas of science which fit into the framework

of functional data analysis.

When we deal with a physical process it is often natural to assume that it behaves in a con-

tinues manner and that the observations do not oscillate significantly between the measurements.

Although, in the Digital Era, we rarely record analog processes continuously, we often have enough

datapoints that interpolation does not cause a significant measurement error. Models incorporating

this additional structure can lead to more precise and meaningful foundings. In this context, FDA

can be seen as a tool which embeds the continuity feature into the model.

On the other hand, except for the good approximation of a continuous process, FDA can also

prove to be useful in a noisy, discontinuous case. Then, FDA can serve as a tool for denoising and

smoothing the data and is beneficial whenever the underlying process is the main concern.

From a pragmatic perspective, functional data can be seen simply as infinitely dimensional

vectors, with extended notion of variance and mean, and thus we may be tempted to employ

classical multivariate techniques. However, there are many practical and theoretical problems that

need to be addressed. For example, in the context of linear models, the inversion of the (infinite

dimensional) covariance operator is not straight forward and needs to be treated carefully, both,

from the theoretical and practical perspective. This issue, together with our novel approach to the

2


0 5 10 15 20 25 30

8010

012

014

016

018

0

x

Figure 1: Berkeley Growth Data: Heights of 20 girls taken from ages 0 through 18 (left). Growth processeasier to visualize in terms of acceleration (right). Tuddenham and Snyder [49] and Ramsey and Silverman[43]

Figure 2: Lower lip movement (top), acceleration (middle) and EMG of a facial muscle (bottom) of a speakerpronouncing the syllable “bob” for 32 replications. Malfait, Ramsay, and Froda [32]

Figure 3: Projections of DNA minicircles on the planes given by the principal axes of inertia (three panelson the left side: TATA curves, right: CAP curves). Mean curves are plotted in white. Panaretos, Kraus andMaddocks [36]

3


classical functional regression problem, is the topic of Chapter 1.

The FDA approach is also useful in a parsimonious representation of the data by taking advantage

of their smoothness. Instead of looking at a function as a dense vector of values, we can often

represent it in an linear combination of a handful of (well chosen) basis functions.

Finally, there are also advantages in the FDA approach which stem from the structure of the

data. For example, one of the drawbacks of an acclaimed multivariate Principal Component Anal-

ysis (PCA) is it’s scale dependence. It makes no sense to rescale a function componentwise (with

different scaling factors at different arguments) and hence for the functional counterpart of PCA,

the Functional Principal Component Analysis (FPCA), the lack of scale-invariance is not an issue.

A detailed introduction to Functional Principle Components is given in Section 1.6. In Chapter 3

we describe an extension of the technique benefiting from the time–dependent framework.

1.2 Brief overview of functional data research

One of the most influential works in the field of FDA is the seminal book by Ramsay and Silver-

man [43]. Together with the R and Matlab libraries, significantly facilitating both research and

practice in the area, they are a main reference in the field. Many important results were mapped

from the multivariate cases, often taking the advantage of the unique features of functional object,

whereas others, like the analysis of derivatives, were derived uniquely in this setting.

As a running example, Ramsay and Silverman [43] consider growth curves of 10 girls measured

at a set of 31 ages. They argue that statistics obtained on derivatives can be more informative than

the classical analysis of the curves themselves, performed earlier by Tuddenham and Snyder [49].

Practical applications of functional data analysis are spread across many areas of science and

engineering. Panaretos et al. [36] use [0, 1]→ R3 closed curves to analyze the behavior of DNA mi-

crocircles, providing the testing methodology for the comparison of two classes of curves. Aston and

Kirch [2] analyze the stationarity and change point detection for functional time series, with appli-

cations to fMRI data. Hadjipantelis et al. [18] analyze Mandarin language using functional principal

components. Functional time series also naturally emerge in financial applications – Kokoszka and

Reimherr [29] analyze predictability of the shape of intraday price curves.

From the theoretical perspective, Berkes et al. [5] extensively studied the problem of change

points within a set of functional observations, whereas Horvath et al. [26] recently investigated

testing for stationarity. Many multivariate techniques were extended to an infinite dimensional

setup, like functional dynamic factor models [20] or functional depth [31].

These works are only a fraction of the ongoing research and for a more accurate survey on

applications and theory we refer to books [43], [16], [25] and [6].

1.3 Hilbert spaces

For most of the results presented in this work we only require that the functional space is a separable

Hilbert space, i.e. a complete inner product space with a countable basis. This allows us to state

4


more general results, so that the space of square integrable functions L2([a, b]), a < b is a special

case.

Although most of our examples are concerned real–valued functions defined on a finite interval,

one should keep in mind other possible applications, including, for example, multivariate functions

or images and audio files, as described in Section 1.2.

1.4 Notation

Let H1, H2 be two (not necessarily distinct) separable Hilbert spaces. We denote by L(Hi, Hj),

(i, j ∈ 1, 2), the space of bounded linear operators from Hi to Hj . Further we write 〈·, ·〉Hfor the inner product on Hilbert space H and ‖x‖H =

√〈x, x〉H for the corresponding norm.

For Φ ∈ L(Hi, Hj) we denote by ‖Φ‖L(Hi,Hj) = sup‖x‖Hi≤1 ‖Φ(x)‖Hj the operator norm and by

‖Φ‖S(Hi,Hj) =(∑∞

k=1 ‖Φ(ek)‖2Hj)1/2

, where e1, e2, ... ∈ Hi is any orthonormal basis (ONB) of Hi,

the Hilbert-Schmidt norm of Φ. It is well known that this norm is independent of the choice of

the basis. Furthermore, with the inner product 〈Φ,Θ〉S(H1,H2) =∑

k≥1〈Φ(ek),Θ(ek)〉H2 the space

S(H1, H2) is again a separable Hilbert space. For simplifying the notation we use Lij instead of

L(Hi, Hj) and in the same spirit Sij , ‖ · ‖Lij , ‖ · ‖Sij and 〈·, ·〉Sij .

All random variables appearing in this work will be assumed to be defined on some common

probability space (Ω,A, P ). A random element X with values in H is said to be in LpH if νp,H(X) :=

(E‖X‖pH)1/p < ∞. More conveniently we shall say that X has p moments. If X possesses a first

moment, then X possesses a mean µ, determined as the unique element for which E〈X,x〉H =

〈µ, x〉H , ∀x ∈ H. For x ∈ Hi and y ∈ Hj let x⊗ y : Hi → Hj be an operator defined as x⊗ y(v) =

〈x, v〉y. If X ∈ L2H , then it possesses a covariance operator C, given by C = E[(X − µ)⊗ (X − µ)].

It can be easily seen that C is a Hilbert-Schmidt operator. Assume X,Y ∈ L2H . Following Bosq [6],

we say that X and Y are orthogonal (X ⊥ Y ) if EX ⊗ Y = 0. A sequence of orthogonal elements

in H with a constant mean and constant covariance operator is called H–white noise.

1.5 Representation and fit

Since we are dealing with infinite dimensional objects we need to represent and approximate them

in a convenient way. This is important from the practical as the theoretical perspective. From the

practical point, due to the limited computer memory, we will always work with approximations. We

want to use low dimensional approximations for computational reasons.

One of the possibilities to represent a curve, is to select a sufficiently fine grid and process the

vector of values of the function on the intervals induced by the gridpoints. This approach, often

used in practice, does not benefit from the continuity of functions.

In this work, we follow the ideas popularized by Ramsey and Silverman [43], based on the basis

function expansion. Most prominent the Karhunen–Loeve or Fourier extension. Let ei1≤i≤∞ be

an orthonormal basis of a separable Hilbert space H. Then, any element x ∈ H can be uniquely

5


represented as

x =

∞∑i=1

〈x, ei〉ei.

Note that, by Parseval’s formula,

‖x‖2 =∞∑i=1

|〈x, ei〉|2.

Since the sum is finite, for any ε, there exist d that

∞∑i=d

|〈x, ei〉|2 < ε.

We can therefore approximate the function with arbitrary precision ε > 0 using only the first d

basis elements. This approach is consistent with intuition. Indeed, if we use, for example, Fourier

basis functions, then the high frequency components are expected to be negligible and will be

diminishing.

Although the fitting and representation of functional data is an important and intensively studied

topic on its own, in this work we assume that observations are fully observed, i.e. we are given the

actual curves. For more information on fitting we refer to [43].

1.6 Dimension reduction

From a theoretical perspective a curve observation X is an intrinsically infinite–dimensional object.

Besides the choice of an appropriate basis, there is also the need for dimension reduction.

Arguably, functional principal components analysis (FPCA) is the key technique to this problem.

Like its multivariate equivalent, FPCA is based on the analysis of the covariance operator and it is

concerned with finding directions which contribute most to the variability of the observations.

Let X be a functional random variable taking values in some Hilbert space H and C = EX ⊗Xbe its covariance operator. (Without loss of generality we assume here and in many places that

EX0 = 0.) For C to exist, we assume that E‖X‖2 < ∞. One can show that C is a symmetric,

positive definite Hilbert-Schmidt operator and can hence by the spectral theorem be decomposed

into

C =

∞∑i=1

λiei ⊗ ei, (1)

where λ1 ≥ λ2 ≥ ... and λi ≥ 0 are the eigenvalues of C and and eii∈N are the corresponding

eigenfunctions, forming an orthonormal basis of the underlying Hilbert space H.

If we pick the first d basis elements eidi=1 and project the observation X on the space spanned

by them, we obtain the optimal d-dimensional approximation in terms of the mean square error, i.e.

E‖X −d∑i=1

〈X, ei〉ei‖2 ≤ E‖X −d∑i=1

〈X, e′i〉e′i‖2,

6

Introduction Functional Time Series

Time

0 2000 4000 6000 8000 10000

27560

27580

27600

27620

27640

27660

Figure 4: Horizontal component of the magnetic field measured in one minute resolution at Honolulu mag-netic observatory from 1/1/2001 00:00 UT to 1/7/2001 24:00 UT. 1440 measurements per day.

for any other orthonormal collection e′i1≤i≤d. The directions ei are called the principal components

of X and the coefficients 〈X, ei〉 are called PC scores. A simple computation shows that PC scores

are uncorrelated, which is another key feature.

We remark again that a main advantage of FPCA over the multivariate version is that scale-

invariance is not relevant. Consequently, it is much easier to interpret functional PCs and linear

combinations thereof. For detailed theory of multivariate principle components we refer to [28] and

to [45] for the functional setup.

FPCA gained popularity in both iid and time–dependent setup. However, in Chapter 3 we argue

that this technique is no longer optimal for time series and may lead to misconception when used

not carefully. We then propose an extension of FPCA, which benefits from the temporal dependence

structure.

2 Functional Time Series

In many practical situations functions are naturally ordered in time. For example, when we deal

with daily observations of the stock market or with sequences of tumor scans. Then, we are in the

context of a so–called functional time series (FTS).

As a motivating example consider Figure 4. Here, the assumption of independence can be too

strong – values at the beginning of each day are highly correlated with those at the end of the

preceding day. Moreover, we see that big jumps are often followed by significant drops.

These, and similar features, may indicate significant temporal dependence not just within a

subject, but also between different subjects (e.g. days). In this section we discuss possible frameworks

which allow to quantify, test and use this additional information.

7


2.1 Stationarity

Many physical processes are known to have time-invariant distribution. This motivates the frequen-

tionist approach to time series, where we assume that the structure does not change in time and

we interfer from estimated covariances. A functional test for stationarity was recently introduced

by Horvath et al. [26]. Non–stationary time series are also extensively studied, however they are

beyond the scope of this work.

Let Xt be a series of random functions. We say that Xt is stationary in the strong–sense if for

any h ∈ Z, k ∈N and any sequence of indices t1, t2, ..., tk vectors (Xt1 , ..., Xtk) and (Xh+t1 , ..., Xh+tk)

are identically distributed.

We also define weak stationary by looking only on the second order structure of the series. We

say that Xt is weakly stationary if E‖Xt‖2 <∞ and

1. EXt = EX0 for each t ∈ Z and

2. EXtXs = EXt−sX0 for each t, s ∈ Z.

2.2 Model approach

Arguably, one of the most popular models of temporal dependence is the functional autoregressive

model (FAR(p)) studied in great detail by Bosq [6]. In this model we assume that the state in time

t is a linear function of the p previous states and some independent white noise process. The main

concern is the estimation of the p linear operators involved. Once the AR structure is identified, we

can profit from the explicit probabilistic structure and dynamic of the time series. We describe this

model in detail in Chapter 1.

Many time series can, however, not be approximated by some FAR(p) process and the need for

more complex models arises. ARMA or GARCH-type models (cf. [21]) could serve as alternatives,

but the required theoretical foundations beyond the relatively simple auto-regressions is still sparse.

Furthermore, for many time series it is not clear which model they follow. Nonetheless time–series

procedures may still apply. It is then preferable to only impose a certain dependence structure,

rather than requiring a particular model. In the next sections we introduce three popular notions

of dependence and justify the choice of the framework employed throughout this work.

2.3 Lp-m-approximability

In this framework, weak dependence is defined by a “small” Lp distance between the process and

it’s approximation based on only last m innovations. This idea is made precise in the following two

definitions.

Definition 1. Suppose (Xn)n≥1 is the random process with values in H and let F−n = σ(..., Xn−2, Xn−1, Xn)

and F+n = σ(Xn, Xn+1, Xn+2, ...) be the σ-algebras generated by terms up to time n and after time

n respectively. Process Xn is said to be m-dependent if F−n and F+n+m are independent.

8


In practice, processes usually do not have the property from Definition 1, however they can be

often approximated by such series. This motivates the following approach to week dependence

Definition 2 (Hormann and Kokoszka [23]). A random sequence Xnn≥1 with values in H is called

Lp–m–approximable, if it can be represented as

Xn = f(δn, δn−1, δn−2, ...),

where the δi are iid elements taking values in a measurable space S and f is a measurable function

f : S∞ → H. Moreover, if δ′i are independent copies of δi defined on the same probability space,

then for

X(m)n = f(δn, δn−1, δn−2, ..., δn−m+1, δ

′n−m, δ

′n−m−1, ...) (2)

we have

∞∑m=1

E‖Xm −X(m)m ‖p <∞.

Note, that the independent copies in (2) are used for simplicity of proofs and the representation

can be more intuitively stated as

X(m)n = f(δn, δn−1, δn−2, ..., δn−m+1, 0, 0, ...),

leading to analogous results. Let us also stress, that the representation 2 is rather general, and

incorporates most time series models encountered in practice. Furthermore, checking the validity of

the dependence condition is simply reduced to p-th order moments, which is typically much simpler

than establishing classical mixing conditions explained next.

2.4 Mixing conditions

There exist numerous variants for mixing. We introduce the strong mixing (or α-mixing) condition,

which is one of the most prominent ones. In the functional context it has e.g. been used by Aston

and Kirch [1]. For an extensive introduction to mixing we refer to Bradley [8].

In this approach we quantify and bound the dependence of sigma fields generated by variables

X0, X−1, ... and Xm, Xm+1, ... for given m ∈N .

Definition 3. A strictly stationary process Xj : j ∈ Z is called strong mixing with mixing

rate rm if

supA,B|P (A ∩B)− P (A)P (B)| = O(rm), rm → 0,

where the supremum is taken over all A ∈ σ(. . . , X−1, X0) and B ∈ σ(Xm, Xm+1, . . .).

9

Introduction Linear models

2.5 Cumulant condition

Another approach to quantifying weak dependence is based on so–called cumulants, expressing the

high order cross–moment structure. In the finite dimensional case, it was popularized by Brillinger

[9]. In the context of Functional Time Series it was brought recently by Panaretos and Tavakoli

[37]. The k−th order cumulant kernel is given by

cum(Xt1(τ1), ..., Xtk(τk)) =∑

v=(v1,...,vp)

(−1)p−1(p− 1)!

p∏l=1

E

∏j∈vl

Xtj (τj)

.where the sum extends over all unordered partitions of 1, ..., k. If we assume that E‖X0‖l2 < ∞for l ≥ 1, then the cumulant kernels are well–defined in L2. For a given cumulant kernel of order 2k

one can define an order cumulant operator Rt1,...,t2k−1: L2([0, 1]k,R)→ L2([0, 1]k,R), defined by

Rt1,...,t2k−1h(τ1, ..., τk)

=

∫[0,1]k

cum(Xt1(τ1), ..., Xt2k−1(τ2k−1), Xt0(τ0))× h(τk+1, ..., τ2k)dτk+1...dτ2k.

We say that a time series satisfies the cumulant condition if and only if

1. E‖X0‖k2 <∞,∑∞

t1,...,tk−1=−∞ ‖cum(Xt1 , ..., Xtk−1, X0)‖2 <∞, ∀k≥2,

2. ‖Rt‖1 <∞ where ‖ · ‖1 is the nuclear norm or Schatten 1-norm.

2.6 Discussion

Many results for stationary processes were obtained assuming strong mixing conditions. However

they are hard to check in practice and they exclude some important statistical models, like, for

example, AR(1) time series with discrete innovations.

Although many concepts of dependence were developed in recent years, there is no dominant

framework. Therefore, researchers try to state results in a general setting, using only the basic

results from the dependence framework, wherever possibly. This approach allows future scientists

to use different dependence concepts as long as these basic results hold. An example can be found in

the work of Aston and Kirch [1], where they obtain their results considering both mixing conditions

and Lp-m-approximability.

In this work we take a similar approach, restricting ourselves to convergence results of eigenvec-

tors and eigenvalues of covariance operators. We choose to use Lp-m-approximability dependence

structure, since all our required results are established in Hormann and Kokoszka [23].

3 Linear models

As we have already pointed out in previous sections, many multivariate techniques have a natural

analogue in the functional data setup. In the sequel, we are concerned with linear models. They

10


constitute the fundamental framework in many areas of statistics, thus they are naturally of great

interest in functional data analysis.

We start by introducing the classical linear regression, allowing the variables to be dependent

in time. Next, we discuss time series models in the functional setup. Finally, we briefly discuss

advantages of frequency domain methods, which are used in Chapters 2 and 3 for exploiting the

temporal dependence structure.

3.1 Linear regression

One of the most popular frameworks in classical statistics is the linear regression, where we try to

quantify the linear dependence between two (possibly multivariate) variables X and Y . The problem

of finding the relation of this type can be also addressed in FDA.

Assume the model

Yt = β(Xt) + εt, t ≥ 1 (3)

where β is a linear Hilbert-Schmidt operator from H1 → H2 and εt is some strong white noise

sequence, independent from (Xt).

As we are concerned with functional time series, we will assume that Xt, Yt are stationary and

weakly dependent. Classical case of iid Xt is of great scientific interest and the interested reader is

referred to [43] and [52].

Although the linear regression shares many properties with its multivariate equivalent, again

there are several important differences. Especially, we note that the linear operator β : H1 → H2

has infinite dimensional, which considerably complicates the estimation. If we approach the problem

in the classical way by multiplying both sides of (3) by Xt and taking the expectation we get

CXY = βCX , (4)

where CXY is the cross-covariance operator of X and Y and CX is the covariance of X. Now, the

natural way to obtain β is by applying the inverse of CX to both sides of the equation (4), which

yields

β = CXY (CX)−1.

The main problem is that the operator (CX)−1 is no longer bounded. Indeed, the domain of CX

is only a subset D, say, of H1. To see this, note that formally C−1X (x) =

∑k≥0 λ

−1k 〈ek, x〉ek, where

λk and ek are the eigenvalues (tending to zero) and eigenfunctions of CX . Hence, D = x ∈H1 :

∑k≥1〈x, ek〉2λ

−2k < ∞. The problem can be approached by some regularization. E.g. one

may replace C−1X by a finite dimensional approximation of the form

∑k≤K λ

−1k ek⊗ ek, where K is a

tuning parameter. This is still quite delicate, when applied to the sample version. Then for for large

values of K, if we underestimate one of the small eigenvalues, its reciprocal explodes and will lead

to very instable estimators. On the other hand, for small K we may get a very poor approximation

of β.

11


This difficulty was addressed by Bosq [6], who gives an extensive survey on the problem. However,

proposed results are based on strong assumptions on the rate of convergence of eigenvalues, which

are impossible to check in practice. In Chapter 1 we present an alternative, data–driven approach.

Finally, note that exactly the same technique can be used for lagged linear regression. Consider

Yt =m∑k=0

βk(Xt−k) + εt, (5)

where m ∈ N . Let us introduce Zt = (Xt, Xt−1, ..., Xt−m) ∈ Hm1 . Then, the model can be written

as

Yt = BZt + εt, (6)

where B : Hm1 → H2 is a linear operator such that BZt =

∑mk=0 βk(Z

(k)t ).

Now, for estimating B in (6), we can apply the same estimation procedures as for (3). This

method of estimation in lagged regressien models is efficient only for small dimensions and small m,

as opposed to the technique discussed in Chapter 2, which gives esitmates at any lag.

3.2 Filtering

The linear models that we are concerned in Chapters 2 and 3 are based on the concept of linear

filtering, popular in multivariate time series as well as in signal processing. For the theory and

survey on applications in this context we refer to the classical book of Oppenheim and Schafer [35].

Definition 4. We say that A = Akk∈Z is a linear filter if for each k ∈ Z, Ak ∈ H1 → H2 is a

linear operator and∑‖At‖2H <∞.

Now, we can extend the model (3), so that it includes a possibility of the so–called lagged

dependence, i.e.

Yt =∑k∈Z

AkXt−k + εt. (7)

In Chapter 2, we are concerned with estimation of Ak and testing the significance of these

operators. Next, in Chapter 3 we consider a low-dimensional filters, i.e. such that dim(Im(Ak)) = d,

trying to find a filter which accounts for the smallest information loss in terms of the mean squared

error. In both cases, results are based on Fourier analysis and the seminal work of Brillinger [9].

3.3 Frequency domain methods

In time series analysis we often deal with periodical data. This feature motivates the Fourier-based

methods which allow us to discover seasonal patterns.

Suppose we are given daily data from a univariate signal plus noise time series, where the signal

comes from a sinusoidal curve with weekly periodicity. The periodogram, a key tool in the frequency

12

Introduction Objectives and structure of the thesis

domain analysis of time series, is a simple tool which allows to detect such a seasonal pattern. In

the given example it will be showing a spike at the frequency corresponding to the weekly period.

The Fourier transform has two important properties which simplify analysis of the process (3).

First, multiplication in the frequency domain is equivalent to convolution in the time domain.

Second, the Fourier transform is a bijection, so results in frequency domain are equivalent to these

in the time domain.

To illustrate the use of these features let us multiply equation (3) by Xs for some s ∈ Z and

take the expectation. By linearity we have

EYtXs =∑k∈Z

AkEXt−kXs,

and by stationarity

EYuX0 =∑k∈Z

AkEXu−kX0,

where u = t− s. Now, noting that on left we have CY Xu and on right we have the convolution of Ak

and CY Xu , the Fourier transform of both sides yields the cross-spectral operator between Yt and

Xt and can be obtained as

FY Xθ = A(θ)FXθ , (8)

where A(θ) =∑

k∈ZAkeikθ is the frequency response function of the series Akk∈Z and

∑k∈Z =

12π

∑k∈ZC

Xk e−ikθ is the spectral density operator of Xt.

Relation (8) is fundamental for this work. In Chapter 2 we use it for the estimation of A,

from which, by taking the inverse Fourier transform, we obtain estimates for operators in (7). In

Chapter 3 we argue that A(θ) built from principal components of FXθ minimizes the information

loss among all linear filters applied to Y .

4 Objectives and structure of the thesis

This work is organized in three chapters. Each chapter constitutes a reprint of a paper, published

or submitted for publication.

The first chapter proposes a data–driven technique for estimation of dimension in functional

AR(1) models. Although the regression problem was studied in great detail by Bosq [6], all tech-

niques were built on very strong assumptions, impossible to check in practice. In our work we

not only provide an alternative, data–driven technique but also prove its consistency without any

assumptions on convergence rates of the spectrum. We support our technique by an extensive sim-

ulation study, which reveals performance close to optimal. This chapter has been published in the

Scandinavian Journal of Statistics [22].

In the second chapter we discuss the estimation functional lagged regression models. As discussed

13

Introduction Objectives and structure of the thesis

in Section 3.1, the method described in Chapter 1 can be successfully adapted to the problem of

lagged covariance. However, in practice the dimension of the problem can outgrow the number of

observation and this may lead to misleading results. We investigate a frequency domain method

which gives consistent estimators at arbitrary chosen lag. Moreover, we provide testing methodology

addressing the significance of given lagged regression operators.

The third chapter extends the functional principal components to the time–dependent setup. In

our work we concentrate on the diagonality of the covariance matrix – one of the most important

features of the principal component analysis. In the time–dependent setup, lagged covariances of

principal components may not be diagonal, which restrains the analysis of independent components.

We relax the setup from the orthogonal projection to convolution and, using the frequency domain

approach, we find a time invariant linear mapping which gives a multivariate series with uncorrelated

components at all leads and lags. Moreover, the resulting vector sequences explain more variance

than the classical PCA with the same number of components. This chapter has been published in

the Journal of the Royal Statistics Society: Series B.

14

Chapter I

A Note on Estimation in Hilbertian Linear Models

A note on estimation in Hilbertian linear models∗

Siegfried Hormann, Lukasz Kidzinski

Department de Mathematique, Universite libre de Bruxelles (ULB), Belgium

Abstract

We study estimation and prediction in linear models where the response and the regressor variable both

take values in some Hilbert space. Our main objective is to obtain consistency of a principal components

based estimator for the regression operator under minimal assumptions. In particular, we avoid some incon-

venient technical restrictions that have been used throughout the literature. We develop our theory in a time

dependent setup which comprises as important special case the autoregressive Hilbertian model.

Keywords: adaptive estimation, consistency, dependence, functional regression, Hilbert spaces,

infinite-dimensional data, prediction.

1 Introduction

In this paper we are concerned with a regression problem of the form

Yk = Ψ(Xk) + εk, k ≥ 1, (I.1)

where Ψ is a bounded linear operator mapping from space H1 to H2. This model is fairly general

and many special cases have been intensively studied in the literature. Our main objective is the

study of this model when the regressor space H1 is infinite dimensional. Then model (I.1) can be

seen as a general formulation of a functional linear model, which is an integral part of functional

data literature. Its various forms are introduced in Chapters 12–17 of Ramsay and Silverman [25].

A few recent references are Cuevas et al. [11], Malfait and Ramsay [23], Cardot et al. [6], Chiou

et al. [8], Muller and Stadtmuller [24], Yao et al. [28], Cai and Hall [3], Li and Hsing [22], Hall

and Horowotiz [15], Reiss and Ogden [26], Febrero-Bande et al. [13], Crambes et al. [10], Yuan and

Cai [29], Ferraty et al. [14], Crambes and Mas [9].

From an inferential point of view, a natural problem is the estimation of the ‘regression operator’

Ψ. Once an estimator Ψ is obtained, we can use it in an obvious way for prediction of the responses Y .

Both, the estimation and the prediction problem are addressed in this paper. In existing literature,

these problems have been discussed from several angles. For example, there is the distinction between

the ‘functional regressors and responses’ model (e.g., Cuevas et al. [11]) or the perhaps more widely

studied ‘functional regressor and scalar response model’ (e.g., Cardot et al. [5]). Other papers deal

with the effect when random functions are not fully observed but are obtained from sparse, irregular

data measured with error (e.g., Yao et al. [28]). More recently, the focus was on establishing rates

of consistency (e.g., Cai and Hall [3], Cardot and Johannes [7]). The two most popular methods

∗Manuscript has been accepted for publication in Scandinavian Journal of Statistics

16

of estimation are based on principal component analysis (e.g., Bosq [1], Cardot et al. [5], Hall and

Horowitz [15]) or spline smoothing estimators (e.g., Hastie and Mallows [16], Marx and Eiler [12],

Crambes et al. [10]).

In this paper we address the estimation and prediction problem for this model when the data

are fully observed, using the principal component (PC) approach. Let us explain what is the new

contribution and what distinguishes our paper from previous work.

(i) The crucial difficulty for this type of problems is that the infinite dimensional operator Ψ needs

to be approximated by a sample version ΨK of finite dimension K, say. Clearly, K = Kn needs to

depend on the sample size and tend to ∞ in order to obtain an asymptotically unbiased estimator.

In existing papers determination of K and proof of consistency require, among others, unnecessary

moment assumptions and artificial restrictions concerning the spectrum of the covariance operator

of the regressor variables Xk. As our main result, we will complement the current literature by

showing that the PC estimator remains consistent without such technical constraints. We provide

a data-driven procedure for the choice of K, which may even be used as a practical alternative to

cross-validation.

(ii) We allow the regressors Xk to be dependent. This is important for two reasons. First, many

examples in FDA literature exhibit dependencies as the data stem from a continuous time process,

which is then segmented into a sequence of curves, e.g., by considering daily data. Examples of this

kind include intra-day patterns of pollution records, meteorological data, financial transaction data

or sequential fMRI recordings. See, e.g., Horvath and Kokoszka [20].

Second, our framework detailed below will include the important special case of a functional

autoregressive model which has been intensively investigated in the functional literature and is often

used to model autoregressive dynamics of a functional time series. This model is analyzed in detail

in Bosq [2]. We can not only greatly simplify the assumptions needed for consistent estimation,

but also allow for a more general setup. E.g., in our Theorem 2 we show that it is not necessary

to assume that Ψ is a Hilbert-Schmidt operator if our intention is prediction. This quite restrictive

assumption is standard in existing literature, though it even excludes the identity operator.

(iii) As we already mentioned before, the literature considers different forms of functional linear

models. Arguably the most common are the scalar response and functional regressor and the func-

tional response and functional regressor case. We will not distinguish between these cases, but work

with a linear model between two general Hilbert spaces.

In the next section we will introduce notation, assumptions, the estimator and our main results.

In Section 3 we provide a small simulation study which compares our data driven choice of K with

cross-validation (CV). As we will see, this procedure is quite competitive with CV in terms of mean

squared prediction error, while it is clearly favorable to the latter in terms of computational costs.

Finally, in Section 6, we give the proofs.

17

2 Estimation of Ψ

2.1 Notation

Let H1, H2 be two (not necessarily distinct) separable Hilbert spaces. We denote by L(Hi, Hj),

(i, j ∈ 1, 2), the space of bounded linear operators from Hi to Hj . Further we write 〈·, ·〉H for

the inner product on Hilbert space H and ‖x‖2H = 〈x, x〉H for the corresponding norm. For Φ ∈L(Hi, Hj) we denote by ‖Φ‖L(Hi,Hj) = sup‖x‖Hi≤1 ‖Φ(x)‖Hj the operator norm and by ‖Φ‖2S(Hi,Hj)

=∑∞k=1 ‖Φ(ek)‖2Hj , where e1, e2, ... ∈ Hi is any orthonormal basis (ONB) of Hi, the Hilbert-Schmidt

norm of Φ. It is well known that this norm is independent of the choice of the basis. Furthermore,

with the inner product 〈Φ,Θ〉S(H1,H2) =∑

k≥1〈Φ(ek),Θ(ek)〉H2 the space S(H1, H2) is again a

separable Hilbert space. For simplifying the notation we use Lij instead of L(Hi, Hj) and in the

same spirit Sij , ‖ · ‖Lij , ‖ · ‖Sij and 〈·, ·〉Sij .

All random variables appearing in this paper will be assumed to be defined on some common

probability space (Ω,A, P ). A random element X with values in H is said to be in LpH if νp,H(X) :=

(E‖X‖pH)1/p < ∞. More conveniently we shall say that X has p moments. If X possesses a first

moment, then X possesses a mean µ, determined as the unique element for which E〈X,x〉H =

〈µ, x〉H , ∀x ∈ H. For x ∈ Hi and y ∈ Hj let x⊗ y : Hi → Hj be an operator defined as x⊗ y(v) =

〈x, v〉y. If X ∈ L2H , then it possesses a covariance operator C, given by C = E[(X − µ)⊗ (X − µ)].

It can be easily seen that C is a Hilbert-Schmidt operator. Assume X,Y ∈ L2H . Following Bosq [2],

we say that X and Y are orthogonal (X ⊥ Y ) if EX ⊗ Y = 0. A sequence of orthogonal elements

in H with a constant mean and constant covariance operator is called H–white noise.

2.2 Setup

We consider the general regression problem (I.1) for fully observed data. Let us collect our main

assumptions.

(A): We have Ψ ∈ L12. Further εk and Xk are zero mean variables which are assumed to be

L4–m–approximable in the sense of Hormann and Kokoszka [18] (see below). In addition εk is

H2–white noise. For any k ≥ 1 we have Xk ⊥ εk.

Here is the weak dependence concept that we impose.

Definition 5 (Hormann and Kokoszka [18]). A random sequence Xnn≥1 with values in H is called

Lp–m–approximable, if it can be represented as

Xn = f(δn, δn−1, δn−2, ...),

where the δi are iid elements taking values in a measurable space S and f is a measurable function

f : S∞ → H. Moreover, if δ′i are independent copies of δi defined on the same probability space,

then for

X(m)n = f(δn, δn−1, δn−2, ..., δn−m+1, δ

′n−m, δ

′n−m−1, ...)

18

we have

∞∑m=1

νp,H(Xm −X(m)m ) <∞.

Evidently, i.i.d. sequences with finite p-th moments are Lp–m–approximable. This leads to the

classical functional linear model. But it is also easily checked that functional linear processes fit in

this framework. More precisely, if Xn is of the form

Xn =∑k≥0

bk(δn−k),

where bk : H0 → H1 are bounded linear operators such that∑

m≥1

∑k≥m ‖bk‖L01 <∞, and (δn) is

i.i.d. noise with νp,H0(δ0) <∞, then Xn is Lp–m–approximable. Other (also non-linear) examples

of functional time series covered by Lp–m–approximability can be found in [18].

A very important example included in our framework is the autoregressive Hilbertian model of

order 1 (ARH(1)) given by the recursion Xk+1 = Ψ(Xk) + εk+1. It will be treated in more detail in

Section 2.4.

The notion of L4–m–approximability implies that the process is stationary and ergodic and

that it has finite forth moments. The latter is in line with existing literature. We are not aware

of any article that works with less than 4 moments. In contrast, for several consistency results

finite moments of all orders (or even bounded random variables) are assumed. Since our estimator

below is a moment estimator, based on second order moments, one could be tempted to believe that

some of our results may be deduced directly from the ergodic theorem under finite second moment

assumptions. We will explain in the next section, after introducing the estimator, why this line of

argumentation is not working.

Our weak dependence assumption implies that a possible non-zero mean of Xk can be estimated

consistently by the sample mean. Moreover we have (see [19])

√n‖X − µ‖H1 = OP (1).

We conclude that the mean can be accurately removed in a preprocessing step and that EXk = 0 is

not a stringent assumption. Since by Lemma 2.1 in [18] Yk will also be L4–m–approximable, the

same argument justifies that we study a linear model without intercept.

2.3 The estimator

The PC based estimator for Ψ described below was first studied by Bosq [1] and is based on a

finite basis approximation. To achieve optimal approximation in finite dimension, one chooses

eigenfunctions of the covariance operator C = E[X1 ⊗ X1] as a basis. Let ∆ = E[X1 ⊗ Y1]. By

Assumption (A) both, ∆ and C, are Hilbert-Schmidt operators. Let (λi, vi)i≥1 be the eigenvalues

and corresponding eigenfunctions of the operator C, such that λ1 ≥ λ2 ≥ .... The eigenfunctions

are orthonormal and those belonging to a non-zero eigenvalue form an orthonormal basis of Im(C),

19

the closure of the image of C. Note that, with probability one, we have X ∈ Im(C). Since Im(C) is

again a Hilbert-space, we can assume that H1 = Im(C), i.e. that the operator is of full rank. In this

case all eigenvalues are strictly positive. Using linearity of Ψ and the requirement Xk ⊥ εk from

(A) we obtain

∆(vj) = E〈X1, vj〉H1Y1 = E〈X1, vj〉H1Ψ(X1) + E〈X1, vj〉H1ε1

= Ψ(E〈X1, vj〉H1X1) = Ψ(C(vj)) = λjΨ(vj).

Then, for any x ∈ H1, the derived equation leads to the representation

Ψ(x) = Ψ

( ∞∑j=1

〈vj , x〉vj

)=

∞∑j=1

∆(vj)

λj〈vj , x〉. (I.2)

Here we assume implicitly that dim(H1) =∞. If dim(H1) = M <∞, then (I.2) still holds with ∞replaced by M . This case is well understood and will therefore be excluded.

Equation (I.2) gives a core idea for estimation of Ψ. We will estimate ∆, vj and λj from

our sample X1, . . . , Xn, Y1, . . . , Yn and substitute the estimators into formula (I.2). The estimated

eigenelements (λj,n, vj,n; 1 ≤ j ≤ n) will be obtained from the empirical covariance operator

Cn =1

n

n∑k=1

Xk ⊗Xk.

In a similar straightforward manner we set

∆n =1

n

n∑k=1

Xk ⊗ Yk.

For ease of notation, we will suppress in the sequel the dependence on the sample size n of these

estimators.

Apparently, from the finite sample we cannot estimate the entire sequence (λj , vj), rather we

have to work with a truncated version. This leads to

ΨK(x) =

K∑j=1

∆(vj)

λj〈vj , x〉, (I.3)

where the choice of K = Kn is crucial. Since we want our estimator to be consistent, Kn has to

grow with the sample size to infinity. On the other hand, we know that λj → 0. Hence, it will be

a delicate issue to control the behavior of 1λj

. A small error in the estimation of λj can have an

enormous impact on (I.3).

Define ΨK(x) =∑K

j=1∆(vj)λj〈vj , x〉. Via the ergodic theorem one can show that the individual

terms λj , vj and ∆ in (I.3) converge to their population counterparts. It follows that ‖ΨK −ΨK‖L12 → 0 a.s., as long as K is fixed. In fact, this holds true under finite second moments.

However, as it is well known, the ergodic theorem doesn’t assure rates of convergence. Even if the

20

underlying random variables were bounded, convergence can be arbitrarily slow. Consequently, we

cannot let K grow with the sample size in this approach. We need to impose further structure

on the dynamics of the process and existence of higher order moments. Both are combined in the

concept of L4–m–approximability.

In most existing papers determination of Kn is related to the decay-rate of λj. For example,

Cardot et al. [5] assume that nλ4Kn→∞ and nλ2

Kn/(∑Kn

j=11αj

)2 →∞, when

α1 = λ1 − λ2 and αj = minλj−1 − λj , λj − λj+1, j > 1. (I.4)

Similar requirements are used in Bosq [2] (Theorem 8.7) or Yao et al. [28] (Assumption (B.5)). Hall

and Horowitz [15] assume in the scalar response model that αj ≥ C−1j−α−1, |∆(vj)λ−1j | ≤ Cj−β

for some α > 1 and 12α + 1 < β. Here C is a constant arising from the additional assumption

E〈X1, vj〉4 ≤ Cλ2j . They emphasize the importance of a sufficient separation of the eigenvalues for

their result. Then, within this setup, optimal minimax bounds are proven to hold for K = n1/(α+2β).

Of course, in practice this choice of K is only possible under the unrealistic assumption that we

know α and β. Cai and Zhou [4] modify the approach by Hall and Horowitz [15] by proposing an

adaptive choice of K which is based on a block thresholding technique. They recover the optimal

rates of Hall and Horowitz [15], but need to impose further technical assumptions. Among others,

the assumptions in [15] are strengthened to E‖Xk‖p < ∞ for all p > 0, j−α λj j−α, and

αj j−α−1. Here an bn means that lim supn |an/bn| < ∞. Rates of convergence are also

obtained in Cardot and Johannes [7]. They propose a new class of estimators which are based on

projecting on some fixed orthonormal basis instead on empirical eigenfunctions. Again, the accuracy

of the estimator relies on a thresholding technique, and similar as to the afore cited papers, the very

strong results are at the price of several technical constraints.

2.4 Consistency results

The papers cited in the previous paragraph are focus on rates of consistency for the estimator ψK .

These important and interesting need to impose technical assumptions on the operator Ψ and the

spectrum of C. In practice, such technical conditions cannot be checked and may be violated.

Furthermore, since we have no knowledge of αj and λj , j ≥ 1, determination of K has to be done

heuristically. It then remains open if the widely used PC based estimation methods stay consistent

in the case where some of these conditions are violated. Our theorems below show that the answer

to this question is affirmative, even if data are dependent. We propose a selection of Kn which is

data driven and can thus be practically implemented. The Kn we use in first result, Theorem 1

below, is given as follows:

(K): Let mn → ∞ such that m6n = o(n). Then we define Kn = min(Bn, En,mn) where Bn =

arg maxj ≥ 1|λ−1j ≤ mn and En = arg maxk ≥ 1|max1≤j≤k α

−1j ≤ mn. Here λj and αj are the

estimates for λj and αj (given in (I.4)), respectively, obtained from C.

A discussion on the tuning parameter mn is given at the end of this section. The choice of Kn

is motivated by a ‘bias variance trade-off’ argument. If an eigenvalue is very small (in our case

21

1/mn) it means that the direction it explains has only small influence on the representation of

Xk. Therefore, excluding it from the representation of Ψ will not cause a big bias, whereas it will

considerably reduce the variance. It will be only included if the sample size is big enough, in which

case we can hope for a reasonable accuracy of λj . In practice it is recommended to replace 1λj

in the

definition of Bn by λ1λj

and 1αj

in the definition of En by λ1αj

to adapt for scaling. For the asymptotics

such a modification has no influence.

Theorem 1. Consider the linear Hilbertian model (I.1) and assume that Assumption (A) and (K)

hold. Suppose further that the eigenvalues λj are mutually distinct and Ψ is a Hilbert-Schmidt

operator. Then the estimator described in Section 2.3 is weakly consistent, i.e. ‖ΨKn−Ψ‖L12P−→ 0,

if n→∞.

It is not hard to see that consistent estimation of Ψ via the PCA approach requires compactness

of the operator. As a simple example suppose that Ψ is the identity operator, which is not Hilbert-

Schmidt anymore. Then for any ONB vi we have Ψ =∑

i≥1 vi⊗vi. Even if from the finite sample

our estimators for v1, . . . , vK would be perfect (vi = vi) we have ‖Ψ − ΨK‖L12 = 1 for any K ≥ 1.

This is easily seen by evaluating Ψ and ΨK at vK+1.

In our next theorem we show that if our target is prediction, then we can further simplify the

assumptions. In this case we will be satisfied if ‖Ψ(Xn) − Ψ(Xn)‖H2 is small. E.g., if 〈Xn, v〉 = 0

with probability one, then the direction v plays no role for describing Xn and a larger value of

‖Ψ(v)− Ψ(v)‖H2 is not relevant.

Theorem 2. Let Assumption (A) hold and define the estimator ΨKn as in Section 2.3 with Kn =

arg maxj ≥ 1| λ1/λj ≤ mn, where mn →∞ and mn = o(√n). Then ‖Ψ(Xn)−ΨKn(Xn)‖H2

P−→ 0.

Remark 1. For our proof it will not be important to evaluate Ψ and Ψ at Xn. We could equally

well use X1, or Xn+1, or some arbitrary variable Xd= X1.

Theorem 2 should be compared to Theorem 3 in Crambes and Mas [9] where an asymptotic

expansion of E‖Ψ(Xn+1)− Ψk(Xn+1)‖2H2is obtained (for fixed k). Their result implies consistency,

but requires again assumptions on the decay rate of λi, an operator Ψ that is Hilbert-Schmidt,

and E‖Xk‖p <∞ for all p > 0. In our theorem we need no assumptions on the eigenvalues anymore,

not even that they are distinct.

In the last theorem we saw that whenever mn = o(√n) and mn → ∞ convergence holds. This

leaves open what is a good choice of the tuning parameter mn. From a practical perspective we

believe that the importance of this question should not be overrated. Most applied researchers will

use CV or some comparable method, which usually will give a Kaltn that is presumably close to

optimal. Hence, if we suppose that

E‖Ψ(Xn)− ΨKaltn

(Xn)‖H2 E‖Ψ(Xn)− ΨKn(Xn)‖H2 (n→∞),

the practitioner can be sure that his approach leads to a consistent estimator under very general

assumptions. In Section 3 we use for the simulations mn =√n/ log n. The performance of this

estimator is in all tested setups comparable to CV.

22

To address the optimality issue from a theoretical point of view seems to be very difficult and

depends on our final objective: is it prediction or estimation. In both cases we believe that results

in this direction can only be realistically obtained under regularity assumptions similar to those in

the above cited articles.

2.5 Applications to functional time series

Functional time series analysis has seen an upsurge in FDA literature, in particular the forecasting

in a functional setup (see e.g. Hyndman and Shang [21] or Sen and Kluppelberg [27]). We sketch

here two possible applications in this context.

FAR(1)

Of particular importance in functional time series is the ARH(1) model of Bosq [2]. We show now

that our framework covers this model. With i.i.d. innovations δk ∈ L4H the process Xk defined via

Xk+1 = Ψ(Xk) + δk+1 is L4H–approximable if Ψ ∈ L(H,H) such that ‖Ψ‖L(H,H) < 1, see [18]. The

stationary solution for Xk has the form

Xk =∑j≥0

Ψj(δk−j).

Setting εk = δk+1 and Yk = Xk+1 we obtain the linear model (I.1). Independence of δk implies

that Xk ⊥ εk and hence Assumption (A) holds. Bosq [2] has obtained a (strongly) consistent

estimator of Ψ, if Ψ is Hilbert-Schmidt and again by imposing assumptions on the spectrum of C.

In our approach we don’t even need that the innovations δk are i.i.d. As long as we can assure

that δk and Xk are L4–m–approximable we only need that δk is H-white noise. Indeed,

denoting A∗ the conjugate of operator A, we have for any x ∈ H1 and y ∈ H2 that

E〈Xk, x〉H1〈εk, y〉H2 =∑j≥0

E〈Ψj(δk−j), x〉H1〈δk+1, y〉H2

=∑j≥0

E〈δk−j , (Ψj)∗(x)〉H1〈δk+1, y〉H2 = 0.

This shows Xk ⊥ εk and Assumption (A) follows.

We obtain the following

Corollary 1. Let Xnn≥1 be an ARH(1) process given by the recurrence equation Xn+1 = Ψ(Xn)+

εn+1. Assume ‖Ψ‖L12 < 1. If εi is H-white noise and Assumption (A) holds, then for the

estimator ΨK given in Theorem 2 we have ‖Ψ(Xn) − ΨK(Xn)‖H2

P−→ 0. In particular if εi is

i.i.d. in L4H , Assumption (A) will hold.

Corollary 2. Let Xnn≥1 be an ARH(1) process given by the recurrence equation Xn+1 = Ψ(Xn)+

εn+1. Assume ‖Ψ‖S12 < 1 and that the covariance operator related to X1 has distinct eigenvalues.

If εi is H-white noise and (A) and (K) hold, then the estimator ΨK is consistent.

We remark that employing the usual state-space representation for FAR(p) processes these results

23

are easily generalized to higher order FAR models.

FARCH(1)

Another possible application of our result refers to a recently introduced functional version of the

celebrated ARCH model (Hormann et al. [17]), which plays a fundamental role in financial econo-

metrics. It is given by the two equations

yk(t) = εk(t)σk(t), t ∈ [0, 1], k ∈ Z

and

σ2k(t) = δ(t) +

∫ 1

0β(t, s)y2

k−1(s)ds, t ∈ [0, 1], k ∈ Z.

Without going into details, let us just mention that one can write the squared observations of a

functional ARCH model as an autoregressive process with innovations νk(t) = y2k(t) − σ2

k(t). The

new noise νk is no longer independent and hence the results of [2] are not applicable to prove

consistency of the involved estimator for the operator β. But it is shown in [17] that the innovations

of this new process form Hilbertian white noise and that the new process is L4–m–approximable.

This allows us to obtain a consistent estimator for β.

3 Simulation study

We consider a linear model of the form Yn = Ψ(Xn) + εn, where X1, ε1, X2, ε2, . . . are mutually

independent. We are testing the performance of the estimator in context of prediction, i.e. we work

under the setting of Theorem 2. For the simulation study we obviously have to work with finite

dimensional spaces H1 and H2. However, because of the asymptotic nature of our results, we set

the dimension relatively high and define H1 = H2 = spanfj : 0 ≤ j ≤ 34, where f0(t) = 1,

f2k−1(t) = sin(2πkt) and f2k(t) = cos(2πkt) are the first 35 elements of a Fourier basis on [0, 1]. We

work with Gaussian curves Xi(t) by setting

Xi(t) =34∑j=0

A(j)i fj−1(t), (I.5)

where (A(0)i , A

(1)i , . . . , A

(34)i )′ are independent Gaussian random vectors with mean zero and covari-

ance Σ. This setup allows us to easily manipulate the eigenvalues λk of a covariance operator

CX = EX ⊗X. Indeed, if we define Σ = diag(a1, . . . , a35), where a1 ≥ a2 ≥ · · · ≥ ak, then λk = ak

and vk = fk−1 is the corresponding eigenfunction. We test three sets of eigenvalues λk1≤k≤35:

• Λ1 : λk = c1ρk−1 with ρ = 1/2; [geometric decay],

• Λ2 : λk = c2/k2 [fast polynomial decay],

• Λ3 : λk = c3/k1.1 [slow polynomial decay].

To bring our data on the same scale and make results under different settings comparable we set

c1, c2 and c3 such that∑35

k=1 λk = 1. This implies E‖Xi‖2 = 1 in all settings. The noise εk is also

24

assumed to be of the form (I.5), but now with E‖εi‖2 = σ2 ∈ 0.25, 1, 2.25, 4.

We test three operators, all of the form Ψ(x) =∑35

i=1

∑35j=1 ψij〈x, vi〉vj .

• Ψ1 : for 1 ≤ i, j ≤ 35 we set ψii = 1 and ψij = 0 when i 6= j,

• Ψ2 : the coefficients ψij are generated as i.i.d. standard normal random variables,

• Ψ3 : for 1 ≤ i, j ≤ 35 we set ψij = 1ij

We standardize the operators such that the operator norm equals one. The operators Ψ2 are

generated once and then fixed for the entire simulation. We generate samples of size n+1 = 80×4`+1,

` = 0, . . . , 4. Estimation is based on the first n observations. We run 200 simulations for each setup

(Λ,Ψ, σ, n). As a performance measure for our procedure the mean squared error on the (n+ 1)-st

observation

MSE =1

200

200∑k=1

‖Ψ(X(k)n+1)− Ψ(X

(k)n+1)‖2H2

, (I.6)

is used. Here X(k)i is the i-th observation of the k-th simulation run.

Now we compute the median truncation level K obtained from our data-driven procedure de-

scribed in Theorem 2 with mn = n1/2

logn . We compare it to the median truncation level obtained by

cross-validation (KCV ) on the same data. To this end, we divide the sample into training and test

sets in proportion (n− ntest) : ntest, where ntest = maxn/10, 100. The estimator is obtained from

the training set for different truncation levels k = 1, 2, . . . , 35. Then, from the test set we determine

KCV = argmink∈1,...,35∑n

`=n−ntest‖Y`+1 − Ψk(X`)‖2H2

.

The MSE and the size of K and KCV are shown for different constellations in Table 1. We display

the results only for σ = 1. Not surprisingly, the bigger the variance of the noise, the bigger MSE,

but otherwise our findings were the same across all constellations of σ. The table shows that the

choice of K proposed by our method results in an MSE which is competitive with CV. We also see

that an optimal choice of K cannot be solely based on the decay of the eigenvalues as it is the case

in our approach. It clearly also depends on the unknown operator itself. Not surprisingly, the best

results are obtained under settings Λ1 (exponentially fast decay of eigenvalues) and Ψ3 (which is

the smoothest among the three operators).

4 Conclusion

Estimation of the regression operator in functional linear models has obtained much interest over

the last years. Our objective in this paper was to show that one of most widely applied estimators in

this context remains consistent, even if several of the synthetic assumptions used in previous papers

are removed. If our intention is prediction, we can further simplify the technical requirements. Our

approach comes with a data driven choice of the parameter which determines the dimension of the

estimator. While our main intention is to show that this choice leads to a consistent estimator,

we have seen in simulations that our method is performing remarkably well when compared to

cross-validation.

25

Table 1: Truncation levels obtained by Theorem 2 (K) and by cross-validation (KCV ) and correspondingMSE. For each constellation we present med(K) of 200 runs.

Ψ1 Ψ2 Ψ3

n KCV MSE K MSE KCV MSE K MSE KCV MSE K MSE

Λ1

80 1 1.10 2 0.96 1 0.68 2 0.69 1 0.64 2 0.66320 3 0.48 2 0.43 1 0.32 2 0.28 1 0.21 2 0.24

1280 4 0.21 3 0.21 3 0.14 3 0.12 2 0.09 3 0.095120 7 0.08 4 0.10 5 0.07 4 0.05 3 0.05 4 0.03

20480 9 0.03 4 0.06 8 0.03 4 0.02 5 0.02 4 0.01

Λ2

80 1 1.00 1 0.85 1 0.82 1 0.58 1 0.56 1 0.4320 2 0.56 1 0.54 1 0.26 1 0.22 1 0.20 1 0.15

1280 5 0.26 2 0.28 2 0.14 2 0.12 1 0.07 2 0.065120 9 0.13 2 0.24 5 0.08 2 0.08 3 0.04 2 0.02

20480 17 0.06 3 0.16 10 0.04 3 0.04 5.5 0.02 3 0.01

Λ3

80 1 1.60 2 1.30 1 0.78 1 0.73 1 0.71 1 0.57320 2 0.85 2 0.78 1 0.35 2 0.40 1 0.22 2 0.28

1280 8 0.55 4 0.55 2 0.22 4 0.22 2 0.08 4 0.125120 24 0.25 6 0.38 9 0.16 6 0.14 3 0.04 6 0.04

20480 33 0.08 11 0.25 23 0.07 11 0.08 5 0.02 11 0.02

5 Proofs

Throughout this entire section we assume the setup and notation of Section 2.2.

5.1 Proof of Theorem 1

We work under Assumptions (A) and (K) and assume distinct eigenvalues of the covariance operator

C and that Ψ is Hilbert-Schmidt. The first important lemma which we use in the proof of Theorem 1

is an error bound for the estimators of the operators ∆ and C. Below we extend results in [18].

Lemma 1. There is a constant U depending only on the law of (Xk, Yk) such that

nmaxE‖∆− ∆n‖2S12 , E‖C − Cn‖2S11 < U.

Proof of Lemma 1. We only prove the bound for ∆, the one for C is similar. First note that by

Lemma 2.1 in [18] and Assumption (A) Yk is also L4–m–approximable. Next we observe that

nE∥∥∆− ∆n

∥∥2

S12 = nE

∥∥∥∥∥ 1

n

n∑k=1

Zk

∥∥∥∥∥2

S12

,

where Zk = Xk ⊗ Yk −∆. Set Z(r)k = X

(r)k ⊗ Y

(r)k −∆. Using the stationarity of the sequence Zk

26

we obtain

nE

∥∥∥∥∥ 1

n

n∑k=1

Zk

∥∥∥∥∥2

S12

=∑|r|<n

(1− |r|

n

)E〈Z0, Zr〉S12

≤ E‖Z0‖2S12 + 2∞∑r=1

|E〈Z0, Zr〉S12 |. (I.7)

By the Cauchy-Schwarz inequality and the independence of Z(r−1)r and Z0 we derive:

|E〈Z0, Zr〉S12 | = |E〈Z0, Zr − Z(r−1)r 〉S12 | ≤ (E‖Z0‖2S12)

12 (E‖Zr − Z(r−1)

r ‖2S12)12 .

Using ‖X0 ⊗ Y0‖S12 = ‖X0‖H1‖Y0‖H2 and again the Cauchy-Schwarz inequality we get

E‖Z0‖2S12 = E‖X0‖2H1‖Y0‖2H2

≤ ν24,H1

(X0)ν24,H2

(Y0) <∞.

To finish the proof we show that∞∑r=1

(E‖Zr − Z(r−1)r ‖2S12)

12 < ∞. By using an inequality of the

type |ab− cd|2 ≤ 2|a|2|b− d|2 + 2|d|2|a− c|2 we obtain

E‖Zr − Z(r−1)r ‖2S12 = ‖Xr ⊗ Yr −X(r−1)

r ⊗ Y (r−1)r ‖2S12

≤ 2E‖Xr‖2H1‖Yr − Y (r−1)

r ‖2H2+ 2E‖Y (r−1)

r ‖2H2‖Xr −X(r−1)

r ‖2H1

≤ 2ν24,H1

(Xr)ν24,H2

(Yr − Y (r−1)r ) + 2ν2

4,H2(Y (r−1)r )ν2

4,H1(Xr −X(r−1)

r ).

Convergence of (C.5) follows now directly from L4-m–approximability.

Application of this lemma leads also to bounds for estimators of eigenvalues and eigenfunctions

of C via the following two lemmas (see [18]).

Lemma 2. Suppose λi, λi are the eigenvalues of C and C, respectively, listed in decreasing order.

Let vi, vi be the corresponding eigenvectors and let ci = 〈vi, vi〉. Then for each j ≥ 1,

αj‖vj − cj vj‖H1 ≤ 2√

2‖C − C‖L11 ,

where αj = minλj−1 − λj , λj − λj+1 and α1 = λ2 − λ1.

Lemma 3. Let λj , λj be defined as in Lemma 2. Then for each j ≥ 1,

|λj − λj | ≤ ‖C − C‖L11 .

In the following calculations we work with finite sums of the representation in (I.2):

ΨK(x) =K∑j=1

∆(vj)

λj〈vj , x〉. (I.8)

In order to prove the main result we consider the term ‖Ψ − ΨK‖L12 and decompose it using the

27

triangle inequality into four terms

‖Ψ− ΨK‖L12 ≤4∑i=1

‖Si(K)‖L12 ,

where

S1(K) =K∑j=1

(cj vj ⊗

∆(cj vj)

λj− cj vj ⊗

∆(cj vj)

λj

), (I.9)

S2(K) =

K∑j=1

(cj vj ⊗

∆(cj vj)

λj− cj vj ⊗

∆(cj vj)

λj

), (I.10)

S3(K) =K∑j=1

(cj vj ⊗

∆(cj vj)

λj− vj ⊗

∆(vj)

λj

), (I.11)

S4(K) = Ψ−ΨK . (I.12)

The following simple lemma gives convergence of S4(Kn), provided KnP−→∞.

Lemma 4. Let Kn, n ≥ 1 be a random sequence taking values in N, such that KnP−→ ∞ as

n→∞. Then ΨKn defined by the equation (I.8) converges to Ψ in probability.

Proof. Notice that since ‖Ψ‖2S12 =∞∑j=1‖Ψ(vj)‖2H2

<∞ for some orthonormal base vj, we can find

mε ∈ N such that ‖Ψ−Ψm‖2S12 =∑j>m‖Ψ(vj)‖2H2

≤ ε, whenever m > mε. Hence

P (‖Ψ−ΨKn‖2S12 > ε) =∞∑m=1

P (‖Ψ−Ψm‖2S12 > ε ∩Kn = m)

= P (Kn ≤ mε).

The next three lemmas deal with terms (I.9)–(I.11).

Lemma 5. Let S1(K) be defined by the equation (I.9) and U the constant derived in Lemma 1.

Then

P (‖S1(Kn)‖L12 > ε) ≤ Um2n

ε2n.

Proof. Note that for an orthonormal system ei ∈ H1 | i ≥ 1 and any sequence xi ∈ H2 | i ≥ 1the following identity holds:∥∥∥∥∥

K∑i=1

ei ⊗ xi

∥∥∥∥∥2

S12

=

∞∑j=1

∥∥∥∥∥K∑i=1

〈ei, ej〉xi

∥∥∥∥∥2

H2

=

K∑j=1

‖xj‖2H2. (I.13)

28

Using this and the fact that the Hilbert-Schmidt norm bounds the operator norm we derive

P (‖S1(Kn)‖2L12 > ε) ≤ P

(∥∥∥∥∥Kn∑j=1

cj vj ⊗1

λj(∆−∆)(cj vj)

∥∥∥∥∥2

S12

> ε

)

≤ P

(1

λ2Kn

Kn∑j=1

‖(∆−∆)(cj vj)‖2H2> ε

)≤ P (m2

n‖∆−∆‖2S12 > ε).

By the Markov inequality

P (‖S1(Kn)‖2L12 > ε) ≤ E‖∆−∆‖2S12m2n

ε≤ Um

2n

εn,

where the last inequality is obtained from Lemma 1.

Lemma 6. Let S2(K) be defined by the equation (I.10) and U the constant from Lemma 5. Then

P (‖S2(Kn)‖L12 > ε) ≤ 4U‖∆‖2S12m4n

ε2n.

Proof. Assumption Kn ≤ Bn and identity (I.13) imply

P (‖S2(Kn)‖2L12 > ε) = P

(∥∥∥∥∥Kn∑j=1

(1

λj− 1

λj

)cj vj ⊗∆(cj vj)

∥∥∥∥∥2

L12

> ε

)

≤ P

(max

1≤j≤Kn

(λj − λjλjλj

)2 Kn∑j=1

‖∆(cj vj)‖2H2> ε

)

≤ P

(max

1≤j≤Kn

(λj − λjλj

)2

>ε

m2n‖∆‖2S12

).

For simplifying the notation let b2 = εm2n‖∆‖2S12

, then

P (‖S2(Kn)‖2L12 > ε) ≤ P

(max

1≤j≤Kn

∣∣∣∣∣ λj − λjλj

∣∣∣∣∣ > b

)

≤ P

(1

λKnmax

1≤j≤Kn|λj − λj | > b ∩ max

1≤j≤Kn|λj − λj | ≤

b

2mn

)+ P

(max

1≤j≤Kn|λj − λj | >

b

2mn

).

The first summand vanishes because

P

(1

λKnmax

1≤j≤Kn|λj − λj | > b ∩ max

1≤j≤Kn|λj − λj | ≤

b

2mn

)

≤ P

(b

2λKnmn> b ∩ |λKn − λKn | ≤

b

2mn

)

≤ P

(1

2mn> λKn ∩ |λKn − λKn | ≤

√ε

m2n2‖∆‖S212

),

29

which is equal to 0 for n large enough, since λKn ≥ 1mn

and the distance between λKn and λKnshrinks faster than 1

2mn. For the second term we use Lemma 3 and the Markov inequality:

P (‖S2(Kn)‖2L12 > ε) ≤ P(

max1≤j≤Kn

|λj − λj | >b

2mn

)≤ P

(‖C − C‖L11 >

b

2mn

)≤ 4m2

n

b2E‖C − C‖2L11

≤ 4U‖∆‖2S12m4n

εn.

Lemma 7. Let S3(K) be defined by (I.11) and U be the constant defined in Lemma 5, then

P (‖S3(Kn)‖L12 < ε) ≤ U(128‖∆‖2L12 + 4ε2)m6n

ε2n.

Proof. By adding and subtracting the term cj vj∆(vj) and using the triangle inequality we derive

P (‖S3(Kn)‖L12 > ε) = P

(∥∥∥∥∥Kn∑j=1

1

λj(cj vj ⊗∆(cj vj)− vj ⊗∆(vj))

∥∥∥∥∥L12

> ε

)

≤ P

(Kn∑j=1

1

λj‖cj vj ⊗∆(cj vj − vj) + (cj vj − vj)⊗∆(vj)‖L12 > ε

)

≤ P

(Kn∑j=1

1

λj(‖∆‖L12‖cj vj − vj‖H1 + ‖cj vj − vj‖H1‖∆‖L12) > ε

).

Now we split Ω = A ∪Ac where A = 1λKn

> 2mn and get

P (‖S3(Kn)‖L12 > ε) ≤ P

(1

λKn

Kn∑j=1

‖cj vj − vj‖H1 >ε

2‖∆‖L12

)

≤ P

(Kn∑j=1


4mn‖∆‖L12

)+ P

(1

λKn> 2mn

). (I.14)

For the first term in the inequality (I.14), by Lemma 2, definition of En and the Markov inequality

30

we get

P

(Kn∑j=1


4mn‖∆‖L12

)≤ P

(mn max

1≤j≤En‖cj vj − vj‖H1 >

ε

4mn‖∆‖L12

)

≤ P

(max

1≤j≤En

2√

2

αj‖C − C‖L12 >

ε

4m2n‖∆‖L12

)

≤ P

(‖C − C‖L12 >

ε

8√

2m3n‖∆‖L12

)

≤ 128‖∆‖2L12m6n

E‖C − C‖2L12ε2

≤ 128U‖∆‖2L12m6n

ε2n.

Since λKn ≥ 1mn

, the second term in the inequality (I.14) is bounded by

P

(λKn <

1

2mn

)≤ P

(λKn <

1

2mn∩ |λKn − λKn | ≤

1

2mn

)+ P

(|λKn − λKn | >

1

2mn

)

≤ P

(‖C − C‖L12 >

1

2mn

)

≤ 4m2nE‖C − C‖2L12 ≤ 4U

m2n

n.

Thus we derive

P (‖S3(Kn)‖L12 > ε) ≤ 128U‖∆‖2L12m6n

ε2n+ 4U

m2n

n≤ U(128‖∆‖2L12 + 4ε2)

m6n

ε2n.

Finally we need a lemma which assures that Kn tends to infinity.

Lemma 8. Let Kn be defined as in (K), then KnP−→∞.

Proof. We have to show that P (minBn, En < p) → 0 for any p ∈ N. Since 1mn 0, for n large

enough we have, by combining Lemma 1 and 3, that

P (Bn < p) = P

(λp <

1

mn

)= P

(λp − λp > λp −

1

mn

)≤ P

(|λp − λp| > λp −

1

mn

)→ 0.

Now we are ready to prove the main result

31

Proof of Theorem 1. First, by the triangle inequality we get

‖Ψ− ΨKn‖L12 ≤ ‖Ψ− ΨKn‖L12 + ‖Ψ−ΨKn‖L12≤ ‖S1(Kn)‖L12 + ‖S2(Kn)‖L12 + ‖S3(Kn)‖L12 + ‖Ψ−ΨKn‖L12 .

By Lemmas 4, 5, 6, 7 and assumption m6n = o(n) we finally obtain for large enough n that

P (‖Ψ− ΨKn‖L12 > ε)

≤ U44m2n

ε4n+ 43U‖∆‖2S12

m4n

ε2n+ 42U(128‖∆‖2L12 + ε2/4)

m6n

ε2n+ P (‖Ψ−ΨKn‖L12 > ε/4)

n→∞−−−→ 0.

5.2 Proof of Theorem 2

In order to simplify the notation we will denote K = Kn. This time as a starting point we take a

representation of Ψ in the basis v1, v2, .... Let Mm = spv1, v2, ..., vm, Mm = spv1, v2, ..., vmwhere spxi, i ∈ I denotes the closed span of the elements xi, i ∈ I. If rank(C) = `, then

vi, i > ` can be any ONB of M⊥` . We write PA for the projection operator which maps on a

closed linear space A. As usual A⊥ denotes the orthogonal complement of A. Since for any m ≥ 1

we can write x = PMm(x) + PM⊥m

(x), the linearity of Ψ and the projection operator gives

Ψ(x) = Ψ(PMm(x)) + Ψ(PM⊥m

(x))

=m∑j=1

〈vj , x〉H1Ψ(vj) + Ψ(PM⊥m(x)).

Now we evaluate Ψ in some vj which is not in the kernel of C. By definitions of Ψ, C and again by

linearity of the involved operators

Ψ(vj) =1

λjΨ(C(vj))

=1

λj

1

n

n∑i=1

〈Xi, vj〉H1Ψ(Xi)

=1

λj

1

n

n∑i=1

〈Xi, vj〉H1(Yi − εi)

=1

λj(∆(vj) + Λ(vj)),

where Λ = − 1n

∑ni=1Xi⊗εi. Hence if m is such that λm > 0 (which will now be implicitely assumed

in the sequel), Ψ can be expressed as

Ψ(x) =

m∑j=1

〈vj , x〉H1

1

λj∆(vj) +

m∑j=1

〈vj , x〉H1

1

λjΛ(vj) + Ψ(PM⊥m

(x)).

32

Note that the first term on the right-hand side is just Ψm(x). Therefore for any x, the distance

between Ψ(x) and Ψm(x) takes the following form

‖Ψ(x)− Ψm(x)‖H2 =

∥∥∥∥∥m∑j=1

〈vj , x〉H1

1

λjΛ(vj) + Ψ(PM⊥m

(x))

∥∥∥∥∥H2

. (I.15)

To assess (I.15) we need the following four lemmas.

Lemma 9. Let (λi, vi)i≥1 and (λi, vi)i≥1 be eigenvalues and eigenfunctions of C and C respectively.

Set j,m ∈ N such that j ≤ m ≤ n, then

‖vj − PMm(vj)‖2H1

≤ 4‖C − C‖2L11

(λm+1 − λj)2.

Proof. Note that by using Parseval’s identity we get

‖vj − PMm(vj)‖2H1

=∞∑k=1

〈vj − PMm(vj), vk〉2H1

=∑k>m

〈vj , vk〉2H1.

Now

(λm+1 − λj)2∑k>m

〈vj , vk〉2H1≤∑k>m

(λk〈vj , vk〉H1 − λj〈vj , vk〉H1)2

=∑k>m

(〈vj , C(vk)〉H1 − λj〈vj , vk〉H1)2.

Since C is a self-adjoint operator, simple algebraic transformations yield

(λm+1 − λj)2∑k>m

〈vj , vk〉2H1≤∑k>m

(〈C(vj), vk〉H1 − λj〈vj , vk〉H1)2

=∑k>m

(〈(C − C)(vj), vk〉H1 − (λj − λj)〈vj , vk〉H1)2

≤ 2∑k>m

|〈(C − C)(vj), vk〉H1 |2 + 2∑k>m

((λj − λj)〈vj , vk〉H1)2.

By Parseval’s inequality and Lemma 3

(λm+1 − λj)2∑k>m

〈vj , vk〉2H1≤ 2‖(C − C)(vj)‖2H1

+ 2|λj − λj |2 ≤ 4‖C − C‖2L11 .

Lemma 10. Let Ψ be defined as in Lemma 2 and K = KnP−→∞. Then ‖PM⊥K (Xn)‖H2

P−→ 0.

Proof. We write here and in the sequel X = Xn. We first remark that for any ε > 0

P (‖PM⊥K (X)‖2H2> ε) = P

( ∞∑i=K+1

|〈vi, X〉H1 |2 > ε

).

33

Since∑∞

i=1 |〈vi, X〉H1 |2 = ‖X‖2H1, there exists a random variable Jε ∈ R such that

∑∞i=Jε|〈vi, X〉H1 |2 <

ε. Since by assumption E‖X‖2H1< ∞, we conclude that Jε is bounded in probability. Hence we

obtain

P (‖PM⊥K (X)‖2H2> ε) ≤ P

( ∞∑i=K+1

|〈vi, X〉H1 |2 > ε ∩ K > Jε

)+ P (K ≤ Jε)

= P (K ≤ Jε),

where the last term converges to zero as n→∞.

Lemma 11. Let Ln = arg maxr ≤ K :∑r

i=1(λK+1 − λi)−2 ≤ ξn, where K = Kn is given as in

Theorem 2 and ξn →∞. Then LnP−→∞.

Proof. Let r ∈ N such that for all 1 ≤ i ≤ r we have λr+1 6= λi. Note that E‖X‖2H1< ∞ implies

λi → 0 and since λi > 0 we can find infinitely many r satisfying this condition. We choose such r

and obtain

P (Ln < r) ≤ P

(r∑i=1

1

(λK+1 − λi)2> ξn ∩ K ≥ r

)+ P (K < r).

Lemma 8 implies that P (K < r) → 0. The first term is bounded by P(∑r

i=11

(λr+1−λi)2> ξn

).

Since λiP−→ λi and r is fixed while ξn →∞, it follows that P (Ln < r)→ 0 if n→∞. Since r can

be chosen arbitrarily large, the proof is finished.

Lemma 12. Let Ψ be defined as in Lemma 2, then ‖PMK(X)− PMK

(X)‖H1

P−→ 0.

Proof. Let us define two variables X(1) =∑L

i=1〈X, vi〉H1vi, X(2) =

∑∞i=L+1〈X, vi〉H1vi and L as in

Lemma 11. Again for simplifying the notation we will write L instead of Ln. Since X = X(1) +X(2)

we derive

‖PMK(X)− PMK

(X)‖H1 ≤ ‖PMK(X(1))− PMK

(X(1))‖H1 + ‖PMK(X(2))‖H1 + ‖PMK

(X(2))‖H1 .

(I.16)

The last two terms are bounded by 2‖X(2)‖H1 . For the first summand in (I.16) we get

‖PMK(X(1))− PMK

(X(1))‖H1 =

∥∥∥∥∥L∑i=1

〈X, vi〉H1(vi − PMK(vi))

∥∥∥∥∥H1

.

Let us choose ξn = o(n) in Lemma 11. The triangle inequality, the Cauchy-Schwarz inequality,

34

Lemma 9 and the definition of L entail

‖PMK(X(1))− PMK

(X(1))‖H1 ≤L∑i=1

|〈X, vi〉H1 |‖vi − PMK(vi)‖H1

≤

(L∑i=1

|〈X, vi〉H1 |2)1/2( L∑

i=1

‖vi − PMK(vi)‖2H1

)1/2

≤ ‖X‖H1

(L∑i=1

‖vi − PMK(vi)‖2H1

)1/2

≤ 2‖X‖H1‖C − C‖L11

(L∑i=1

1

(λK+1 − λi)2

)1/2

≤ 2‖X‖H1‖C − C‖L11√ξn.

This implies the inequality

‖PMK(X)− PMK

(X)‖H1 ≤ 2‖X‖H1‖C − C‖L11√ξn + 2‖X(2)‖H1 . (I.17)

Hence by Lemma 1 we have 2‖X‖H1‖C − C‖L11√ξn = oP (1). Furthermore we have that ‖X(2)‖ =(∑

j>L |〈X, vj〉|2)1/2 P−→ 0. This follows from the proof of Lemma 10.

Lemma 13. Let Ψ be defined as in Lemma 2, then ‖Ψ(PM⊥K(X))‖H2

P−→ 0.

Proof. Some simple manipulations show

‖Ψ(PM⊥K(X))‖H2 = ‖Ψ(X − PMK

(X))‖H2

= ‖Ψ(PMK(X) + PM⊥K

(X)− PMK(X))‖H2

≤ ‖Ψ(PMK(X))−Ψ(PMK

(X))‖H2 + ‖Ψ(PM⊥K(X))‖H2

≤ ‖Ψ‖L12(‖PMK

(X)− PMK(X)‖H1 + ‖PM⊥K (X)‖H1

).

Direct applications of Lemma 10 and Lemma 12 finish the proof.

Proof of Theorem 2. Set

Θn(x) =

Kn∑j=1

Λ(vj)

λj〈vj , x〉H1 .

By the representation (I.15) and the triangle inequality

‖Ψ(X)− Ψ(X)‖H2 ≤ ‖Θn(X)‖H2 + ‖Ψ(PM⊥Kn(X))‖H2 .

Lemma 13 shows that the second term tends to zero in probability.

If in Lemma 1 we define Ψ ≡ 0, then Λ = ∆ and by independence of εk and Xk we get

Λ = 0. By the arguments of Lemma 5 we infer P (‖Θn‖L12 > ε) ≤ Um2n/ε

2n, which implies that

35

‖Θn(X)‖H2

P−→ 0.

6 Acknowledgement

This research was supported by the Communaute francaise de Belgique—Actions de Recherche Con-

certees (2010–2015) and Interuniversity Attraction Poles Programme (IAP-network P7/06), Belgian

Science Policy Office.

Bibliography

[1] Bosq, D. (1991). Modelization, nonparametric estimation and prediction for continuous time

processes. In Nonparametric functional estimation and related topics. NATO Adv. Sci. Inst.

Ser. C Math. Phys. Sci., 335, 509–529, Kluwer Acad. Publ.

[2] Bosq, D. (2000). Linear Processes in Function Spaces., Springer, New York.

[3] Cai, T. & Hall, P. (2006). Prediction in functional linear regression. Ann. Statist. 34, 2159–2179.

[4] Cai, T. & Zhou, H. (2008). Adaptive functional linear regression. technical report.

[5] Cardot, H., Ferraty, F. & Sarda, P. (1999). Functional linear model. Statist. Probab. Lett. 45,

11–22.

[6] Cardot, H., Ferraty, F. & Sarda, P. (2003). Spline estimators for the functional linear model.

Statist. Sinica 13, 571–591.

[7] Cardot, H. & Johannes, J. (2010). Thresholding projection estimators in functional linear mod-

els. J. Multivariate Anal. 101, 395–408.

[8] Chiou, J.-M., Muller, H.-G. & Wang, J.-L. (2004). Functional response models. Statist. Sinica

14, 675–693.

[9] Crambes, C. & Mas, A. (2013). Asymptotics of prediction in functional linear regression with

functional outputs. Bernoulli 19, 2153–2779.

[10] Crambes, C., Kneip, A. & Sarda, P. (2009). Smoothing splines estimators for functional linear

regression. Ann. Statist. 37, 35–72.

[11] Cuevas, A., Febrero, M. & Fraiman, R. (2002). Linear functional regression: the case of fixed

design and functional response. Canadian J. Statist. 30, 285–300.

[12] Eilers, P. & Marx, B. (1996). Flexible Smoothing with B-splines and Penalties. Statist. Sciences

11, 89–121.

[13] Febrero-Bande, M., Galeano, P. & Gonzlez-Manteiga, W. (2010). Measures of influence for the

functional linear model with scalar response. J. Multivariate Anal. 101, 327–339.

[14] Ferraty, F., Laksaci, A., Tadj, A. & Vieu, P. (2011). Kernel regression with functional response.

Electron. J. Statist. 5, 159–171.

36

[15] Hall, P. & Horowitz, J. (2007). Methodology and convergence rates for functional linear regres-

sion. Ann. Statisti. 35, 70–91.

[16] Hastie, T. & Mallows, C. (1993). A discussion of ”a statistical view of some chemometrics

regression tools” by I.E. Frank and J.H. Friedman, Technometrics 35, 140–143.

[17] Hormann, S., Horvath, L. & Reeder, R. (2012). A Functional Version of the ARCH Model.

Econometric Theory 29, 267–288.

[18] Hormann, S. & Kokoszka, P. (2010). Weakly dependent functional data. Ann. Statist. 38,

1845–1884.

[19] Hormann, S. & Kokoszka, P. (2012). Functional Time Series. Handbook of Statistics 30, 157–

185.

[20] Horvath, L. & Kokoszka, P. (2012). Inference for Functional Data with Applications, Springer.

[21] Hyndman, R. J. & H.Shang, L. (2009). Forecasting functional time series. J. Korean Statist.

Soc., 38, 199–211.

[22] Li, Y. & Hsing, T. (2007). On rates of convergence in functional linear regression. J. Multivariate

Anal. 98, 1782–1804.

[23] Malfait, N. & Ramsay J. O. (2003). The historical functional linear model. Canad. J. Statist.

31, 115–128.

[24] Muller, H.-G. & Stadtmuller, U. (2005). Generalized functional linear models. Ann. Statist. 33,

774–805.

[25] Ramsay, J. O. & Silverman, B. (2005). Functional Data Analysis (2nd ed.), Springer, New York.

[26] Reiss, T. P. & Ogden, T. R. (2007). Functional principal component regression and functional

partial least squares. J. Amer. Statist. Assoc. 102, 984–996.

[27] Sen, R. & Kluppelberg, S. (2010). Time series of functional data. technical report.

[28] Yao, F., Muller, H.-G. & Wang, J.-L. (2005). Functional linear regression analysis for longitu-

dinal data. Ann. Statisti. 33, 2873–2903.

[29] Yuan, M. & Cai, T. (2011). A reproducing kernel Hilbert space approach to functional linear

regression. Ann. Statist. 38, 3412–3444.

37

Chapter II

Estimation in functional lagged regression

Estimation in functional lagged regression∗

Siegfried Hormann1, Lukasz Kidzinski1, Piotr Kokoszka2

1 Department de Mathematique, Universite libre de Bruxelles (ULB), Boulevard du Triomphe, B-1050

Bruxelles, Belgium

2 Department of Statistics, Colorado State University, Fort Collins, CO 80523, USA

Abstract

The paper introduces a functional time series (lagged) regression model. The impulse response coefficients

in such a model are operators acting on a separable Hilbert space, which is the function space L2 in appli-

cations. A spectral approach to the estimation of these coefficients is proposed and asymptotically justified

under a general nonparametric condition on the temporal dependence of the input series. Since the data are

infinite dimensional, the estimation involves a spectral domain dimension reduction technique. Consistency

of the estimators is established under general data dependent assumptions on the rate of the dimension re-

duction parameter. Their finite sample performance is evaluated by a simulation study which compares two

ad hoc approaches to dimension reduction with a new asymptotically justified method. The new method is

superior when the MSE of the in sample prediction error is used as a criterion.

1 Introduction

This paper is concerned with the estimation of impulse response operators in functional lagged

regression. Time series (or lagged) regression goes back to the origins of modern time series anal-

ysis, Kolmogorov [20], Wiener [30]. Accounts are given in many monographs and textbooks, e.g.

Brillinger [4], Priestley [26], Shumway and Stoffer [29]. It forms the most important input–output

paradigm in modeling engineering, geophysical and economic systems. In an abstract form, the

lagged regression model is

Y` = a+∑k∈Z

bk(X`−k) + ε`. (1.1)

The regressors Xk are elements of a Hilbert space H, the responses Y` and the errors ε` belong to

a possibly different Hilbert space H ′, and bk : H → H ′ are linear operators. In the most common

setting, all quantities are scalars, applications with several scalar input series are not uncommon.

While model (1.1) can be formulated in abstract Hilbert spaces, the existing statistical theory relies

on the assumption that all spaces are finite dimensional. This is because solutions to problems of

estimation, prediction and interpolation require inverting various matrices, and these inverses do

not exist (as bounded operators) in infinite dimensional spaces. A dimension reduction methodology

with a requisite theory must be developed.

Such issues have been extensively investigated in the field of Functional Data Analysis, with

the most relevant research relating to the functional linear model, e.g. Ramsay and Silverman [27],

Horvath and Kokoszka [17]. There are many types of functional linear models; those most relevant

∗Manuscript submitted for publication

39

to this paper are known as the scalar response model and the fully functional model. The scalar

response model has the form Yk = a +∫T b(u)Xk(u)du + εk, where the Xk are functions and the

responses Yk are scalars. This model has been investigated from many angles, to give a selection

of references, we cite Cardot et al. [7], Muller and Stadtmuller [23], Cai and Hall [5], Li and Hsing

[21], Crambes et al. [10], James et al. [18], McKeague and Sen [22] and Comte and Johannes [9].

The fully functional model is defined by

Yk(t) = a(t) +

∫Tb(t, u)Xk(u)du+ εk(t),

where now the responses Yk and the errors εk are also functions. This model is more complex, and

has not been so thoroughly investigated as the scalar response model, but it is safe to say that it

is presently well understood: Yao et al. [32], Chiou and Muller [8], Hormann and Kokoszka [13]

Gabrys et al. [11] and Hormann and Kidzinski [12] are just a few examples of recent work.

As in the usual linear regression, the assumption imposed on the above models is that the

pairs (Xk, Yk) are independent and identically distributed, and these models do not involve lagged

values of the input series. However, many problems in science, economics and engineering can be

formulated in terms of statistical inference for functional time series (FTS) which are defined as

Xn(t), t ∈ T , n = 1, 2, . . ., where n is the index referring to day, week, year or a similar unit of

time, and plays the role of the time index in time series analysis. The random elements Xn are

functions defined on a common domain T , typically an interval. This concept has been applied over

the last two decades in many settings, and a fairly complete theory for estimation, prediction and

testing for a single FTS exists, both in time and spectral domains: Bosq [2], Horvath and Kokoszka

[17], Hormann and Kokoszka [14], to name a few accounts.

The objective of this paper is to advance the existing framework by considering the input–output

paradigm for two FTS in the context of model (1.1). There are two broad approaches to inference

and prediction in the lagged regression model: 1) time domain approach based on ARMA modeling

of the series (X`) and response function modeling of the coefficients bk, Box et al. [3]; 2) spectral

domain approach based on coherency analysis, Brillinger [4]. While the Box–Jenkins approach has

an appealing heuristic justification, the coherency approach is viewed as a more principled one, and

has been extensively used in geosciences and engineering. Recent advances in the spectral theory

for functional data, Panaretos and Tavakoli (Panaretos and Tavakoli [25], Panaretos and Tavakoli

[24]), Hormann et al. [16], have opened up a prospect of developing a usable and asymptotically

supported methodology for model (1.1). As with most functional procedures, the main challenge is a

suitable dimension reduction technique and the need to deal with unbounded operators, difficulties

not encountered in the scalar and vector theory; details are explained in Section 3.

The remainder of this paper is organized as follows. Section 2 introduces model (1.1) in greater

detail by specifying the assumptions on its parameters and dependence structure. Estimation

methodology is explained in Section 3 and asymptotically justified in Section 4. Its finite sam-

ple performance is evaluated in Section 5. All proofs are collected in Section 6. In the Appendix,

we describe a new method for selecting an important dimension reduction parameter.

40

2 Model specification

We consider model (1.1) with a strictly stationary sequence (Xk) and a thereof independent i.i.d.

sequence (εk), with realizations in separable Hilbert spaces H and H ′, respectively. These spaces

are equipped with the norms ‖f‖ =√〈f, f〉, where 〈·, ·〉 is the inner product. The inner product

in H and H ′ is denoted in the same way. Even though we consider only real–valued observations,

it is convenient to assume that H and H ′ are Hilbert spaces over the complex field C, so that

〈f, g〉 = 〈g, f〉, where z is the complex conjugate of z.

Throughout we suppose that E‖Xk‖2 < ∞ and E‖εk‖2 < ∞ and that Eεk = 0. A sufficient

condition for the convergence of (1.1) is∑

k∈Z ‖bk‖L <∞, where ‖b‖L = supf : ‖f‖=1 ‖b(f)‖ denotes

the usual operator norm. A slightly stronger, but more convenient assumption is∑k∈Z‖bk‖S <∞, (2.1)

where ‖ · ‖S is the Hilbert–Schmidt norm. Recall that ‖Ψ‖2S =∑

j≥1 ‖Ψ(ej)‖2 =∑

j≥1 λ2j , where

(ej) is any orthonormal basis in H and the λj are the eigenvalues of Ψ. Recall that ‖Ψ‖L ≤ ‖Ψ‖S .

Our assumptions imply that (Y`) is also strictly stationary and E‖Y`‖2 <∞.

The means µX = EX` and µY = EY` are estimated by sample averages which, under quite

general dependence assumptions, are√n-consistent, see Section 2.4 of Bosq [2] for general results in

Banach spaces, and Theorem 16.3. of Horvath and Kokoszka [17] for the form of dependence used

in this paper. Since µY = a+∑

k∈Z bk(µX), once the bk have been estimated, an estimator for the

intercept operator a can be easily obtained. We therefore consider from now on the model

Y` =∑k∈Z

bk(X`−k) + ε`, (EY` = 0, EX` = 0). (2.2)

Since the process (Y`, X`) is strictly stationary and has second order moments, the operators

CXh := Cov(Xh, X0) and CY Xh := Cov(Yh, X0)

defined by the relation

Cov(X,Y )(f) = E[(X − EX)〈f, Y − EY 〉]

exist as elements of the space of Hilbert–Schmidt operators. The autocovariances of the input series

are assumed to be summable: ∑h∈Z‖CXh ‖S <∞. (2.3)

For ease of reference, we collect the time domain conditions imposed so far in the following assump-

tion.

Assumption 1. All random elements are square integrable. Model (2.2) and conditions (2.1) and

(2.3) hold. The sequences (Xk) and (εk) are strictly stationary and independent of each other. The

errors εk are independent.

41

Setting i =√−1, we introduce the spectral density operator

FXθ :=1

2π

∑h∈Z

CXh e−ihθ, θ ∈ [−π, π],

and the cross–spectral density operator

FY Xθ :=1

2π

∑h∈Z

CY Xh e−ihθ, θ ∈ [−π, π].

By (2.3) and Lemma 3 these two series are absolutely convergent in the Hilbert–Schmidt norm.

We will use the following assumption.

Assumption 2. For any θ ∈ [−π, π] the operators FXθ : H → H are full rank, that is ker(FXθ)

= 0.

For a scalar time series, Assumption 2 is equivalent to requiring that the spectral density of (X`)

be positive over [−π, π]. An analogous nonsingularity condition must be imposed for vector–valued

time series, Theorem 8.3.1 of Brillinger [4].

Next, we introduce the frequency response operator

Bθ :=∑h∈Z

bhe−ihθ, θ ∈ [−π, π].

The mapping θ 7→ 12πBθ is the Fourier transform (FT) of the sequence (bk) from which we may

recover bh by

bh =1

2π

∫ π

−πBθe

ihθdθ.

In case of the general model (1.1), bh are operators. We refer to Hormann et al. [16] for details on

how this type of FT is rigorously defined. If the Yk are scalars and the bk functions on an interval,

then Bθ = Bθ(u) =∑

h∈Z bh(u)e−ihθ reduces to a pointwise FT.

We conclude this section by specifying the assumptions on the dependence structure of the

process (Xk). We use the concept of Lp-m-approximability introduced by Hormann and Kokoszka

[13]. This moment based notion of dependence is convenient to apply and has been verified to

hold for several popular functional time series models, including functional linear processes. We

conjecture that our results could also be established under the cumulant type assumptions used by

Panaretos and Tavakoli [24], but the latter framework seems to be more restrictive than ours. We

write that X ∈ LpH if X takes values in Hilbert space H and

νp(X) :=(E‖X‖p

)1/p<∞.

Definition 1. A sequence (Xn) ∈ LpH is called Lp-m-approximable if each Xn admits the represen-

tation

Xn = f(un, un−1, . . .), (2.4)

where the ui are i.i.d. elements taking values in a measurable space S, and f is a measurable function

42

f : S∞ → H. Moreover we assume that if u′i is an independent copy of ui defined on the same

probability space, then letting

X(m)n = f(un, un−1, . . . , un−m+1, u

′n−m, u

′n−m−1, . . .) (2.5)

we have∞∑m=1

νp(X0 −X(m)

0

)<∞. (2.6)

Notice that by construction X(m)n

d= X0 (equality in law), and that X

(m)n is independent of

(Xn−k : k ≥ m). Representation (2.4) implies that the Xk form a stationary and ergodic sequence

in L2. Similar assumptions have been used extensively in recent theoretical work, as all stationary

time series models in practical use can be represented as Bernoulli shifts (2.4), see Wu [31], Shao

and Wu [28], Aue et al. [1], Hormann and Kokoszka [13], Hormann et al. [15] and Kokoszka and

Reimherr [19].

Assumption 3. The input sequence (Xk) is L4-m-approximable.

3 Estimation of the impulse response operators

In a scalar lagged regression model y` =∑

k bkx`−k+εk, the frequency response function is estimated

by Bθ = fyx(θ)/fxx(θ), where fyx(θ) and fxx(θ) are, respectively, estimates of the cross–spectrum of

(yk) and (xk) and the spectral density of the (xk). The response coefficients bh are then estimated

by the inverse FT of Bθ. To develop a similar procedure for functional data, we begin with the

relation

FY Xθ = Bθ FXθ (3.1)

which follows by changing the order of summation (cf. Lemma 3):

FY Xθ (f) =∑k∈Z

bk

(1

2π

∑h∈Z

CXh−k(f)e−i(h−k)θ

)e−ikθ = Bθ FXθ (f).

Heuristically, (3.1) yields the relation Bθ = FY Xθ (FXθ)−1

. This relation is heuristic only because(FXθ)−1

is not a bounded operator, see Lemma 4. We now explain how to overcome this problem

and construct consistent estimators of the impulse response operators bk.

For any θ ∈ [−π, π], the operator FXθ is Hilbert-Schmidt, symmetric and non-negative definite.

The verification is not difficult, see Hormann et al. [16]. Assumption 2 implies that its eigenvalues

λm(θ) are positive and the eigenfunctions (ϕm(θ) : m ≥ 1) form an orthonormal basis of H. Hence,

by the spectral theorem, it can be decomposed as

FXθ (f) =∑m≥1

λm(θ)〈f, ϕm(θ)〉ϕm(θ), f =∑m≥1

〈f, ϕm(θ)〉, ϕm(θ), (3.2)

where the λm(θ) are arranged in descending order and the corresponding eigenfunctions are normal-

43

ized to unit length. Relations (3.1) and (3.2) imply

FY Xθ (ϕm(θ)) = λm(θ)Bθ (ϕm(θ)) , m ≥ 1.

Therefore

Bθ(f) =∑m≥1

〈f, ϕm(θ)〉λm(θ)

FY Xθ (ϕm(θ)). (3.3)

The latter sum is convergent for any f ∈ H, and relation (3.3) forms the starting point of our

estimations approach.

Consider a sample (Y1, X1), . . . , (Yn, Xn) and the sample cross–covariance operators:

CY Xh (f) =

1n

∑n−hk=1 Yk+h〈f,Xk〉, h = 0, . . . , n− 1;

1n

∑nk=1−h Yk+h〈f,Xk〉, h = −n+ 1, . . . ,−1;

0, |h| ≥ n.

The estimators CXh for the autocovariances of (X`) are defined analogously. Now we set

FY Xθ|q =1

2π

∑|h|≤q

ωq(h)CY Xh e−ihθ.

This is the functional version of the smoothed periodogram. A popular choice is to use the Bartlett

weights ωq(h) = 1− |h|/(q + 1), but any weights satisfying the following assumption can be used.

Assumption 4. The weights ωq(h) satisfy ωq(−h) = ωq(h), |ωq(h)| ≤ ω?, for some ω? independent

of q and h, and limq→∞ ωq(h) = 1, for every fixed h.

All kernels used in practice lead to weights which satisfy Assumption 4.

In an analogous way define FXθ|q. The latter operator is non-negative definite, symmetric and

Hilbert-Schmidt for any frequency θ ∈ [−π, π]. Thus, the spectral theorem applies and we can

use its eigenfunctions ϕm(θ) = ϕm|q(θ) and eigenvalues λm(θ) = λm|q(θ) as estimators for the

population spectrum of FXθ . Clearly, ϕm(θ) and ϕm(θ) can only be close if they have the same

direction. In other words, the best we can hope is that ‖ϕm(θ) − cm(θ)ϕm(θ)‖ is small, when

cm(θ) = 〈ϕm(θ), ϕm(θ)〉/|〈ϕm(θ), ϕm(θ)〉|. This implies that all formulas defining estimators and

test statistics must be invariant with respect to cm(θ).

We are now ready to define

bh =1

2π

∫ π

−πBθe

ihθdθ,

where

Bθ(f) = Bθ|p,q,K(f) =K∑m=1

〈f, ϕm|q(θ)〉λm|q(θ)

FY Xθ|p (ϕm|q(θ)). (3.4)

The estimator Bθ involves three tuning parameters: p, q and K, which in principle may each

depend on θ. For the sake of readability, we shall in the sequel often suppress the dependence on

these parameters in the notation. The selection of these parameters is discussed in the following

44

sections.

Notice that (3.4) is invariant with respect to rotations of ϕm(θ); if c is a number on the complex

unit circle, then we can replace ϕm(θ) by cϕm(θ) without changing the estimator. This follows from

〈f, cϕm(θ)〉FY Xθ (cϕm(θ)) = cc〈f, ϕm(θ)〉FY Xθ (ϕm(θ)),

and cc = 1. Hence, in theoretical arguments, we can replace ϕm(θ) in (3.4) by cm(θ)ϕm(θ).

4 Consistency of the estimators

Asymptotic assumptions commonly used in the context of functional regression models are for-

mulated in terms of decay rates of eigenvalues and conditions on the gaps between eigenvalues of

the covariance operator CX0 . Under such assumptions, convergence rates for the estimators can be

obtained. In our spectral context, taking this route would necessitate translating such conditions

about the eigenvalues of CX0 to the eigenvalues of FXθ , θ ∈ [−π, π], and this does not yield clean

conditions. It is much more natural to base the rates of convergence of the bh directly on the rates

of convergence of the spectral density operator stated in the following lemma which is proven in

Section 6.

Lemma 1. Suppose Assumptions 1, 2, 3 and 4 hold. If q = qn → ∞, such that q2 = o(n), then

there exist null sequences (ψXn ) and (ψY Xn ), such that

supθ∈[−π,π]

‖FXθ − FXθ ‖L = oP (ψXn ) and supθ∈[−π,π]

‖FY Xθ − FY Xθ ‖L = oP (ψY Xn ).

Such an approach will allow us to specify the value of K (the truncation level in (3.4)) which

implies the consistency of the bh directly in terms of the sequences (ψXn ) and (ψY Xn ).

To establish the consistency of the bh, we need technical assumptions which ensure the identifia-

bility of the eigenfunctions of the spectral density estimators. These assumptions do not enter into

the convergence rates or the selection of K. Introduce the following function, which measures the

size of spectral gaps:

α1(θ) := λ1(θ)− λ2(θ);

αm(θ) := λm(θ)− λm+1(θ) ∧ λm−1(θ)− λm(θ), m ≥ 2.

In case of αm(θ) 6= 0, the eigenspace corresponding to λm(θ) is one-dimensional, and ϕm(θ) is

unique up to multiplication with a number on the complex unit circle. If αm(θ) = 0 for some θ,

the eigenspace corresponding to λm(θ) has dimension greater than one. Then, ϕm(θ) cannot be

identified. We shall thus impose the following assumptions.

Assumption 5. It holds that infθ αk(θ) > 0, for all k ≥ 1.

To formulate our consistency result, we need the following random variables. We define K =

45

minK(i), 1 ≤ i ≤ 4 with

K(1) = maxk ≥ 1: infθλk(θ) ≥ 2ψXn ,

K(2) = maxk ≥ 1: ψY Xn

∫ π

−πW kλ (θ)dθ ≤ 1,

K(3) = maxk ≥ 1:

∫ π

−π

(W kλ (θ)

)2dθ ≤

(ψXn)−1/2,

K(4) = maxk ≥ 1:

∫ π

−π

(W kα(θ)

)2dθ ≤

(ψXn)−1/2,

and

W kλ (θ) =

(k∑

m=1

1

λ2m(θ)

)1/2

and W kα(θ) =

(k∑

m=1

1

α2m(θ)

)1/2

.

By convention, the maximum over the empty set is equal to zero.

Theorem 1. Suppose that Assumptions 1, 2, 3, 4, 5 hold. For any null sequences (ψXn ) and (ψY Xn )

in Lemma 1 define K = minK(i) : 1 ≤ i ≤ 4. If q, p→∞ such that q + p = o(n1/2), then

maxh∈Z‖bh − bh‖S

P−→ 0.

Theorem 1 is proven in Section 6

Theorem 1 provides general conditions on the dimension parameter K to ensure that the esti-

mator bh is consistent. We now propose three specific rules whose finite sample performance will be

compared in the next section.

Cross–validation (CV). We divide 1, . . . , n into a training set Str = 1, . . . ,m and a test set

Stest = m+1, . . . , n, with m = bαnc and α ∈ (0, 1). With the variables Xj : j ∈ Str we estimate

the operators bh|k for h ∈ −H, . . . ,H, H ≥ 0, with a fixed dimension Kθ = k for all θ. Then, we

compute

V 2k :=

∑j∈Stest

∥∥∥Yj − ∑|h|≤H

bh|k(Xj−hIj − h ∈ Stest)∥∥∥2, k ≥ 1.

We set K = argmink≥1Vk. In Section 5, we use α = 0.8 and H = 3.

A potential disadvantage of this method is that K is fixed for all frequencies θ. In principle,

one could vary K over a partition of [−π, π]. However, such a method is numerically unstable and

increases computational costs.

Eigenvalue thresholding (ET). A major source of variability of estimator (3.4) is small eigen-

values λm|q(θ) in the denominator. Hence, another natural tuning approach consists in truncating

the sum in (3.4) as soon as λm|q(θ) is below a certain threshold ε = εn. Hence, we choose

Kθ = argmaxm≥1λm|q(θ) > εn.

46

In Section 5, we use εn = n−1/2.

Final prediction error (FPE). This method is more complex and is of independent interest since

it is applicable to the static functional linear model as well. The new data driven approach is

explained in Appendix 1, which also contains its theoretical justification.

5 Assessment of the performance in finite samples

5.1 Data generating processes and numerical implementation of the estimators

We work with a scalar response model of the form

Yt =∑|h|≤H

〈Xt−h, bh〉+ et, H ≥ 0.

In order to generate it, we assume a finite dimensional specification bh =∑d

k=1 bh;kfk, where the

functions fk form an orthonormal system in H. If we expand the curves Xt along the basis (fk) we

have Xt =∑

k≥1〈Xt, fk〉fk, and thus

Yt =∑|h|≤H

d∑k=1

bh;k〈Xt−h, fk〉+ et.

Hence, we can rearrange this functional lagged regression model in vector form as

Yt =∑|h|≤H

X′t−hbh + et,

where bh = (bh;1, . . . , bh;d)′ and Xt := (〈Xt, f1〉, . . . , 〈Xt, fd〉)′.

A similar discretization can be done for the operator Bθ and the covariance operators CXh and

CY Xh . More specifically, we can write

Bθ(x) =

∑|h|≤H

b′h exp(−ihθ)

x =: Bθx,

and

CXh (x) = f ′d(EXt+hX

′t

)x := f ′dC

Xh x and CY Xh (x) =

(EYt+hX

′t

)x := CY Xh x,

where x = (〈x, f1〉, . . . , 〈x, fd〉)′ and f ′d = (f1, . . . , fd). In other words, all involved operators have

their corresponding matrices, which act on the coefficients of x projected onto span(f1, . . . , fd),

instead of acting on x itself. Following this line of argumentation it follows that

Bθ(x) = FY Xθ(FXθ)−1

x,

where FXθ and FY Xθ are spectral densities related to the matrices (CXh ) and (CY Xh ), respectively.

47

Furthermore, it can be readily shown that

Bθ|p,q,k(x) = Bθ|p,q,k x := FY Xθ|p

(k∑

m=1

1

λm|q(θ)ϕm|q(θ)ϕ

∗m|q(θ)

)x,

where λm|q(θ) and ϕm|q(θ) are the eigenvalues and eigenvectors of

FXθ|q =1

2π

∑|h|≤q

wq(h)CXh e−ihθ,

and where

FY Xθ|p =1

2π

∑|h|≤p

wp(h)CY Xh e−ihθ.

Here CXh and CY Xh are the usual empirical covariance matrices related the sequence ((Yt,Xt) : 1 ≤t ≤ n) and wq(h) are the Bartlett weights.

Finally, b` = 12π

∫ π−π Bθ|p,q,ke

ihθdθ. (Note that we do allow p, q and k to depend on θ.) Since this

term cannot be computed explicitly, we use the numerical approximation

b`(x) = f ′d

(1

|Θ|∑θ∈Θ

Bθ|p,q,kei`θ

)x =: f ′d b` x, (5.1)

where Θ is a fine mesh on [−π, π].

5.2 Simulation settings and results

For the simulation study we have chosen the following settings.

• We set bh = 0 if h /∈ 0, ` where ` ∈ 1, 3. Furthermore, b0;k = α0(d − k + 1) and b`;k =

α1(d − k + 1)2, with α0 and α1 such that ‖b0‖ = β0 and ‖b`‖ = β` and d = 15. We choose

β0 ∈ 0.5, 1 and β` ∈ 0.5, 1. The curves fd are the first d Fourier basis functions.

• We assume that (et)i.i.d.∼ N(0, σ2), with σ2 ∈ 0.1, 0.5, and suppose that Xt = ΨXt−1 + εt,

where (εt)i.i.d.∼ Nd(0,Σd), Ψ = (ψij : 1 ≤ i, j ≤ d) satisfies ψij = c/(ij), with ‖Ψ‖ ∈ 0, 0.7.

Obviously, ‖Ψ‖ = 0 amounts to the i.i.d. setting. Since

E‖Xt‖2 =∑k≥1

Var(〈Xt, fk〉) <∞,

we assume that the elements of diag(Σd) are decaying. More precisely we set diag(Σd) =

(1, 1/2i, 1/3i, . . . , 1/di), with i ∈ 2, 4.• The sample size is n ∈ 250, 500. The parameters p and q are set equal to 10. Variation of

these parameters did invoke much changes for the output. Under each settings specified above,

we make 100 simulation runs and use the three methods described in Section 4 for tuning the

dimension parameter K. For the ET criterion we chose εn = 1/√n.

• We compare two measures of fit. The first is the relative absolute error of the estimators of the

48

‖Ψ‖ n ` CV FPE THmean sd mean sd mean sd

0 250 1 0.687 0.154 0.621 0.090 0.629 0.0453 0.669 0.107 0.660 0.103 0.653 0.047

500 1 0.564 0.117 0.514 0.080 0.565 0.0343 0.608 0.113 0.593 0.070 0.596 0.038

0.7 250 1 0.526 0.242 0.512 0.144 0.323 0.0523 0.674 0.219 0.631 0.091 0.513 0.045

500 1 0.531 0.292 0.571 0.190 0.339 0.0633 0.592 0.144 0.552 0.099 0.483 0.039

Table 1: Mean and standard deviation of δerr for three methods under β0 = β1 = 1, σ2 = 0.5 and i = 2.

‖Ψ‖ n ` CV FPE THmean sd mean sd mean sd

0 250 1 0.479 0.081 0.474 0.066 0.481 0.0453 0.507 0.065 0.514 0.069 0.512 0.045

500 1 0.480 0.045 0.486 0.039 0.490 0.0313 0.519 0.046 0.528 0.042 0.524 0.029

0.7 250 1 0.485 0.064 0.453 0.050 0.476 0.0473 0.541 0.096 0.518 0.065 0.531 0.050

500 1 0.505 0.042 0.479 0.036 0.497 0.0323 0.549 0.044 0.533 0.039 0.547 0.032

Table 2: Mean and standard deviation of δMSE for three methods under β0 = β1 = 1, σ2 = 0.5 and i = 2.

two non-zero lags:

δerr =1

2

(‖b0 − b0‖

β0+‖b` − b`‖

β1

).

The second is the mean square criterion:

δMSE =1

n

n∑t=1

Yt − ∑|h|≤3

bh(Xt−hI1 ≤ t− h ≤ n)

2

.

Each simulation run gives with each setting a sample δ1, . . . , δ100. We compute the mean and

the standard deviation.

Discussion of results. Due to a large number of settings we do not show all our results. Rather,

we display in Tables 1 and 2 a few selected and representative settings. Overall we found that

the relative absolute error δerr is typically smallest when we use method TH. In particular, for the

dependent setting and fast decay of eigenvalues, this method clearly outperforms CV and FPE. Still,

the method FPE is performing best when the (Xt) are independent and when i = 2 (slower decaying

eigenvalues).

The situation is quite different if we look at the model fit using δMSE. Then, overall the method

FPE performs best, especially under dependence. Here TH is throughout giving the largest errors

among the three approaches. CV can slightly outperform FPE when the (Xt) are independent and

when i = 2.

49

In conclusion we recommend to use the method TH if the target is estimation. Possibly this

method could be further improved by tuning the selection of the threshold εn. If the target is to use

the model for prediction, then FPE is preferable, though it comes with larger numerical costs than

TH. Method CV cannot be recommended, because it is numerically expensive and was not a clear

winner in any of the settings tested.

6 Proofs

It is assumed that all random elements in the sequel are defined on a common probability space

(Ω,A, P ). Recall that the vector space of all Hilbert–Schmidt operators acting on a Hilbert space

H is itself a Hilbert space with the inner product 〈K,L〉S =∑

m≥1〈K(em), L(em)〉, where (em) is

any orthonormal basis of H. The tensor product x ⊗ y of x, y ∈ H is a Hilbert–Schmidt operator

defined by x⊗ y(z) = x〈z, y〉 whose norm is ‖x⊗ y‖S = ‖x‖‖y‖.

6.1 Auxiliary lemmas

We collect in this section several simple lemmas referred to in Sections 2 and 3, and used in the

arguments that follow.

Lemma 2. If X,Z ∈ H are square integrable, and Ψ is a Hilbert–Schmidt operator, then

‖Cov(Ψ(X), Z)‖S ≤ ‖Ψ‖S ‖Cov(X,Z)‖S .

Proof. To lighten the notation, assume EX = 0, EZ = 0. Then, for any orthonormal basis (ej , j ≥1),

‖Cov(Ψ(X), Z)‖2S =∞∑j=1

‖E[Ψ(X)〈ej , Z〉]‖2 =∞∑j=1

‖Ψ(E[〈ej , Z〉X])‖2,

where we used the fact that expectation commutes with any bounded operator. It follows that

‖Cov(Ψ(X), Z)‖2S ≤∞∑j=1

‖Ψ‖2L‖Cov(X,Z)(ej)‖2 ≤ ‖Ψ‖2L ‖Cov(X,Z)‖2S .

The claim then follows because ‖Ψ‖L ≤ ‖Ψ‖S .

Lemma 3. Under Assumption 1,∑

h∈Z ‖CY Xh ‖S <∞.

Proof. Since Cov(ε`, Xk) = 0,

‖CY Xh ‖S = ‖∑k

Cov(bk(Xh−k), X0)‖S ≤∑k

‖Cov(bk(Xh−k), X0)‖S .

50

Therefore, by Lemma 2, ∑h

‖CY Xh ‖S ≤∑h

∑k

‖bk‖S ‖Cov(Xh−k, X0)‖S

=∑k

‖bk‖S∑h

‖CXh ‖S ,

so the claim follows from (2.3).

Lemma 4. Suppose Assumptions 1 and 2 hold. Then, for any θ ∈ [−π, π], the operator FXθ is

unbounded. It is invertible on

Dθ =

f ∈ H :∑m≥1

〈f, ϕm(θ)〉2λ−2m (θ) <∞

.

Proof. To show that the inverse of FXθ does not exist as a bounded operator, we must find a sequence

fn → 0 such that lim infn→∞ ‖(FXθ)−1

(fn)‖ > 0. As noted in the discussion leading to (3.2), for

any θ ∈ [−π, π], the operator FXθ is Hilbert-Schmidt, symmetric and non-negative definite. Since

Fθ is Hilbert-Schmidt,∑

m≥1 λ2m(θ) < ∞, and thus λm(θ) → 0, as m → ∞. Since all eigenvalues

λm(θ) are positive and (ϕm(θ) : m ≥ 1) is an orthonormal basis of H,

(FXθ)−1

(f) =∑m≥1

1

λm(θ)〈f, ϕm(θ)〉ϕm(θ),

if the series converges, i.e. if f ∈ Dθ. Setting fn = λ(θ)ϕn(θ), we see that fn → 0 and ‖(FXθ)−1

(fn)‖ =

‖ϕn‖ = 1.

6.2 Proofs of Lemma 1 and Theorem 1

We begin with a lemma which allows us to bound the difference between sample and population

auto– and cross–covariance operators. It is an extension of a fundamental result that the difference

between the sample and population covariance operators (h = 0, X = Y ) is of the order n−1/2, see

Bosq [2] and Horvath and Kokoszka [17]. It is a result likely to find applications in many asymptotic

arguments in the context of functional time series.

Lemma 5. Suppose Assumption 3 holds. Then there is a constant κ, independent of n and h, such

that E‖CXh −CXh ‖S ≤ κn−1/2. If, in addition, Assumption 1 holds, then E‖CY Xh −CY Xh ‖S ≤ κn−1/2.

Proof. We present the argument for CXh and h ≥ 0, which contains the key points. The result

for the cross–covariance operators is established in a similar way using the lemmas of Section 6.1.

We will repeatedly use the following simple relation: |〈x1 ⊗ y1, x2 ⊗ y2〉S | = |〈x1, y1〉〈x2, y2〉| ≤‖x1‖‖x2‖‖y1‖‖y2‖.

51

Using stationarity and diagonal summation, we obtain

nE‖CXh − CXh ‖2S =∑|r|<n

(1− |r|

n

)E〈Xr+h ⊗Xr − CXh , Xh ⊗X0 − CXh 〉S .

By Definition 1, if r ∈ 0, . . . , h− 1, then X(r)r+h is independent of Xr, Xh and X0. It follows easily

that E〈X(r)r+h ⊗Xr, Xh ⊗X0〉S = 0. Hence the summands above are bounded by

|E〈Xr+h ⊗Xr, Xh ⊗X0〉S | =∣∣∣E〈(Xr+h −X

(r)r+h)⊗Xr, Xh ⊗X0〉S

∣∣∣≤ ν3

4(X0)ν4(X0 −X(r)0 ).

When r ≥ h we get∣∣E〈Xr+h ⊗Xr − CXh , Xh ⊗X0 − CXh 〉S∣∣

=∣∣∣E〈Xr+h ⊗Xr − [Xr+h ⊗Xr]

(r), Xh ⊗X0 − CXh 〉S∣∣∣

≤ E[‖Xr+h ⊗Xr −X(r)

r+h ⊗X(r−h)r ‖S

(‖Xh ⊗X0‖S + ‖CXh ‖S

)]≤ 2

[E‖Xr+h ⊗Xr −X(r)

r+h ⊗X(r−h)r ‖2S

]1/2ν4(X0),

where we used

E(‖Xh ⊗X0‖S + ‖CXh ‖S

)2 ≤ 2E(‖Xh ⊗X0‖2S + 2‖CXh ‖2S≤ 2E‖Xh‖2‖X0‖2 + 2E‖Xh ⊗Xh‖2S

≤ 2(E‖Xh‖4

)1/2E(‖X0‖4

)1/2+ 2E‖Xh‖2‖Xh‖2

≤ 4ν24(X0).

Some further basic estimates show that[E‖Xr+h ⊗Xr −X(r)

r+h ⊗X(r−h)r ‖2S

]1/2

≤√

2ν4(X0)[ν4(X0 −X(r−h)

0 ) + ν4(X0 −X(r)0 )].

Similar estimates can be obtained when r < 0, and the result follows from (2.6).

It is convenient to introduce the following remainder terms:

τX(h) =∑|k|≥h

‖CXk ‖S ; τY X(h) =∑|k|≥h

‖CY Xk ‖S ; τ b(h) =∑|k|≥h

‖bk‖S .

52

Proof of Lemma 1. By repeated application of the triangle inequality, we obtain

supθ∈[−π,π]

‖FXθ − FXθ ‖L

≤ 1

2π

∑|h|≤q

‖CXh − CXh ‖L +∑|h|≤q

|1− ωq(h)|‖CXh ‖L + τX(q)

.

Since by Lemma 5,∑|h|≤q E‖CXh − CXh ‖L = O(qn−1/2), the first term tends to zero. The term∑

|h|≤q |1− ωq(h)|‖CXh ‖L tends to zero by (2.3), Assumption 4 and by the dominated convergence.

Again by (2.3), it follows that τX(q)→ 0. For example, one may then chose

ψXn =qn−1/2 +

∑|h|≤q

|1− ωq(h)|‖CXh ‖L + τX(q)1−γ

, γ ∈ (0, 1).

The same arguments apply to the spectral cross–density operators.

Proof of Theorem 1. Since

maxh∈Z‖bh − bh‖S ≤

1

2π

∫ π

−π‖Bθ −Bθ‖Sdθ,

we focus on the estimation of the frequency response operator Bθ. Define

Bθ = Bθ(K) =∑m≤K

FY Xθ(

1

λm(θ)ϕm(θ)⊗ ϕm(θ)

).

Then1

2‖Bθ −Bθ‖2S ≤ ‖Bθ − Bθ‖2S + ‖Bθ −Bθ‖2S .

Since, by (3.3), ∑`≥1

1

λ2` (θ)‖FY Xθ (ϕ`(θ))‖2 = ‖Bθ‖2S ≤

(∑k∈Z‖bk‖S

)2

<∞,

we see that

‖Bθ −Bθ‖2S =∑`>K

1

λ2` (θ)‖FY Xθ (ϕ`(θ))‖2 → 0, K →∞.

Thus, it remains to prove that1

2π

∫ π

−π‖Bθ − Bθ‖Sdθ

P−→ 0, (6.1)

and that KP−→∞. Condition (6.1) can be replaced by∫ π

−π‖Bθ − Bθ‖Sdθ × IAn

P−→ 0, (6.2)

53

where An ⊂ A is defined as

An := supθ‖FXθ − FXθ ‖ ≤ ψXn ∩ sup

θ‖FY Xθ − FY Xθ ‖ ≤ ψY Xn .

This is because by Lemma 1 we have that P (An)→ 1.

We have

Bθ − Bθ =K∑m=1

[FY Xθ

(1


)− FY Xθ

(1


)]

=K∑m=1

FY Xθ

(1

λm(θ)ϕm(θ)⊗ ϕm(θ)− 1


)

+K∑m=1

(FY Xθ − FY Xθ

)( 1


).

Thus, using ‖F G‖S ≤ ‖F‖L‖G‖S , we get

‖Bθ − Bθ‖S = ‖FY Xθ ‖L

∥∥∥∥∥K∑m=1

(1



)∥∥∥∥∥S

+ ‖FY Xθ − FY Xθ ‖L

(K∑m=1

1

λ2m(θ)

)1/2

.

Since we have supθ∈[−π,π] ‖FY Xθ ‖L ≤ 1π τ

Y X(0), (6.2) follows from

∫ π

−π

∥∥∥∥∥K∑m=1

(1



)∥∥∥∥∥S

dθ × IAn = oP (1) (6.3)

and

ψY Xn

∫ π

−πWKλ (θ)dθ = OP (1). (6.4)

Relation (6.4) is already immediate from the condition K ≤ K(2).

Some routine estimates show that the integrand in (6.3) is bounded by2

K∑m=1

1

λm(θ)‖ϕm(θ)− cm(θ)ϕm(θ)‖+

K∑m=1

|λm(θ)− λm(θ)|λm(θ)λm(θ)

× IAn , (6.5)

where cm(θ) is given as in Section 3. By Lemma 3.2 in Hormann and Kokoszka [13] we have that

‖ϕm(θ)− cm(θ)ϕm(θ)‖ ≤ 2√

2

αm(θ)sup

θ∈[−π,π]‖FXθ − FXθ ‖L,

and

supθ∈[−π,π]

supm≥1|λm(θ)− λm(θ)| ≤ sup

θ∈[−π,π]‖FXθ − FXθ ‖L. (6.6)

54

Thus we obtain the bound

4√

2

K∑m=1

ψXn

λm(θ)

[1

αm(θ)+

1

λm(θ)

]× IAn (6.7)

for (6.5). We further remark that on An we have that λm(θ) ≥ λm(θ)−|λm(θ)−λm(θ)| ≥ λm(θ)−ψXn .

Therefore, since K ≤ K(1), we have that (6.7) is bounded by

4√

2K∑m=1

ψXn

λm(θ)

[1

αm(θ)+

2

λm(θ)

]≤ 4√

2ψXn

(WKλ (θ)WK

α (θ) + 2(WKλ (θ)

)2), (6.8)

where we have made use of the Cauchy-Schwarz inequality in the last step. Using K ≤ K(3) and

K ≤ K(4) it is now easy to infer that (6.2) holds.

It remains to show that KP−→∞, i.e. that K(i) →∞ for 1 ≤ i ≤ 4.

Fix a large k and observe that P (K(1) ≥ k) = P (infθ λk(θ) ≥ 2ψXn ). Now define Bk;n :=

supθ |λk(θ) − λk(θ)| ≤ δk/2 where δk := infθ λk(θ). From Assumption 2 it follows that δk > 0.

Furthermore, it follows from Lemma 1 and (6.6) that P (Bk;n)→ 1 for n→∞. On the other hand

infθ λk(θ) ≥ infθ λk(θ)− supθ |λk(θ)−λk(θ)|, so that on Bk;n we have infθ λk(θ) ≥ δk/2. And hence,

for n large enough, we have infθ λk(θ) ≥ 2ψXn on Bk;n. Consequently P (K(1) ≥ k)→ 1 for n→∞,

irrespective of how large k was chosen.

Now we proveK(4) →∞. Fix again a big k and notice that is suffices to show that P (∫ π−π min1≤m≤k α

−2m (θ)dθ >

xn) → 0, for any xn → ∞. Define B′k;n := supθ |αk(θ) − αk(θ)| ≤ δ′k/2 where δ′k := infθ αk(θ)

and set Ak;n = ∩km=1B′k;n. Then for any fixed k we have P (Ak;n) → 1 and on Ak;n it holds that

min1≤m≤k αm(θ) ≤ min1≤m≤k δ′m/2 = rk. By Assumption 5 rk > 0 for any k. Hence, on Ak;n the

integral∫ π−π min1≤m≤k α

−2m (θ)dθ is bounded by 4π/rk and this is smaller than xn when n is big

enough. This proves K(4) →∞.

The proof of K(2) →∞ and K(3) →∞ is similar and therefore omitted.

1 Appendix

In this appendix, we derive the FPE method of selecting the dimension parameter K used in

Sections 3 and 4. In section 1.1, we discuss the relation of our spectral approach to the time domain

estimation in functional regression. This motivates the derivation of the FPE method in Section 1.2.

Section 1.3 contains the proofs of two results stated in Sections 1.1 and 1.2.

1.1 Relation to ordinary functional regression

As before we consider complex Hilbert spaces H and H ′ and define for elements (a, f), (b, g) ∈ H ′×Hdefine [(a, f), (b, g)] = 〈a, b〉+ 〈f, g〉. This defines an inner product on H ′×H, and with it the space

becomes a Hilbert space. Let us fix a frequency θ ∈ [−π, π], and define a zero mean complex random

55

element ∆ = (Υ,Ξ) ∈ L2H′×H such that

C∆ = E∆⊗∆ =

(CΥ CΥΞ

CΞΥ CΞ

)=

(FYθ FY XθFXYθ FXθ

). (1.1)

Now we regress Υ on Ξ, i.e. we seek h0 ∈ L(H,H ′) (the space of bounded linear operators from H

to H ′) which satisfies

h0 = argminh∈L(H,H′)E‖Υ− h(Ξ)‖2.

Then by the usual projection arguments, h0 solves the equation CΥΞ = h0 CΞ. By the definition

of CΥΞ and CΞ, it follows that h0 is also the solution to (3.1) and hence, by Assumption 2, is

equal to Bθ. Consequently, h0, or equivalently Bθ, can also be estimated from a random sample

((Υk,Ξk) : 1 ≤ k ≤ L) by standard methods known from functional linear models. A typical

estimator (see e.g. Cardot et al. [6]) is

h0;d(f) =

d∑`=1

CΥΞ(v`)

γ`〈f, v`〉 :=

d∑`=1

b`〈f, v`〉, (1.2)

where CΥΞ(f) := 1L

∑Lk=1 Υk〈f,Ξk〉 and γ` and v` are the eigenvalues and eigenvectors of CΞ(f) :=

1L

∑Lk=1 Ξk〈f,Ξk〉.

In practice we do not know C∆, but, as we will see in Lemma 1, below, it can be consistently

estimated from the data, which then in turn allows to generate a random sample ((Υi,Ξi) : 1 ≤ i ≤ L)

with a covariance which is asymptotically equal to C∆. A more direct approach is to define the

functional discrete Fourier transforms

Υk|p =1√2πp

pk∑t=p(k−1)+1

Yte−i(t−p(k−1))θ and Ξk|p =

1√2πp

pk∑t=p(k−1)+1

Xte−i(t−p(k−1))θ.

If we denote CΥΞp and CΞ

p covariance and cross-covariance operators related to the sequence ((Υk|p,Ξk|p) : 1 ≤k ≤ L), the following lemma holds:

Lemma 6. Consider the estimator FXθ|p with the Bartlett weights wp(h) = 1−|h|/p. Under Assump-

tion 3 we have ‖FXθ|p − CΞp ‖2S = OP (p3/n). Under the same conditions we have ‖FY Xθ|p − C

ΥΞp ‖2S =

OP (p3/n).

The lemma, which we prove in Section 1.3, confirms that computing (1.2) from the variables

(Υk|p,Ξk|p), which serve as an approximation to a random sample (Υk,Ξk), yields an estimator

which resembles closely Bθ|p,p,d in (3.4).

1.2 Description of the FPE approach

In order to keep this discussion short, we only consider the scalar response case. This is in line

with our simulation study. Our starting point is the alternative interpretation of Bθ discussed in

Section 1.1. Suppose we have an estimator h0;d for Bθ from a sample ((Υk,Ξk) : 1 ≤ k ≤ L). Now

we pick (Υ,Ξ) independent of this sample and set K = Kθ = argmind≥0E|Υ − h0;d(Ξ)|2. Note

56

that here, by the Riesz representation theorem, h0;d(Ξ) is of the from 〈Ξ, h0;d〉. With d = K in

(1.2) we minimize the mean squared prediction error in this functional regression. The related

model selection criterion is commonly known as final prediction error (FPE) criterion. Of course, to

compute K explicitly is mathematically infeasible, and therefore we go for an approximation. For

this purpose, we first note that that the coefficients bk in (1.2) satisfy

bd := (b1, . . . , bd)′ = argmin(b1,...,bd)∈Cd

L∑i=1

|Υi −d∑`=1

b`〈Ξi, v`〉|2.

Our problem is greatly simplified if we replace the empirical principal component scores by the

population ones and set

bd := (b1, . . . , bd)′ = argmin(b1,...,bd)∈Cd

L∑i=1

|Υi −d∑`=1

b`〈Ξ, v`〉|2,

and then define h0;d(Ξ) =∑d

`=1 b`〈Ξ, v`〉 and K = argmind≥0E|Υ− h0;d(Ξ)|2.

Proposition 1. Suppose that the ((Υi,Ξi) : 1 ≤ i ≤ L) constitute a Gaussian random sample, with

circularly-symmetric observation, i.e. E∆[∆, (a, f)] = 0 for any (a, f) ∈ H ′ ×H. Then for L > d

we have

E|Υ− h0;d(Ξ)|2 = σ2d ×

L

L− d,

where σ2d = 1

L−dE(Υ−Xbd)∗(Υ−Xbd) and X = (〈Ξi, v`〉 : 1 ≤ i ≤ L; 1 ≤ ` ≤ d), Υ = (Υ1, . . . ,ΥL)′.

The proof of this proposition is given in Section 1.3. Assuming Gaussianity is not a restriction,

since our estimator only relies on the second order structure of the data. Furthermore, by Panare-

tos and Tavakoli [24] we know that under general dependence assumptions the discrete Fourier

transforms Υi|p and Ξi|p are asymptotically (p→∞) complex normal random elements.

The proposition then suggests to choose d such that σ2d×

LL−d is minimized. An unbiased estimate

for the unknown σ2d is

1

L− d(Υ− Xbd)

∗(Υ− Xbd).

Finally, replacing the theoretical scores leads to the following dimension selection:

K = argmin0≤d<LL

(L− d)2(Υ− Xbd)

∗(Υ− Xbd), (1.3)

where X = (〈Ξi|p, v`〉 : 1 ≤ i ≤ L; 1 ≤ ` ≤ d) and Υ = (Υ1|p, . . . ,ΥL|p)′.

1.3 Proofs of Lemma 6 and Proposition 1

Proof Lemma 6. We define

CXh =1

Lp

L−1∑k=0

( p−h∑t=1

Xt+h+kp ⊗Xt+kp

), for 0 ≤ h < p,

57

and

CXh =1

Lp

L−1∑k=0

( p∑t=|h|+1

Xt−|h|+kp ⊗Xt+kp

), for −p < h < 0.

Direct verification shows that

CΞθ =

1

2π

∑|h|<p

CXh e−ihθ.

For two random operators An and Bn we write An = Bn + Op(mn) if ‖An − Bn‖S = OP (mn).

Then, for p > h ≥ 0, we deduce with the help of Lemma 5 that

nCXh − LpCXh =L−1∑k=0

( p∑t=p−h+1

Xt+h+kp ⊗Xt+kp

)+

n−h∑t=Lp+1

Xt+h ⊗Xt

= Lh(CXh +Op(L

−1/2))

= Lh(CXh +Op(L

−1/2)).

The same bound can be derived for h < 0. Thus,

CXh =

(1− |h|

p

)CXh +

(n

Lp− 1

)CXh +OP (L−1/2),

and since nLp − 1 ≤ p

n−p we have that

CXh =

(1− |h|

p

)CXh +OP

((p/n)1/2

).

We conclude that ‖CΞθ − FXθ ‖2S = OP

(p3/n

). A similar bound can be obtained for CΥΞ

θ − FY Xθ .

This proves Lemma 6.

Proof of Proposition 1. We have

E|Υ− h0;d(Ξ)|2 = E|Υ−d∑`=1

b`〈Ξ, v`〉|2 = E|d∑`=1

(b` − b`)〈Ξ, v`〉+ Z|2, (1.4)

where Z = (Υ − 〈Ξ, h0〉) +∑

`>d b`〈Ξ, v`〉. We set ε = Υ − 〈Ξ, h0〉. By the projection theorem it

follows that Cov(ε,Ξ) = 0. Furthermore, since we assume that b` are independent of Ξ, and since

principal components scores are orthogonal it follows that (1.4) equals

E|d∑`=1

(b` − b`)〈Ξ, v`〉|2 + E|Z|2 =d∑`=1

E|b` − b`|2γ` + E|Z|2.

58

With Γ = diag(γ1, . . . , γd) and Z = (Z1, . . . , ZL)′ and Zi = εi +∑

`>d b`〈Ξi, v`〉 we get

d∑`=1

E|b` − b`|2γ` = E[(bd − bd)∗Γ(bd − bd)

]= E

[Z∗X(X∗X)−1Γ(X∗X)−1X∗Z

]= tr

(ΓE[(X∗X)−1X∗ZZ∗X(X∗X)−1

]). (1.5)

We have E[ZZ∗] = E|Z|2IL. The imposed circular-symmetry implies that

EΥ〈Ξ, f〉 = 0 and E〈Ξ, f〉〈Ξ, g〉 = 0 ∀f, g ∈ H. (1.6)

Consequently, by Gaussianity it follows that Z and X are independent. (Note that two complex Gaus-

sian random variables U1 and U2, say, are independent if and only if Cov(U1, U2) = Cov(U1, U2) = 0.)

We can therefore conclude by a simple conditioning argument that (1.5) simplifies to

E|Z|2 tr(E[(

(XΓ−1/2)∗(XΓ−1/2))−1]

) =: E|Z|2 tr(EW−1

).

The matrix W−1 is an inverse complex Wishart matrix with expectation EW−1 = IdL−d . Thus

E|Υ− h0;d(Ξ)|2 = E|Z|2 × LL−d .

Bibliography

[1] A. Aue, S. Hormann, L. Horvath, and M. Reimherr. Break detection in the covariance structure

of multivariate time series models. The Annals of Statistics, 37:4046–4087, 2009.

[2] D. Bosq. Linear Processes in Function Spaces. Springer, 2000.

[3] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control.

Prentice Hall, Englewood Cliffs, third edition, 1994.

[4] D. R. Brillinger. Time Series: Data Analysis and Theory. Holt, New York, 1975.

[5] T. Cai and P. Hall. Prediction in functional linear regression. The Annals of Statistics, 34:

2159–2179, 2006.

[6] H. Cardot, F. Ferraty, and P. Sarda. Functional linear model. Statistics and Probability Letters,

45:11–22, 1999.

[7] H. Cardot, F. Ferraty, A. Mas, and P. Sarda. Testing hypothesis in the functional linear model.

Scandinavian Journal of Statistics, 30:241–255, 2003.

[8] J-M. Chiou and H-G. Muller. Diagnostics for functional regression via residual processes.

Computational Statistics and Data Analysis, 15:4849–4863, 2007.

[9] F. Comte and J. Johannes. Adaptive functional linear regression. The Annals of Statistics, 40:

2765–2797, 2012.

59

[10] C. Crambes, A. Kneip, and P. Sarda. Smoothing splines estimators for functional linear regres-

sion. The Annals of Statistics, 37:35–72, 2009.

[11] R. Gabrys, L. Horvath, and P. Kokoszka. Tests for error correlation in the functional linear

model. Journal of the American Statistical Association, 105:1113–1125, 2010.

[12] S. Hormann and L. Kidzinski. A note on estimation in Hilbertian linear models. Scandinavian

Journal of Statistics, 2014. Forthcoming.

[13] S. Hormann and P. Kokoszka. Weakly dependent functional data. The Annals of Statistics, 38:

1845–1884, 2010.

[14] S. Hormann and P. Kokoszka. Functional time series. In C. R. Rao and T. Subba Rao, editors,

Time Series, volume 30 of Handbook of Statistics. Elsevier, 2012.

[15] S. Hormann, L. Horvath, and R. Reeder. A functional version of the ARCH model. Econometric

Theory, 29:267–288, 2013.

[16] S. Hormann, L. Kidzinski, and M. Hallin. Dynamic functional principal components. Journal

of the Royal Statistical Society: Series B, 2014. Forthcoming.

[17] L. Horvath and P. Kokoszka. Inference for Functional Data with Applications. Springer, 2012.

[18] G. M. James, J. Wang, and J. Zhu. Functional linear regression that’s interpretable. The

Annals of Statistics, 37:2083–2108, 2009.

[19] P. Kokoszka and M. Reimherr. Predictability of shapes of intraday price curves. The Econo-

metrics Journal, 16:285–308, 2013.

[20] A. N. Kolmogorov. Interpolation und Extrapolation von stationaren zufalligen Folgen. Bull.

Acad. Sci. U.S.S.R., 5:3–14, 1941.

[21] Y. Li and T. Hsing. On rates of convergence in functional linear regression. Journal of Multi-

variate Analysis, 98:1782–1804, 2007.

[22] I. McKeague and B. Sen. Fractals with point impacts in functional linear regression. The


[23] H-G. Muller and U. Stadtmuller. Generalized functional linear models. The Annals of Statistics,

33:774–805, 2005.

[24] V. M. Panaretos and S. Tavakoli. Fourier analysis of stationary time series in function space.

The Annals of Statistics, 41:568–603, 2013.

[25] V. M. Panaretos and S. Tavakoli. Cramer–Karhunen–Loeve representation and harmonic prin-

cipal component analysis of functional time series. Stochastic Processes and their Applications,

123:2779–2807, 2013.

[26] M. B. Priestley. Spectral Analysis and Time Series. Academic Press, 1981.

[27] J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, 2005.

60

[28] X. Shao and W. B. Wu. Asymptotic spectral theory for nonlinear time series. The Annals of

Statistics, 35:1773–1801, 2007.

[29] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications with R Examples.

Springer, 2011.

[30] N. Wiener. The Extrapolation, Interpolation and Smoothing of Stationary Time Series with

Engineering Applications. Wiley, 1949.

[31] W. Wu. Nonlinear System Theory: Another Look at Dependence, volume 102. The National

Academy of Sciences of the United States, 2005.

[32] F. Yao, H-G. Muller, and J-L. Wang. Functional linear regression analysis for longitudinal data.


61

Appendix A

Dynamic Functional Principal Components

Dynamic Functional Principal Components∗

Siegfried Hormann1, Lukasz Kidzinski1, Marc Hallin2,3

1 Department of Mathematics, Universite libre de Bruxelles (ULB), CP210, Bd. du Triomphe, B-1050

Brussels, Belgium.

2 ECARES, Universite libre de Bruxelles (ULB), CP 114/04 50, avenue F.D. Roosevelt B-1050 Brussels,

Belgium.

3 ORFE, Princeton University, Sherrerd Hall, Princeton, NJ 08540, USA.

Abstract

In this paper, we address the problem of dimension reduction for time series of functional data (Xt : t ∈Z). Such functional time series frequently arise, e.g., when a continuous-time process is segmented into

some smaller natural units, such as days. Then each Xt represents one intraday curve. We argue that

functional principal component analysis (FPCA), though a key technique in the field and a benchmark for

any competitor, does not provide an adequate dimension reduction in a time-series setting. FPCA indeed

is a static procedure which ignores the essential information provided by the serial dependence structure of

the functional data under study. Therefore, inspired by Brillinger’s theory of dynamic principal components,

we propose a dynamic version of FPCA, which is based on a frequency-domain approach. By means of a

simulation study and an empirical illustration, we show the considerable improvement the dynamic approach

entails when compared to the usual static procedure.

Keywords. Dimension reduction, frequency domain analysis, functional data analysis, functional

time series, functional spectral analysis, principal components, Karhunen-Loeve expansion.

1 Introduction

The tremendous technical improvements in data collection and storage allow to get an increasingly

complete picture of many common phenomena. In principle, most processes in real life are continuous

in time and, with improved data acquisition techniques, they can be recorded at arbitrarily high

frequency. To benefit from increasing information, we need appropriate statistical tools that can

help extracting the most important characteristics of some possibly high-dimensional specifications.

Functional data analysis (FDA), in recent years, has proven to be an appropriate tool in many

such cases and has consequently evolved into a very important field of research in the statistical

community.

Typically, functional data are considered as realizations of (smooth) random curves. Then every

observation X is a curve (X(u) : u ∈ U). One generally assumes, for simplicity, that U = [0, 1], but

U could be a more complex domain like a cube or the surface of a sphere. Since observations are

functions, we are dealing with high-dimensional – in fact intrinsically infinite-dimensional – objects.

So, not surprisingly, there is a demand for efficient data-reduction techniques. As such, functional

∗Manuscript has been accepted for publication in Journal of the Royal Statistical Sociaty: Series B

63

principal component analysis (FPCA) has taken a leading role in FDA, and functional principal

components (FPC) arguably can be seen as the key technique in the field.

In analogy to classical multivariate PCA (see Jolliffe [22]), functional PCA relies on an eigen-

decomposition of the underlying covariance function. The mathematical foundations for this have

been laid several decades ago in the pioneering papers by Karhunen [23] and Loeve [26], but it took

a while until the method was popularized in the statistical community. Some earlier contributions

are Besse and Ramsay [5], Ramsay and Dalzell [30] and, later, the influential books by Ramsay and

Silverman [31], [32] and Ferraty and Vieu [11]. Statisticians have been working on problems related

to estimation and inference (Kneip and Utikal [24], Benko et al. [3]), asymptotics (Dauxois et al. [10]

and Hall and Hosseini-Nasab [15]), smoothing techniques (Silverman [34]), sparse data (James et

al. [21], Hall et al. [16]), and robustness issues (Locantore et al. [25], Gervini [12]), to name just a

few. Important applications include FPC-based estimation of functional linear models (Cardot et

al. [9], Reiss and Ogden [33]) or forecasting (Hyndman and Ullah [20], Aue et al. [1]). The usefulness

of functional PCA has also been recognized in other scientific disciplines, like chemical engineering

(Gokulakrishnan et al. [14]) or functional magnetic resonance imaging (Aston and Kirch [2], Viviani

et al. [37]). Many more references can be found in the above cited papers and in Sections 8–10 of

Ramsay and Silverman [32], where we refer to for background reading.

Most existing concepts and methods in FDA, even though they may tolerate some amount of

serial dependence, have been developed for independent observations. This is a serious weakness, as

in numerous applications the functional data under study are obviously dependent, either in time or

in space. Examples include daily curves of financial transactions, daily patterns of geophysical and

environmental data, annual temperatures measured on the surface of the earth, etc. In such cases,

we should view the data as the realization of a functional time series (Xt(u) : t ∈ Z), where the time

parameter t is discrete and the parameter u is continuous. For example, in case of daily observations,

the curve Xt(u) may be viewed as the observation on day t with intraday time parameter u. A key

reference on functional time series techniques is Bosq [8], who studied functional versions of AR

processes. We also refer to Hormann and Kokoszka [19] for a survey.

Ignoring serial dependence in this time-series context may result in misleading conclusions and

inefficient procedures. Hormann and Kokoszka [18] investigate the robustness properties of some

classical FDA methods in the presence of serial dependence. Among others, they show that usual

FPCs still can be consistently estimated within a quite general dependence framework. Then the

basic problem, however, is not about consistently estimating traditional FPCs: the problem is that,

in a time-series context, traditional FPCs are not the adequate concept of dimension reduction

anymore – a fact which, since the seminal work of Brillinger [6], is well recognized in the usual

vector time-series setting. FPCA indeed operates in a static way: when applied to serially dependent

curves, it fails to take into account the potentially very valuable information carried by the past

values of the functional observations under study. In particular, a static FPC with small eigenvalue,

hence negligible instantaneous impact on Xt, may have a major impact on Xt+1, and high predictive

value.

Besides their failure to produce optimal dimension reduction, static FPCs, while cross-sectionally

64

uncorrelated at fixed time t, typically still exhibit lagged cross-correlations. Therefore the resulting

FPC scores cannot be analyzed componentwise as in the i.i.d. case, but need to be considered as

vector time series which are less easy to handle and interpret.

These major shortcomings are motivating the present development of dynamic functional prin-

cipal components (dynamic FPCs). The idea is to transform the functional time series into a vector

time series (of low dimension, ≤ 4, say), where the individual component processes are mutually

uncorrelated (at all leads and lags; autocorrelation is allowed, though), and account for most of

the dynamics and variability of the original process. The analysis of the functional time series can

then be performed on those dynamic FPCs; thanks to their mutual orthogonality, dynamic FPCs

moreover can be analyzed componentwise. In analogy to static FPCA, the curves can be optimally

reconstructed/approximated from the low-dimensional dynamic FPCs via a dynamic version of the

celebrated Karhunen-Loeve expansion.

Dynamic principal components first have been suggested by Brillinger [6] for vector time series.

The purpose of this article is to develop and study a similar approach in a functional setup. The

methodology relies on a frequency-domain analysis for functional data, a topic which is still in its

infancy (see, for instance, Panaretos and Tavakoli [27]).

The rest of the paper is organized as follows. In Section 2 we give a first illustration of the

procedure and sketch two typical applications. In Section 3, we describe our approach and state

a number of relevant propositions. We also provide some asymptotic features. In Section 4, we

discuss its computational implementation. After an illustration of the methodology by a real data

example on pollution curves in Section 5, we evaluate our approach in a simulation study (Section 6).

Appendices A and B detail the mathematical framework and contain the proofs. Some of the more

technical results and proofs are provided in a supplementary document.

After the present paper (which has been available on Arxiv since October 2012) was submitted,

another paper by Panaretos and Tavakoli [28] was published, where similar ideas are proposed. While

both papers aim at the same objective of a functional extension of Brillinger’s concept, there are

essential differences between the solutions developed. The main result in Panaretos and Tavakoli [28]

is the existence of a functional process (X∗t ) of rank q which serves as an “optimal approximation”

to the process (Xt) under study. The construction of (X∗t ), which is mathematically quite elegant,

is based on stochastic integration with respect to some orthogonal-increment (functional) stochas-

tic process (Zω). The disadvantage, from a statistical perspective, is that this construction is not

explicit, and that no finite-sample version of the concept is provided – only the limiting behavior

of the empirical spectral density operator and its eigenfunctions is obtained. Quite on the contrary,

our Theorem 4 establishes the consistency of an empirical, explicitly constructed and easily imple-

mentable version of the dynamic scores – which is what a statistician will be interested in. We also

remark that we are working under milder technical conditions.

65

2 Illustration of the method

An impression of how well the proposed method works can be obtained from Figure 1. Its left panel

shows ten consecutive intraday curves of some pollutant level (a detailed description of the underlying

data is given in Section 5). The two panels to the right show one-dimensional reconstructions of

these curves. We used static FPCA in the central panel and dynamic FPCA in the right panel. The

0.0 0.2 0.4 0.6 0.8 1.0

−6

−4

−2

02

Intraday time

Sq

rt(P

M1

0)

0.0 0.2 0.4 0.6 0.8 1.0

−6

−4

−2

02

Intraday time

Sq

rt(P

M1

0)

0.0 0.2 0.4 0.6 0.8 1.0

−6

−4

−2

02

Intraday time

Sq

rt(P

M1

0)

Figure 1: Ten successive daily observations (left panel), the corresponding static Karhunen-Loeve expansionbased on one (static) principal component (middle panel), and the dynamic Karhunen-Loeve expansion withone dynamic component (right panel). Colors provide the matching between the actual observations and theirKarhunen-Loeve approximations.

difference is notable. The static method merely provides an average level, exhibiting a completely

spurious and highly misleading intraday symmetry. In addition to daily average levels, the dynamic

approximation, to a large extent, also catches the intraday evolution of the curves. In particular,

it retrieves the intraday trend of pollution levels, and the location of their daily spikes and troughs

(which varies considerably from one curve to the other). For this illustrative example we chose

one-dimensional reconstructions, based on one single FPC; needless to say, increasing the number

of FPCs (several principal components), we obtain much better approximations – see Section 4 for

details.

Applications of dynamic PCA in a time series analysis are the same as those of static PCA in the

context of independent (or uncorrelated) observations. This is why obtaining mutually orthogonal

principal components – in the sense of mutually orthogonal processes – is a major issue here. This

orthogonality, at all leads and lags, of dynamic principal components, indeed, implies that any

second-order based method (which is the most common approach in time series) can be carried out

componentwise, i.e. via scalar methods. In contrast, static principal components still have to be

treated as a multivariate time series.

Let us illustrate this superiority of mutually orthogonal dynamic components over the auto- and

cross-correlated static ones by means of two examples.

Change point analysis: Suppose that we wish to find a structural break (change point) in a sequence

of functional observations X1, . . . , Xn. For example, Berkes et al. [4] consider the problem of detect-

66

ing a change in the mean function of a sequence of independent functional data. They propose to

first project data on the p leading principal components and argue that a change in the mean will

show in the score vectors, provided hat the proportion of variance they are accounting for is large

enough. Then a CUSUM procedure is utilized. The test statistic is based on the functional

Tn(x) =1

n

p∑m=1

λ−1m

∑1≤k≤nx

Y statmk − x

∑1≤k≤n

Y statmk

2

, 0 ≤ x ≤ 1.

Here Y statmk is the m-th empirical PC score of Xk and λm is the m-th largest eigenvalue of the empirical

covariance operator related to the functional sample. The assumption of independence implies that

Tn(x) converges, under the no-change hypothesis, to the sum of p squared independent Brownian

bridges. Roughly speaking, this is due to the fact that the partial sums of score vectors (used in the

CUSUM statistic) converge in distribution to a multivariate normal with diagonal covariance. That

is, the partial sums of the individual scores become asymptotically independent, and we just obtain

p independent CUSUM test statistics – a separate one for each score sequence. The independent

test statistics are then aggregated.

This simple structure is lost when data are serially dependent. Then, if a CLT holds,(∑

1≤k≤n Ystatmk : m =

1, . . . , p)′

converges to a normal vector where the covariance (which is still diagonal) needs to be

replaced by the long-run covariance of the score vectors, which is typically non-diagonal.

In contrast, using dynamic principal components, the long-run covariance of the score vectors

remains diagonal; see Proposition 4. Let diag(λ1(0), . . . , λp(0)) be a consistent estimator of this

long-run variance and Y dynmk be the dynamic scores. Then replacing the test functionals Tn(x) by

T dynn (x) =

2π

n

p∑m=1

λ−1m (0)

∑1≤k≤nx

Y dynmk − x

∑1≤k≤n

Y dynmk

2

, 0 ≤ x ≤ 1,

we get that (under appropriate technical assumptions ensuring a functional CLT) the same asymp-

totic behavior holds as for Tn(x), so that again p independent CUSUM test statistics can be aggre-

gated.

Dynamic principal components, thus, and not the static ones, provide a feasible extension of the

Berkes et al. [4] method to the time series context.

Lagged regression: A lagged regression model is a linear model in which the response Wt ∈ Rq, say,

is allowed to depend on an unspecified number of lagged values of a series of regressor variables

(Xt) ∈ Rp. More specifically, the model equation is

Wt = a+∑k∈Z

bkXt−k + εt, (2.1)

with some i.i.d. noise (εt) which is independent of the regressor series. The intercept a ∈ Rq and

the matrices bk ∈ Rq×p are unknown. In time series analysis, the lagged regression is the natural

extension of the traditional linear model for independent data.

67

The main problem in this context, which can be tackled by a frequency domain approach, is

estimation of the parameters. See, for example, Shumway and Stoffer [35] for an introduction. Once

the parameters are known, the model can, e.g., be used for prediction.

Suppose now that Wt is a scalar response and that (Xk) constitutes a functional time series.

The corresponding lagged regression model can be formulated in analogy, but involves estimation

of an unspecified number of operators, which is quite delicate. A pragmatic way to proceed is

to have Xk in (2.1) replaced by the vector of the first p dynamic functional principal component

scores Yk = (Y1k, . . . , Ypk)′, say. The general theory implies that, under mild assumptions (basically

guaranteeing convergence of the involved series),

bk =1

2π

∫ π

−πBθe

ikθdθ, where Bθ = FWYθ

(FYθ)−1

,

and

FYθ =1

2π

∑h∈Z

cov(Yt+h, Yt)e−ihθ and FWY

θ =1

2π

∑h∈Z

cov(Wt+h, Yt)e−ihθ

are the spectral density matrix of the score sequence and the cross-spectrum between (Wt) and

(Yt), respectively. In the present setting the structure greatly simplifies. Our theory will reveal (see

Proposition 9) that FYθ is diagonal at all frequencies and that

Bθ =

(fWY1θ

λ1(θ), . . . ,

fWYpθ

λp(θ)

),

with fWYmθ being the co-spectrum between (Wt) and (Ymt) and λm(θ) is the m-th dynamic eigenvalue

of the spectral density operator of the series (Xk) (see Section 3.2). As a consequence, the influence

of each score sequence on the regressors can be assessed individually.

Of course, in applications, these population quantities are replaced by their empirical versions

and one may use some testing procedure for the null-hypothesis H0 : fWYpθ = 0 for all θ, in order to

justify the choice of the dimension of the dynamic score vectors and to retain only those components

which have a significant impact on Wt.

3 Methodology for L2 curves

In this section, we introduce some necessary notation and tools. Most of the discussion on technical

details is postponed to the Appendices A, B and the supplementary document C. For simplicity, we

are focusing here on L2([0, 1])-valued processes, i.e. on square-integrable functions defined on the

unit interval; in the appendices, however, the theory is developed within a more general framework.

3.1 Notation and setup

Throughout this section, we consider a functional time series (Xt : t ∈ Z), where Xt takes values in

the space H := L2([0, 1]) of complex-valued square-integrable functions on [0, 1]. This means that

68

Xt = (Xt(u) : u ∈ [0, 1]), with ∫ 1

0|Xt(u)|2du <∞

(|z| :=√zz, where z the complex conjugate of z, stands for the modulus of z ∈ C). In most

applications, observations are real, but, since we will use spectral methods, a complex vector space

definition will serve useful.

The space H then is a Hilbert space, equipped with the inner product 〈x, y〉 :=∫ 1

0 x(u)y(u)du,

so that ‖x‖ := 〈x, x〉1/2 defines a norm. The notation X ∈ LpH is used to indicate that, for some

p > 0, E[‖X‖p] <∞. Any X ∈ L1H then possesses a mean curve µ = (E[X(u)] : u ∈ [0, 1]), and any

X ∈ L2H a covariance operator C, defined by C(x) := E[(X − µ)〈x,X − µ〉]. The operator C is a

kernel operator given by

C(x)(u) =

∫ 1

0c(u, v)x(v)dv, with c(u, v) := cov(X(u), X(v)), u, v ∈ [0, 1],

with cov(X,Y ) := E(X −EX)(Y − EY ). The process (Xt : t ∈ Z) is called weakly stationary if, for

all t, (i) Xt ∈ L2H , (ii) EXt = EX0, and (iii) for all h ∈ Z and u, v ∈ [0, 1],

cov(Xt+h(u), Xt(v)) = cov(Xh(u), X0(v)) =: ch(u, v).

Denote by Ch, h ∈ Z, the operator corresponding to the autocovariance kernels ch. Clearly, C0 = C.

It is well known that, under quite general dependence assumptions, the mean of a stationary func-

tional sequence can be consistently estimated by the sample mean, with the usual√n-convergence

rate. Since, for our problem, the mean is not really relevant, we throughout suppose that the

data have been centered in some preprocessing step. For the rest of the paper, it is tacitly as-

sumed that (Xt : t ∈ Z) is a weakly stationary, zero mean process defined on some probability space

(Ω,A, P ).

As in the multivariate case, the covariance operator C of a random element X ∈ L2H admits an

eigendecomposition (see, e.g., p. 178, Theorem 5.1 in [13])

C(x) =∞∑`=1

λ`〈x, v`〉v`, (3.1)

where (λ` : ` ≥ 1) are C’s eigenvalues (in descending order) and (v` : ` ≥ 1) the corresponding

normalized eigenfunctions, so that C(v`) = λ`v` and ‖v`‖ = 1. If C has full rank, then the sequence

(v` : ` ≥ 1) forms an orthonormal basis of L2([0, 1]). Hence, X admits the representation

X =

∞∑`=1

〈X, v`〉v`, (3.2)

which is called the static Karhunen-Loeve expansion of X. The eigenfunctions v` are called the

(static) functional principal components (FPCs) and the coefficients 〈X, v`〉 are called the (static)

FPC scores or loadings. It is well known that the basis (v` : ` ≥ 1) is optimal in representing X in

69

the following sense: if (w` : ` ≥ 1) is any other orthonormal basis of H, then

E‖X −p∑`=1

〈X, v`〉v`‖2 ≤ E‖X −p∑`=1

〈X,w`〉w`‖2, ∀p ≥ 1. (3.3)

Property (3.3) shows that a finite number of FPCs can be used to approximate the function X

by a vector of given dimension p with a minimum loss of “instantaneous” information. It should

be stressed, though, that this approximation is of a static nature, meaning that it is performed

observation by observation, and does not take into account the possible serial dependence of the

Xt’s, which is likely to exist in a time-series context. Globally speaking, we should be looking for an

approximation which also involves lagged observations, and is based on the whole family (Ch : h ∈ Z)

rather than on C0 only. To achieve this goal, we introduce below the spectral density operator, which

contains the full information on the family of operators (Ch : h ∈ Z).

3.2 The spectral density operator

In analogy to the classical concept of a spectral density matrix, we define the spectral density

operator.

Definition 2. Let (Xt) be a stationary process. The operator FXθ whose kernel is

fXθ (u, v) :=1

2π

∑h∈Z

ch(u, v)e−ihθ, θ ∈ [−π, π],

where i denotes the imaginary unit, is called the spectral density operator of (Xt) at frequency θ.

To ensure convergence (in an appropriate sense) of the series defining fXθ (u, v) (see Appendix A.2),

we impose the following summability condition on the autocovariances

∑h∈Z

(∫ 1

0

∫ 1

0|ch(u, v)|2dudv

)1/2

<∞. (3.4)

The same condition is more conveniently expressed as∑h∈Z‖Ch‖S <∞, (3.5)

where ‖ · ‖S denotes the Hilbert-Schmidt norm (see Section C.1 in the supplementary document).

A simple sufficient condition for (3.5) to hold will be provided in Proposition 7.

This concept of a spectral density operator has been introduced by Panaretos and Tavakoli [27].

In our context, this operator is used to create particular functional filters (see Sections 3.3 and A.3),

which are the building blocks for the construction of dynamic FPCs. A functional filter is defined via

a sequence Φ = (Φ` : ` ∈ Z) of linear operators between the spaces H = L2([0, 1]) and H ′ = Rp. The

filtered variables Yt have the form Yt =∑

`∈Z Φ`(Xt−`), and by the Riesz representation theorem,

the linear operators Φ` are given as

x 7→ Φ`(x) = (〈x, φ1`〉, . . . , 〈x, φp`〉)′, with φ1`, . . . , φp` ∈ H.

70

We shall considerer filters Φ for which the sequences (∑N

`=−N φm`(u)ei`θ : N ≥ 1), 1 ≤ m ≤ p,

converge in L2([0, 1]× [−π, π]). Hence, we assume existence of a square integrable function φ?m(u|θ)such that

limN→∞

∫ π

−π

∫ 1

0

(N∑

`=−Nφm`(u)ei`θ − φ?m(u|θ)

)2

dudθ = 0. (3.6)

In addition we suppose that

supθ∈[−π,π]

∫ 1

0[φ?m(u|θ)]2 du <∞. (3.7)

Then, we write φ?m(θ) :=∑

`∈Z φmèi`θ or, in order to emphasize its functional nature, φ?m(u|θ) :=∑

`∈Z φm`(u)ei`θ. We denote by C the family of filters Φ which satisfy (3.6) and (3.7). For example,

if Φ is such that∑

` ‖φm`‖ <∞, then Φ ∈ C.

The following proposition relates the spectral density operator of (Xt) to the spectral density

matrix of the filtered sequence (Yt =∑

`∈Z Φ`(Xt−`)). This simple result plays a crucial role in our

construction.

Proposition 2. Assume that Φ ∈ C and let φ?m(θ) be given as above. Then the series∑

`∈Z Φ`(Xt−`)

converges in mean square to a limit Yt. The p-dimensional vector process (Yt) is stationary, with

spectral density matrix

FYθ =

〈FXθ (φ?1(θ)), φ?1(θ)

⟩· · · 〈FXθ (φ?p(θ)), φ

?1(θ)

⟩...

. . ....

〈FXθ (φ?1(θ)), φ?p(θ)⟩· · · 〈FXθ (φ?p(θ)), φ

?p(θ)

⟩ .

Since we do not want to assume a priori absolute summability of the filter coefficients Φ`, the

series FYθ = (2π)−1∑

h∈ZCYh e

ihθ, where CYh = cov(Yh, Y0), may not converge absolutely, and hence

not pointwise in θ. As our general theory will show, the operator FYθ can be considered as an

element of the space L2Cp×p([−π, π]), i.e. the collection of measurable mappings f : [−π, π] → Cp×p

for which∫ π−π ‖f(θ)‖2Fdθ < ∞, where ‖ · ‖F denotes the Frobenius norm. Equality of f and g is

thus understood as∫ π−π ‖f(θ) − g(θ)‖2Fdθ = 0. In particular it implies that f(θ) = g(θ) for almost

all θ.

To explain the important consequences of Proposition 2, first observe that under (3.5), for

every frequency θ, the operator FXθ is a non-negative, self-adjoint Hilbert-Schmidt operator (see

Section C.1 of the supplementary file). Hence, in analogy to (3.1), FXθ admits, for all θ, the spectral

representation

FXθ (x) =∑m≥1

λm(θ)〈x, ϕm(θ)〉ϕm(θ),

where λm(θ) and ϕm(θ) denote the dynamic eigenvalues and eigenfunctions. We impose the order

λ1(θ) ≥ λ2(θ) ≥ . . . ≥ 0 for all θ ∈ [−π, π], and require that the eigenfunctions be standardized so

that ‖ϕm(θ)‖ = 1 for all m ≥ 1 and θ ∈ [−π, π].

71

Assume now that we could choose the functional filters (φm` : ` ∈ Z) in such a way that

limN→∞

∫ π

−π

∫ 1

0

(N∑

`=−Nφm`(u)ei`θ − ϕm(u|θ)

)2

dudθ = 0. (3.8)

We then have FYθ = diag(λ1(θ), . . . , λp(θ)) for almost all θ, implying that the coordinate processes

of (Yt) are uncorrelated at any lag: cov(Ymt, Ym′s) = 0 for all s, t and m 6= m′. As discussed in the

Introduction, this is a desirable property which the static FPCs do not possess.

3.3 Dynamic FPCs

Motivated by the discussion above, we wish to define φm` in such a way that φ?m = ϕm (in L2([0, 1]×[−π, π])). To this end, we suppose that the function ϕm(u|θ) is jointly measurable in u and θ (this

assumption is discussed in Appendix A.1). The fact that eigenfunctions are standardized to unit

length implies∫ π−π∫ 1

0 ϕ2m(u|θ)dudθ = 2π. We conclude from Tonelli’s theorem that

∫ π−π ϕ

2m(u|θ)dθ <

∞ for almost all u ∈ [0, 1], i.e. that ϕm(u|θ) ∈ L2([−π, π]) for all u ∈ Am ⊂ [0, 1], where Am has

Lebesgue measure one. We now define, for u ∈ Am,

φm`(u) :=1

2π

∫ π

−πϕm(u|s)e−i`sds; (3.9)

for u /∈ Am, φm`(u) is set to zero. Then, it follows from the results in Appendix A.1 that (3.8) holds.

We conclude that the functional filters defined via (φm` : ` ∈ Z, 1 ≤ m ≤ p) belong to the class Cand that the resulting filtered process has diagonal autocovariances at all lags.

Definition 3 (Dynamic functional principal components). Assume that (Xt : t ∈ Z) is a mean-zero

stationary process with values in L2H satisfying assumption (3.5). Let φm` be defined as in (3.9).

Then the m-th dynamic functional principal component score of (Xt) is

Ymt :=∑`∈Z〈Xt−`, φm`〉, t ∈ Z, m ≥ 1. (3.10)

Call Φm := (φm` : ` ∈ Z) the m-th dynamic FPC filter coefficients.

Remark 1. If EXt = µ, then the dynamic FPC scores Ymt are defined as in (3.10), with Xs replaced

by Xs − µ.

Remark 2. Note that the dynamic scores (Ymt) in (3.10) are not unique. The filter coefficients φm`

are computed from the eigenfunctions ϕm(θ), which are defined up to some multiplicative factor z

on the complex unit circle. Hence, to be precise, we should speak of a version of (Ymt) and a version

of (φm`). We further discuss this issue after Theorem 2 and in Section 3.4.

The rest of this section is devoted to some important properties of dynamic FPCs.

Proposition 3 (Elementary properties). Let (Xt : t ∈ Z) be a real-valued stationary process satis-

fying (3.5), with dynamic FPC scores Ymt. Then,

(a) the eigenfunctions ϕm(θ) are Hermitian, and hence Ymt is real;

(b) if Ch = 0 for h 6= 0, the dynamic FPC scores coincide with the static ones.

72

Proposition 4 (Second-order properties). Let (Xt : t ∈ Z) be a stationary process satisfying (3.5),

with dynamic FPC scores Ymt. Then,

(a) the series defining Ymt is mean-square convergent, with

EYmt = 0 and EY 2mt =

∑`∈Z

∑k∈Z〈C`−k(φm`), φmk〉;

(b) the dynamic FPC scores Ymt and Ym′s are uncorrelated for all s, t and m 6= m′. In other words,

if Yt = (Y1t, . . . , Ypt)′ denotes some p-dimensional score vector and CYh its lag-h covariance matrix,

then CYh is diagonal for all h;

(c) the long-run covariance matrix of the dynamic FPC score vector process (Yt) is

limn→∞

1

nVar(Y1 + · · ·+ Yn) = 2π diag(λ1(0), . . . , λp(0)).

The next theorem, which tells us how the original process (Xt(u) : t ∈ Z, u ∈ [0, 1]) can be

recovered from (Ymt : t ∈ Z, m ≥ 1), is the dynamic analogue of the static Karhunen-Loeve expansion

(3.2) associated with static principal components.

Theorem 2 (Inversion formula). Let Ymt be the dynamic FPC scores related to the process (Xt(u) : t ∈Z, u ∈ [0, 1]). Then,

Xt(u) =∑m≥1

Xmt(u) with Xmt(u) :=∑`∈Z

Ym,t+`φm`(u) (3.11)

(where convergence is in mean square). Call (3.11) the dynamic Karhunen-Loeve expansion of Xt.

We have mentioned in Remark 2 that dynamic FPC scores are not unique. In contrast, our

proofs show that the curves Xmt(u) are unique. To get some intuition, let us draw a simple analogy

to the static case. There, each v` in the Karhunen-Loeve expansion (3.2) can be replaced by −v`,i.e., the FPCs are defined up to their signs. The `-th score is 〈X, v`〉 or 〈X,−v`〉, and thus is not

unique either. However, the curves 〈X, v`〉v` and 〈X,−v`〉(−v`) are identical.

The sums∑p

m=1Xmt(u), p ≥ 1, can be seen as p-dimensional reconstructions of Xt(u), which

only involve the p time series (Ymt : t ∈ Z), 1 ≤ m ≤ p. Competitors to this reconstruction

are obtained by replacing φm` in (3.10) and (3.11) with alternative sequences ψm` and υm`. The

next theorem shows that, among all filters in C, the dynamic Karhunen-Loeve expansion (3.11)

approximates Xt(u) in an optimal way.

Theorem 3 (Optimality of Karhunen-Loeve expansions). Let Ymt be the dynamic FPC scores

related to the process (Xt : t ∈ Z), and define Xmt as in Theorem 2. Let Xmt =∑

`∈Z Ym,t+` υm`,

with Ymt =∑

`∈Z〈Xt−`, ψm`〉, where (ψmk : k ∈ Z) and (υmk : k ∈ Z) are sequences in H belonging

to C. Then,

E‖Xt −p∑

m=1

Xmt‖2 =∑m>p

∫ π

−πλm(θ)dθ ≤ E‖Xt −

p∑m=1

Xmt‖2 ∀p ≥ 1. (3.12)

Inequality (3.12) can be interpreted as the dynamic version of (3.3). Theorem 3 also suggests

73

the proportion ∑m≤p

∫ π

−πλm(θ)dθ

/E‖X1‖2 (3.13)

of variance explained by the first p dynamic FPCs as a natural measure of how well a functional

time series can be represented in dimension p.

3.4 Estimation and asymptotics

In practice, dynamic FPC scores need to be calculated from an estimated version of FXθ . At the same

time, the infinite series defining the scores need to be replaced by finite approximations. Suppose

again that (Xt : t ∈ Z) is a weakly stationary zero-mean time series such that (3.5) holds. Then, a

natural estimator for Ymt is

Ymt :=

L∑`=−L

〈Xt−`, φm`〉, m = 1, . . . , p and t = L+ 1, . . . n− L, (3.14)

where L is some integer and φm` is computed from some estimated spectral density operator FXθ .

For the latter, we impose the following preliminary assumption.

Assumption B.1 The estimator FXθ is consistent in integrated mean square, i.e.∫ π

−πE‖FXθ − FXθ ‖2S dθ → 0 as n→∞. (3.15)

Panaretos and Tavakoli [27] propose an estimator FXθ satisfying (3.15) under certain functional

cumulant conditions. By stating (3.15) as an assumption, we intend to keep the theory more

widely applicable. For example, the following proposition shows that estimators satisfying Assump-

tion B.1 also exist under L4-m-approximability, a dependence concept for functional data introduced

in Hormann and Kokoszka [18]. Define

FXθ =∑|h|≤q

(1− |h|

q

)CXh e

−ihθ, 0 < q < n, (3.16)

where CXh is the usual empirical autocovariance operator at lag h.

Proposition 5. Let (Xt : t ∈ Z) be L4-m-approximable, and let q = q(n)→∞ such that q3 = o(n).

Then the estimator FXθ defined in (3.16) satisfies Assumption B.1. The approximation error is

O(αq,n), where

αq,n =q3/2

√n

+1

q

∑|h|≤q

|h|‖Ch‖S +∑|h|>q

‖Ch‖S .

Corollary 1. Under the assumptions of Proposition 5 and∑

h |h|‖Ch‖S <∞ the convergence rate

of the estimator (3.16) is O(n−1/5).

Since our method requires the estimation of eigenvectors of the spectral density operator, we also

need to introduce certain identifiability constraints on eigenvectors. Define α1(θ) := λ1(θ) − λ2(θ)

74

and

αm(θ) := minλm−1(θ)− λm(θ), λm(θ)− λm+1(θ) for m > 1,

where λi(θ) is the i-th largest eigenvalue of the spectral density operator evaluated in θ.

Assumption B.2 For all m, αm(θ) has finitely many zeros.

Assumption B.2 essentially guarantees disjoint eigenvalues for all θ. It is a very common assumption

in functional PCA, as it ensures that eigenspaces are one-dimensional, and thus eigenfunctions are

unique up to their signs. To guarantee identifiability, it only remains to provide a rule for choosing

the signs. In our context, the situation is slightly more complicated, since we are working in a

complex setup. The eigenfunction ϕm(θ) is unique up to multiplication by a number on the complex

unit circle. A possible way to fix the direction of the eigenfunctions is to impose a constraint of the

form 〈ϕm(θ), v〉 ∈ (0,∞) for some given function v. In other words, we choose the orientation of

the eigenfunction such that its inner product with some reference curve v is a positive real number.

This rule identifies ϕm(θ), as long as it is not orthogonal to v. The following assumption ensures

that such identification is possible on a large enough set of frequencies θ ∈ [−π, π].

Assumption B.3 Denoting by ϕm(θ) be the m-th dynamic eigenvector of FXθ , there exists v such

that 〈ϕm(θ), v〉 6= 0 for almost all θ ∈ [−π, π].

From now on, we tacitly assume that the orientations of ϕm(θ) and ϕm(θ) are chosen so that

〈ϕm(θ), v〉 and 〈ϕm(θ), v〉 are in [0,∞) for almost all θ. Then, we have the following result.

Theorem 4 (Consistency). Let Ymt be the random variable defined by (3.14) and suppose that

Assumptions B.1–B.3 hold. Then, for some sequence L = L(n) → ∞, we have YmtP−→ Ymt as

n→∞.

Practical guidelines for the choice of L are given in the next section.

4 Practical implementation

In applications, data can only be recorded discretely. A curve x(u) is observed on grid points

0 ≤ u1 < u2 < · · · < ur ≤ 1. Often, though not necessarily so, r is very large (high frequency data).

The sampling frequency r and the sampling points ui may change from observation to observation.

Also, data may be recorded with or without measurement error, and time warping (registration)

may be required. For deriving limiting results, a common assumption is that r → ∞, while a

possible measurement error tends to zero. All these specifications have been extensively studied

in the literature, and we omit here the technical exercise to cast our theorems and propositions in

one of these setups. Rather, we show how to implement the proposed method, after the necessary

preprocessing steps have been carried out. Typically, data are then represented in terms of a finite

(but possibly large) number of basis functions (vk : 1 ≤ k ≤ d), i.e., x(u) =∑d

k=1 xkvk(u). Usually

Fourier bases, b-splines or wavelets are used. For an excellent survey on preprocessing the raw data,

we refer to Ramsey and Silverman [32, Chapters 3–5].

75

In the sequel, we write (aij : 1 ≤ i, j ≤ d) for a d×d matrix with entry aij in row i and column j.

Let x belong to the span Hd := sp(vk : 1 ≤ k ≤ d) of v1, . . . , vd. Then x is of the form v′x, where

v = (v1, . . . , vd)′ and x = (x1, . . . , xd)

′. We assume that the basis functions v1, . . . , vd are linearly

independent, but they need not be orthogonal. Any statement about x can be expressed as an

equivalent statement about x. In particular, if A : Hd → Hd is a linear operator, then, for x ∈ Hd,

A(x) =d∑

k=1

xkA(vk) =d∑

k=1

d∑k′=1

xk〈A(vk), vk′〉vk′ = v′Ax,

where A′ = (〈A(vi), vj〉 : 1 ≤ i, j ≤ d). Call A the corresponding matrix of A and x the corresponding

vector of x.

The following simple results are stated without proof.

Lemma 1. Let A,B be linear operators on Hd, with corresponding matrices A and B, respectively.

Then,

(i) for any α, β ∈ C, the corresponding matrix of αA+ βB is αA + βB;

(ii) A(e) = λe iff Ae = λe, where e = v′e;

(iii) letting A :=p∑i=1

p∑j=1

gijvi ⊗ vj, G := (gij : 1 ≤ i, j ≤ d), where gij ∈ C, and V := (〈vi, vj〉 : 1 ≤

i, j ≤ d), the corresponding matrix of A is A = GV ′.

To obtain the corresponding matrix of the spectal density operators FXθ , first observe that, if

Xk =∑d

i=1Xkivi =: v′Xk, then

CXh = EXh ⊗X0 =

d∑i=1

d∑j=1

EXhiX0jvi ⊗ vj .

It follows from Lemma 1 (iii) that CXh = CXh V

′ is the corresponding matrix of CXh := EXhX

′0; the

linearity property (i) then implies that

FXθ =1

2π

(∑h∈Z

CXh e−ihθ

)V ′ (4.1)

is the corresponding matrix of FXθ . Assume that λm(θ) is the m-th largest eigenvalue of FXθ , with

eigenvector ϕm(θ). Then λm(θ) is also an eigenvalue of FXθ and v′ϕm(θ) is the corresponding eigen-

function, from which we can compute, via its Fourier expansion, the dynamic FPCs. In particular,

we have

φmk =v′

2π

∫ π

−πϕm(s)e−iksds =: v′φmk,

and hence

Ymt =∑k∈Z

∫ 1

0X′t−kv(u)v′(u)φmkdu =

∑k∈Z

X′t−kV φmk. (4.2)

In view of (4.1), our task is now to replace the spectral density matrix

76

FXθ =

1

2π

∑h∈Z

CXh e−ihθ

of the coefficient sequence (Xk) by some estimate. For this purpose, we can use existing multivariate

techniques. Classically, we would put, for |h| < n,

CXh :=

1

n

n∑k=h+1

XkX′k−h, h ≥ 0, and CX

h := CX−h, h < 0

(recall that we throughout assume that the data are centered) and use, for example, some lag window

estimator

FXθ :=

1

2π

∑|h|≤q

w(h/q)CXh e−ihθ, (4.3)

where w is some appropriate weight function, q = qn → ∞ and qn/n → 0. For more details con-

cerning common choices of w and the tuning parameter qn, we refer to Chapters 10–11 in Brockwell

and Davis [7] and to Politis [29]. We then set FXθ := FXθ V

′ and compute the eigenvalues and eigen-

functions λm(θ) and ϕm(θ) thereof, which serve as estimators of λm(θ) and ϕm(θ), respectively. We

estimate the filter coefficients by φmk = v′

2π

∫ π−π ϕm(s)eiksds. Usually, no analytic form of ϕm(s) is

available, and one has to perform numerical integration. We take the simplest approach, which is

to set

φmk =v′

2π(2Nθ + 1)

Nθ∑j=−Nθ

ϕm(πj/Nθ)eiks =: v′φmk, (Nθ 1).

The larger Nθ the better. This clearly depends on the available computing power.

Now, we substitute φmk into (4.2), replacing the infinite sum with a rolling window

Ymt =L∑

k=−LX′t−kV φmk. (4.4)

This expression only can be computed for t ∈ L+ 1, . . . , n−L; for 1 ≤ t ≤ L or n−L+ 1 ≤ t ≤ n,

set X−L+1 = · · · = X0 = Xn+1 = · · · = Xn+L = EX1 = 0. This, of course, creates a certain bias on

the boundary of the observation period. As for the choice of L, we observe that∑

`∈Z ‖φm`‖2 = 1.

It is then natural to choose L such that∑−L≤`≤L ‖φm`‖2 ≥ 1− ε, for some small threshold ε, e.g.,

ε = 0.01.

Based on this definition of φmk, we obtain an empirical p-term dynamic Karhunen-Loeve expan-

sion

Xt =

p∑m=1

L∑k=−L

Ym,t+kφmk, with Ymt = 0, t ∈ −L+ 1, . . . , 0 ∪ n+ 1, . . . , n+ L. (4.5)

Parallel to (3.13), the proportion of variance explained by the first p dynamic FPCs can be

estimated through

PVdyn(p) :=π

Nθ

∑m≤p

Nθ∑j=−Nθ

λm(πj/Nθ)/ 1

n

n∑k=1

‖Xk‖2.

77

We will use (1 − PVdyn(p)) as a measure of the loss of information incurred when considering a

dimension reduction to dimension p. Alternatively, one also can use the normalized mean squared

errorNMSE(p) :=

n∑k=1

‖Xk − Xk‖2/ n∑k=1

‖Xk‖2. (4.6)

Both quantities converge to the same limit.

5 A real-life illustration

In this section, we draw a comparison between dynamic and static FPCA on basis of a real data

set. The observations are half-hourly measurements of the concentration (measured in µgm−3) of

particulate matter with an aerodynamic diameter of less than 10µm, abbreviated as PM10, in ambient

air taken in Graz, Austria from October 1, 2010 through March 31, 2011. Following Stadlober

et al. [36] and Aue et al. [1], a square-root transformation was performed in order to stabilize

the variance and avoid heavy-tailed observations. Also, we removed some outliers and a seasonal

(weekly) pattern induced from different traffic intensities on business days and weekends. Then we

use the software R to transform the raw data, which is discrete, to functional data, as explained

in Section 4, using 15 Fourier basis functions. The resulting curves for 175 daily observations,

X1, . . . , X175, say, roughly representing one winter season, for which pollution levels are known to

be high, are displayed in Figure 2.

0.0 0.2 0.4 0.6 0.8 1.0

02

46

81

01

2

time

sq

rt(P

M1

0)

0.0 0.2 0.4 0.6 0.8 1.0

02

46

81

01

2

time

Figure 2: A plot of 175 daily curves xt(u), 1 ≤ t ≤ 175, where xt(u) are the square-root transformed anddetrended functional observations of PM10, based on 15 Fourier basis functions. The solid black line representsthe sample mean curve µ(u).

From those data, we computed the (estimated) first dynamic FPC score sequence (Y dyn1t : 1 ≤ t ≤ 175).

To this end, we centered the data at their empirical mean µ(u), then implemented the procedure

described in Section 4. We used the traditional Bartlett kernel w(x) = 1− |x| in (4.3) to obtain an

estimator for the spectral density operator, with bandwidth q = bn1/2c = 13. More sophisticated

78

estimation methods, as those proposed, for example, by Politis [29], of course can be considered;

but they also depend on additional tuning parameters, still leaving much of the selection to the

practitioner’s choice. From FXθ we obtain the estimated filter elements φ1`. It turns out that they

fade away quite rapidly. In particular∑10

`=−10 ‖φ1`‖2 ≈ 0.998. Hence, for calculation of the scores

in (4.4) it is justified to choose L = 10. The five central filter elements φ1`(u), ` = −2, . . . , 2, are

plotted in Figure 3.

0.0

0.5

1.0

Figure 3: The five central filter elements φ1,−2(u), . . . , φ1,2(u) (from left to right).

Further components could be computed similarly, but for the purpose of demonstration we focus

on one component only. In fact, the first dynamic FPC already explains about 80% of the total

variance, compared to the 73% explained by the first static FPC. The latter was also computed,

resulting in the static FPC score sequence (Y stat1t : 1 ≤ t ≤ 175). Both sequences are shown in

Figure 4, along with their differences.

0 50 100 150

−2

02

4

Time [days]

1st

FP

C s

co

res

0 50 100 150

−4

−2

02

4

Time [days]

1st

DF

PC

sco

res

0 50 100 150

−2

02

4

Time [days]

Diffe

ren

ce

s

Figure 4: First static (left panel) and first dynamic (middle panel) FPC score sequences, and their differences(right panel).

Although based on entirely different ideas, the static and dynamic scores in Figure 4 (which,

of course, are not loading the same functions) appear to be remarkably close to one another. The

reason why the dynamic Karhunen-Loeve expansion accounts for a significantly larger amount of

the total variation is that, contrary to its static counterpart, it does not just involve the present

observation.

To get more statistical insight into those results, let us consider the first static sample FPC,

v1(u), say, displayed in Figure 5. We see that v1(u) ≈ 1 for all u ∈ [0, 1], so that the static FPC

score Y stat1t =

∫ 10 (Xt(u)− µ(u))v1(u)du roughly coincides with the average deviation of Xt(u) from

79

−1

0

1

0.0 0.4 0.8 0.0 0.4 0.8

34

56

78

0.0 0.4 0.8

34

56

78

0.0 0.4 0.8

34

56

78

0.0 0.4 0.8

34

56

78

0.0 0.4 0.8

34

56

78

0.0 0.4 0.8

34

56

78

Figure 5: First static FPC v1(u) (solid line), and second static FPC v2(u) (dashed line) [left panel]. µ(u)±v1(u) [middle panel] and µ(u) ± v2(u) [right panel] describe the effect of the first and second static FPC onthe mean curve.

the sample mean µ(u): the effect of a large (small) first score corresponds to a large (small) daily

average of√PM10. In view of the similarity between Y dyn

1t and Y stat1t , it is possible to attribute

the same interpretation to the dynamic FPC scores. However, regarding the dynamic Karhunen-

Loeve expansion, dynamic FPC scores should be interpreted sequentially. To this end, let us take

advantage of the fact that∑1

`=−1 ‖φ1`‖2 ≈ 0.92. In the approximation by a single-term dynamic

Karhunen-Loeve expansion, we thus roughly have

Xt(u) ≈ µ(u) +1∑

`=−1

Y dyn1,t+`φ1`(u).

This suggests studying the impact of triples (Y dyn1,t−1, Y

dyn1t , Y dyn

1,t+1) of consecutive scores on the pollu-

tion level of day t. We do this by adding the functions

eff(δ−1, δ0, δ1) :=1∑

`=−1

δ`φ1`(u), with δi = const×±1,

to the overall mean curve µ(u). In Figure 6, we do this with δi = ±1. For instance, the upper left

panel shows µ(u)+eff(−1,−1,−1), corresponding to the impact of three consecutive small dynamic

FPC scores. The result is a negative shift of the mean curve. If two small scores are followed by

a large one (second panel from the left in top row), then the PM10 level increases as u approaches

1. Since a large value of Y dyn1,t+1 implies a large average concentration of

√PM10 on day t + 1, and

since the pollution curves are highly correlated at the transition from day t to day t+ 1, this should

indeed be reflected by a higher value of√PM10 towards the end of day t. Similar interpretations can

be given for the other panels in Figure 6.

It is interesting to observe that, in this example, the first dynamic FPC seems to take over the

roles of the first two static FPCs. The second static FPC (see Figure 5) indeed can be interpreted as

an intraday trend effect; if the second static score of day t is large (small), then Xt(u) is increasing

(decreasing) over u ∈ [0, 1]. Since we are working with sequentially dependent data, we can get

information about such a trend from future and past observations, too. Hence, roughly speaking,

80

we have1∑

`=−1

Y dyn1,t+`φ1`(u) ≈

2∑m=1

Y statmt vm(u).

This is exemplified in Figure 1 of Section 1, which shows the ten consecutive curves x71(u) −µ(u), . . . , x80(u)− µ(u) (left panel) and compares them to the single-term static (middle panel) and

the single-term dynamic Karhunen-Loeve expansions (right panel).

0.0 0.4 0.8

45

67

8

Intraday time

Sqrt

(PM

10)

(δ−1,δ0,δ1) = (−1,−1,−1)

0.0 0.4 0.8

45

67

8

Intraday time

Sqrt

(PM

10)

(δ−1,δ0,δ1) = (−1,−1,+1)

0.0 0.4 0.8

45

67

8

Intraday timeS

qrt

(PM

10)

(δ−1,δ0,δ1) = (−1,+1,−1)

0.0 0.4 0.8

45

67

8

Intraday time

Sqrt

(PM

10)

(δ−1,δ0,δ1) = (−1,+1,+1)

0.0 0.4 0.8

45

67

8

Intraday time

Sqrt

(PM

10)

(δ−1,δ0,δ1) = (+1,−1,−1)

0.0 0.4 0.8

45

67

8

Intraday time

Sqrt

(PM

10)

(δ−1,δ0,δ1) = (+1,−1,+1)

0.0 0.4 0.8

45

67

8

Intraday time

Sqrt

(PM

10)

(δ−1,δ0,δ1) = (+1,+1,−1)

0.0 0.4 0.84

56

78

Intraday time

Sqrt

(PM

10)

(δ−1,δ0,δ1) = (+1,+1,+1)

Figure 6: Mean curves µ(u) (solid line) and µ(u) + eff(δ−1, δ0, δ1), with δi = ±1 (dashed).

6 Simulation study

In this simulation study, we compare the performance of dynamic FPCA with that of static FPCA

for a variety of data-generating processes. For each simulated functional time series (Xt), where

Xt = Xt(u), u ∈ [0, 1], we compute the static and dynamic scores, and recover the approximating

series (Xstatt (p)) and (Xdyn

t (p)) that result from the static and dynamic Karhunen-Loeve expansions,

respectively, of order p. The performances of these approximations are measured in terms of the

corresponding normalized mean squared errors (NMSE)

n∑t=1

‖Xt − Xstatt (p)‖2

/ n∑t=1

‖Xt‖2 and

n∑t=1

‖Xt − Xdynt (p)‖2

/ n∑t=1

‖Xt‖2.

The smaller these quantities, the better the approximation.

Computations were implemented in R, along with the fda package. The data were simulated

according to a functional AR(1) model Xn+1 = Ψ(Xn) + εn+1. In practice, this simulation has to

be performed in finite dimension d, say. To this end, let (vi), i ∈ N be the Fourier basis functions

81

on [0, 1]: for large d, due to the linearity of Ψ,

〈Xn+1, vj〉 = 〈Ψ(Xn), vj〉+ 〈εn+1, vj〉

= 〈Ψ( ∞∑i=1

〈Xn, vi〉vi), vj〉+ 〈εn+1, vj〉 ≈

d∑i=1

〈Xn, vi〉〈Ψ(vi), vj〉+ 〈εn+1, vj〉.

Hence, letting Xn = (〈Xn, v1〉, . . . , 〈Xn, vd〉)′ and εn = (〈εn+1, v1〉, . . . , 〈εn+1, vd〉)′, the first d Fourier

coefficients of Xn approximately satisfy the VAR(1) equation

Xn+1 = PXn + εn, where P = (〈Ψ(vi), vj〉 : 1 ≤ i, j ≤ d). Based on this observation, we used

a VAR(1) model for generating the first d Fourier coefficients of the process (Xn). To obtain P, we

generate a matrix G = (Gij : 1 ≤ i, j ≤ d), where the Gij ’s are mutually independent N(0, ψij), and

then set P := κG/‖G‖. Different choices of ψij are considered. Since Ψ is bounded, we have Pij → 0

as i, j → ∞. For the operators Ψ1, Ψ2 and Ψ3, we used ψij = (i2 + j2)−1/2, ψij = (i2/2 + j3/2)−1,

and ψij = e−(i+j), respectively. For d and κ, we considered the values d = 15, 31, 51, 101 and

κ = 0.1, 0.3, 0.6, 0.9. The noise (εt) is chosen as independent Gaussian and obtained as a lin-

ear combination of the functions (vi : 1 ≤ i ≤ d) with independent zero-mean normal coefficients

(Ci : 1 ≤ i ≤ d), such that Var(Ci) = exp((i − 1)/10). With this approach, we generate n = 400

observations. We then follow the methodology described in Section 4 and use the Barlett kernel

in (4.3) for estimation of the spectral density operator. The tuning parameter q is set equal to√n = 20. A more sophisticated calibration probably can lead to even better results, but we also

observed that moderate variations of q do not fundamentally change our findings. The numerical

integration for obtaining φmk is performed on the basis of 1000 equidistant integration points. In

(4.4) we chose L = min(L′, 60), where L′ = argminj≥0

∑−j≤`≤j ‖φm`‖2 ≥ 0.99. The limitation

L ≤ 60 is imposed to keep computation times moderate. Usually, convergence is relatively fast.

For each choice of d and κ, the experiment as described above is repeated 200 times. The mean

and standard deviation of NMSE in different settings and with values p = 1, 2, 3, 6 are reported in

Table 1. Results do not vary much among setups with d ≥ 31, and thus in Table 1 we only present

the cases d = 15 and d = 101.

We see that, in basically all settings, dynamic FPCA significantly outperforms static FPCA in

terms of NMSE. As one can expect, the difference becomes more striking with increasing dependence

coefficient κ. It is also interesting to observe that the variations of NMSE among the 200 replications

is systematically smaller for the dynamic procedure.

Finally, it should be noted that, in contrast to the static PCA, the empirical version of our

procedure is not “exact”, but is subject to small approximation errors. These approximation errors

can stem from numerical integration (which is required in the calculation of φmk) and are also due

to the truncation of the filters at some finite lag L (see Section 4). Such little deviations do not

matter in practice if a component explains a significant proportion of variance. If, however, the

additional contribution of the higher-order component is very small, then it can happen that it

doesn’t compensate a possible approximation error. This becomes visible in the setting Ψ3 with 3

or 6 components, where for some constellations the NMSE for dynamic components is slightly larger

than for the static ones.

82

1co

mp

onen

t2

com

ponen

ts3

com

ponen

ts6

com

ponen

tsd

κst

ati

cdynam

icst

ati

cdynam

icst

ati

cdynam

icst

ati

cdynam

ic

Ψ1

15

0.1

0.697

(0.1

6)

0.637

(0.1

3)

0.546

(0.1

5)

0.447

(0.1

0)

0.443

(0.1

2)

0.325

(0.0

8)

0.256

(0.0

8)

0.138

(0.0

5)

0.3

0.696

(0.1

6)

0.621

(0.1

4)

0.542

(0.1

5)

0.434

(0.1

1)

0.440

(0.1

3)

0.314

(0.0

8)

0.253

(0.0

8)

0.132

(0.0

5)

0.6

0.687

(0.3

2)

0.571

(0.2

3)

0.526

(0.2

5)

0.392

(0.1

5)

0.423

(0.2

0)

0.283

(0.1

1)

0.240

(0.1

1)

0.119

(0.0

6)

0.9

0.648

(0.7

6)

0.479

(0.4

7)

0.481

(0.5

6)

0.322

(0.2

9)

0.377

(0.4

3)

0.229

(0.2

0)

0.209

(0.2

2)

0.096

(0.0

9)

101

0.1

0.805

(0.1

2)

0.740

(0.0

8)

0.708

(0.1

1)

0.587

(0.0

8)

0.642

(0.1

2)

0.478

(0.0

7)

0.519

(0.0

8)

0.274

(0.0

5)

0.3

0.802

(0.1

3)

0.729

(0.1

1)

0.704

(0.1

2)

0.577

(0.0

9)

0.637

(0.1

1)

0.469

(0.0

8)

0.515

(0.1

0)

0.269

(0.0

5)

0.6

0.792

(0.2

2)

0.690

(0.1

8)

0.689

(0.1

9)

0.545

(0.1

2)

0.619

(0.1

6)

0.441

(0.1

0)

0.495

(0.1

3)

0.252

(0.0

7)

0.9

0.755

(0.6

6)

0.616

(0.4

5)

0.640

(0.5

0)

0.479

(0.3

1)

0.568

(0.4

0)

0.387

(0.2

3)

0.446

(0.3

4)

0.220

(0.1

5)

Ψ2

15

0.1

0.524

(0.2

0)

0.491

(0.1

7)

0.355

(0.1

4)

0.306

(0.1

1)

0.263

(0.1

0)

0.208

(0.0

8)

0.129

(0.0

5)

0.082

(0.0

3)

0.3

0.522

(0.2

1)

0.473

(0.1

8)

0.351

(0.1

6)

0.294

(0.1

2)

0.259

(0.1

2)

0.200

(0.0

8)

0.126

(0.0

6)

0.078

(0.0

4)

0.6

0.507

(0.4

9)

0.413

(0.2

9)

0.331

(0.2

9)

0.255

(0.1

5)

0.240

(0.1

9)

0.174

(0.1

0)

0.114

(0.0

8)

0.068

(0.0

5)

0.9

0.458

(1.1

5)

0.310

(0.5

9)

0.272

(0.6

4)

0.187

(0.3

2)

0.193

(0.4

1)

0.130

(0.2

1)

0.088

(0.1

7)

0.052

(0.0

9)

101

0.1

0.585

(0.1

9)

0.549

(0.1

7)

0.436

(0.1

5)

0.378

(0.1

1)

0.356

(0.1

3)

0.282

(0.1

0)

0.240

(0.0

8)

0.146

(0.0

5)

0.3

0.581

(0.2

1)

0.530

(0.1

8)

0.436

(0.1

2)

0.369

(0.1

1)

0.350

(0.1

3)

0.274

(0.0

9)

0.234

(0.1

0)

0.141

(0.0

6)

0.6

0.564

(0.4

6)

0.469

(0.2

7)

0.405

(0.3

3)

0.321

(0.1

8)

0.323

(0.2

1)

0.242

(0.1

3)

0.212

(0.1

2)

0.125

(0.0

7)

0.9

0.495

(1.0

6)

0.362

(0.5

9)

0.345

(0.6

8)

0.250

(0.3

9)

0.251

(0.5

8)

0.180

(0.3

4)

0.168

(0.2

6)

0.097

(0.1

4)

Ψ3

15

0.1

0.367

(0.2

0)

0.344

(0.1

8)

0.134

(0.0

8)

0.127

(0.0

7)

0.049

(0.0

3)

0.054

(0.0

4)

0.002

(0.0

0)

0.017

(0.0

3)

0.3

0.362

(0.2

4)

0.322

(0.1

7)

0.129

(0.0

9)

0.119

(0.0

7)

0.048

(0.0

3)

0.050

(0.0

4)

0.002

(0.0

0)

0.015

(0.0

3)

0.6

0.334

(0.5

5)

0.253

(0.2

4)

0.113

(0.1

6)

0.097

(0.0

9)

0.041

(0.0

5)

0.040

(0.0

4)

0.002

(0.0

0)

0.011

(0.0

2)

0.9

0.236

(1.1

2)

0.146

(0.4

3)

0.074

(0.2

8)

0.061

(0.1

6)

0.025

(0.0

8)

0.027

(0.0

7)

0.001

(0.0

0)

0.008

(0.0

4)

101

0.1

0.366

(0.1

9)

0.344

(0.1

7)

0.134

(0.0

8)

0.127

(0.0

7)

0.049

(0.0

3)

0.054

(0.0

4)

0.002

(0.0

0)

0.017

(0.0

3)

0.3

0.363

(0.2

5)

0.322

(0.1

8)

0.131

(0.1

0)

0.120

(0.0

7)

0.047

(0.0

3)

0.050

(0.0

4)

0.002

(0.0

0)

0.015

(0.0

3)

0.6

0.325

(0.5

2)

0.251

(0.2

4)

0.113

(0.1

6)

0.098

(0.0

9)

0.040

(0.0

5)

0.040

(0.0

4)

0.002

(0.0

0)

0.011

(0.0

2)

0.9

0.235

(1.0

5)

0.149

(0.4

3)

0.074

(0.2

8)

0.061

(0.1

6)

0.025

(0.0

9)

0.026

(0.0

7)

0.001

(0.0

0)

0.008

(0.0

4)

Tab

le1:

Res

ult

sof

the

sim

ula

tion

sof

Sec

tion

6.

Bold

nu

mbe

rsre

pre

sen

tth

em

ean

of

NM

SE

for

dyn

am

ican

dst

ati

cpro

cedu

res

resu

ltin

gfr

om

200

sim

ula

tion

run

s.T

he

nu

mbe

rsin

brack

ets

show

stan

dard

dev

iati

on

sm

ult

ipli

edby

afa

ctor

10.

The

valu

esκ

give

the

size

of‖Ψ

i‖L

,i

=1,2,3

.W

eco

nsi

der

dim

ensi

on

sof

the

un

der

lyin

gm

odel

sd

=15

an

dd

=10

1.

83

7 Conclusion

Functional principal component analysis is taking a leading role in the functional data literature. As

an extremely effective tool for dimension reduction, it is useful for empirical data analysis as well as

for many FDA-related methods, like functional linear models. A frequent situation in practice is that

functional data are observed sequentially over time and exhibit serial dependence. This happens, for

instance, when observations stem from a continuous-time process which is segmented into smaller

units, e.g., days. In such cases, classical static FPCA still may be useful, but, in contrast to the

i.i.d. setup, it does not lead to an optimal dimension-reduction technique.

In this paper, we propose a dynamic version of FPCA which takes advantage of the potential serial

dependencies in the functional observations. In the special case of uncorrelated data, the dynamic

FPC methodology reduces to the usual static one. But, in the presence of serial dependence, static

FPCA is (quite significantly, if serial dependence is strong) outperformed.

This paper also provides (i) guidelines for practical implementation, (ii) a toy example with

PM10 air pollution data, and (iii) a simulation study. Our empirical application brings empirical

evidence that dynamic FPCs have a clear edge over static FPCs in terms of their ability to represent

dependent functional data in small dimension. In the appendices, our results are cast into a rigorous

mathematical framework, and we show that the proposed estimators of dynamic FPC scores are

consistent.

84

Appendices of Chapter 3

85

A General methodology and proofs

In this subsection, we give a mathematically rigorous description of the methodology introduced in

Section 3.1. We adopt a more general framework which can be specialized to the functional setup of

Section 3.1. Throughout, H denotes some (complex) separable Hilbert space equipped with norm

‖ · ‖ and inner product 〈·, ·〉. We work in complex spaces, since our theory is based on a frequency

domain analysis. Nevertheless, all our functional time series observations Xt are assumed to be

real-valued functions.

A.1 Fourier series in Hilbert spaces.

For p ≥ 1, consider the space LpH([−π, π]), that is, the space of measurable mappings x : [−π, π]→ H

such that∫ π−π ‖x(θ)‖pdθ < ∞. Then, ‖x‖p = ( 1

2π

∫ π−π ‖x(θ)‖pdθ)1/p defines a norm. Equipped with

this norm, LpH([−π, π]) is a Banach space, and for p = 2, a Hilbert space with inner product

(x, y) :=1

2π

∫ π

−π〈x(θ), y(θ)〉dθ.

One can show (see e.g. [8, Lemma 1.4]) that, for any x ∈ L1H([−π, π]), there exists a unique element

I(x) ∈ H which satisfies ∫ π

−π〈x(θ), v〉dθ = 〈I(x), v〉 ∀v ∈ H. (A.1)

We define∫ π−π x(θ)dθ := I(x).

For x ∈ L2H([−π, π]), define the k-th Fourier coefficient as

fk :=1

2π

∫ π

−πx(θ)e−ikθdθ, k ∈ Z. (A.2)

Below, we write ek for the function θ 7→ eikθ, θ ∈ [−π, π].

Proposition 6. Suppose x ∈ L2H([−π, π]) and define fk by equation (A.2). Then, the sequence

Sn :=∑n

k=−n fkek has a mean square limit in L2H([−π, π]). If we denote the limit by S, then

x(θ) = S(θ) for almost all θ.

Proof. See supplementary document.

Let us turn to the Fourier expansion of eigenfunctions ϕm(θ) used in the definition of the dynamic

DPFCs. Eigenvectors are scaled to unit length: ‖ϕm(θ)‖2 = 1. In order for ϕm to belong to

L2H([−π, π]), we additionally need measurability. Measurability cannot be taken for granted. This

comes from the fact that ‖zϕm(θ)‖2 = 1 for all z on the complex unit circle. In principle we could

choose the “signs” z = z(θ) in an extremely erratic way, such that ϕm(θ) is no longer measurable.

To exclude such pathological choices, we tacitly impose in the sequel that versions of ϕm(θ) have

been chosen in a “smooth enough way”, to be measurable.

86

Now we can expand the eigenfunctions ϕm(θ) in a Fourier series in the sense explained above:

ϕm =∑`∈Z

φmè` with φm` =1

2π

∫ π

−πϕm(s)e−i`sds.

The coefficients φm` thus defined yield the definition (3.10) of dynamic FPCs. In the special case

H = L2([0, 1]), φm` = φm`(u) satisfies by (A.1)∫ 1

0φm`(u)v(u)du =

1

2π

∫ π

−π

∫ 1

0ϕm(u|s)v(u)due−i`sds

=

∫ 1

0

(1

2π

∫ π

−πϕm(u|s)e−i`sds

)v(u)du ∀v ∈ H.

This implies that φm`(u) = 12π

∫ π−π ϕm(u|s)e−i`sds for almost all u ∈ [0, 1], which is in line with the

definition given in (3.9). Furthermore, (3.8) follows directly from Proposition 6.

A.2 The spectral density operator

Assume that the H-valued process (Xt : t ∈ Z) is stationary with lag h autocovariance operator CXhand spectral density operator

FXθ :=1

2π

∑h∈Z

CXh e−ihθ. (A.3)

Let S(H,H ′) be the set of Hilbert-Schmidt operators mapping from H to H ′ (both assumed to

be separable Hilbert spaces). When H = H ′ and when it is clear which space H is meant, we

sometimes simply write S. With the Hilbert-Schmidt norm ‖ ·‖S(H,H′) this defines again a separable

Hilbert space, and so does L2S(H,H′)([−π, π]). We will impose that the series in (A.3) converges in

L2S(H,H)([−π, π]): we then say that (Xt) possesses a spectral density operator.

Remark 3. It follows that the results of the previous section can be applied. In particular we may

deduce that CXk =∫ π−π F

Xθ e

ikθdθ.

A sufficient condition for convergence of (A.3) in L2S(H,H)([−π, π]) is assumption (3.5). Then, it

can be easily shown that the operator FXθ is self-adjoint, non-negative definite and Hilbert-Schmidt.

Below, we introduce a weak dependence assumption established in [18], from which we can derive a

sufficient condition for (3.5).

Definition 4 (Lp–m–approximability). A random H–valued sequence (Xn : n ∈ Z) is called Lp–m–

approximable if it can be represented as Xn = f(δn, δn−1, δn−2, ...), where the δi’s are i.i.d. elements

taking values in some measurable space S and f is a measurable function f : S∞ → H. Moreover,

if δ′1, δ′2, ... are independent copies of δ1, δ2, ... defined on the same measurable space S, then, for

X(m)n := f(δn, δn−1, δn−2, ..., δn−m+1, δ

′n−m, δ

′n−m−1, ...),

we have

∞∑m=1

(E‖Xm −X(m)m ‖p)1/p <∞. (A.4)

87

Hormann and Kokoszka [18] show that this notion is widely applicable to linear and non-linear

functional time series. One of its main advantages is that it is a purely moment-based dependence

measure that can be easily verified in many special cases.

Proposition 7. Assume that (Xt) is L2–m–approximable. Then (3.5) holds and the operators FXθ ,

θ ∈ [−π, π], are trace-class.


Instead of Assumption (3.5), Panaretos and Tavakoli [27] impose for the definition of a spectral

density operator summability of CXh in Schatten 1-norm, that is,∑

h∈Z ‖CXh ‖T < ∞. Under such

slightly more stringent assumption, it immediately follows that the resulting spectral operator is

trace-class. The verification of convergence may, however, be a bit delicate. At least, we could not

find a simple criterion as in Proposition 7.

Proposition 8. Let FXθ be the spectral density operator of a stationary sequence (Xt) for which

the summability condition (3.5) holds. Let λ1(θ) ≥ λ2(θ) ≥ · · · denote its eigenvalues and ϕm(θ)

be the corresponding eigenfunctions. Then, (a) the functions θ 7→ λm(θ) are continuous; (b) if we

strengthen (3.5) into the more stringent condition∑

h∈Z |h|‖CXh ‖S < ∞, the λm(θ)’s are Lipschitz-

continuous functions of θ; (c) assuming that (Xt) is real-valued, for each θ ∈ [−π, π], λm(θ) =

λm(−θ) and ϕm(θ) = ϕm(−θ).


Let x be the conjugate element of x, i.e. 〈x, z〉 = 〈z, x〉 for all z ∈ H. Then x is real-valued iff

x = x.

Remark 4. Since ϕm(θ) is Hermitian, it immediately follows that φm` = φm`, implying that the

dynamic FPCs are real if the process (Xt) is.

A.3 Functional filters

Computation of dynamic FPCs requires applying time-invariant functional filters to the process (Xt).

Let Ψ = (Ψk : k ∈ Z) be a sequence of linear operators mapping the separable Hilbert space H to

the separable Hilbert space H ′. Let B be the backshift or lag operator, defined by BkXt := Xt−k,

k ∈ Z. Then the functional filter Ψ(B) :=∑

k∈Z ΨkBk, when applied to the sequence (Xt), produces

an output series (Yt) in H ′ via

Yt = Ψ(B)Xt =∑k∈Z

Ψk(Xt−k). (A.5)

Call Ψ the sequence of filter coefficients, and, in the style of the scalar or vector time series termi-

nology, call

Ψθ = Ψ(e−iθ) =∑k∈Z

Ψke−ikθ (A.6)

88

the frequency response function of the filter Ψ(B). Of course, series (A.5) and (A.6) only have a

meaning if they converge in an appropriate sense. Below we use the following technical lemma.

Proposition 9. Suppose that (Xt) is a stationary sequence in L2H and possesses a spectral den-

sity operator satisfying supθ tr(FXθ ) < ∞. Consider a filter (Ψk) such that Ψθ converges in

L2S(H,H′)([−π, π]), and suppose that supθ ‖Ψθ‖S(H,H′) <∞. Then,

(i) the series Yt :=∑

k∈Z Ψk(Xt−k) converges in L2H′;

(ii) (Yt) possesses the spectral density operator FYθ = ΨθFXθ (Ψθ)∗;

(iii) supθ tr(FYθ ) <∞.


In particular, the last proposition allows for iterative applications. If supθ tr(FXθ ) < ∞ and Ψθ

satisfies the above properties, then analogue results apply to the output Yt. This is what we are

using in the proofs of Theorems 1 and 2.

A.4 Proofs for Section 3

To start with, observe that Propositions 2 and 4 directly follow from Proposition 9. Part (a) of

Proposition 3 also has been established in the previous Section (see Remark 4), and part (b) is

immediate. Thus, we can proceed to the proof of Theorems 1 and 2.

Proof of Theorems 2 and 3. Assume we have filter coefficients Ψ = (Ψk : k ∈ Z) and Υ = (Υk : k ∈Z), where Ψk : H → Cp and Υk : Cp → H both belong to the class C. If (Xt) and (Yt) are H-valued

and Cp-valued processes, respectively, then there exist elements ψmk and υmk in H such that

Ψ(B)(Xt) =∑k∈Z

(〈Xt−k, ψ1k〉, . . . , 〈Xt−k, ψpk〉)′

and

Υ(B)(Yt) =∑`∈Z

p∑m=1

Yt+`,mυm`.

Hence, the p-dimensional reconstruction of Xt in Theorem 3 is of the form

p∑m=1

Xmt = Υ(B)[Ψ(B)Xt] =: ΥΨ(B)Xt.

Since Ψ and Υ are required to belong to C, we conclude from Proposition 9 that the processes

Yt := Ψ(B)Xt and Xt = Υ(B)Yt are mean-square convergent and possess a spectral density op-

erator. Letting ψm(θ) =∑

k∈Z ψmkeikθ and υm(θ) =

∑`∈Z υmè

i`θ, we obtain, for x ∈ H and

y = (y1, . . . , ym)′ ∈ Cp, that the frequency response functions Ψθ and Υθ satisfy

Ψθ(x) =∑k∈Z

(〈x, ψ1k〉, . . . , 〈x, ψpk〉)′ e−ikθ = (〈x, ψ1(θ)〉, . . . , 〈x, ψp(θ)〉)′

89

and

Υθ(y) =∑`∈Z

p∑m=1

ymυmè−i`θ =

p∑m=1

ymυm(−θ).

Consequently,

ΥθΨθ =

p∑m=1

υm(−θ)⊗ ψm(θ). (A.7)

Now, using Proposition 9, it is readily verified that, for Zt := Xt−ΥΨ(B)Xt, we obtain the spectral

density operator

FZθ =(FXθ −ΥθΨθFXθ

)(FXθ − FXθ Ψ∗θΥ

∗θ

), (A.8)

where FXθ is such that FXθ FXθ = FXθ .

Using Lemma 5,

E‖Xt −ΥΨ(B)Xt‖2 =

∫ π

−πtr(FZθ ) dθ =

∫ π

−π

∥∥∥FXθ −ΥθΨθFXθ∥∥∥2

Sdθ. (A.9)

Clearly, (A.9) is minimized if we minimize the integrand for every fixed θ under the constraint that

ΥθΨθ is of the form (A.7). Employing the eigendecomposition FXθ =∑

m≥1 λm(θ)ϕm(θ) ⊗ ϕm(θ),

we infer that

FXθ =∑m≥1

√λm(θ)ϕm(θ)⊗ ϕm(θ).

The best approximating operator of rank p to FXθ is the operator

FXθ (p) =

p∑m=1

√λm(θ)ϕm(θ)⊗ ϕm(θ),

which is obtained if we choose ΥθΨθ =∑p

m=1 ϕm(θ)⊗ ϕm(θ) and hence

ψm(θ) = ϕm(θ) and υm(θ) = ϕm(−θ).

Consequently, by Proposition 6, we get

ψmk =1

2π

∫ π

−πϕm(s)e−iksds and υmk =

1

2π

∫ π

−πϕm(−s)e−iksds = ψm,−k.

With this choice, it is clear that ΥΨ(B)Xt =∑p

m=1Xmt and

E‖Xt −p∑

m=1

Xmt‖2 =

∫ π

−π

∥∥∥FXθ − FXθ (p)∥∥∥2

Sdθ =

∫ π

−π

∑m>p

λm(θ)dθ;

the proof of Theorem 3 follows.

Turning to Theorem 2, observe that, by the monotone convergence theorem, the last integral

tends to zero if p→∞, which completes the proof of Theorem 2.

90

B Large sample properties

For the proof of Theorem 4, let us show that E|Ymt − Ymt| → 0 as n→∞.

Fixing L ≥ 1,

E|Ymt − Ymt| ≤ E∣∣∣∣∑j∈Z〈Xt−j , φmj〉 −

L∑j=−L

〈Xt−j , φmj〉∣∣∣∣

≤ E∣∣∣∣ L∑j=−L

〈Xt−j , φmj − φmj〉∣∣∣∣+ E

∣∣∣∣ ∑|j|>L

〈Xt−j , φmj〉∣∣∣∣, (B.1)

and the result follows if each summand in (B.1) converges to zero, which we prove in the two

subsequent lemmas. For notational convenience, we often suppress the dependence on the sample

size n; all limits below, however, are taken as n→∞.

Lemma 2. If L = L(n) → ∞ sufficiently slowly, then, under Assumptions B.1–B.3, we have that∣∣∣ ∑|j|≤L〈Xk−j , φmj − φmj〉

∣∣∣ = oP (1).

Proof. The triangle and Cauchy-Schwarz inequalities yield

∣∣∣ ∑|j|≤L

〈Xk−j , φmj − φmj〉∣∣∣ ≤ L∑

j=−L‖Xk−j‖‖φmj − φmj‖

≤ maxj∈Z‖φmj − φmj‖

L∑j=−L

‖Xk−j‖.

Let cm(θ) := 〈φm(θ), φm(θ)〉/|〈φm(θ), φm(θ)〉|. Jensen’s inequality and the triangular inequality

imply that, for any j ∈ Z,

2π‖φmj − φmj‖ =∥∥∥∫ π

−π(ϕm(θ)− ϕm(θ))eijθdθ

∥∥∥ ≤ ∫ π

−π‖ϕm(θ)− ϕm(θ)‖dθ

≤∫ π

−π‖ϕm(θ)− cm(θ)ϕm(θ)‖dθ +

∫ π

−π|1− cm(θ)|dθ

=: Q1 +Q2.

By Lemma 3.2 in [18], we have

Q1 ≤∫ π

−π

8

|αm(θ)|2‖FXθ − FXθ ‖S ∧ 2 dθ.

By Assumption B.2, αm(θ) has only finitely many zeros, θ1, . . . , θK , say. Let δε(θ) := [θ−ε, θ+ε]

and A(m, ε) :=⋃Ki=1 δε(θi). By definition, the Lebesgue measure of this set is |A(m, ε)| ≤ 2Kε.

Define Mε such that

M−1ε = minαm(θ) | θ ∈ [−π, π]\A(m, ε).

91

By continuity of αm(θ) (see Proposition 8), we have Mε <∞, and thus∫ π

−π

8

|αm(θ)|2‖FXθ − FXθ ‖ ∧ 2dθ ≤ 4Kε+ 8M2

ε

∫ π

−π‖FXθ − FXθ ‖dθ =: Bn,ε.

By Assumption B.1, there exists a sequence εn → 0 such that Bn,εn → 0 in probability, which entails

Q1 = oP (1). Note that this also implies∫ π

−π

∣∣〈ϕm(θ), v〉 − cm(θ)〈ϕm(θ), v〉∣∣dθ = oP (1). (B.2)

Turning to Q2, suppose that Q2 is not oP (1) Then, there exists ε > 0 and δ > 0 such that ,for

infinitely many n, P (Q2 ≥ ε) ≥ δ. Set

F = Fn :=θ ∈ [−π, π] : |cm(θ)− 1| ≥ ε

4π

.

One can easily show that, on the set Q2 ≥ ε, we have λ(F ) > ε/4. Clearly, |cm(θ) − 1| ≥ ε/4π

implies that cm(θ) = eiz(θ) with z(θ) ∈ [−π/2,−ε′] ∪ [ε′, π/2], for some small enough ε′. Then the

left-hand side in (B.2) is bounded from below by∫F

∣∣〈ϕm(θ), v〉 − cm(θ)〈ϕm(θ), v〉∣∣dθ

=

∫F

(〈(ϕm(θ), v〉 − cos(z(θ)〈ϕm(θ), v〉)2 + (sin2(z(θ))〈ϕm(θ), v〉2

)1/2dθ. (B.3)

Write F := F ′ ∪ F ′′, where

F ′ := F ∩θ : |〈ϕm(θ), v〉 − cos(z(θ))〈ϕm(θ), v〉

∣∣ ≥ 〈ϕm(θ), v〉2

and

F ′′ := F ∩θ : |〈ϕm(θ), v〉 − cos(z(θ))〈ϕm(θ), v〉

∣∣ < 〈ϕm(θ), v〉2

.

On F ′, the integrand (B.3) is greater than or equal to 〈ϕm(θ), v〉/2. On F ′′ the inequality cos(z(θ))〈ϕm(θ), v〉 >〈ϕm(θ), v〉/2 holds, and consequently

〈ϕm(θ), v〉| sin(z(θ))| > 〈ϕm(θ), v〉2

| sin(z(θ))|

>〈ϕm(θ), v〉

π|z(θ)| ≥ 〈ϕm(θ), v〉

πε′.

Altogether, this yields that the integrand in (B.3) is larger than or equal to 〈ϕm(θ), v〉ε′/π. Now, it

is easy to see that, due to Assumption B.3, (B.2) cannot hold. This leads to a contradiction.

Thus, we can conclude that maxj∈Z ‖φmj− φmj‖ = oP (1), so that, for sufficiently slowly growing

L, we also have L maxj∈Z ‖φmj − φmj‖ = oP (1). Consequently,

∣∣∣∣∣ ∑|j|≤L

〈Xk−j , φmj − φmj〉

∣∣∣∣∣ = oP (1)×

L−1L∑

j=−L‖Xk−j‖

. (B.4)

92

It remains to show that L−1L∑

j=−L‖Xk−j‖ = OP (1). By the weak stationarity assumption, we have

E‖Xk‖2 = E‖X1‖2, and hence, for any x > 0,

P

(L−1

L∑j=−L

‖Xk−j‖ > x

)≤∑L

k=−LE‖Xk‖Lx

≤3√E‖X1‖2x

.

Lemma 3. Let L = L(n)→∞. Then, under condition (3.5), we have∣∣∣∣∣ ∑|j|>L

〈Xk−j , φmj〉

∣∣∣∣∣ = oP (1).

Proof. This is immediate from Proposition 4, part (a).

Turning to the proof of Proposition 5, we first establish the following lemma, which an extension

to lag-h autocovariance operators of a consistency result from [18] on the empirical covariance

operator. Define, for |h| < n,

Ch =1

n

n−h∑k=1

Xk+h ⊗Xk, h ≥ 0, and Ch = C−h, h < 0.

Lemma 4. Assume that (Xt : t ∈ Z) is an L4-m-approximable series. Then, for all |h| < n,

E‖Ch − Ch‖S ≤ U√

(|h| ∨ 1)/n, where the constant U neither depends on n nor on h.


Proof of Proposition 5. By the triangle inequality,

2π‖FXθ − FXθ ‖S =

∥∥∥∥∥∑k∈Z

Che−ihθ −

q∑h=−q

(1− |h|

q

)Che

−ihθ

∥∥∥∥∥S

≤

∥∥∥∥∥q∑

h=−q

(1− |h|

q

)(Ch − Ch)e−ihθ

∥∥∥∥∥S

+

∥∥∥∥∥1

q

q∑h=−q

|h|Che−ihθ

∥∥∥∥∥S

+

∥∥∥∥∥ ∑|h|>q

Che−ihθ

∥∥∥∥∥S

≤q∑

h=−q

(1− |h|

q

)‖Ch − Ch‖S +

1

q

q∑h=−q

|h|‖Ch‖S +∑|h|>q

‖Ch‖S .

The last two terms tend to 0 by condition (3.5) and Kronecker’s lemma. For the first term we may

use Lemma 4. Taking expectations, we obtain that, for some U1,

q∑h=−q

(1− |h|

q

)E‖Ch − Ch‖S ≤ U1

q3/2

√n.

93

Note that the bound does not depend on θ; hence q3 = o(n) and condition (3.5) jointly imply that

supθ∈[−π,π]E‖FXθ − FXθ ‖S → 0 as n→∞.

C Technical results and background

C.1 Linear operators

Consider the class L(H,H ′) of bounded linear operators between two Hilbert spaces H and H ′.

For Ψ ∈ L(H,H ′), the operator norm is defined as ‖Ψ‖L := sup‖x‖≤1 ‖Ψ(x)‖. The simplest operators

can be defined via a tensor product v ⊗ w; then v ⊗ w(z) := v〈z, w〉. Every operator Ψ ∈ L(H,H ′)

possesses an adjoint Ψ∗ ∈ L(H ′, H), which satisfies 〈Ψ(x), y〉 = 〈x,Ψ∗(y)〉 for all x ∈ H and y ∈ H ′.It holds that ‖Ψ∗‖L = ‖Ψ‖L. If H = H ′, then Ψ is called self-adjoint if Ψ = Ψ∗. It is called

non-negative definite if 〈Ψx, x〉 ≥ 0 for all x ∈ H.

A linear operator Ψ ∈ L(H,H ′) is said to be Hilbert-Schmidt if, for some orthonormal basis

(vk : k ≥ 1) of H, we have ‖Ψ‖2S :=∑

k≥1 ‖Ψ(vk)‖2 <∞. Then, ‖Ψ‖S defines a norm, the so-called

Hilbert-Schmidt norm of Ψ, which bounds the operator norm ‖Ψ‖L ≤ ‖Ψ‖S , and can be shown to

be independent of the choice of the orthonormal basis. Every Hilbert-Schmidt operator is compact.

The class of Hilbert-Schmidt operators between H and H ′ defines again a separable Hilbert space

with inner product 〈Ψ,Θ〉S :=∑

k≥1〈Ψ(vk),Θ(vk)〉: denote this class by S(H,H ′).

If Ψ ∈ L(H,H ′) and Υ ∈ L(H ′′, H), then ΨΥ is the operator mapping x ∈ H ′′ to Ψ(Υ(x)) ∈ H ′.Assume that Ψ is a compact operator in L(H,H ′) and let (s2

j ) be the eigenvalues of (Ψ∗)Ψ. Then Ψ is

said to be trace class if ‖Ψ‖T :=∑

j≥1 sj <∞. In this case, ‖Ψ‖T defines a norm, the so-called Schat-

ten 1-norm. We have that

‖Ψ‖S ≤ ‖Ψ‖T , and hence any trace-class operator is Hilbert-Schmidt. For self-adjoint non-negative

operators, it holds that ‖Ψ‖T = tr(Ψ) :=∑

k≥1〈Ψ(vk), vk〉. If ΨΨ = Ψ, then we have tr(Ψ) = ‖Ψ‖2S .

For further background on the theory of linear operators we refer to [13].

C.2 Random sequences in Hilbert spaces

All random elements that appear in the sequel are assumed to be defined on a common probability

space (Ω,A, P ). We write X ∈ LpH(Ω,A, P ) (in short, X ∈ LpH) if X is an H-valued random variable

such that E‖X‖p <∞. Every element X ∈ L1H possesses an expectation, which is the unique µ ∈ H

satisfying E〈X, y〉 = 〈µ, y〉 for all y ∈ H. Provided that X and Y are in L2H , we can define the cross-

covariance operator as CXY := E(X−µX)⊗ (Y −µY ), where µX and µY are the expectations of X

and Y , respectively. We have that ‖CXY ‖T ≤ E‖(X−µX)⊗(Y −µY )‖T = E‖X−µX‖‖Y −µY ‖, and

so these operators are trace-class. An important specific role is played by the covariance operator

CXX . This operator is non-negative definite and self-adjoint with tr(CXX) = E‖X − µX‖2. An

H-valued process (Xt) is called (weakly) stationary if (Xt) ∈ L2H , and EXt and CXt+hXt do not

depend on t. In this case, we write CXh , or shortly Ch, for CXt+hXt if it is clear to which process it

belongs.

94

Many useful results on random processes in Hilbert spaces or more general Banach spaces are

collected in Chapters 1 and 2 of [8].

C.3 Proofs for Appendix A

Proof of Proposition 6. Letting 0 < m < n, note that

‖Sn − Sm‖22 =

( ∑m≤|k|≤n

fkek,∑

m≤|`|≤n

fè`

)

=1

2π

∫ π

−π

∑m≤|k|≤n

∑m≤|`|≤n

〈fk, f`〉ei(k−`)θdθ =∑

m≤|k|≤n

‖fk‖2.

To prove the first statement, we need to show that (Sn) defines a Cauchy sequence in L2H([−π, π]),

which follows if we show that∑

k∈Z ‖fk‖2 < ∞. We use the fact that, for any v ∈ H, the function

〈x(θ), v〉 belongs to L2([−π, π]). Then, by Parseval’s identity and (A.1), we have, for any v ∈ H,

1

2π

∫ π

−π|〈x(θ), v〉|2dθ =

∑k∈Z

(1

2π

∫ π

−π〈x(s), v〉e−iksds

)2

=∑k∈Z|〈fk, v〉|2.

Let (vk : k ≥ 1) be an orthonormal basis of H. Then, by the last result and Parseval’s identity

again, it follows that

‖x‖22 =1

2π

∫ π

−π

∑`≥1

|〈x(θ), v`〉|2dθ =1

2π

∑`≥1

∫ π

−π|〈x(θ), v`〉|2dθ

=∑`≥1

∑k∈Z|〈fk, v`〉|2 =

∑k∈Z‖fk‖2.

As for the second statement, we conclude from classical Fourier analysis results that, for each

v ∈ H,

limn→∞

1

2π

∫ π

−π

(〈x(θ), v〉 −

n∑k=−n

(1

2π

∫ π

−π〈x(s), v〉e−iksds

)eikθ

)2

dθ = 0.

Now, by definition of Sn, this is equivalent to

limn→∞

1

2π

∫ π

−π〈x(θ)− Sn(θ), v〉2 dθ = 0, ∀v ∈ H.

Combined with the first statement of the proposition and∫ π

−π〈x(θ)− S(θ), v〉2 dθ ≤ 2

∫ π

−π〈x(θ)− Sn(θ), v〉2 dθ

+ 2‖v‖2∫ π

−π‖Sn(θ)− S(θ)‖2dθ,

this implies that1

2π

∫ π

−π〈x(θ)− S(θ), v〉2 dθ = 0, ∀v ∈ H. (C.1)

Let (vi), i ∈ N bee an orthonormal basis of H, and define

95

Ai := θ ∈ [−π, π] : 〈x(θ)− S(θ), vi〉 6= 0.

By (C.1), we have that λ(Ai) = 0 (λ denotes the Lebesgue measure), and hence λ(A) = 0 for

A = ∪i≥1Ai. Consequently, since (vi) define an orthonormal basis, for any θ ∈ [−π, π] \A, we have

〈x(θ)− S(θ), v〉 = 0 for all v ∈ H, which in turn implies that x(θ)− S(θ) = 0.

Proof of Proposition 7. Without loss of generality, we assume that EX0 = 0. Since X0 and X(h)h ,

h ≥ 1, are independent,

‖CXh ‖S = ‖EX0 ⊗ (Xh −X(h)h )‖S ≤ (E‖X0‖2)1/2(E‖Xh −X

(h)h ‖

2)1/2.

The first statement of the proposition follows.

Let θ be fixed. Since FXθ is non-negative and self-adjoint, it is trace class if and only if

tr(FXθ ) =∑m≥1

〈FXθ (vm), vm〉 <∞ (C.2)

for some orthonormal basis (vm) of H. The trace can be shown to be independent of the choice of

the basis. Define Vn,θ = (2πn)−1/2∑n

k=1Xkeikθ and note that, by stationarity,

FXn,θ := EVn,θ ⊗ Vn,θ =1

2π

∑|h|<n

(1− |h|

n

)EX0 ⊗X−he−ihθ.

It is easily verified that the operators FXn,θ again are non-negative and self-adjoint. Also note that,

by the triangular inequality,

‖FXn,θ −FXθ ‖S ≤∑|h|<n

|h|n‖CXh ‖S +

∑|h|≥n

‖CXh ‖S .

By application of (3.5) and Kronecker’s lemma, it easily follows that the latter two terms converge

to zero. This implies that FXn,θ(v) converges in norm to FXθ (v), for any v ∈ H.

Choose vm = ϕm(θ). Then, by continuity of the inner product and the monotone convergence

theorem, we have ∑m≥1

〈FXθ (ϕm(θ)), ϕm(θ)〉 =∑m≥1

limn→∞

〈FXn,θ(ϕm(θ)), ϕm(θ)〉

= limn→∞

∑m≥1

〈FXn,θ(ϕm(θ)), ϕm(θ)〉.

Using the fact that the FXn,θ’s are self-adjoint and non-negative, we get∑m≥1

〈FXn,θ(ϕm(θ)), ϕm(θ)〉 = tr(FXn,θ) = E‖Vn‖2

=1

2π

∑|h|<n

(1− |h|

n

)E〈X0, Xh〉e−ihθ.

96

Since |E〈X0, Xh〉| = |E〈X0, Xh −X(h)h 〉|, by the Cauchy-Schwarz inequality,∑

h∈Z|E〈X0, Xh〉| ≤

∑h∈Z

(E‖X0‖2)1/2(E(Xh −X(h)h )2)1/2 <∞,

and thus the dominated convergence theorem implies that

tr(FXθ ) =1

2π

∑h∈Z

E〈X0, Xh〉e−ihθ ≤∑h∈Z|E〈X0, Xh〉| <∞, (C.3)

which completes he proof.

Proof of Proposition 8. We have (see e.g. [13], p. 186) that the dynamic eigenvalues are such that

|λm(θ)− λm(θ′)| ≤ ‖FXθ −FXθ′ ‖S . Now,

‖FXθ −FXθ′ ‖S ≤∑h∈Z‖CXh ‖S |e−ihθ − e−ihθ′ |.

The summability condition (3.5) implies continuity, hence part (a) of the proposition. The fact that

|e−ihθ − e−ihθ′ | ≤ |h||θ − θ′| yields part (b). To prove (c), observe that

λm(θ)ϕm(θ) = FXθ (ϕm(θ)) =1

2π

∑h∈Z

EXh〈ϕm(θ), X0〉e−ihθ

for any θ ∈ [−π, π]. Since the eigenvalues λm(θ) are real, we obtain, by computing the complex

conjugate of the above equalities,

λm(θ)ϕm(θ) =1

2π

∑h∈Z

EXh〈ϕm(θ), X0〉eihθ = FX−θ(ϕm(θ)).

This shows that λm(θ) and ϕm(θ) are eigenvalue and eigenfunction of FX−θ and they must correspond

to a pair (λn(−θ), ϕn(−θ)); (c) follows.

Lemma 5. Let (Zt) be a stationary sequence in L2H with spectral density FZθ . Then,∫ π

−πtr(FZθ )dθ = tr

(∫ π

−πFZθ dθ

)= tr(CZ0 ) = E‖Zt‖2.

Proof. Let S = S(H,H). Note that∫ π−π F

Zθ dθ = IFZ if and only if

〈IFZ , V 〉S =

∫ π

−π〈FZθ , V 〉S dθ for all V ∈ S. (C.4)

97

For some orthonormal basis (vk) define VN =∑N

k=1 vk ⊗ vk. Then (C.4) implies that

tr(IFZ) = limN→∞

N∑k=1

〈IFZ(vk), vk〉 = limN→∞

〈IFZ , VN 〉S

= limN→∞

∫ π

−π〈FZθ , VN 〉Sdθ = lim

N→∞

∫ π

−π

N∑k=1

〈FZθ (vk), vk〉 dθ.

Since FZθ is non-negative definite for any θ, the monotone convergence theorem allows to interchange

the limit with the integral.

Proof of Proposition 9. (i) Define Y r,st :=

∑r<|k|≤s Ψk(Xt−k) and the related transfer operator

Ψr,sθ :=

∑r<|k|≤s Ψke

−ikθ. We also use Y st = Ψ0(Xt) + Y 0,s

t and Ψsθ = Ψ0 + Ψ0,s

θ . Since Y r,st is

a finite sum, it is obviously in L2H′ . Also, the finite number of filter coefficients makes it easy to

check that (Y r,st : t ∈ Z) is stationary and has spectral density operator FY r,sθ = Ψr,s

θ FXθ (Ψr,s

θ )∗. By

the previous lemma we have

E‖Y r,st ‖2 =

∫ π

−πtr(FY

r,st

θ )dθ =

∫ π

−πtr(Ψr,s

θ FXθ (Ψr,s

θ )∗)dθ

≤∫ π

−π‖Ψr,s

θ ‖2S(H,H′)tr(F

Xθ )dθ.

Now, it directly follows from the assumptions that (Y st : s ≥ 1) defines a Cauchy sequence in L2

H′ .

This proves (i).

Next, remark that by our assumptions ΨθFXθ (Ψθ)∗ ∈ L2

S(H′,H′)([−π, π]). Hence, by the results

in Appendix A.1,

∑|h|≤r

1

2π

∫ π

−πΨuFXu (Ψu)∗eihudu e−ihθ → ΨθFXθ (Ψθ)

∗ as r →∞,

where convergence is in L2S(H′,H′)([−π, π]). We prove that ΨθFXθ (Ψθ)

∗ is the spectral density oper-

ator of (Yt). This is the case if 12π

∫ π−π ΨuFXu (Ψu)∗eihudu = CYh . For the approximating sequences

(Y st : t ∈ Z) we know from (i) and Remark 3 that

1

2π

∫ π

−πΨsuFXu (Ψs

u)∗eihudu = CYs

h .

Routine arguments show that under our assumptions ‖CY sh − CYh ‖S(H′,H′) → 0 and∥∥∥∥∫ π

−π

(ΨuFXu (Ψu)∗ −Ψs

uFXu (Ψsu)∗)eihudu

∥∥∥∥S(H′,H′)

→ 0, (s→∞).

Part (ii) of the proposition follows, hence also part (iii).

Proof of Lemma 4. Let us only consider the case h ≥ 0. Define X(r)n as the r-dependent approxi-

98

mation of (Xn) provided by Definition 4. Observe that

nE∥∥Ch − Ch∥∥2

S = nE

∥∥∥∥∥ 1

n

n−h∑k=1

Zk

∥∥∥∥∥2

S

,

where Zk = Xk+h ⊗Xk − Ch. Set Z(r)k = X

(r)k+h ⊗X

(r)k − Ch. Stationarity of (Zk) implies

nE

∥∥∥∥∥ 1

n

n−h∑k=1

Zk

∥∥∥∥∥2

S

=∑|r|<n−h

(1− |r|

n

)E〈Z0, Zr〉S

≤h∑

r=−h|E〈Z0, Zr〉S |+ 2

∞∑r=h+1

|E〈Z0, Zr〉S |, (C.5)

while the Cauchy-Schwarz inequality yields

|E〈Z0, Zr〉S | ≤ E|〈Z0, Zr〉S | ≤√E‖Z0‖2SE‖Zr‖2S = E‖Z0‖2S .

Furthermore, from ‖Xh ⊗X0‖ = ‖Xh‖‖X0‖, we deduce

E‖Z0‖2S = E‖X0‖2‖Xh‖2 ≤(E‖X0‖4

)1/2<∞.

Consequently, we can bound the first sum in (C.5) by (2h + 1)(E‖X0‖4

)1/2. For the second term

in (C.5), we obtain, by independence of Z(r−h)r and Z0, that

|E〈Z0, Zr〉S | = |E〈Z0, Zr − Z(r−h)r 〉S | ≤ (E‖Z0‖2S)1/2(E‖Zr − Z(r−h)

r ‖2S)1/2.

To conclude, it suffices to show that∑∞

r=1(E‖Zr − Z(r−h)r ‖2S)1/2 ≤M <∞, where the bound M is

independent of h. Using an inequality of the type |ab− cd|2 ≤ 2|a|2|b− d|2 + 2|d|2|a− c|2, we obtain

E‖Zr − Z(r−h)r ‖2S = E‖Xr ⊗Xr+h −X(r−h)

r ⊗X(r−h)r+h ‖

2S

≤ 2E‖Xr‖2‖Xr+h −X(r−h)r+h ‖

2 + 2E‖X(r−h)r+h ‖

2‖Xr −X(r−h)r ‖2

≤ 2(E‖Xr‖4)1/2(E‖Xr+h −X(r−h)r+h ‖

4)1/2

+ 2(E‖X(r−h)r+h ‖

4)1/2(E‖Xr −X(r−h)r ‖4)1/2.

Note that E‖Xr‖4 = E‖X(r−h)r+h ‖

4 = E‖X0‖4 and

E‖Xr+h −X(r−h)r+h ‖

4 = E‖Xr −X(r−h)r ‖4 = E‖X0 −X(r−h)

0 ‖4.

Altogether we get

E‖Zr − Z(r−h)r ‖2S ≤ 4(E‖X0‖4)1/2(E‖X0 −X(r−h)

0 ‖4)1/2.

Hence, L4-m-approximability implies that∑∞

r=h+1 |E〈Z0, Zr〉S | converges and is uniformly bounded

over 0 ≤ h < n.

99

Acknowledgement

The research of Siegfried Hormann and Lukasz Kidzinski was supported by the Communaute

francaise de Belgique – Actions de Recherche Concertees (2010–2015) and the Belgian Science Policy

Office – Interuniversity attraction poles (2012–2017). The research of Marc Hallin was supported by

the Sonderforschungsbereich “Statistical modeling of nonlinear dynamic processes” (SFB823) of the

Deutsche Forschungsgemeinschaft and the Belgian Science Policy Office – Interuniversity attraction

poles (2012–2017).

Bibliography

[1] Aue, A., Dubart Norinho, D. and Hormann, S. (2014), On the prediction of functional

time series, J. Amer. Statist. Assoc. (forthcoming).

[2] Aston, J.A.D. and Kirch, C. (2011), Estimation of the distribution of change-points with

application to fMRI data, Technical Report, University of Warwick, Centre for Research in

Statistical Methodology, 2011.

[3] Benko, M., Hardle, W. and Kneip, A. (2009), Common functional principal components,

The Annals of Statistics 37, 1–34.

[4] Berkes, I., Gabrys, R., Horvath, L. and Kokoszka, P.(2009), Detecting changes in the

mean of functional observations, J. Roy. Statist. Soc. Ser. B, 71, 927–946.

[5] Besse, P. and Ramsay, J. O. (1986), Principal components analysis of sampled functions,

Psychometrika 51, 285–311.

[6] Brillinger, D. R. (1981), Time Series: Data Analysis and Theory, Holden Day, San Fran-

cisco.

[7] Brockwell, P. J. and Davis, R. A. (1981), Time Series: Theory and Methods, Springer,

New York.

[8] Bosq, D. (2000), Linear Processes in Function Spaces, Springer, New York.

[9] Cardot, H., Ferraty, F. and Sarda, P. (1999), Functional linear model, Statist.& Probab.

Lett. 45, 11–22.

[10] Dauxois, J., Pousse, A. and Romain, Y. (1982), Asymptotic theory for the principal

component analysis of a vector random function: Some applications to statistical inference, J.

Multivariate Anal. 12, 136–154.

[11] Ferraty, F. and Vieu, P. (2006), Nonparametric Functional Data Analysis, Springer, New

York.

[12] Gervini, D. (2007), Robust functional estimation using the median and spherical principal

components, Biometrika 95, 587–600.

100

[13] Gohberg, I., Goldberg, S. and Kaashoek, M. A. (2003), Basic Classes of Linear Oper-

ators, Birkhauser.

[14] Gokulakrishnan, P., Lawrence, P. D., McLellan, P. J. and Grandmaison, E. W.

(2006), A functional-PCA approach for analyzing and reducing complex chemical mechanisms,

Computers and Chemical Engineering 30, 1093–1101.

[15] Hall, P. and Hosseini-Nasab, M. (2006), On properties of functional principal components

analysis, J. Roy. Statist. Soc. Ser. B 68, 109–126.

[16] Hall, P., Muller, H.-G. and Wang, J.-L. (2006), Properties of principal component

methods for functional and longitudinal data analysis, The Annals of Statistics 34, 1493–1517.

[17] Hormann, S. and Kidzinski, L. (2012), A note on estimation in Hilbertian linear models,

Scand. J. Stat. (forthcoming).

[18] Hormann, S. and Kokoszka, P. (2010), Weakly dependent functional data, The Annals of

Statistics 38, 1845–1884.

[19] Hormann, S. and Kokoszka, P. (2012), Functional Time Series, in Handbook of Statistics:

Time Series Analysis-Methods and Applications, 157–186.

[20] Hyndman, R. J. and Ullah, M. S. (2007), Robust forecasting of mortality and fertility

rates: a functional data approach, Computational Statistics & Data Analysis 51, 4942–4956.

[21] James, G.M., Hastie T. J. and Sugar, C. A. (2000), Principal component models for

sparse functional data, Biometrika 87, 587–602.

[22] Jolliffe, J. (2005), Principal Component Analysis, Wiley & Sons.

[23] Karhunen, K. (1947), Uber lineare Methoden in der Wahrscheinlichkeitsrechnung, Ann. Acad.

Sci. Fennicae Ser. A. I. Math.-Phys. 37, 79.

[24] Kneip, A. and Utikal, K. (2001), Inference for density families using functional principal

components analysis, J. Amer. Statist. Assoc. 96, 519–531.

[25] Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T. and Cohen,

K. L. (1999), Robust principal component analysis for functional data, Test 8, 1–73.

[26] Loeve, M. (1946), Fonctions aleatoires de second ordre, Revue Sci. 84, 195–206.

[27] Panaretos, V. M. and Tavakoli, S. (2013), Fourier analysis of stationary time series in

function space, The Annals of Statistics 41, 568-603.

[28] Panaretos, V. M. and Tavakoli, S. (2013), Cramer-Karhunen-Loeve representation and

harmonic principal component analysis of functional time series, Stoch. Proc. Appl. 123, 2779-

2807.

[29] Politis, D. N. (2011), Higher-order accurate, positive semi-definite estimation of large-sample

covariance and spectral density matrices, Econometric Theory 27, 703–744.

101

[30] Ramsay, J. O. and Dalzell, C. J. (1991), Some tools for functional data analysis (with

discussion), J. Roy. Statist. Soc. Ser. B 53, 539–572.

[31] Ramsay, J. and Silverman, B. (2002), Applied Functional Data Analysis, Springer, New

York.

[32] Ramsay, J. and Silverman, B. (2005), Functional Data Analysis (2nd ed.), Springer, New

York.

[33] Reiss, P. T. and Ogden, R. T. Functional principal component regression and functional

partial least squares, J. Amer. Statist. Assoc. 102, 984–996.

[34] Silverman, B. (1996), Smoothed functional principal components analysis by choice of norm,

The Annals of Statistics 24, 1–24.

[35] Shumway, R. and Stoffer, D. (2006), Time Series Analysis and Its Applications (2nd ed.),

Springer, New York.

[36] Stadlober, E., Hormann, S. and Pfeiler, B. (2008), Quality and performance of a PM10

daily forecasting model, Atmospheric Environment 42, 1098–1109.

[37] Viviani, R., Gron, G. and Spitzer, M. (2005), Functional principal component analysis of

fMRI data, Human Brain Mapping 24, 109–129.

102

General Bibliography


Bibliography

[1] J. A. D. Aston and C. Kirch. Detecting and estimating epidemic changes in dependent functional

data. Journal of Multivariate Analysis, 109:204–220, 2012.

[2] J. A. D. Aston and C. Kirch. Evaluating stationarity via change–point alternatives with appli-

cations to fmri data. The Annals of Applied Statistics, 6:1906–1948, 2012.

[3] A. Aue, S. Hormann, L. Horvath, and M. Reimherr. Break detection in the covariance structure

of multivariate time series models. The Annals of Statistics, 37:4046–4087, 2009.

[4] M. Benko, W. Hardle, and A. Kneip. Common functional principal components. The Annals

of Statistics, 37:1–34, 2009.

[5] I. Berkes, R. Gabrys, L. Horvath, and P. Kokoszka. Detecting changes in the mean of functional

observations. Journal of the Royal Statistical Society (B), 71:927–946, 2009.

[6] D. Bosq. Linear Processes in Function Spaces. Springer, 2000.

[7] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control.

Prentice Hall, Englewood Cliffs, third edition, 1994.

[8] R. C. Bradley. Basic properties of strong mixing conditions. In E. Eberlein and M. S. Taqqu,

editors, Dependence in Probability and Statistics, pages 165–192. Birkhauser, Boston, 1986.

[9] D. R. Brillinger. Time Series: Data Analysis and Theory. Holt, New York, 1975.

[10] T. Cai and P. Hall. Prediction in functional linear regression. The Annals of Statistics, 34:2159–

2179, 2006.

[11] H. Cardot, F. Ferraty, A. Mas, and P. Sarda. Testing hypothesis in the functional linear model.

Scandinavian Journal of Statistics, 30:241–255, 2003.

[12] H. Cardot, F. Ferraty, and P. Sarda. Spline estimators for the functional linear model. Statistica

Sinica, 13:571–591, 2003b.

[13] J-M. Chiou and H-G. Muller. Diagnostics for functional regression via residual processes.

Computational Statistics and Data Analysis, 15:4849–4863, 2007.

[14] F. Comte and J. Johannes. Adaptive functional linear regression. The Annals of Statistics,

40:2765–2797, 2012.

[15] C. Crambes, A. Kneip, and P. Sarda. Smoothing splines estimators for functional linear regres-

sion. The Annals of Statistics, 37:35–72, 2009.

[16] F. Ferraty and P. Vieu. Nonparametric Functional Data Analysis: Theory and Practice.

Springer, 2006.

[17] R. Gabrys, L. Horvath, and P. Kokoszka. Tests for error correlation in the functional linear

model. Journal of the American Statistical Association, 105:1113–1125, 2010.

104


[18] Pantelis Z. Hadjipantelis, John A. D. Aston, and Jonathan P. Evans. Characterizing fundamen-

tal frequency in mandarin: A functional principal component approach utilizing mixed effect

models. The Journal of the Acoustical Society of America, 131(6), 2012.

[19] P. Hall, H.-G. Muller, and J.-L. Wang. Properties of principal component methods for functional

and longitudinal data analysis. The Annals of Statistics, 34:1493–1517, 2006.

[20] S. Hays, H. Shen, and J. Z. Huang. Functional dynamic factor models with application to yield

curve forecasting. The Annals of Applied Statistics, 6:870–894, 2012.

[21] S. Hormann, L. Horvath, and R. Reeder. A functional version of the ARCH model. Econometric

Theory, 29:267–288, 2013.

[22] S. Hormann, L. Kidzinski, and M. Hallin. Dynamic functional principal components. Technical

report, Universite libre de Bruxelles, 2013.

[23] S. Hormann and P. Kokoszka. Weakly dependent functional data. The Annals of Statistics,

38:1845–1884, 2010.

[24] S. Hormann and P. Kokoszka. Functional time series. In C. R. Rao and T. Subba Rao, editors,

Time Series, volume 30 of Handbook of Statistics. Elsevier, 2012.

[25] L. Horvath and P. Kokoszka. Inference for Functional Data with Applications. Springer, 2012.

[26] L. Horvath, P. Kokoszka, and G. Rice. Testing stationarity of functional time series. Journal

of Econometrics, 179(1):66–82, 2014.

[27] G. M. James, J. Wang, and J. Zhu. Functional linear regression that’s interpretable. The


[28] I. T. Jolliffe. Principal Component Analysis. Springer Verlag, 1986.

[29] P. Kokoszka and M. Reimherr. Predictability of shapes of intraday price curves. The Econo-

metrics Journal, 16(3):285–308, 2013.

[30] Y. Li and T. Hsing. On rates of convergence in functional linear regression. Journal of Multi-

variate Analysis, 98:1782–1804, 2007.

[31] Sara Lopez-Pintado and Juan Romo. On the concept of depth for functional data. Journal of

the American Statistical Association, 104(486):718–734, 2009.

[32] N. Malfait and J. O. Ramsay. The historical functional model. Canadian Journal of Statistics,

31:115–128, 2003.

[33] I. McKeague and B. Sen. Fractals with point impacts in functional linear regression. The


[34] H-G. Muller and U. Stadtmuller. Generalized functional linear models. The Annals of Statistics,

33:774–805, 2005.

105


[35] Alan V Oppenheim, Ronald W Schafer, John R Buck, et al. Discrete-time signal processing,

volume 2. Prentice-hall Englewood Cliffs, 1989.

[36] V. M Panaretos, D. Kraus, and J. H. Maddocks. Second-order comparison of gaussian random

functions and the geometry of dna minicircles. Journal of the American Statistical Association,

105(490):670–682, 2010.

[37] V. M. Panaretos and S. Tavakoli. Cramer–Karhunen–Loeve representation and harmonic prin-

cipal component analysis of functional time series. Stochastic Processes and their Applications,

123:2779–2807, 2013.

[38] V. M. Panaretos and S. Tavakoli. Fourier analysis of stationary time series in function space.


[39] D. N. Politis. Higher-order accurate, positive semidefinite estimation of large sample covariance

and spectral density matrices. Econometric Theory, 27:1469–4360, 2011.

[40] M. B. Priestley. Spectral Analysis and Time Series. Academic Press, 1981.

[41] J. O. Ramsay and C. J. Dalzell. Some tools for functional data analysis. Journal of the Royal

Statistical Society (B), 53:539–572, 1991.

[42] J. O. Ramsay and B. W. Silverman. Applied Functional Data Analysis. Springer, 2002.

[43] J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, 2005.

[44] P. T. Reiss and R. T. Ogden. Functional principal component regression and functional partial

least squares. Journal of the American Statistical Association, 102:984–996, 2007.

[45] J. A. Rice and B. W. Silverman. Estimating the mean and covariance structure nonparametri-

cally when the data are curves. Journal of the Royal Statistical Society. Series B (Methodolog-

ical), pages 233–243, 1991.

[46] X. Shao and W. B. Wu. Asymptotic spectral theory for nonlinear time series. The Annals of

Statistics, 35:1773–1801, 2007.

[47] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications with R Examples.

Springer, 2011.

[48] E. Stadtlober, S. Hormann, and B. Pfeiler. Qualiy and performance of a PM10 daily forecasting

model. Athmospheric Environment, 42:1098–1109, 2008.

[49] R. D. Tuddenham and M. M. Snyder. Physical growth of california boys and girls from birth to

eighteen years. Publications in child development. University of California, Berkeley, 1(2):183–

364, 1953.

[50] W. Wu. Nonlinear System Theory: Another Look at Dependence, volume 102 of Proceedings of

The National Academy of Sciences of the United States. 2005.

[51] F. Yao, H-G. Muller, and J-L. Wang. Functional data analysis for sparse longitudinal data.

Journal of the American Statistical Association, 100:577–590, 2005.

106


[52] F. Yao, H-G. Muller, and J-L. Wang. Functional linear regression analysis for longitudinal data.


107

Inference for stationary functional time series: dimension ...

Documents