Time Series Analysis by State Space Methods by Durbin and Koopman

OXFORD STATISTICAL SCIENCE SERIES

A. C. Atkinson: Plots, transformations and regression M. Stone: Coordinate-free multivariate statistic W. J. Krzanowsid: Principles of multivariate analysis: a user's perspective M. Aitkin, D. Anderson, B. Francis, and J. Hinde: Statistical modelling in

GLIM P. J. Diggle: Time series: a biostatistical introduction H. Tong: Non-linear time series: a dynamical system approach V. P. Godambe: Estimating functions A. C. Atkinson and A. N. Donev: Optimum experimental design J. K. Lindsey: Models for repeated measurements N. T. Longford: Random coefficient models P. J. Brown: Measurement, regression and calibration P. J. Diggle, K. Y. Liang, and S. L. Zeger: The analysis of longitudinal data J. I. Ansell and M. J. Phillips: Practical methods for reliability data analysis J. K. Lindsey: Modelling frequency and count data J. L. Jensen: Saddlepoint approximations Steffen L. Lauritzen: Graphical models A. W. Bowman and A. Azzalini: Applied smoothing methods for data analysis J. K. Lindsey: Models for repeated measurements, (Second edition) Michael Evans and Tim Swartz: Approximating integrals via Monte Carlo

and deterministic methods D. F. Andrews and J. E. Stafford: Symbolic computation for statistical inference T. A. Severini: Likelihood methods in statistics W. J. Krzanowski: Principles of multivariate analysis: a user's perspective

updated edition

Time Series Analysis by State Space Methods

J. DURBIN Department of Statistics

London School of Economics and Political Science

and

S. J. KOOPMAN Department of Econometrics Free University Amsterdam

FGV-SP / BIBLIOTECA

OXFORD UNIVERSITY PRESS CS OJ

^ CO CD CM CS) 1 ? m nnn?*"*

OXFORD UNIVERSITY PRESS

Great Clarendon Street, Oxford 0X2 6DP Oxford University Press is a department of the University of Oxford.

It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide in

Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi

New Delhi Shanghai Taipei Toronto With offices in

Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine "Vietnam

Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries

Published in the United States by Oxford University Press Inc., New York

© J. Durbin and S. J. Koopman 2001

The moral rights of the author have been asserted Database right Oxford University Press (maker)

First published 2001 Reprinted 2008

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means,

without the prior permission in writing of Oxford University Press, or as expressly permitted by law, or under terms agreed with the appropriate

reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department,

Oxford University Press, at the address above. You must not circulate this book in any other binding or cover

and you must impose this same condition on any acquirer. A catalogue record for this book is available from the British Library

Library of Congress Cataloging in Publication Data Durbin, James.

Time series analysis by state space methods / J. Durbin and S. J. Koopman. (Oxford statistical science series)

Includes bibliographical references and index. I. Time-series analysis. 2. State-space methods. I. Koopman, S. J. (Siem Jan). II. Title.

III. Series

QA280.D87 2001 519.5'5- dc21 00-054845

ISBN 978-0-19-852354-3 Printed in Great Britain

on acid-free paper by Biddies Ltd., King's Lynn, Norfolk

Q V - S P / 8 i 8 L ! O T E C A ]

R-e lis Ch&fasdn

I d^DliU (BIBLIODATA)SP000531514

To Anne JD

To my family and friends SJK

Preface

This book presents a comprehensive treatment of the state space approach to time series analysis. The distinguishing feature of state space time series models is that observations are regarded as made up of distinct components such as trend, seasonal, regression elements and disturbance terms, each of which is modelled separately. The models for the components are put together to form a single model called a state space model which provides the basis for analysis. The techniques that emerge from this approach are very flexible and are capable of handling a much wider range of problems than the main analytical system currently in use for time series analysis, the Box-Jenkins ARIMA system.

The exposition is directed primarily towards students, teachers, methodologists and applied workers in time series analysis and related areas of applied statistics and econometrics. Nevertheless, we hope that the book will also be useful to workers in other fields such as engineering, medicine and biology where state space models are employed. We have made a special effort to make the presentation accessible to readers with no previous knowledge of state space methods. For the development of all important parts of the theory, the only mathematics required that is not elementary is matrix multiplication and inversion, while the only statistical theory required, apart from basic principles, is elementary multivariate normal regression theory.

The techniques that we develop are aimed at practical application to real problems in applied time series analysis. Nevertheless, a surprising degree of beauty is to be found in the elegant way that many of the results drop out; no doubt this is due to the Markovian nature of the models and the recursive structure of the calculations.

State space time series analysis began with the pathbreaking paper of Kalman (1960) and early developments of the subject took place in the field of engineering. The term 'state space' came from engineering, and although it does not strike a natural rapport in statistics and econometrics, we, like others, use it because it is strongly established. We assure beginners that the meaning of the word 'state' will become clear very quickly; however, the attachment of the word 'space' might perhaps remain mysterious to non-engineers.

The book is divided into two parts. Part I discusses techniques of analysis based on the linear Gaussian state space model; the methods we describe represent the core of traditional state space methodology together with some new developments. We have aimed at presenting a state-of-the-art treatment of time

viii PREFACE

series methodology based on this state space model. Although the model has been studied extensively over the past forty years, there is much that is new in our treatment. We use the word 'new' here and below to refer to results derived from original research in our recently published papers or obtained for the first time in this book. In particular, this usage relates to the treatment of disturbance smoothing in Chapter 4, exact initial filtering and smoothing in Chapter 5, the univariate treatment of multivariate observations plus computing algorithms in Chapter 6, aspects of diffuse likelihood, the score vector, the EM algorithm and allowing for the effects of parameter estimation on estimates of variance in Chapter 7, and the use of importance sampling for Bayesian analysis in Chapter 8.

The linear Gaussian model, often after transformation of the observations, provides an adequate basis for the analysis of many of the time series that are encountered in practice. However, situations occur where this model fails to provide an acceptable representation of the data. For example, in the study of road accidents when the numbers observed are small, the Poisson distribution gives an intrinsically better model for the behaviour of the data than the normal distribution. There is therefore a need to extend the scope of state space methodology to cover non-Gaussian observations; it is also important to allow for nonlinearities in the model and for heavy-tailed densities to deal with outliers in the observations and structural shifts in the state.

These extensions are considered in Part II of the book. The treatment given there is new in the sense that it is based partly on the methods developed in our Durbin and Koopman (2000) paper and partly on additional new material. The methodology is based on simulation since exact analytical results are not available for the problems under consideration. Earlier workers had employed Markov chain Monte Carlo simulation methods to study these problems; however, during the research that led to this book, we investigated whether traditional simulation methods based on importance sampling and antithetic variables could be made to work satisfactorily. The results were successful so we have adopted this approach throughout Part II. The simulation methods that we propose are computationally transparent, efficient and convenient for the type of time series applications considered in this book.

An unusual feature of the book is that we provide analyses from both classical and Bayesian perspectives. We start in all cases from the classical standpoint since this is the mode of inference within which each of us normally works. This also makes it easier to relate our treatment to earlier work in the field, most of which has been done from the classical point of view. We discovered as we developed simulation methods for Part II, however, that we could obtain Bayesian solutions by relatively straightforward extensions of the classical treatments. Since our views on statistical inference are eclectic and tolerant, and since we believe that methodology for both approaches should be available to applied workers, we are happy to include solutions from both perspectives in the book.

Both the writing of the book and doing the research on which it was based have been highly interactive and highly enjoyable joint efforts. Subject to this, there have naturally been differences in the contributions that the two authors have

PREFACE ix

made to the work. Most of the new theory in Part I was initiated by SJK while most of the new theory in Part 13 was initiated by JD. The expository and literary styles of the book were the primary responsibility of JD while the illustrations and computations were the primary responsibility of SJK.

Our collaboration in this work began while we were working in the Statistics Department of the London School of Economics and Political Science (LSE). We were fortunate in having as departmental colleagues three distinguished workers in the field. Our main thanks are to Andrew Harvey who helped us in many ways and whose leadership in the development of the methodology of state space time series analysis at the LSE was an inspiration to us both. We also thank Neil Shephard for many fruitful discussions on various aspects of the statistical treatment of state space models and for his incisive comments on an earlier draft of this book. We are grateful to Piet de Jong for some searching discussions on theoretical points.

We thank Jurgen Doornik of Nuffield College, Oxford for his help over a number of years which has assisted in the development of the computer packages STAMP and SsfPack. SJK thanks Nuffield College, Oxford for its hospitality and the Royal Netherlands Academy of Arts and Sciences for its financial support while he was at CentER, Tilburg University.

The book was written in LaTeX using the MiKTeX system (http://www. miktex. org). We thank Jurgen Doornik for assistance in setting up this LaTeX system. London J.D. Amsterdam S.J.K. November 2000

http://www

Contents

1 Introduction 1 1.1 Basic ideas of state space analysis 1 1.2 Linear Gaussian model I 1.3 Non-Gaussian and nonlinear models 3 1.4 Prior knowledge 4 1.5 Notation 4 1.6 Other books on state space methods 5 1.7 Website for the book 5

I THE LINEAR GAUSSIAN STATE SPACE MODEL

2 Local level model 9 2.1 Introduction 9 2.2 Filtering 11

2.2.1 The Kalman Filter 11 2.2.2 Illustration 12

2.3 Forecast errors 13 2.3.1 Cholesky decomposition 14 2.3.2 Error recursions 15

2.4 State smoothing 16 2.4.1 Smoothed state 16 2.4.2 Smoothed state variance 17 2.4.3 Illustration 18

2.5 Disturbance smoothing 19 2.5.1 Smoothed observation disturbances 20 2.5.2 Smoothed state disturbances 20 2.5.3 Illustration 21 2.5.4 Cholesky decomposition and smoothing 22

2.6 Simulation 22 2.6.1 Illustration 23

2.7 Missing observations 23 2.7.1 Illustration 25

2.8 Forecasting 25 2.8.1 Illustration 27

CONTENTS

2.9 Initialisation 27 2.10 Parameter estimation 30

2.10.1 Loglikelihood evaluation 30 2.10.2 Concentration of loglikelihood 31 2.10.3 Illustration 32

2.11 Steady state 32 2.12 Diagnostic checking 33

2.12.1 Diagnostic tests for forecast errors 33 2.12.2 Detection of outliers and structural breaks 35 2.12.3 Illustration 35

2.13 Appendix: Lemma in multivariate normal regression 37

3 Linear Gaussian state space models 38 3.1 Introduction 38 3.2 Structural time series models 39

3.2.1 Univariate models 39 3.2.2 Multivariate models 44 3.2.3 STAMP 45

3.3 ARMA models and ARIMA models 46 3.4 Exponential smoothing 49 3.5 State space versus Box-Jenkins approaches 51 3.6 Regression with time-varying coefficients 54 3.7 Regression with ARMA errors 54 3.8 Benchmarking 54 3.9 Simultaneous modelling of series from different sources 56 3.10 State space models in continuous time 57

3.10.1 Local level model 57 3.10.2 Local linear trend model 59

3.11 Spline smoothing 61 3.11.1 Spline smoothing in discrete time 61 3.11.2 Spline smoothing in continuous time 62

4 Filtering, smoothing and forecasting 64 4.1 Introduction 64 4.2 Filtering 65

4.2.1 Derivation of Kalman filter 65 4.2.2 Kalman filter recursion 67 4.2.3 Steady state 68 4.2.4 State estimation errors and forecast errors 68

4.3 State smoothing 70 4.3.1 Smoothed state vector 70 4.3.2 Smoothed state variance matrix 72 4.3.3 State smoothing recursion 73

4.4 Disturbance smoothing 73

CONTENTS xiii

4.4.1 Smoothed disturbances 73 4.4.2 Fast state smoothing 75 4.4.3 Smoothed disturbance variance matrices 75 4.4.4 Disturbance smoothing recursion 76

4.5 Covariance matrices of smoothed estimators 77 4.6 Weight functions 81

4.6.1 Introduction 81 4.6.2 Filtering weights 81 4.6.3 Smoothing weights 82

4.7 Simulation smoothing 83 4.7.1 Simulating observation disturbances 84 4.7.2 Derivation of simulation smoother for observation

disturbances 87 4.7.3 Simulation smoothing recursion 89 4.7.4 Simulating state disturbances 90 4.7.5 Simulating state vectors 91 4.7.6 Simulating multiple samples 92

4.8 Missing observations 92 4.9 Forecasting 93 4.10 Dimensionality of observational vector 94 4.11 General matrix form for filtering and smoothing 95

5 Initialisation of filter and smoother 99 5.1 Introduction 99 5.2 The exact initial Kalman filter 101

5.2.1 The basic recursions 101 5.2.2 Transition to the usual Kalman filter 104 5.2.3 A convenient representation 105

5.3 Exact initial state smoothing 106 5.3.1 Smoothed mean of state vector 106 5.3.2 Smoothed variance of state vector 107

5.4 Exact initial disturbance smoothing 109 5.5 Exact initial simulation smoothing 110 5.6 Examples of initial conditions for some models 110

5.6.1 Structural time series models 110 5.6.2 Stationary ARMA models 111 5.6.3 Nonstationary ARIMA models 112 5.6.4 Regression model with ARMA errors 114 5.6.5 Spline smoothing 115

5.7 Augmented Kalman filter and smoother 115 5.7.1 Introduction 115 5.7.2 Augmented Kalman filter 115 5.7.3 Filtering based on the augmented Kalman filter 116

xiv CONTENTS

5.7.4 Illustration: the locallinear trend model 118 5.7.5 Comparisons of computational efficiency 119 5.7.6 Smoothing based on the augmented Kalman filter 120

6 Further computational aspects 121 6.1 Introduction 121 6.2 Regression estimation 121

6.2.1 Introduction 121 6.2.2 Inclusion of coefficient vector in state vector 122 6.2.3 Regression estimation by augmentation 122 6.2.4 Least squares and recursive residuals 123

6.3 Square root filter and smoother 124 6.3.1 Introduction 124 6.3.2 Square root form of variance updating 125 6.3.3 Givens rotations 126 6.3.4 Square root smoothing 127 6.3.5 Square root filtering and initialisation 127 6.3.6 {lustration: local linear trend model 128

6.4 Univariate treatment of multivariate series 128 6.4.1 Introduction 128 6.4.2 Details of univariate treatment 129 6.4.3 Correlation between observation equations 131 6.4.4 Computational efficiency 132 6.4.5 Illustration: vector splines 133

6.5 Filtering and smoothing under linear restrictions 134 6.6 The algorithms of SsfPack 134

6.6.1 Introduction 134 6.6.2 The SsfPack function 135 6.6.3 Illustration: spline smoothing 136

7 Maximum likelihood estimation 138 7.1 Introduction 138 7.2 Likelihood evaluation 138

7.2.1 Loglikelihood when initial conditions are known 138 7.2.2 Diffuse loglikelihood 139 7.2.3 Diffuse loglikelihood evaluated via augmented Kalman

filter 140 7.2.4 Likelihood when elements of initial state vector are

fixed but unknown 141 7.3 Parameter estimation 142

7.3.1 Introduction 142 7.3.2 Numerical maximisation algorithms 142 7.3.3 The score vector 144 7.3.4 The EM algorithm 147

CONTENTS

7.3.5 Parameter estimation when dealing with diffuse initial conditions 149

7.3.6 Large sample distribution of maximum likelihood estimates 150

7.3.7 Effect of errors in parameter estimation 150 7.4 Goodness of fit 152 7.5 Diagnostic checking 152

8 Bayesian analysis 155 8.1 Introduction 155 8.2 Posterior analysis of state vector 155

8.2.1 Posterior analysis conditional on parameter vector 155 8.2.2 Posterior analysis when parameter vector is

unknown 155 8.2.3 Non-informative priors 158

8.3 Markov chain Monte Carlo methods 159

9 Illustrations of the use of the linear Gaussian model 161 9.1 Introduction 161 9.2 Structural time series models 161 9.3 Bivariate structural time series analysis 167 9.4 Box-Jenkins analysis 169 9.5 Spline smoothing 172 9.6 Approximate methods for modelling volatility 175

ff NON-GAUSSIAN AND NONLINEAR STATE SPACE MODELS

10 Non-Gaussian and nonlinear state space models 179 10.1 Introduction 179 10.2 The general non-Gaussian model 179 10.3 Exponential family models 180

10.3.1 Poisson density 181 10.3.2 Binary density 181 10.3.3 Binomial density 181 10.3.4 Negative binomial density 182 10.3.5 Multinomial density 182

10.4 Heavy-tailed distributions 183 10.4.1 i-Distribution 183 10.4.2 Mixture of normals 184 10.4.3 General error distribution 184

10.5 Nonlinear models 184 10.6 Financial models 185

10.6.1 Stochastic volatility models 185

XVI CONTENTS

10.6.2 General autoregressive conditional heteroscedasticity 187

10.6.3 Durations: exponential distribution 188 10.6.4 Trade frequencies: Poisson distribution 188

11 Importance sampling 189 11.1 Introduction 189 11.2 Basic ideas of importance sampling 190 11.3 Linear Gaussian approximating models 191 11.4 Linearisation based on first two derivatives 193

11.4.1 Exponentional family models 195 11.4.2 Stochastic volatility model 195

11.5 Linearisation based on the first derivative 195 11.5.1 i-distribution 197 11.5.2 Mixture of normals 197 11.5.3 General error distribution 197

11.6 Linearisation for non-Gaussian state components 198 11.6.1 /-distribution for state errors 199

11.7 Linearisation for nonlinear models 199 11.7.1 Multiplicative models 201

11.8 Estimating the conditional mode 202 11.9 Computational aspects of importance sampling 204

11.9.1 Introduction 204 11.9.2 Practical implementation of importance sampling 204 11.9.3 Antithetic variables 205 11.9.4 Diffuse initialisation 206 11.9.5 Treatment of /-distribution without importance

sampling 208 11.9.6 Treatment of Gaussian mixture distributions without

importance sampling 210

12 Analysis from a classical standpoint 212 12.1 Introduction 212 12.2 Estimating conditional means and variances 212 12.3 Estimating conditional densities and distribution

functions 213 12.4 Forecasting and estimating with missing observations 214 12.5 Parameter estimation 215

12.5.1 Introduction 215 12.5.2 Estimation of likelihood 215 12.5.3 Maximisation of loglikelihood 216 12.5.4 Variance matrix of maximum likelihood estimate 217 12.5.5 Effect of errors in parameter estimation 217

CONTENTS xvii

12.5.6 Mean square error matrix due to simulation 217 12.5.7 Estimation when the state disturbances are Gaussian 219 12.5.8 Control variables 219

13 Analysis from a Bayesian standpoint 222 13.1 Introduction 222 13.2 Posterior analysis of functions of the state vector 222 13.3 Computational aspects of Bayesian analysis 225 13.4 Posterior analysis of parameter vector 226 13.5 Markov chain Monte Carlo methods 228

14 Non-Gaussian and nonlinear illustrations 230 14.1 Introduction 230 14.2 Poisson density: van drivers killed in Great Britain 230 14.3 Heavy-tailed density: outlier in gas consumption in UK 233 14.4 Volatility: pound/dollar daily exchange rates 236 14.5 Binary density: Oxford-Cambridge boat race 237 14.6 Non-Gaussian and nonlinear analysis using SsfPack 238

References 241

Author index 249

Subject index 251

1 Introduction

1.1 Basic ideas of state space analysis

State space modelling provides a unified methodology for treating a wide range of problems in time series analysis. In this approach it is assumed that the develop-ment over time of the system under study is determined by an unobserved series of vectors « i , . . . , art, with which are associated a series of observations Vi, • • •, yn \ the relation between the a f ' s and the is specified by the state space model. Hie purpose of state space analysis is to infer the relevant properties of the a., 's from a knowledge of the observations y i , . . . , yn. This book presents a systematic treatment of this approach to problems of time series analysis.

Our starting point when deciding the structure of the book was that we wanted to make the basic ideas of state space analysis easy to understand for readers with no previous knowledge of the approach. We felt that if we had begun the book by developing the theory step by step for a general state space model, the underlying ideas would be obscured by the complicated appearance of many of the formulae. We therefore decided instead to devote Chapter 2 of the book to a particularly simple example of a state space model, the local level model, and to develop as many as possible of the basic state space techniques for this model. Our hope is that this will enable readers new to the techniques to gain insights into the ideas behind state space methodology that will help them when working through the greater complexities of the treatment of the general case. With this purpose in mind, we introduce topics such as Kalman filtering, state smoothing, disturbance smoothing, simulation smoothing, missing observations, forecasting, initialisation, maximum likelihood estimation of parameters and diagnostic checking for the local level model. We demonstrate how the basic theory that is needed can be developed very simply from elementary results in regression theory.

1.2 Linear Gaussian model

Before going on to develop the theory for the general model, we present a series of examples that show how the linear Gaussian state space model relates to prob-lems of practical interest. This is done in Chapter 3 where we begin by showing how structural time series models can be put into state space form. By structural time series models we mean models in which the observations are made up of

INTRODUCTION

trend, seasonal, cycle and regression components plus error, where the compo-nents are generally represented by forms of random walk models. We go on to put Box-Jenkins ARIMA models into state space form, thus demonstrating that these models are special cases of state space models. Next we discuss the history of exponential smoothing and show how it relates to simple forms of state space and ARIMA models. We then compare the relative merits of Box-Jenkins and state space methodologies for applied time series analysis. We follow this by consider-ing various aspects of regression with or without time-varying coefficients or autocorrelated errors. We also show that the benchmarking problem, which is an important problem in official statistics, can be dealt with by state space methods. Further topics discussed are simultaneous modelling series from different sources, continuous time models and spline smoothing in discrete and continuous time.

Chapter 4 begins the development of the theory for the analysis of the general linear Gaussian state space model. This model has provided the basis for most earlier work on state space time series analysis. We give derivations of the Kalman filter and smoothing recursions for the estimation of the state vector and its conditional variance matrix given the data. We also derive recursions for estimating the observation and state disturbances. We derive the simulation smoother which is an important tool in the simulation methods we employ later in the book. We show that forecasting and allowance for missing observations are easily dealt with in the state space framework.

Computational algorithms in state space analyses are mainly based on recur-sions, that is, formulae in which we calculate the value at time t + 1 from earlier values for f, t — 1 , . . . , 1. The question of how these recursions are started up at the beginning of the series is called initialisation; it is dealt with in Chapter 5. We give a general treatment in which some elements of the initial state vector have known distributions while others are diffuse, that is, treated as random variables with infinite variance, or are treated as unknown constants to be estimated by maximum likelihood. The treatment is based on recent work and we regard it as more transparent than earlier discussions of the problem.

Chapter 6 discusses computational aspects of filtering and smoothing and begins by considering the estimation of the regression component of the model. It next considers the square root filter and smoother which may be used when the Kalman filter and smoother show signs of numerical instability. It goes on to discuss how multivariate time series can be treated as univariate series by bringing elements of the observational vectors into the system one at a time, with compu-tational savings relative to the multivariate treatment in some cases. The chapter concludes by discussing computer algorithms.

In Chapter 7, maximum likelihood estimation of unknown parameters is con-sidered both for the case where the distribution of the initial state vector is known and for the case where at least some elements of the vector are diffuse or are treated as fixed and unknown. The use of the score vector and the EM algorithm is discussed. The effect of parameter estimation on variance estimation is examined.

Up to this point the exposition has been based on the classical approach to inference in which formulae are worked out on the assumption that parameters

1.3. NON-GAUSSIAN AND NONLINEAR MODELS 3

are known, while in applications parameter values are replaced by appropriate estimates. In Bayesian analysis the parameters are treated as random variables with a specified or a non-informative prior joint density. In Chapter 8 we consider Bayesian analysis of the linear Gaussian model both for the case where the prior density is proper and for the case where it is non-informative. We give formulae from which the posterior mean can be calculated for functions of the state vector, either by numerical integration or by simulation. We restrict attention to functions which, for given values of the parameters, can be calculated by the Kalman filter and smoother; this however includes a substantial number of quantities of interest.

In Chapter 9 we illustrate the use of the methodology by applying the techniques that have been developed to a number of analyses based on real data. These include a study of the effect of the seat belt law on road accidents in Great Britain, forecasting the number of users logged on to an Internet server, fitting acceleration against time for a simulated motorcycle accident and modelling the volatility of financial time series.

1.3 Non-Gaussian and nonlinear models

Part II of the book extends the treatment to state space models which are not both linear and Gaussian. The analyses are based on simulation using the approach developed by Durbin and Koopman (2000). Chapter 10 illustrates the range of non-Gaussian and nonlinear models that can be analysed using the methods of Part H. This includes exponential family models such as the Poisson distribution for the conditional distribution of the observations given the state. It also includes heavy-tailed distributions for the observational and state disturbances, such as the i-distribution and mixtures of normal densities. Departures from linearity of the models are studied for cases where the basic state space structure is preserved. Financial models such as stochastic volatility models are investigated from the state space point of view.

The simulation techniques employed in this book for handling non-Gaussian models are based on importance sampling and antithetic variables. The methods that we use are described in Chapter 11. We first show how to calculate the conditional mode of the state given the observations for the non-Gaussian model by iterated use of the Kalman filter and smoother. We then find the linear Gaussian model with the same conditional mode given the observations and we use the conditional density of the state given the observations for this model as the impor-tance density. Since this density is Gaussian we can draw random samples from it for the simulation using the simulation smoother described in Chapter 4. To improve efficiency we introduce two antithetic variables intended to balance the simulation sample for location and scale. Finally, we show how cases where the observation or state errors have f-distributions or distributions which are mixtures of normal densities can be treated without using importance sampling.

Chapter 12 discusses the implementation of importance sampling numerically. It describes how to estimate conditional means and variances of functions of

4 INTRODUCTION

elements of the state vector given the observations and how to estimate conditional densities and distributions of these functions. It also shows how variances due to simulation can be estimated by means of simple formulae. It next discusses the estimation of unknown parameters by maximum likelihood. The final topic considered is the effect of errors in the estimation of the parameters on the estimates.

In Chapter 13 we show that by an extension of the importance sampling techniques of Chapter 11 we can obtain a Bayesian treatment for non-Gaussian and nonlinear models. Estimates of posterior means, variances, densities and distribution functions of functions of elements of the state vector are provided. We also estimate the posterior densities of the parameters of the model. Estimates of variances due to simulation are obtained.

We provide examples in Chapter 14 which illustrate the methods that have been developed in Part II for analysing observations using non-Gaussian and nonlinear state space models. We cover both classical and Bayesian analyses. The illustrations include the monthly number of van drivers killed in road accidents in Great Britain, outlying observations in quarterly gas consumption, the volatility of exchange rate returns and analysis of the results of the annual boat race between teams of the universities of Oxford and Cambridge. They demonstrate that the methods we have developed work well in practice.

1.4 Prior knowledge

Only basic knowledge of statistics and matrix algebra is needed in order to under-stand the theory in this book. In statistics, an elementary knowledge is required of the conditional distribution of a vector y given a vector -t in a multivariate normal distribution; the central results needed from this area for much of the theory of the book are stated in the lemma in §2.13. Little previous knowledge of time series analysis is required beyond an understanding of the concepts of a stationary time series and the autocorrelation function. In matrix algebra all that is needed are matrix multiplication and inversion of matrices, together with basic concepts such as rank and trace.

1.5 Notation

Although a large number of mathematical symbols are required for the exposition of the theory in this book, we decided to confine ourselves to the standard English and Greek alphabets. The effect of this is that we occasionally need to use the same symbol more than once; we have aimed however at ensuring that the meaning of the symbol is always clear from the context.

• The same symbol 0 is used to denote zero, a vector of zeros or a matrix of zeros.

• Weusethe generic notation p(-), /?(•, •), />(• | •) to denote a probability density, a joint probability density and a conditional probability density.

1.7. WEBSITE FOR THE BOOK 5

• i f* is a vector which is normally distributed with mean vector fj, and variance matrix V, we write x ~ N O , V).

• If x is a random variable with the chi-squared distribution with v degrees of freedom, we write x ~ xl-

• We use the same symbol Var(jt) to denote the variance of a scalar random variable x and the variance matrix of a random vector x. Similarly, we use the same symbol Cov(jc, y) to denote the covariance between scalar random variables x and y and between random vectors x and y.

1.6 Other books on state space methods

Without claiming complete coverage, we list here a number of books which contain treatments of state space methods.

First we mention three early books written from an engineering standpoint: Jazwinski (1970), Sage and Melsa (1971) and Anderson and Moore (1979). A later book from a related standpoint is Young (1984).

Books written from the standpoint of statistics and econometrics include Harvey (1989), who gives a comprehensive state space treatment of structural time series models together with related state space material, West and Harrison (1997), who give a Bayesian treatment with emphasis on forecasting, Kitagawa and Gersch (1996) and Kim and Nelson (1999).

More general books on time series analysis and related topics which cover partial treatments of state space topics include Brockwell and Davis (1987) (39 pages on state space out of about 570), Harvey (1993) (48 pages out of about 300), Hamilton (1994) (one chapter of 37 pages on state space out of about 800 pages) and Shumway and Stoffer (2000) (one chapter of 112 pages out of about 545 pages). The monograph of Jones (1993) on longitudinal models has three chapters on state space (66 pages out of about 225). The book by Fahrmeir and Tutz (1994) on multivariate analysis based on generalised linear modelling has a chapter on state space models (48 pages out of about 420).

Books on time series analysis and similar topics with minor treatments of state space analysis include Granger and Newbold (1986) and Mills (1993). We mention finally the book edited by Doucet, deFreitas and Gordon (2000) which contains a collection of articles on Monte Carlo (particle) filtering and the book edited by Akaike and Kitagawa (1999) which contains six chapters (88 pages) on illustrations of state space analysis out of a total of 22 chapters (385 pages).

1.7 Website for the book

We will maintain a website for the book at h t t p : //www. ssf pack. com/dkbook/ for data, code, corrections and other relevant information. We will be grateful to readers if they inform us about their comments and errors in the book so corrections can be placed on the site.

Parti

The linear Gaussian state space model

In Part I we present a state-of-the-art treatment of the construction and analysis of linear Gaussian state space models, and we discuss the software required for implementing the resulting methodology. Methods based on these models, possibly after transformation of the observations, are appropriate for a wide range of problems in practical time series analysis. We also present illustrations of the applications of the methods to real series.

Local level model

2.1 Introduction

The purpose of this chapter is to introduce the basic techniques of state space analysis, such as filtering, smoothing, initialisation and forecasting, in terms of a simple example of a state space model, the local level model. This is intended to help beginners grasp the underlying ideas more quickly than they would if we were to begin the book with a systematic treatment of the general case. So far as inference is concerned, we shall limit the discussion to the classical standpoint; a Bayesian treatment of the local level model may be obtained as a special case of the Bayesian analysis of the linear Gaussian model in Chapter 8.

A time series is a set of observations y i , . . . , yn ordered in time. The basic model for representing a time series is the additive model

Here, jjLt is a slowly varying component called the trend, y, is a periodic compo-nent of fixed period called the seasonal and st is an irregular component called the error or disturbance. In general, the observation yt and the other variables in (2.1) can be vectors but in this chapter we assume they are scalars. In many appli-cations, particularly in economics, the components combine multiplicatively, giving

By taking logs however and working with logged values model (2.2) reduces to model (2.1), so we can use model (2.1) for this case also.

To develop suitable models for and yt we need the concept of a random walk. This is a scalar series at determined by the relation at+1 ~ ctt + r\t where the T]('s are independent and identically distributed random variables with zero means and variances a^.

Consider a simple form of model (2.1) in which fxt — at where a( is a random walk, no seasonal is present and all random variables are normally distributed. We assume that et has constant variance of. This gives the model

yt = V<t + Yt +£t> t = l,...,n. (2.1)

yt = fi,ytst. (2.2)

yt = at + et, et ~ N(0, ae2),

a t+i = a t + r)t, r}t ~ N(0, rf), (2.3)

10 LOCAL LEVEL MODEL,

for t — 1 , . . . , n where the e,'s and are all mutually independent and are independent of a\. This model is called the local level model. Although it has a simple form, this model is not an artificial special case and indeed it provides the basis for the analysis of important real problems in practical time series analysis; for example, it provides the basis for exponentially weighted moving averages as discussed in §3.4. It exhibits the characteristic structure of state space models in which there is a series of unobserved values «i, . . . , a„ which represents the development over time of the system under study, together with a set of observations yi,... ,yn which are related to the a / s by the state space model (2.3). The object of the methodology that we shall develop is to infer relevant properties of the a r ' s from a knowledge of the observations yi,... ,y„.

We assume initially that a\ ~ N(ai,.Pi) where ai and P\ are known and that of and of are known. Since random walks are non-stationary the model is non-stationary. By non-stationary here we mean that distributions of random variables yt and at depend on time t.

For applications of the model to real series using classical inference, we need to compute quantities such as the mean of at given _Vi,..., yt-i or the mean of at given yi,... ,yn, together with their variances; we also need to fit the model to data by calculating maximum likelihood estimates of the parameters of and of. In principle, this could be done by using standard results from multivariate normal theory as described in books such as Anderson (1984). In this approach the observations yt generated by the local level model are represented as an n x 1 vector y such that

y ~ N(lai, ß), with y — f y i \

\ynj

1 -m

w and ß ^ l l ' P i + E,

(2.4)

where the (i, j)th element of the n x n matrix £ is given by

i,j = 1,... ,n, of + (i IK;, U ~

* < J i =J> * > J

(2.5)

which follows since the local level model implies that f-i

yt +£,, 7=1

t — 1,..., n. (2-6)

Starting from this knowledge of the distribution of y, estimation of conditional means, variances and covariances is in principle a routine matter using standard results in multivariate analysis based on the properties of the multivariate normal distribution. However, because of the serial correlation between the observations yt, the routine computations rapidly become cumbersome as n increases. This naive

2.2. FILTERING 11

approach to estimation can be improved on considerably by using the filtering and smoothing techniques described in the next three sections. In effect, these techniques provide efficient computing algorithms for obtaining the same results as those derived by multivariate analysis theory. The remaining sections of this chapter deal with other important issues such as fitting the local level model and forecasting future observations.

The local level model (2.3) is a simple example of a linear Gaussian state space model. The variable at is called the state and is unobserved. The overall object of the analysis is to study the development of the state over time using the observed values yi,..., yn. Further examples of these models will be described in Chapter 3. The general form of the linear Gaussian state space model is given in equation (3.1); its properties will be considered in detail from the standpoint of classical inference in Chapters 4-7. A Bayesian treatment of the model will be given in Chapter 8.

2.2 Filtering

2.2.1 THE KALMAN FILTER

The object of filtering is to update our knowledge of the system each time a new observation y, is brought in. We shall develop the theory of filtering for the local level model (2.3). Since all distributions are normal, conditional joint distributions of one set of obervations given another set are also normal. Let Yt-\ be the set of past observations {)>i,..., yt-1} and assume that the conditional distribution of at given i is N(a t,P t) where at and Pt are to be determined. Given that at and Pt are known, our object is to calculate at+i and P1+i when yt is brought in. We do this by using some results from elementary regression theory.

Since ai+i - E(at+1 \Yt) = E (a, + r/,|Ft) and Pt+l = Var(at+I | Yt) = Var(af + rj, \ Yt) from (2.3), we have

at+1 = E(at\Yt), Pt+l = Var(a,|F,) + o f . (2.7)

Define vt = yt — at and Ft — Var(i>f). Then

E(u/l^-i) = E(at + et - at\Yt_i) =at~at = 0.

ThusE(v t) - E[E(vt\Yt-{j\ - OandCov(vf, y,) - E ( i ^ ) - E[E(v,|Fi_l)yy] - 0 so v( and yj are independent for j — 1 , . . . , t — I. When Yt is fixed, Yt \ and yt are fixed so Yt-1 and vt are fixed and vice versa. Consequently, E(at\Yt) — E{ctt\Yt_\, v,) and Var(arf |Fr) — Var(ai vt). Since all variables are normally distributed, the conditional expectation and variance are given by standard for-mulae from multivariate normal regression theory. For a general treatment of the results required see the regression lemma in §2.13.

It follows from equation (2.49) of this lemma that

E(at\Yt) - E (o f t m - i , vt) = E(at\Yt^) + C o v ^ , ^ V a r ^ r V (2.8)


where Cov(at, vt) = E[af(yf - a t)] = E[af(a, + s, - a,)]

= E[a,(a, - a,)] = E iVar^ i rÔ] = P,,

and Var(^) - Ft = Var(a, + e, - flf) = Var(a,|r,_i) + Var(e,) = Pt + a ; .

Since a, = E(af |yf_i) we have from (2.8)

= ^ + (2.9)

where K, — Pt/Ft is the regression coefficient of at on vt. From equation (2.50) of the regression lemma in §2.13 we have

Var(af|Ff) — Var(ar|yf_i, vf) = V a i - ^ I ^ O - Cov(af, vtf N^{vt)~x

= Pt — Pf/Ft

= Pt(\ - Kt). (2.10)

From (2.7), (2.9) and (2.10), we have the full set of relations for updating from time t to time t + 1,

vt=yt-ati Ft=Pt+cr}, Kt = PtfFt, at+l = a, + K(vt, Pt+i = Pt( 1 ~ Kt) +

for t — 1 , . . . ,n. Note that a\ and Pi are assumed known here; however, more general initial specifications will be dealt with in §2.9. Relations (2.11) constitute the celebrated Kalman filter for this problem. It should be noted that Pt depends only on cre2 and cr2 and does not depend on Yt \. We include the case f — n in (2.11) for convenience even though an+1 and Pn+i are not normally needed for anything except forecasting.

The notation employed in (2.11) is not the simplest that could have been used; we have chosen it in order to be compatible with the notation that we consider appropriate for the treatment of the general multivariate model in Chapter 4. A set of relations such as (2.11) which enables us to calculate quantities for t 4- 1 given those for t is called a recursion. Formulae (2.9) and (2.10) could be derived in many other ways. For example, a routine Bayesian argument for normal densities could be used in which the prior density is p{at\Yt-\) ~ N(at, Pt), the likelihood is /?(>•/ loti) and we obtain the posterior density as p(at\Yt) (a*, P*), where a* and P* are given by (2.9) and (2.10), respectively. We have used the regression approach because it is particularly simple and direct for deriving filtering and smoothing recursions for the general state space model in Chapter 4.

2.2.2 ILLUSTRATION

In this chapter we shall illustrate the algorithms using observations from the river Nile. The data set consists of a series of readings of the annual flow volume at

2.3. FORECAST ERRORS 13

V —J I . ! . ) 1 . I

1880 1900 1920 1940 1960

(iv)

V 1 . I . I 1 1 ._ 1880 1900 1920 1940 1960

Fig. 2.1. Nile data and output of Kalman filter: (i) filtered state a, and its 90% confidence intervals; (ii) filtered state variance Pt ; (iii) prediction errors u,; (iv) prediction variance Ft.

Aswan from 1871 to 1970. The series has been analysed by Cobb (1978) and Balke (1993). We analyse the data using the local level model (2.3) with a\ — 0, Pl = 107, <rf2 — 15099 and <r2 — 1469.1. The values for a{ and Pi were chosen arbitrarily for illustrative purposes. The values for o j and cr2 are the maximum likelihood estimates which we obtain in §2.10.3. The output of the Kalman filter (that is vt, Ft, at and Pt for t = 2 , . . . , n) is presented graphically together with the raw data in Figure 2.1.

The most obvious feature of the four graphs is that F, and P, converge rapidly to constant values which confirms that the local level model has a steady state solution, that is, Pt and Ft converged to fixed values; for discussion of the concept of a steady state see §2.11. However, it was found that this local level model converged numerically to a steady state in around 25 updates of Pt although the graph of Pt seems to suggest that the steady state was obtained after around 10 updates.

2.3 Forecast errors

The Kalman filter residual vt — yt — at and its variance Ft are the one-step-ahead forecast error and the one-step-ahead forecast error variance of yt given as defined in §2.2. The forecast errors t ^ , . . . , vn are sometimes called innovations because they represent the new part of yt that cannot be predicted from the past for t = 1 , . . . , n. We shall make use of vt and Ft for a variety of results in the next sections. It is therefore important to study them in detail.

1250

1000

750

500-

250

-250

f ^mmm 1880

1/ sJTA f A / viv .-V ( IV»

1900 1920 1940 1960

1880 1900 1920 1940 1960


2.3.1 CHOLESKY DECOMPOSITION

First we show that v \ , . . . , i>„ are mutually independent. The joint density of y\, • • •, yn is

n p(yu . ..,yn) = P(yO n Myrin-i). (2.12)

t=2

Transform from yi,..., yn to v\,... , vn. Since each vt equals yt minus a linear junction of y\,..., yt~i for t = 2 , . . . , n, the Jacobian is one. From (2.12) on making the substitution we have

n

p(vu...>vn) = l\p(vt), (2.13) t~1

since p(vi) = p(y\) and p(vt) ~ /7(yt|Ff_i) for t — 2 , . . . , n. Consequently, the vt's are independently distributed.

We next show that the forecast errors vt are effectively obtained from a Cholesky decomposition of the observation vector y. The Kalman filter recursions compute the forecast error vt as a linear function of the initial mean a\ and the observations y i , . . . , y, since

u, = y i -ai,

V2 = y2 -a\ - K\{y\ — ai)> = J3 - «1 - K2(y2 -ai)-Ki(l — K2)(yi — «0, and so on.

It should be noted that Kt does not depend on the initial mean a\ and the observa-tions y\,..., yn; it depends only on the initial state variance P\ and the disturbance variances o j and a^. Using the definitions in (2.4), we have

d — C(y — lai), with

where matrix C is the lower triangular matrix

~ 1 0 0 0" C21 1 0 0

c - C31 C32 1 0

_CN\ CN2 C«3 • • • 1_

''cy = -(l'-1 Jfi.iXl - ft.2) • • • (1 - Kj+l)Kj,

for i — 2 , . . . , n and j — 1 — 2. The distribution of t> is therefore

v ~ N(0, CßC'),

(2.14)

(2.15)

2.3. FORECAST ERRORS 15

where £2 = Var(y) as given by (2.4). On the other hand we know from (2.11) and (2.13) that E(w,) = 0, Var(uf) = Ft and Cov(u,, vj) = 0, for j = 1 , . . . , n and t ± j; therefore,

~FX 0 0 0 ~ 0 Ft 0 0

v ~ N(0, F), with F - 0 0 F3 0

_ 0 0 0 ••• Fn_

where CSlC = F. The transformation of a symmetric positive definite matrix (say £2) into a diagonal matrix (say F) using a lower triangular matrix (say C) by means of the relation CQC" ~ F is known as the Cholesky decomposition of the symmetric matrix. The Kalman filter can therefore be regarded as essentially a Cholesky decomposition of the variance matrix implied by the local level model (2.3). This result is important for understanding the role of the Kalman filter and it will be used further in §§2.5.4 and 2.10.1. Note also that F~~l - ( C T ^ ^ C " 1

so we have Q'1 = C'F"lC.

2.3.2 ERROR RECURSIONS

Define the state estimation error as

x( — at — at, with Var(Xf) — Pt. (2.16)

We now show that the state estimation errors x, and forecast errors vt are linear functions of the initial state error xi and the disturbances st and r)t analogously to the way that at and yt are linear functions of the initial state and the disturbances, for t — 1 , . . . , n. It follows directly from the Kalman filter relations (2.11) that

vt =yt~ at — a t + s t — a t

= xt +et, and

Xt-t_i — ctt-)-i — at+i — a.t + r]t—at — Ktvt

— xt + rjt - Kt(xt + Et ) — LtXf + T]t — Kteif

where

Lt = \-Kt=at/Ft. (2.17)

Thus analogously to the local level model relations

yt = at + bu (xt+i — at + rit,

we have the error relations

vt — xt+£t> x<+i = Ltxt + t]t ~ Ktety / = 1, (2.18)


with x\ — ttj — a\. These relations will be used in the next section. We note that Pt, F,, Kf and L, do not depend on the initial state mean a\ or the observations y\,..., y„ but only on the initial state variance P\ and the disturbance variances cr2 and cr2. We note also that the recursion for Pt±\ in (2.11) can alternatively be derived by

Pt+1 = Var(x;+1) = Cov(x?+i, at+1) — Co\(xt+l,at + rjt) = LtCov{xt,at + T}{) + COV(I7I, at + r)t) - KtCov{et,at + i]t) = LtP,+ al.

2.4 State smoothing

We now consider the estimation of a\,..., an given the entire sample Y„. We shall find it convenient to use in place of the collective symbol Yn its representation as the vector y — ( y 1 } . . . , yn)' defined in (2.4). Since all distributions are normal, the conditional density of at given y is N(af, Vt) where at — E(ar(|j>) and Vt — Var(a, |y). We call at the smoothed state, Vt the smoothed state variance and the operation of calculating ot\,..., a„ state smoothing.

2.4.1 SMOOTHED STATE

The forecast errors V\,... ,vn are mutually independent and are a linear trans-formation of yi, . . . , yn, and vt,..., vn are independent of y\,... f yt-\ with zero means. Moreover, when y\,..., yn are fixed, Yt-i and vt,..., vn are fixed and vice versa. By (2.49) of the regression lemma in §2.13 we therefore have

a t ~ E(af[y) = E(a i | r i_1 , «,,...,!>„) - E(at\ Yt-i) + Cov[a„ ( v t , v n ) ' } Var[(u„... , i ^ T 1 ^ , • - •, vR)'

( Cov(oft, vt)' = Off

\^Cov(cit,vn)J |_0

F; 0 - l / V

\VN

at + ^ Cov(ai, Vj)Fj 1vj. j=t

(2.19)

Now Gov (a,, Vj) — Cov(.tf, vj) for j = t , . n , and

Cov(x,, vt) — Efx^.tf + £,)] — Var(x,) — Pt, Cov(x,, uf+1) = E[jc,(x,+i +£i+i)l = E[x t(L tx t + r)t - Ktst)] = PtLt.

Similarly,

Cov(x,, w/+2) = PtLtLt+u

(2.20) Cov(x,, vn) = PtLtLt+l ... L„„j.

2.4. STATE SMOOTHING 17

Substituting in (2.19) gives

I E> ' , n j _L p T I Vt+2 _L_ Ft î + 1 f t +2

= at 4- Ptff-i,

where

_l_ 7 J T Vt+1 -L / J J " i + 3 4 -

JTf •f'f+l ^f+2 ^+3 VN 4 L f L i + i . . . Ln-1 — (2.21)

is a weighted sum of innovations after t — 1. The value of this at time t is

Vi+l , T v i+3 ft — — h Lt+\ — h Lst+lL-t+2 + • • •

4- L , + 2 . . . L„_i t>„

(2.22)

Obviously, rn — 0 since no observations are available after time n. By substituting from (2.22) into (2.21), it follows that the values of rt-1 can be evaluated using the backwards recursion

vt rt-i = — + Ltn, Ft (2.23)

with rn — 0, for t = n, n — 1 , . . . , 1. The smoothed state can therefore be calcu-lated by the backwards recursion

r,„i = Ft vt +Ltrt, ât=at + Ptrtû t - n, . . . , 1 , (2.24)

with rn — 0. The relations in (2.24) are collectively called the state smoothing recursion.

2.4.2 SMOOTHED STATE VARIANCE The error variance of the smoothed state, Vt ~ Var(ar|F„). is derived in a similar way. Using the properties of the innovations and the regression lemma in §2.13 with x = at, y = (y\,..., yt-\)' and z — (ty, • • •» vn)' in (2.50), we have

Vt - Var(af|y) - Vw{at\YtUvt, ...,vn) = Var(«f |Yt -0 - Covfa„ (i>„ . . . , vn)'] Var[(u„. . . , vn)'Vl

xCovfoTi, (vt,. . ., Vn)J

= p t -

( Cov(af, vt) \ '

\ Coy (a,, u„)

Ft 0

0 Fn

i=t (2.25)


where the expressions for Cov(ar, Vj) are given by (2.20). Substituting these into (2.25) leads to

V ~ P - P2— P2J2 — P2I212 —! p2j2j2 . . . r 2 J_ Ft Ft+1 ft+ 2 Fn

= Pt- P /M-i , (2.26)

where

Nt-i = + L}-^- + + L2tL2

t+lL2t+2-±- + • • • Ft *t+1 Ft+2 Ft+3

+ L2tL2

t+l--L2n_ * (2.27)

Fn

is a weighted sum of the inverse variances of innovations after time t — 1. Its value at time t is

1 2 ^ 2 2 ^ 2 2 2 1 N, = — |- //f+|— h Lt+lLt+2— ) h Lt+lLt+2 Ft+1 Tf+2

(2.28)

and, obviously, — 0 since no variances are available after time n. Substituting from (2.28) into (2.27) it follows that the value for Nt-i can be calculated using the backwards recursion

N,_1 = - I + Z.?iVt, (2.29) Ft

with Nn- 0, for t = n,n - 1 , . . . , 1. We see from (2.22) and (2.28) that Nt = Var(rt) since the forecast errors i>, are independent.

Combining these results, the error variance of the smoothed state can be cal-culated by the backwards recursion

= f " 1 + Vt = P, - P?Nt-U t = 1, (2.30)

with Nn — 0. The relations in (2.30) are collectively called the state variance smoothing recursion. From the standard error ^/V, of &t we can construct confidence intervals for a, for t — 1 , . . . , n. It is also possible to derive the smoothed covariances between the states, that is, Cov(a,, a,|)>), / / st using similar arguments. We shall not give them here but will derive them for the general case in §4.5.

2.4.3 ILLUSTRATION

We now show the results of state smoothing for the Nile data of §2.2.2 using the same local level model. The Kalman filter is applied first and the output vt, Ft, at and P, is stored for t — 1 , . . . , n. Figure 2.2 presents the output of the backwards smoothing recursions (2.24) and (2.30); that is rt, Nt, a, and Vt. The plot of at includes the confidence bands for at. The graph of Var(or,|y) shows that the

2.5. DISTURBANCE SMOOTHING 19

CO 4000-

1250 - . , j S\ ; .t-ik 1 An

1000-3500

500

750

2500

3000

1880 1900 1920 1940 1960 1880 1900 1920 1940 I960

.02

0

Cm)

4 1880 1900 1920 1940 1960 1880 1900 1920 1940 1960

Fig. 2.2. Nile data and output of state smoothing recursion: (i) smoothed state a, and its 90% confidence intervals; (ii) smoothed state variance Vf; (iii) smoothing cumulant rt \ (iv) smoothing variance cumulant Nt.

conditional variance of a t is larger at the beginning and end of the sample, as it obviously should be on intuitive grounds. Comparing the graphs of at and at in Figures 2.1 and 2.2, we see that the graph of 6tt is much smoother than that of at, except at time points close to the end of the series, as it should be.

2.5 Disturbance smoothing

In this section we consider the calculation of the smoothed observation disturbance et — E(et |_y) — yt — &t and the smoothed state disturbance fjt — E(r}t\y)— a t+1 — a t together with their error variances. Of course, these could be calcu-lated directly from a knowledge of a . \ , . . . , an and covariances Cov(a,, a j for j <t. However, it turns out to be computationally advantageous to compute them from rt and Nt without first calculating at, particularly for the general model discussed in Chapter 4. The merits of smoothed disturbances are discussed in §4.4. For example, the estimates e, and f}t are useful for detecting outliers and structural breaks, respectively; see §2.12.2.

In order to economise on the amount of algebra in this chapter we shall present the required recursions for the local level model without proof, referring the reader to §4.4 for derivations of the analogous recursions for the general model.


2.5.1 SMOOTHED OBSERVATION DISTURBANCES

From (4.36) in §4.4.1, the smoothed observation disturbance et — Efely) is calculated by

s, = o]ut, t = n,...,l, (2.31)

where

ut - F~lvt - Ktrt, (2.32)

and where the recursion for rt is given by (2.23). The scalar ut is referred to as the smoothing error. Similarly, from (4.44) in §4.4.3, the smoothed variance Var(e? | y) is obtained by

Var(e,|y) = a2 - a*Dt> f = * , . . . , 1, (2.33)

where

A - F~l + K?Nt, (2.34)

and where the recursion for Nt is given by (2.29). Since from (2.22) vt is indepen-dent of rt, and Var(r,) — Nt> we have

Var(«,) - Var(F~xvt - Ktrt) - F~2 Var(ur) + K2Var(r() = Dt.

Consequently, from (2.31) we obtain Var(e,) ~<y^Dt. Note that the methods for calculating a, and et are consistent since Kf —

PtF~\ Lt — \ — Kt — a2F~l and

et — yt - &t

= yt-<k- ptrt-1 = vt- Pt(F~1vt + L{rt) = F-'vtiFt-P^ô^PtF^r, - o2(F~lv( - Ktrt), 1.

Similar equivalences can be shown for Vt and Var(er jy).

2.5.2 SMOOTHED STATE DISTURBANCES

From (4.41) in §4.4.1, the smoothed mean of the disturbance r), — |y) is calculated by

Vt=cr2rt, f 1, (2.35)

where the recursion for rt is given by (2.23). Similarly, from (4.47) in §4.4.3, the smoothed variance Var(?^ jy) is computed by

Varfo,\y) = a2 - o*Nt, / = I, (2.36)


where the recursion for N( is given by (2.29). Since Var(rf) = Nt, we have that Var(fjt) = o*Nt. These results are interesting because they give an interpretation to the values r, and Nt~, they are the scaled smoothed estimator of r}( — a i + 1 — at and its unconditional variance, respectively.

The method of calculating fjt is consistent with the definition r)t ~ a i + i — at

since

fjt = &t+i - &t

= at+1 4- Pt+ir, - a t - Ptrt_i = at + Ktvt-at + PtLtrt + tfr, - Pt(F~lvt + Ltrt)

=

Similar consistencies can be shown for N, and Var(r/, |y).

2.5.3 ILLUSTRATION

The smoothed disturbances and their related variances for the Nile data and the local level model of §2.2.2 are calculated by the above recursions and presented in Figure 2.3. We note from the graphs of Var(£f ) and Var(^ | y) the extent that these conditional variances are larger at the beginning and end of the sample. Obviously, the plot of rt in Figure 2.2 and the plot of r)t in Figure 2.3 are the same apart from a different scale.

Fig. 2.3. Output of disturbance smoothing recursion: (i) observation error st; (ii) obser-vation error variance Var(£, [y); (iii) state error f},\ (iv) state error variance Var(ijf |>>).


2.5.4 CHOLESKY DECOMPOSITION AND SMOOTHING

We now consider the calculation of £t ~ E(£,|y) by direct regression of s — (êi,..., en)' on the observation vector y to obtain s — , . . . , ên)', that is,

e - E(£) + Cov(£, y)Var(y)1[y - E(y)] - Cov(£, y)Qr\y - 1 ax).

It is obvious from (2.6) that Cov(e, y) — cr2/„; also, from the Cholesky decom-position considered in §2.3.1 we have — C'F~lC and C(y — la\) — v. We therefore have

è — OçC' F~lv,

which, by consulting the definitions of the lower triangular elements of C in (2.14), also leads to the disturbance equations (2.31) and (2.32). Thus

ê — a u =

where

u^C'F^v with v ~ C(y — la*)*

It follows that

u = C'F~lC{y - lai) - S2~l(y - lfli), (2.37)

where Q — Var(y) and F — CQC', as is consistent with standard regression theory.

2.6 Simulation

It is simple to draw samples generated by the local level model (2.3). We first draw the random normal deviates

e? ~ N(0, erf), ^ - N(0, rf), * = 1 , . . . , n.

Then we generate observations using the local level recursion as follows

yf(-> ^ a,0 + a,+1 = <Xt} + t = 1 , . . . , n,

for some starting value a ^ . For certain applications, which will be mainly discussed in Part II of this

book, we may require samples generated by the local level model conditional on the observed time series y i , . . . , yn. Such samples can be obtained by use of the simulation smoother developed for the general linear Gaussian state space model by de Jong and Shephard (1995) which we derive in §4.7. For the local level model, a simulated sample for the disturbances £t, t — 1,..., n, given the observations

2.7. MISSING OBSERVATIONS 23

^ vrt can be obtained using (4.77) by the backwards recursion

et = dt+cr?(vt/Ft-Ktrt), ft-\ — vt/Ft — Wtdf/Ct + Ltrt,

where d, ~ N(0, C{) with

Ct=o*-a*(l/Ft + K?Nt), Wt = cre

2(l/Ft - KtNtLt), JVf_! - 1 /Ft + W?/Ct + L*Nti

for t = n , . . . , 1 with r„ — Nn — 0. The quantities vt, Ft, Kt and Lt — 1 — Kt are obtained from the Kalman filter and they need to be stored for t — 1 , . . . , n. We note that the recursions for rt and Nt are similar to the recursions for rt and Nt given by (2.23) and (2.29); in fact, they are equivalent if Wt — 0 for t — 1 , . . . , n.

Given a sample for et, t — 1 , . . . , n, we obtain simulated samples for at and Tjt via the relations

at=yt—et, t = 1,..., n, Vt — «i+i f = 1, ...,n - 1.

2.6.1 ILLUSTRATION

To illustrate the difference between simulating a sample from the local level model unconditionally and simulating a sample conditional on the observations, we consider the Nile data and the local level model of §2.2.2. In Figure 2.4 (i) we present the smoothed state a t and a sample generated by the local level model unconditionally. The two series have seemingly nothing in common. In the next panel, again the smoothed state is presented but now together with a sample gen-erated conditional on the observations. Here we see that the generated sample is much closer to a.t. The remaining two panels present the smoothed disturbances together with a sample from the corresponding disturbances conditional on the observations.

2.7 Missing observations

A considerable advantage of the state space approach is the ease with which missing observations can be dealt with. Suppose we have a local level model where observations yj, with j — r , . . . , r* — 1, are missing for 1 < r < x* < n. For the filtering stage, the most obvious way to deal with the situation is to define a new series y* where y* = yt for t — 1 , . . . , r — 1 and y* — yi+T*-T for t ~ x,... ,n* with n* — n — (t* — x). The model for y* with time scale t — 1 , . . . , n* is then the same as (2.3) with yt = y* except that a r — + where /?t_i ~ N[0, (r* — x)cr2]. Filtering for this model can be treated by the methods developed in Chapter 4 for the general state space model. The treatment is readily extended if more than one group of observations is missing.


- 2 0 0

-400

1880 1900 1920 1940 1960 1880 1900 1920 1940 1960

200

- 2 0 0

-400 b

(iv)

• * ( ( • • • •

• Smooth slate eu I CoDd.sample

1880 1900 1920 1940 1960 1880 1900 1920 1940 1960

Fig. 2.4. Simulation: (i) smoothed state at and sample a^; (ii) smoothed state a, and sample a t; (iii) smoothed observation error and sample et; (iv) smoothed state error f}t and sample ijt.

It is, however, easier and more transparent to proceed as follows, using the original time domain. For filtering at times t — x,..., r* — 1, we have

E(a,|y,_i) - E(a,|y t_i) - E V j = T '

— aT

and

Var(a(|y,_i) - Y m f o ^ . , ) - Var(aT + ^ J ^ - i ) - Px + (/ - r W 2 .

giving

ö(+i = ax, t+i Pt+< t = r , r * - 1, (2.38)

the remaining values at and Pt being given as before by (2.11) for t — 1 , . . . , x and t — x*,... ,n. Hie consequence is that we can use the original filter (2.11) for all t by taking vt = 0 and Kt — 0 at the missing time points. The same procedure is used when more than one group of observations is missing. It follows that allowing for missing observations when using the Kalman filter is extremely simple.

The forecast error recursions from which we derive the smoothing recursions are given by (2.18). These error-updating equations at the missing time points

2.8. FORECASTING 25

become

vt — xt 4 et, xt+i — x( 4- rjt, t — t , . . . , r* — 1,

since K, — 0 and therefore Lt — 1. The covariances between the state at the missing time points and the innovations after the missing period are given by

Co \ (a t ,Vr*)= Pt, Cov(ar, Vj) — PtLr*LT*+i.. j ~ r* 4 1 , . . . , n, t — t,..., r* — 1.

By deleting the terms associated with the missing time points, the state smoothing equation (2.19) for the missing time points becomes

Substituting the covariance terms into this and talcing into account the definition (2.21) leads directly to

r(-i - rt, a( = at 4 Ptrt-u i = r , . . . , r* - 1. (2.39)

Again, the consequence is that we can use the original state smoother (2.24) for all t by taking v{ — 0 and Kt — 0, and hence Lt — 1, at the missing time points. This device applies to any missing observation within the sample period. In the same way the equations for the variance of the state error and the smoothed disturbances can be obtained by putting vt — 0 and Kt = 0 at missing time points.

2.7.1 ILLUSTRATION

Here we consider the Nile data and the same local level model as before; however, we treat the observations at time points 2 1 , . . . , 40 and 61 , . . . , 80 as missing. The Kalman filter is applied first and the output vt, Ft, at and Pt is stored for t — 1,..., n. Then, the state smoothing recursions are applied. The first two graphs in Figure 2.5 are the Kalman filter values of at and Pt, respectively. The last two graphs are the smoothing output at and Vt, respectively.

Note that the application of the Kalman filter to missing observations can be regarded as extrapolation of the series to the missing time points, while smoothing at these points is effectively interpolation.

2.8 Forecasting

Let yn+J be the minimum mean square error forecast of y n + j given the time series yu...,yn for j — 1 , 2 , . . . , . / with J as some pre-defined positive integer. By minimum mean square error forecast here we mean the function yn+j of _vi, . . . , yn

which minimises E[(y„+/ — yn+j)2\Yn]. Then y n + j — E(yn+y|y„). This follows immediately from the well-known result that if x is a random variable with mean

n


Fig. 2.5. Filtering and smoothing output when observations are missing: (i) filtered state at (extrapolation); (ii) filtered state variance P,; (iii) smoothed state at (interpolation); (iv) smoothed state variance Vt.

fji the value of X that minimises E(x — A.)2 is X — yu. The variance of the forecast error is denoted by F n + j — Var(_y„+;-|7n). The theory of forecasting for the local level model turns out to be surprisingly simple; we merely regard forecasting as filtering the observations yi, ...,yn, y„ + i , . . . , yn+j using the recursion (2.11) and treating the last J observations , . . . , yM+j as missing.

Letting an+j — E(a/!+7-|y/!) and Pn+j = Var(a„+y j7n), it follows immediately from equation (2.38) with r — n + 1 and r* — n + J in §2.7 that

Gn+j+l &n+j> Pn+j-1-1 = Pn+j + j = I , > . . , J — I ,

with dn+i —an+i and Pn+i = Pn+i obtained from the Kalman filter (2.11). Furthermore, we have

yn+j = E(yn+j\Yn) = E(an+ji|y„) + E(£/1+j,-|7n) = an+j,

Pn+j = Var(yn+j\Yn) = Var(an+7-1Yn) + Var(en+J|Yn) = Pn+J + of,

for j — 1 , . . . , J. The consequence is that the Kalman filter can be applied for t ~ 1 , . . . ,n + J where we treat the observations at times n + I,... ,n + J as missing. Thus we conclude that forecasts and their error variances are delivered by applying the Kalman filter in a routine way with vt = 0 and Kt — 0 for

2.9. INITIALISATION 27

Fig. 2.6. Nile data and output of forecasting: (i) state forecast a, and 50% confidence intervals; (ii) state variance P,; (iii) observation forecast E(j>, | yf_x); (iv) observation forecast variance Ft.

t — n + l , . . . , n + / . The same property holds for the general linear Gaussian state space model as we shall show in §4.9.

2.8.1 ILLUSTRATION

The Nile data set is now extended by 30 missing observations allowing the computation of forecasts for the observations yioi , . . . , yi3o- The Kalman filter only is required. The graphs in Figure 2.6 contain %+j\n = an+j\n, Pn+j\n, an+j\n

and Fn+j\n, respectively, for j — I,..., J with J — 30. The confidence interval for E(y„+j |y) is yn+j\„ dt k*/Fn+j\n where k is determined by the required probability of inclusion; in Figure 2.6 this probability is 50%.

2.9 Initialisation

We assumed in previous sections that the distribution of the initial state a.\ is , Pi) where a\ and Pi are known. We now consider how to start up the filter

(2.11) when nothing is known about the distribution of ai , which is the usual situation in practice. In this situation it is reasonable to represent oc\ as having a diffuse prior density, that is, fix a\ at an arbitrary value and let P\ oo. From (2.11) we have

ui = yi - Fi = Pi +


and, by substituting into the equations for a2 and P2 in (2.11), it follows that

«2 = 01 + P1

P\+Gl

Pi 11-

Pl Pi+O?

;(.yi -ai),

( ' " i ^ W 2 i 2 <7-, + <7. >1 '

(2.40)

(2.41)

Letting Fj -> oo, we obtain a2 — yi, P2 = °p2 + arf and we can then proceed

normally with the Kalman filter (2.11) for t — 2 , . . . , n. This process is called diffuse initialisation of the Kalman filter and the resulting filter is called the diffuse Kalman filter. We note the interesting fact that the same values of at and Pt for t = 2 , . . . , n can be obtained by treating y\ as fixed and taking ol\ ~ N(}>i, CT2). This is intuitively reasonable in the absence of information about the marginal distribution of a\ since (yi — a{) ~ N(0, a2).

We also need to take account of the diffuse distribution of the initial state a\ in the smoothing recursions. It is shown above that the filtering equations for t — 2 , . . . , n are not affected by letting Pi -> oo. Therefore, the state and disturbance smoothing equations are also not affected for t —n,..., 2 since these only depend on the Kalman filter output. From (2.24), the smoothed mean of the state «i is given by

ai = a\ + P\

— ai +

1 Pi +OT(

P\

•wt+ 1 0 Pi +*e2)

V\ + r2 £

Letting P\ have

P y + o j 1 ' Pj + a;

oo, we obtain ai — tij + t'i + ofr\ and by substituting for t>i we

&i = yi +<^1-1.

The smoothed conditional variance of the state oti given y is, from (2.30)

1 ( . Pi Vi - Pi - P, P i + a 2 V P\+°s)

Letting Pi oo, we obtain V\ = a 2 — .

2.9. INITIALISATION 29

The smoothed means of the disturbances for t — 1 are given by

2 1 P \ si = a e u u with Mj - 2Vi - —^ri, Py + a I Pi + <Tg

and fji = . Letting Py oo, we obtain — —cr2ri. Note that ry depends on the Kalman filter output for t = 2 , . . . , n. The smoothed variances of the disturbances for t = 1 depend on D\ and Ny of which only Dy is affected by Pi oo; using (2.34),

2

MW) Letting Pi oo, we obtain Di — Ny and therefore Var(êi) = <J*Ni. The variance

1 Dy = +

Pi + <j2

of the smoothed estimate of ï)y remains unaltered as Var(i/i) = cr, TV The initial smoothed state ây under diffuse conditions can also be obtained by

assuming that yi is fixed and «j = yy — sy where si ~ N(0, tr2). For example, for the smoothed mean of the state at t — 1, we have now only n — 1 varying y,'s so that

&i — ay + '—V j= 2

Cov(ai, Uy) v / F; 1

with ay — yy. It follows from (2.40) that a2 = ay — yy. Further, v2 — y2—a2 — a% + e2 — yy — orj + rjy + s2 - yy ~ —£y +rj i + £2- Consequently, Cov(«i, v2) — Cov(—£i, Si + rjy + s2) — <x2. We therefore have from (2.19),

a 2 , (l-^2)o-g2 , (1 - K2)(l - K3)o2

ai=ai+ —v2 + -v3 H H t2 rj 14

= yi + a ; r u

as before with ry as defined in (2.21) for t = 1. The equations for the remaining a t ' s are the same as previously.

Use of a diffuse prior for initialisation is the approach preferred by most time series analysts in the situation where nothing is known about the initial value «1. However, some workers find the diffuse approach uncongenial because they regard the assumption of an infinite variance as unnatural since all observed time series have finite values. From this point of view an alternative approach is to assume that ay is an unknown constant to be estimated from the data by maximum likelihood. The simplest form of this idea is to estimate ay by maximum likelihood from the first observation yi. Denote this maximum likelihood estimate by ay and its variance by Var(ai). We then initialise the Kalman filter by taking ay ~ ay and Pi — Var(ai). Since when ai is fixed yy ~ N(ai, cr2), we have ay — yy and Varied) = cr2. We therefore initialise the filter by taking ai = yy and Py — cr2.But these are the same values as we obtain by assuming that ay is diffuse. It follows that we obtain the same initialisation of the Kalman filter by representing ay as a random variable with infinite variance as by assuming that it is fixed and unknown


and estimating it from y^. We shall show that a similar result holds for the general linear Gaussian state space model in §5.7.3.

2.10 Parameter estimation

We now consider the fitting of the local level model to data from the standpoint of classical inference. In effect, this amounts to deriving formulae on the assumption that the parameters are known and then replacing these by their maximum likelihood estimates. Bayesian treatment will be considered for the general linear Gaussian model in Chapter 8. Parameters in state space models are often called hyperparameters, possibly to distinguish them from elements of state vectors which can plausibly be thought of as random parameters; however, in this book we shall just call them parameters, since with the usual meaning of the word parameter this is what they are. We will discuss methods for calculating the loglikelihood function and the maximisation of it with respect to the parameters, of and of.

2.10.1 LOGLIKELIHOOD EVALUATION

Since

p(yu---,yt) = p(Yi-\)p(yt\Yt-\)>

for t — 2,..., n, the joint density of yi,..., yn can be expressed as n

p(y) = l\p(yt\Yt-i)> i=I

where p(y\|Fo) — p(yi)- Now p(yt |yt_0 — Fs) and vt — yt — at so on tak-ing logs and assuming that a\ and P\ are known the loglikelihood is given by

log L log p(y) = - ~ log(2 jt) - i J 2 ( l o S + f ) " (2-42)

The exact loglikelihood can therefore be constructed easily from the Kalman filter (2.11).

Alternatively, let us derive the loglikelihood for the local level model from the representation (2.4). This gives

IogL - ~ log(2jt) - i log |G| - - f l i i y Q _ 1 ( y - f l i i ) ,

which follows from the multivariate normal distribution y ~ N(<zi 1, £2). Since Q = CFC\ \C\ = 1, Q"1 - C'P"1 Candy = C(y - ^1) , it follows that

logl^S = loglCFC'l - log|C||F||C| - l o g i n

and

(y -fliiyfi"1^ - f l j l ) - v'F~lv.

2.10. PARAMETER ESTIMATION 31

Substitution and using the results log|F| = Yllî l o S Ft and v'F~lv — £?=i F r l v f l e a d d i r e c t l y t 0 (2.42).

The loglikelihood in the diffuse case is derived as follows. All terms in (2.42) remain finite as Pi --̂ oo with _y fixed except the term for t — 1. It thus seems reasonable to remove the influence of Pi as P\ oo by defining the diffuse loglikelihood as

since Fi /Pi —>• 1 and v2/Fy 0 as Pi oo. Note that v, and Ft remain finite as Pi -> oo for t = 2,..., n.

Since Pi does not depend on a 2 and of, the values of of and o 2 that maximise log L are identical to the values that maximise log L + \ log Pi. As Pi ^ oo, these latter values converge to the values that maximise log L j because first and second derivatives with respect to of and of converge, and second derivatives are finite and strictly negative. It follows that the maximum likelihood estimators of of and of obtained by maximising (2.42) converge to the values obtained by maximising (2.43) as Pi oo.

We estimate the unknown parameters of and of by maximising expression (2.42) or (2.43) numerically according to whether ai and Pi are known or un-known. In practice it is more convenient to maximise numerically with respect to the quantities 1fre — logo2 and ^ — log of. An efficient algorithm for numerical maximisation is implemented in the STAMP 6.0 package of Koopman, Harvey, Doomik and Shephard (2000). This optimisation procedure is based on the quasi-Newton scheme BFGS for which details are given in §7.3.2.

2.10.2 CONCENTRATION OF LOGLIKELIHOOD

It can be advantageous to re-parameterise the model prior to maximisation in order to reduce the dimensionality of the numerical search. For example, for the local level model we can put q — o f /o f to obtain the model

H T O - i g ^ + f) (2.43)

y, = at + st, st ~ N(0, of) ,

at+1 = at + r)t, i)t ~ N(0, qo2),

and estimate the pair of , q in preference to of, of. Put P* — Ptja2 and F* — Ft/o2; from (2.11) and §2.9, the diffuse Kalman filter for the re-parameterised


Table 2.1. Estimation of parameters of local level model by maximum likelihood.

Iteration q Score Log-likelihood 0 1 0 -3.32 -495.68 1 0.0360 —3.32 0.93 -492.53 2 0.0745 -2.60 0.25 -492.10 3 0.0974 -2.32 -0.001 -492.07 4 0.0973 -2.33 0.0 -492.07

local level model is then vt = yt-at, = +1,

at+1 - a, + Ktvt, P*+l = Pt*( 1 - Kt) + q,

where Kt — P*/F* — Pt/Ft fori = 2 , . . . , n and it is initialised with a2 — y\ and p* = 1 + q. Note that F* depends on q but not on a 2 . The loglikelihood (2.43) then becomes

log Ld = ~ l log(27T) - ~ log of - i £ ( i o S Ft + - ( 2 4 4 )

By maximising (2.44) with respect to of, for given . . . , F*, we obtain

The value of log Ld obtained by substituting of for erf in (2.44) is called the concentrated diffuse loglikelihood and is denoted by log giving

log Ldc = log(2jr) - ^ - ^ log <rf - i I > g F , * . (2.46)

This is maximised with respect to q by a one-dimensional numerical search.

2.10.3 ILLUSTRATION

The estimates of the variances of and of = 4 of for the Nile data are obtained by maximising the concentrated diffuse loglikelihood (2.46) with respect to xfr where q — exp(^). In Table 2.1 the iterations of the BFGS procedure are reported starting with \jf — 0. The relative percentage change of the loglikelihood goes down very rapidly and convergence is achieved after 4 iterations. The final estimate for ^ is —2.33 and hence the estimate of q is q — 0.097. The estimate of of given by (2.45) is 15099 which implies that the estimate of of is of = qo2 — 0.097 x 15099 - 1469.1.

2.11 Steady state

We now consider whether the Kahnan filter (2.11) converges to a steady state as n—> 00. This will be the case if P, converges to a positive value, P say. Obviously,

2.12. DIAGNOSTIC CHECKING 33

we would then have Ft P + of and K, P j{P + a2). To check whether there is a steady state, put Pt+\ — Pt = P in (2.11) and verify whether the resulting equation in P has a positive solution. The equation is

P = p ( l - - P _ j+cr„2, V P + a V n

which reduces to the quadratic

x2-xh-h=0> (2.47)

where x = P jo2 and h — o f / o f , with the solution

x = (A + V ^ + 4 A ) / 2 .

This is positive when h > 0 which holds for non-trivial models. The other solution to (2.47) is inapplicable since it is negative for h > 0. Thus all non-trivial local level models have a steady state solution.

The practical advantage of knowing that a model has a steady state solution is that, after convergence of Pt to P has been verified to be close enough, we can stop computing Ft and Kt and the filter (2.11) reduces to the single relation

1 — d( -{- Ki)t,

with K = P/{P + a2) and vt = yt — at. While this has little consequence for the simple local level model we are concerned with, it is a useful property for the more complicated models we shall consider in Chapter 4, where Pt can be a large matrix.

2.12 Diagnostic checking

2.12.1 DIAGNOSTIC TESTS FOR FORECAST ERRORS

The assumptions underlying the local level model are that the disturbances st and r)t are normally distributed and serially independent with constant variances. On these assumptions the standardised one-step forecast errors

et = ~=, t-1,...,«, (2.48) V Ft

(or for t = 2 , . . . , n in the diffuse case) are also normally distributed and serially independent with unit variance. We can check that these properties hold by means of the following large-sample diagnostic tests:

• Normality The first four moments of the standardised forecast errors are given by

m\ 1 "

= i y > i - m i ) * , q = 2,3,4, tt r <


with obvious modifications in the diffuse case. Skewness and kurtosis are denoted by 5 and K, respectively, and are defined as

™2

and it can be shown that when the model assumptions are valid they are asymptotically normally distributed as

N

see Bowman and Shenton (1975). Standard statistical tests can be used to check whether the observed values of S and K are consistent with their asymptotic densities. They can also be combined as

3)2

N = n { — + 6 24

which asymptotically has a / 2 distribution with 2 degrees of freedom on the null hypothesis that the normality assumption is valid. The QQ plot is a graphical display of ordered residuals against their theoretical quantiles. The 45 degree line is taken as a reference line (the closer the residual plot to this line, the better the match). Heteroscedasticity A simple test for heteroscedasticity is obtained by comparing the sum of squares of two exclusive subsets of the sample. For example, the statistic

Tn e2

H(h) = 1 , J2-

is /^./»-distributed for some preset positive integer h, under the null hypothesis of homoscedasticity. Here, et is defined in (2.48) and the sum of h squared forecast errors in the denominator starts at t — 2 in the diffuse case. Serial correlation When the local level model holds, the standardised forecast errors are serially uncorrelated as we have shown in §2.3.1. Therefore, the correlogram of the forecast errors should reveal serial correlation insignificant. A standard portmanteau test statistic for serial correlation is based on the Box-Ljung statistic suggested by Ljung and Box (1978). This is given by

* c 2

Q(k) = n(n + 2)V—^ • ' M J = I71

for some preset positive integer k where Cj is the jth correlogram value

1 Ci — (e t — m i ) ( e , _ j - mi).

f=j+1 More details on diagnostic checking will be given in §7.5.

nm2 t=j+l


2.12.2 DETECTION OF OUTLIERS AND STRUCTURAL BREAKS

The standardised smoothed residuals are given by

u* = £t/y/Var(£t) = Dût,

r* - hfy/Vwtfid = NT*rtt f = 1,..., n;

see §2.5 for details on computing the quantities ut, Dt, rt and Nt. Harvey and Koopman (1992) refer to these standardised residuals as auxiliary residuals and they investigate their properties in detail. For example, they show that the auxiliary residuals are autocorrelated and they discuss their autocorrelation function. The auxiliary residuals can be useful in detecting outliers and structural breaks in time series because st and fjt are estimators of st and r]t. An outlier in a series that we postulate as generated by the local level model is indicated by a large (positive or negative) value for st and a break in the level a, is indicated by a large (positive or negative) value for . A discussion of use of auxiliary residuals for the general model will be given in §7.5.

2.12.3 ILLUSTRATION

We consider the fitted local level model for the Nile data as obtained in §2.10.3. A plot of et is given in Figure 2.7 together with the histogram, the QQ plot and the correlogram. These plots are satisfactory and they suggest that the assumptions

. (ii)

1880 1900 1920 1940 1960 1980

(in)

/ / \ K

\

.5

-.5

-4 -3

Ov)

- 2 - 1 0 1

— r j j r j ^ j" L I.

10 15

Fig. 2.7. Diagnostic plots for standardised prediction errors: (i) standardised residual; (ii) histogram plus estimated density; (iii) ordered residuals; (iv) correlogram.


-2-

1880 1900 1920 1940 1960 1980

OH)

1880 1900 1920 1940 I960 1980

J

-4 -3

(iv)

/ / \

- 2

\ \

Fig. 2.8. Diagnostic plots for auxiliary residuals: (i) observation residual u*; (ii) histogram and estimated density for «*; (iii) state residual r*\ (iv) histogram and estimated density for r*.

underlying the local level model are valid for the Nile data. This is largely confirmed by the following diagnostic test statistics

S = -0.03, K = 0.09, N = 0.05, if (33) = 0.61, Q(9) - 8.84.

The low value for the heteroscedasticity statistic H indicates a degree of heteroscedasticity in the residuals. This is apparent in the plots of «* and r* together with their histograms in Figure 2.8. These diagnostic plots indicate outliers in 1913 and 1918 and a level break in 1899. The plot of the Nile data confirms these findings.

2.13. APPENDIX: LEMMA IN MULTIVARIATE NORMAL REGRESSION 37

2.13 Appendix: Lemma in multivariate normal regression

We present here a simple lemma in multivariate normal regression theory which we use extensively in the book to derive results in filtering, smoothing and related problems.

Suppose that x, y and z are random vectors of arbitrary orders that are jointly normally distributed with means (j,p and covariance matrices — E[(p — fJ>P)(l ~ /A?)'] f° r p,g ~x,y and z with /xz—0 and £y z — 0. The symbols x, y, z, p and q are employed for convenience and their use here is unrelated to their use in other parts of the book.

Lemma

Var(x|y,z) = Var(*Sy) -(2.49) (2.50)

PROOF. By standard multivariate normal regression theory we have

E(x| y) = }jlx + VxvV71'

Var(x|y) - E, y \ 1 — 1 / ljXy*-'yy ^xy,

see, for example, Anderson (1984, Theorem 2.5.1). Applying (2.51) to vector ( y ) in place of y gives

E(*\y, Z) = tlx + [ ^xy Vxz ] y - l

0

0

- lix + ^Xy^yyl(y - fly) + EÊ^Z, - I ,

since fiz — 0 and T*yz — 0. This proves (2.49). Applying (2.52) to vector ( ̂ ) in place of y gives

(2.51)

(2.52)

E" 1 yy

xp — 1 v** ' y1

->XX ^xy y y ^Xy — ̂ XZ^zz ' E:.

E'

since Eyz — 0. This proves (2.50). This simple lemma provides the basis for the treatment of the Kalman filter and smoother in this book.

3 Linear Gaussian state space models

3.1 Introduction

The general linear Gaussian state space model can be written in a variety of ways; we shall use the form

y, = Z,a, +et, et ~ N(0, Ht\ <*t+i = Ttat + Rtr}t, r\t ~ N(0, Qt), t = 1 , . . . , n,

where yt is a p x 1 vector of observations and at is an unobserved m x 1 vector called the state vector. The idea underlying the model is that the development of the system over time is determined by a t according to the second equation of (3.1), but because a t cannot be observed directly we must base the analysis on observations yt. The first equation of (3.1) is called the observation equation and the second is called the state equation. Hie matrices Zt, Tu Rti Ht and Qt are initially assumed to be known and the error terms st and rjt are assumed to be serially independent and independent of each other at all time points. Matrices Z, and Tt-i are permitted to depend on j i , . . . , yt i- The initial state vector a\ is assumed to be N(ai, P\) independently of £ i , . . . , and . . . ,r)n, where a\ and P\ are first assumed known; we will consider in Chapter 5 how to proceed in the absence of knowledge of a\ and P\. In practice, some or all of the matrices Zt, Ht,Tt, Rt and Qt will depend on elements of an unknown parameter vector i f f , the estimation of which will be considered in Chapter 7.

The first equation of (3.1) has the structure of a linear regression model where the coefficient vector a t varies over time. The second equation represents a first order vector autoregressive model, the Markovian nature of which accounts for many of the elegant properties of the state space model. The local level model (2.3) considered in the last chapter is a simple special case of (3.1). In many applica-tions Rt is the identity. In others, one could define rj* — R,i], and Q* — RtQtR't and proceed without explicit inclusion of Rt, thus making the model look simpler. However, if Rt is m x r with r < m and Qt is nonsingular, there is an obvious advantage in working with nonsingular rjt rather than singular ift. We assume that Rt is a subset of the columns of /,„; in this case Rt is called a selection matrix since it selects the rows of the state equation which have nonzero disturbance terms; however, much of the theory remains valid if Rt is a general m x r matrix.

3.2. STRUCTURAL TIME SERIES MODELS 39

Model (3.1) provides a powerful tool for the analysis of a wide range of problems. In this chapter we shall give substance to the general theory to be presented in Chapter 4 by describing a number of important applications of the model to problems in time series analysis and in spline smoothing analysis.

3.2 Structural time series models

3.2.1 UNIVARIATE MODELS

A structural time series model is one in which the trend, seasonal and error terms in the basic model (2.1), plus other relevant components, are modelled explicitly. This is in sharp contrast to the philosophy underlying Box-Jenkins ARIMA models where trend and seasonal are removed by differencing prior to detailed analysis. In this section we shall consider structural models for the case where yt is univariate; we shall extend this to the case where yt is multivariate in §3.2.2. A detailed discussion of structural time series models, together with further references, has been given by Harvey (1989).

The local level model considered in Chapter 2 is a simple form of a structural time series model. By adding a slope term vt, which is generated by a random walk, to this we obtain the model

Ulis is called the local linear trend model. If = f, = 0 then vf+i = vt — v, say, and (JLt+i = ßt + v so the trend is exactly linear and (3.2) reduces to the deterministic linear trend plus noise model. The form (3.2) with of > 0 and of > 0 allows the trend level and slope to vary over time.

Applied workers sometimes complain that the series of values of ß t obtained by fitting this model does not look smooth enough to represent their idea of what a trend should look like. This objection can be met by setting of = 0 at the outset and fitting the model under this restriction. Essentially the same effect can be obtained by using in place of the second and third equation of (3.2) the model A2ß t + i — i.e. ßt+i — 2ß t — ßt-1 + Kt where A is the first difference operator defined by Axt -- xt — Jt'e-i. This and its extension Arßt — & for r > 2 have been advocated for modelling trend in state space models in a series of papers by Peter Young and his collaborators under the name integrated random walk models; see, for example, Young, Lane, Ng and Palmer (1991). We see that (3.2) can be written in the form

yt = (it+£t, £t ~ N(0, of) , ßt+i = ßt + v, + ~ N(0, of) , vt+i - -KÎ, Ç, ~N(0 , of).

(3.2)

which is a special case of (3.1).

40 LINEAR GAUSSIAN STATE SPACE MODELS

To model the seasonal term yt in (2.1), suppose there are s "months' per 'year'. Thus for monthly data s = 12, for quarterly data s — 4 and for daily data, when modelling the weekly pattern, s = 7. If the seasonal pattern is constant over time, the seasonal values for months 1 to s can be modelled by the constants y*,..., y* where Yfj=i Yj = For the y'th 'month' in 'year' i we have yt — y* where t — s(i — 1) + j for i — 1 , 2 , . . . and j — 1 , . . . , s. It follows that Y.%o Yt+i-s = 0 so Yt+i = - E j=1 Yt+i-j With t — s — 1, j , . . . . In practice we often wish to allow the seasonal pattern to change over time. A simple way to achieve this is to add an error term a>t to this relation giving the model

s-l

Yt+1 = - y t + 1 ~ j + ^ ^ N ( 0 ' (3-3>

j=i for t ~ 1 , . . . , n where initialisation at t — 1, ..., s — 1 will be taken care of later by our general treatment of the initialisation question in Chapter 5, An alternative suggested by Harrison and Stevens (1976) is to denote the effect of season j at time t by yj, and then let yj t be generated by the quasi-random walk

Yjj+i = Yjt + o>jt, t = (i-l)s + j, i — 1 , 2 , . . . , j ~ ..., s,

with an adjustment to ensure that each successive set of s seasonal components sums to zero; see Harvey (1989, §2.3.4) for details of the adjustment.

It is often preferable to express the seasonal in a trigonometric form, one version of which, for a constant seasonal, is

yt = J2(yj cos Xjt-hyfsmXjt), = — , j= 1 [s/2], (3.4) ." , s

where [a] is the largest integer < a and where the quantities yj and y* are given constants. For a time-varying seasonal this can be made stochastic by replacing y-}

and y* by the random walks

Yj.t+i = Yjt + a>jt, y*,t+i = 9jt+&*jt> J = 1 > • • • > C^/2], t - 1 , . . . , n, (3.5)

where&jt and &>*t are independent N(0, a^) variables; for details see Young et al. (1991). An alternative trigonometric form is the quasi-random walk model

t/2]

rt = jlYjt> (3-6) 7=1

where

yi.t+i — Yjt cos kj+ Y*t sin Xj + a>jti

Yjj+i = -Yjt sin Xj + yjt cos Xj + co*jt, 7 = 1 , . . . , [s/2], (3.7)


in which the a>}t and (o*t terms are independent N(0, o f ) variables. We can show that when the stochastic terms in (3.7) are zero, the values of yt defined by (3.6) are periodic with period s by taking

Yjt — f j cos Xjt + Y* sin Xjt,

Yjt ~ ~Yj swXjt + yJco&Xjt,

which are easily shown to satisfy the deterministic part of (3.7). The required result follows since yt defined by (3.4) is periodic with period s. In effect, the deterministic part of (3.7) provides a recursion for (3.4).

The advantage of (3.6) over (3.5) is that the contributions of the errors (Ojt and (o*t are not amplified in (3.6) by the trigonometric functions cos Xjt and sin Xjt. We regard (3.3) as the main time domain model and (3.6) as the main frequency domain model for the seasonal component in structural time series analysis.

Each of the four seasonal models can be combined with either of the trend models to give a structural time series model and all these can be put in the state space form (3.1). For example, for the local linear trend model (3.2) together with model (3.3) we take

at = (fit vt yt yt-1 ... tt-i+2)',

with

Z, - ( Z M , Z [ y ] ) , % = diag(7jM], T [ y ] ) ,

Rt - diag(/?M, /?[}/)) , Qt - d i a g Q [ y ] ) , (3.8)

where

Z M = (1 ,0 ) , Z[y] = ( 1 , 0 , . . . , 0 ) ,

Tlp-} ~ f1 M 0 1 J'

% } = h,

Qifi]z 0 a \

" - 1 - 1 . . . - 1 - 1 " 1 0 0 0

T[y] - 0 1 0 0

0 0 1 0

R[y] =(i ,o, . . . ,oy,

Qly] = <*l-

Tliis model plays a prominent part in the approach of Harvey (1989) to structural time series analysis; he calls it the basic structural time series model. The state


space form of this basic model with s = 4 is therefore

at = (fit Vt Yt Yt-i Yt-2)', 1 1 0 0 0 0 1 0 0 0

Zf = (1 0 1 0 0), Tt = 0 0 -1 -1 -1 0 0 1 0 0 0 0 0 1 0

1 0 0 0 1 0 ~ a l 0 0

Qt= 0 o f 0

0 0 al R t = 0 0 1

0 0 0 0 0 0

Alternative seasonal specifications can also be used within the basic structural model. The Harrison and Stevens (1976) seasonal model referred to below (3.3) has the (s + 2) x 1 state vector

where the relevant parts of the system matrices for substitution in (3.8) are given

in which 1 is an s x 1 vector of ones, a>it H j~ a)sj — 0 and variance matrix <2[y] has rank s — 1. The trigonometric seasonal specification (3.7) has the (s + 1) x I state vector

oct = (fit vt yu y* y2t ...)',

with the relevant parts of the system matrices given by

Zjyj = (1, 0, 1, 0 , 1 , . . . , 1, 0,1) T[y] = diag(Cj, . . . , Cs*, - 1 ) ,

R[y] — 1, Qiy] = (J2IS-1,

where we assume that £ is even, s* — and

at = ( f i t vt yt ... Yt-s+i )',

by

= h,

Z[y]=( 1 , 0 , . . . , 0), T[y] = r _ 0 * Erl — j q

ÖM = " i i ' A ) ,

When Ä is odd, we have

Z ^ = ( 1 , 0 , 1 , 0 , 1 , . . . , 1,0)

i? f ) /] -- Is-1,

T[y] = diag(Ci, . . . , Cs*),

Qlyl = alh-\>


Another important component in some time series is the cycle ct which we can introduce by extending the basic time series model (2.2) to

yt — V<t + Yt + ct + bu t-— 1 , . . . , n. (3.9)

In its simplest form ct is a pure sine wave generated by the relation

ct = ccos Xct + c* sinAci,

where Xc is the frequency of the cycle; the period is 2TifXc which is normally substantially greater that the seasonal period s. As with the seasonal, we can allow the cycle to change stochastically over time by means of the relations analogous to (3.7)

c (+i = ct cos Xc + c* sin A,c -j- a>t, c*+i ~ ~ct sin Ac 4- c* cos Xc -f £>*, (3.10)

with

c* — —c sin Xct + c* cos Xct,

where wt and w* are independent N(0, cr|) variables. Cycles of this form fit nat-urally into the structural time series model framework. The frequency Xc can be treated as an unknown parameter to be estimated.

Explanatory variables and intervention effects are easily allowed for in the structural model framework. Suppose we have k regressors x i , , . . . , x^ with re-gression coefficients fii,..., which are constant over time and that we also wish to measure the change in level due to an intervention at time r . We define an intervention variable wt as follows:

u)( — 0, t < r, = 1, t > r.

Adding these to the model (3.9) gives k

yt = fit + yt + ct + Y.fijxj, + Xwt + £,, ? = l , . . . , n . (3.11) j=i

We see that X measures the change in the level of the series at a known time r due to an intervention at time r. The resulting model can readily be put into state space form. For example, if Yt — Ct — X — 0, k — 1 and if \it is determined by a local level model, we can take

ott = (/Af M , Zt = (1 xu),

in (3.1). Here, although we have attached a suffix t to it is made to satisfy yfr, i + l = so it is constant. Other examples of intervention variables are the

Tt = 1 0 0 1


pulse intervention variable defined by

XVt — 0, t < T, = 1, i = T,

t > r,

and the s/ope intervention variable defined by

u/, - 0, = 1 +t - T,

t < X, t > T.

For other forms of intervention variable designed to represent a more gradual change of level or a transient change see Box and Tiao (1975). Coefficients such as A. which do not change over time can be incorporated into the state vector by setting the corresponding state errors equal to zero. Regression coefficients fijt which change over time can be handled straightforwardly in the state space framework by modelling them by random walks of the form

An example of the use of model (3.11) for intervention analysis is given by Harvey and Durbin (1986) who used it to measure the effect of the British seat belt law on road traffic casualties. Of course, if the cycle term, the regression term or the intervention term are not required, they can be omitted from (3.11). Instead of including regression and intervention coefficients in the state vector, an alternative way of dealing with them is to concentrate them out of the likelihood function and estimate them via regression, as we will show in §6.2.3.

3.2.2 MULTIVARIATE MODELS

The methodology of structural time series models lends itself easily to generalisa-tion to multivariate time series. Consider the local level model for a p x 1 vector of observations yt, that is

with p x p variance matrices £ e and In this so-called seemingly unrelated time series equations model, each series in yt is modelled as in the univariate case, but the disturbances may be correlated instantaneously across series. In the case of a model with other components such as slope, cycle and seasonal, the disturbances associated with the components become vectors which have p x p variance matrices. The link across the p different time series is through the corre-lations of the disturbances driving the components.

A seemingly unrelated time series equations model is said to be homogeneous when the variance matrices associated with the different disturbances are propor-tional to each other. For example, the homogeneity restriction for the multivariate

ßj,t+1 - ßjt + X j t , Xjt ~ N(0, a l ) , j — 1, . , . ,k. (3.12)

yt - fit+et, fit± i = p,t + rjt,

where p,t, st and rjt are p x 1 vectors and

s,~ N(0,Ef i), rjt~ N(0,S,) ,

(3.13)


local level model is

EjJ = gÊ,

where scalar q is the signal-to-noise ratio. This means that all the series in yt, and linear combinations thereof, have the same dynamic properties which implies that they have the same autocorrelation function for the stationary form of the model. A homogeneous model is a rather restricted model but it is easy to estimate. For further details we refer to Harvey (1989, Chapter 8).

Consider the multivariate local level model without the homogeneity restriction but with the assumption that the rank of is r < p. The model then contains only r underlying level components. We may refer to these as common levels. Recognition of such common factors yields models which may not only have an interesting interpretation, but may also provide more efficient inferences and forecasts. With an appropriate ordering of the series the model may be written as

yt — a + A/A* + £t,

1 - V* +

where p* and jjf are r x l vectors, a is a p x 1 vector and A is a p x r matrix. We further assume that

where a* is a (p — r) x 1 vector and A* is a (p — r) x r matrix of nonzero values and where variance matrix E* is a r x r positive definite matrix. The matrix A may be interpreted as a factor loading matrix. When there is more than one common factor (r > 1), the factor loadings are not unique. A factor rotation may give components with a more interesting interpretation.

Further discussion of multivariate extensions of structural time series models are given by Harvey (1989, Chapter 8) and Harvey and Koopman (1997).

3.2.3 STAMP

A wide-ranging discussion of structural time series models can be found in the book of Harvey (1989). A good supplementary source for further applications and later work is Harvey and Shephard (1993). The computer package STAMP 6.0 of Koopman et al (2000) is designed to analyse, model and forecast time series based on univariate and multivariate structural time series models. The package has implemented the Kalman filter and associated algorithms leaving the user free to concentrate on the important part of formulating a model. More information on STAMP can be obtained from the Internet at

http://stamp-software.com/

http://stamp-software.com/


3.3 ARMA models and AREVÍA models

Autoregressive integrated moving average (ARJMA) time series models were introduced by Box and Jenkins in their pathbreaking (1970) book; see Box, Jenkins and Reinsel (1994) for the current version of this book. As with structural time series models considered in the last section, Box and Jenkins typically regarded a univariate time series yt as made up of trend, seasonal and irregular components. However, instead of modelling the various components separately, their idea was to eliminate the trend and seasonal by differencing at the outset of the analysis. The resulting differenced series are treated as a stationary time series, that is, a series where characteristic properties such as means, covariances, etc. remain invariant under translation through time. Let Ayt =yt~ yt-1, A2>v = A(Ay(), Asy t - yt - yt-s, A2

syt = AÂ^y,), and so on, where we are assuming that we have s 'months' per 'year'. Box and Jenkins suggest that differencing is continued until trend and seasonal effects have been eliminated, giving a new variable y* — Ad A®yt for d, D — 0, 1 , . . . , which we model as a stationary autoregressive moving average ARMA(/?, q) model given by

y* - M - i + • • • + 4>py*-p + b + QxKt-i + • • •+e q ç t . q , ç, ~ N(o, af2),

(3.14)

with non-negative integers p and q and where is a serially independent series of N(0, cr2) disturbances. This can be written in the form

/--l yt

j=1 / = 1 , . . . , « , (3.15)

i

where r = max(/>, q H- 1) and for which some coefficients are zero. Box and Jenkins normally included a constant term in (3.15) but for simplicity we omit this; the modifications needed to include it are straightforward. We use the sym-bols d, p and q here and elsewhere in their familiar ARIMA context without prejudice to their use in different contexts in other parts of the book.

We now demonstrate how to put these models into state space form, beginning with case where d — D — 0, that is, no differencing is needed, so we can model the series by (3.15) with y* replaced by yt. Take

Z, = ( 1 0 0 - • • 0),

/ » <hyt~\ H + 4>ryt-r+1 + 01 + h Ör-l?/-r+2

Ott = <hyt-1 H + 4>ryt~r+2 + #2<Tí h 1- 0r-l&-r+3

V

(3.16)

/

3.3. ARMA MODELS AND ARIMA MODELS 47

and write the state equation for a t + i as in (3.1) with

Tt = T

<Pi 1

<f>r-l 4>r

Rt = R

( 1 \ Oi

\0r-tJ

% = 6+1-

(3.17)

This, together with the observation equation yt — Ztat, is equivalent to (3.15) but is now in the state space form (3.1) with et — 0, implying that Ht — 0. For example, with r — 2 we have the state equation

( f c t t + O i f t + i ) = 0 (̂ -fVeifi)"1"̂ )̂ 1,

The form given is not the only state space version of an ARMA model but is a convenient one.

We now consider the case of a univariate non-seasonal nonstationary ARIMA model of order p, d and q, with d > 0, given by (3.14) with y* = Adyt. As an example, we first consider the state space form of the ARIMA model with p — 2, d — 1 and q = 1 which is given by

yt = ( 1 1 0)«„

1 1 «r+l =

with the state vector defined as

a, =

and y* = Ayt = yt — yt-\. This example generalises easily to ARIMA models with d — 1 with other values for p and q. The ARIMA model with p — 2, d = 2 and q — 1 in state space form is given by

0" 1 1 1 tr+l. 0 U/

* = ( 1 1 1 0)a„

ctt+i

1 1 1 0 0 1 1 0 0 0 4>1 1 0 0 (h. 0

a.t +

/ 0 \ 0 1 Zt+1.


with

l \ a, ~ Aj/_i

y* K<i>2 ytJ+eü,)

and y* — A 2yt — A (y, — yr-i). The relations between yt, A yt and A 2yt follow immediately since

We deal with the unknown nonstationary values yo and Ayo in the initial state vector «i in §5.6.3 where we describe the initialisation procedure for filtering and smoothing. Instead of estimating yo and Ayo directly, we treat these elements of a^ as diffuse random elements while the other elements, including y*, are stationary which have proper unconditional means and variances. The need to facilitate the initialisation procedure explains why we set up the state space model in this form. The state space forms for ARIMA models with other values for p*, d and q* can be represented in similar ways. The advantage of the state space formulation is that the array of techniques that have been developed for state space models are made available for ARMA and ARIMA models. In particular, techniques for exact maximum likelihood estimation and for initialisation are available.

As indicated above, for seasonal series both trend and seasonal are eliminated by the differencing operation yf = AdAfyt prior to modelling y* by a stationary ARMA model of the form (3.15). The resulting model for y* can be put into state space form by a straightforward extension of the above treatment. A well-known seasonal ARIMA model is the so-called airline model which is given by

which has a standard ARIMA state space representation. It is interesting to note that for many state space models an inverse relation

holds in the sense that the state space model has an ARIMA representation. For example, if second differences are taken in the local linear trend model (3.2), the terms in jit and vf disappear and we obtain

Since the first two autocorrelations of this are nonzero and the rest are zero, we can write it as a moving average series + 0\ + where the £*'s are independent N(0, cr?*) disturbances, and we obtain the representation

A yt = A 2yt + A y,_lt

yt — Ay, + y,_i = A2yf + Ay,_i + yt^.

y* - AA12yt = & - - 012f,_12 + 0i012C,_I3, (3.18)

A2yt = et+1 - 2ef+i + et + £t+i - Ht + ç,

(3.19)

In Box and Jenkins' notation this is an ARIMA(0,2,2) model. It is important to recognise that the model (3.19) is less informative than (3.2) since it has lost the

3.4. EXPONENTIAL SMOOTHING 49

information that exists in the form (3.2) about the level p, and the slope vf. If a seasonal term generated by model (3.3) is added to the local linear trend model, the corresponding ARIMA model has the form

s+2

A2A = +

where 6 i , . . . , (9v+2 are determined by the four variances of, of, cr2 and of. In this model, information about the seasonal is lost as well as information about the trend. The fact that structural time series models provide explicit information about trend and seasonal, whereas ARIMA models do not, is an important advantage that the structural modelling approach has over ARIMA modelling. We shall make a detailed comparison of the two approaches to time series analysis in §3.5.

3.4 Exponential smoothing

In this section we consider the development of exponential smoothing in the 1950s and we examine its relation to simple forms of state space and Box-Jenkins models.

Let us start with the introduction in the 1950's of the exponentially weighted moving average (EWMA) for one-step-ahead forecasting of yi+i given a univariate time series yt, _yf_j This has the form

oo

= (1 - 0 < X < 1. (3.20) /= 0

From (3.20) we deduce immediately the recursion

yt+i=(\-X)yt+Xyt, (3.21)

which is used in place of (3.20) for practical computation. This has a simple structure and requires little storage so it was very convenient for the primitive computers available in the 1950's. As a result, EWMA forecasting became very popular in industry, particularly for sales forecasting of many items simultaneously. We call the operation of calculating forecasts by (3.21) exponential smoothing.

Denote the one-step forecast error yt — yt by ut and substitute in (3.21) with t replaced by t — 1; this gives

yt ~ ut = (1 - + X(yt_i - u,-0,

that is,

Ay, — ut — Xut~\. (3.22)

Taking u, to be a series of independent N(0, of ) variables, we see that we have deduced from the EWMA recursion (3.21) the simple ARIMA model (3.22).

An important contribution was made by Muth (1960) who showed that EWMA forecasts produced by the recursion (3.21) are minimum mean square error forecasts in the sense that they minimise E(y,+i — yt+1)2 for observations


yt, yt-1,... generated by the local level model (2.3), which for convenience we write in the form

» = (3-23) f^t+i - v-t + Ht,

where s, and are serially independent random variables with zero means and constant variances. Taking first differences of observations yt generated by (3.23) gives

Ay, = yt- yt-\ =8t- et-i + .

Since s t and are serially uncorrelated the autocorrelation coefficient of the first lag for Ayt is nonzero but all higher autocorrelations are zero. This is the autocorrelation function of a moving average model of order one which, with X suitably defined, we can write in the form

A yt — ut — Xut-1,

which is the same as model (3.22). We observe the interesting point that these two simple forms of state space

and ARIMA models produce the same one-step forecasts and that these can be calculated by the EWMA (3.21) which has proven practical value. We can write this in the form

yt+1 = % + (i - A.)Cyi - %),

which is the Kalman filter for the simple state space model (3.23). The EWMA was extended by Holt (1957) and Winters (1960) to series con-

taining trend and seasonal. The extension for trend in the additive case is

5Wi =mt+btt

where mt and bt are level and slope terms generated by the EWMA type recursions

mt - (1 - Xi)yt + ki(mt-i + bt_0,

bt = { 1 - X2)(mt - mt~0 + X2bt-i.

In an interesting extension of the results of Muth (1960), Theil and Wage (1964) showed that the forecasts produced by these Holt-Winters recursions are minimum mean square error forecasts for the state space model

y, = (it +£,, fh+1 = f^t + v, (3.24) y»+i = Vt + £t,

which is the local linear trend model (3.2). Taking second differences of yt gene-rated by (3.24), we obtain

A2y, = Kt—i + St-\ ~ Ht-i + £t ~ 2 + st-2-

3.5. STATE SPACE VERSUS BOX-JENKINS APPROACHES 51

This is a stationary series with nonzero autocorrelations at lags 1 and 2 but zero autocorrelations elsewhere. It therefore follows the moving average model

A2y, — Uj <9iwf_i - 02ut-2,

which is a simple form of ARIMA model. Adding the seasonal term yl+i — —yt ----- — yt-s+1 + o)t from (3.3) to the

measurement equation of (3.23) gives the model

yt = M/ + Yt +£f> (3.25)

Yt+i — Yt Yt-s+2 +*>t,

which is a special case of the structural time series models of §3.2. Now take first differences and first seasonal differences of (3.25). We find

AAsyt — i - 1 + QH-1 - 2o),_2 + (ot-3 + et - st-\ - et-s + £t-s-i, (3.26)

which is a stationary time series with nonzero autocorrelations at lags 1, 2, s — 1, s and s + 1. Consider the airline model (3.18) for general

AAsyt — u( - 0\ut~\ - 0sut-s - Q\Gsut-s-1,

which has been found to fit well many economic time series containing trend and seasonal. It has nonzero autocorrelations at lags 1, s — 1, s and s + 1. Now the autocorrelation at lag 2 from model (3.25) arises only from Var(<yi) which in most cases in practice is small. Thus when we add a seasonal component to the models we find again a close correspondence between state space and ARIMA models. A slope component vt can be added to (3.25) as in (3.24) without significantly affecting the conclusions.

A pattern is now emerging. Starting with EWMA forecasting, which in appro-priate circumstances has been found to work well in practice, we have found that there are two distinct types of models, the state space models and the Box-Jenkins ARIMA models which appear to be very different conceptually but which both give minimum mean square error forecasts from EWMA recursions. The explanation is that when the time series has an underlying structure which is sufficiently simple, then the appropriate state space and ARIMA models are essentially equivalent. It is when we move towards more complex structures that the differences emerge. In the next section we will make a comparison of the state space and Box-Jenkins approaches to a broader range of problems in times series analysis. The above discussion has been based on Durbin (2000b, §2).

3.5 State space versus Box-Jenkins approaches

In this section we compare the state space and Box-Jenkins approaches to time series analysis. The early development of state space methodology took place in


the field of engineering rather than statistics, starting with the pathbreaking paper of Kalman (1960). In this paper Kalman did two crucially important things. He showed that a very wide class of problems could be encapsulated in a simple linear model, essentially the state space model (3.1). Secondly he showed how, due to the Markovian nature of the model, the calculations needed for practical application of the model could be set up in recursive form in a way that was particularly convenient on a computer. A huge amount of work was done in the development of these ideas in the engineering field subsequently. In the 1960s to the early 1980's contributions to state space methodology from statisticians and econometricians were isolated and sporadic. In recent years however there has been a rapid growth of interest in the field in both statistics and econometrics as is indicated by references throughout the book.

The key advantage of the state space approach is that it is based on a struc-tural analysis of the problem. The different components that make up the series, such as trend, seasonal, cycle and calendar variations, together with the effects of explanatory variables and interventions, are modelled separately before being put together in the state space model. It is up to the investigator to identify and model any features in particular situations that require special treatment. In contrast, the Box-Jenkins approach is a kind of 'black box', in which the model adopted depends purely on the data without prior analysis of the structure of the system that generated the data. A second advantage of state space models is that they are flexible. Because of the recursive nature of the models and of the computational techniques used to analyse them, it is straightforward to allow for known changes in the structure of the system over time. On the other hand, Box-Jenkins models are homogeneous through time since they are based on the assumptions that the differenced series is stationary.

State space models are very general. They cover a very wide range including all ARIMA models. Multivariate observations can be handled by straightforward extensions of univariate theory, which is not the case with Box-Jenkins models. It is easy to allow for missing observations with state space models. Explanatory variables can be incorporated into the model without difficulty. Moreover, the associated regression coefficients can be permitted to vary stochastically over time if this seems to be called for in the application. Trading day adjustments and other calendar variations can be readily taken care of. Because of the Markovian nature of state space models, the calculations needed to implement them can be put in recursive form. This enables increasingly large models to be handled effectively without disproportionate increases in the computational burden. No extra theory is required for forecasting. All that is needed is to project the Kalman filter forward into the future. This gives the forecasts required together with their estimated standard errors using the standard formulae used earlier in the series.

It might be asked, if these are all the advantages of state space modelling, what are the disadvantages relative to the Box-Jenkins approach? In our opinion, the only disadvantages are the relative lack in the statistical and econometric communities of information, knowledge and software regarding these models.

3.5. STATE SPACE VERSUS BOX-JENKINS APPROACHES 53

ARIMA modelling forms a core part of university courses on time series analysis and there are numerous text books on the subject. Software is widely available in major general packages such as SAS, S-PLUS and MINITAB as well as in many specialist time series packages. In contrast, state space modelling for time series is taught in relatively few universities and on the statistical side, as distinct from the engineering side, very few books are available. There is not much state space software on general statistical packages, and specialist software has only recently become available.

Now let us consider some of the disadvantages of the Box-Jenkins approach. The elimination of trend and seasonal by differencing may not be a drawback if forecasting is the only object of the analysis, but in many contexts, particularly in official statistics and some econometric applications, knowledge about these components has intrinsic importance. It is true that estimates of trend and seasonal can be 'recovered' from the differenced series by maximising the residual mean square as in Burman (1980) but this seems an artificial procedure which is not as appealing as modelling the components directly.

The requirement that the differenced series should be stationary is a weakness of the theory. In the economic and social fields, real series are never stationary however much differencing is done. The investigator has to face the question, how close to stationarity is close enough? This is a hard question to answer.

In the Box-Jenkins system it is relatively difficult to handle matters like missing observations, adding explanatory variables, calendar adjustments and changes in behaviour over time. These are straightforward to deal with in the state space approach. In practice it is found that the airline model and similar ARIMA models fit many data sets quite well, but it can be argued that the reason for this is that they are approximately equivalent to plausible state space models. This point is discussed at length by Harvey (1989, pp. 72, 73). As we move away from airline-type models, the model identification process in the Box-Jenkins system becomes difficult to apply. The main tool is the sample autocorrelation function which is notoriously imprecise due to its high sampling variability. Practitioners in applied time series analysis are familiar with the fact that many examples can be found where the data appear to be explained equally well by models whose specifications look very different.

A final point in favour of structural models is their transparency. One can examine graphs of trend, seasonal and other components to check whether they behave in accordance with expectation. If, for example, a blip was found in the graph of the seasonal component, one could go back to the original data to trace the source and perhaps make an adjustment to the model accordingly.

To sum up, state space models are based on modelling the observed structure of the data. They are more general, more flexible and more transparent than Box-Jenkins models. They can deal with features of the data which are hard to handle in the Box-Jenkins system. Software for state space models is publicly available, as we indicate at appropriate points of the exposition in the chapters to follow. The above discussion has been based on Durbin (2000b, §3).


3.6 Regression with time-varying coefficients

Suppose that in the linear regression model yt — Zta + st we wish the coefficient vector a to vary over time. A suitable model for this is to put a — a t and permit each coefficient an to vary according to a random walk a,-j/+i — a„ + r}it. This gives a state equation for the vector a t in the form a / + i = a t + r)t. Since the model is a special case of (3.1) it can be handled in a routine fashion by Kalman filter and smoothing techniques.

3.7 Regression with ARM A errors

Consider a regression model of the form

yt = Xtp+St. t = 1 , . . . , n, (3.27)

where yt is a univariate dependent variable, Xt is a 1 x k regressor vector, is its coefficient vector and £f denotes the error which is assumed to follow an ARMA model of form (3.15); this ARMA model may or may not be stationary and some of the coefficients <j>j, Oj may be zero as long as <pr and are not both zero. Let a t be defined as in (3.16) and let

where — ft. Writing the state equation implied by (3.17) as at+\ ~ Tat + Rr)t, let

ji* o r ] ' ä * = [ Ä ] ' z T ^ W i o . - . o ) ,

where T and R are defined in (3.17). Then the model

YF = Z*AF*, < + I - T*a* + R*r)t,

is equivalent to (3.27) and is in state space form (3.1) so Kalman filter and smoothing techniques are applicable; these provide an efficient means of fitting model (3.27). It is evident that the treatment can easily be extended to the case where the regression coefficients are determined by random walks as in §3.6. Moreover, with this approach, unlike some others, it is not necessary for the ARMA model used for the errors to be stationary.

3.8 Benchmarking

A common problem in official statistics is the adjustment of monthly or quarterly observations, obtained from surveys and therefore subject to survey errors, to agree with annual totals obtained from censuses and assumed to be free from error. The annual totals are called benchmarks and the process is called benchmarking. We shall show how the problem can be handled within a state space framework.

3.8. BENCHMARKING 55

Denote the survey observations, which we take to be monthly (5 = 12), by yt and the true values they are intended to estimate by y* for t = 12{/ — 1) + j, i — 1 , . . . , i and j — I,... ,12, where i is the number of years. Thus the survey error is yt — )'* which we denote by er/f/ where erf is the standard deviation of the survey error at time t. The error is modelled as an AR(1) model with unit variance. In principle, ARMA models of higher order could be used. We assume that the values of af are available from survey experts and that the errors are bias free; we will mention the estimation of bias later. The benchmark values are given by*; — J^pLi y\2(i-i)+j for / — 1 , . . . , £. We suppose for simplicity of exposition that we have these annual values for all years in the study though in practice the census values will usually lag a year or two behind the survey observations. We take as the model for the observations

k yt=fit + Yt + + aW> t = l..-,12t, (3.28)

J=1

where fi t is trend, yt is seasonal and the term w j t represents systematic effects such as the influence of calendar variations which can have a substantial effect on quantities such as retail sales but which can vary slowly over time.

The series is arranged in the form

J l , . . • , yil, 3*13. • • • , J 2 4 , x2, y25, • • • , ym, Xt.

Let us regard the time point in the series at which the benchmark occurs as t — (12/)'; thus the point t — (12/)' occurs in the series between t — 12i and t — \2i + 1. It seems reasonable to update the regression coefficients Sjt only once a year, say in January, so we take for these coefficients the model

<5/,12i+l — ^j , 12« + f / , 1 2 i , j ~ . . . yk, i — 1 , . . . , t ,

8jit+1 — 8 jit, otherwise.

Take the integrated random walk model for the trend component and model (3.3) for the seasonal component, that is,

11 A2/x, = yt = - £ Yt-j + o>t',

7=1

see §3.2 for alternative trend and seasonal models. It turns out to be convenient to put the observation errors into the state vector, so we take

at = fit~ 11, Yt,---, y t - n , S U t 8 k t , , et~iu f / ) •

Thus yt = Ztat where

Zt - (1, 0 , . . . , 0,1, 0 , . . . , 0, w u , w k t , 1, 0 , . . . , 0, ct/), t - 1,..., n,


and xi ~ Ztctt where

Zt = (12t 12i \

1 , . . . , 1 , 0 , . . . , 0 , £ wis ,...•> J 2 , i = (12i)', i = 1 2 i - l l js=I2i—11 /

for i — 1 1. Using results from §3.2 it is easy to write down the state tran-sition from at to at+1 for t ~ Yli — 11 to t — Yli — 1, taking account of the fact that ~ F r o m t — 12z to t = (120' the transition is the identity. From t = (120' to i — 12i -f-1, the transition is the same as for t = 12i — 11 to t ~ Yli — except that we take account of the relation ^jjn+i ~ <5/, 12/ + Kj,m where / 0.

There are many variants of the benchmarking problem. For example, the annual totals may be subject to error, the benchmarks maybe values at a particular month, say December, instead of annual totals, the survey observations may be biased and the bias needs to be estimated, more complicated models than the AR(1) model can be used to model y*; finally, the observations may behave multiplicatively whereas the benchmark constraint is additive, thus leading to a nonlinear model. All these variants are dealt with in a comprehensive treatment of the benchmarking problem by Durbin and Quenneville (1997). They also consider a two-step approach to the problem in which a state space model is first fitted to the survey observations and the adjustments to satisfy the benchmark constraints takes place in a second stage.

Essentially, this example demonstrates that the state space approach can be used to deal with situations in which the data come from two different sources. Another example of such problems will be given in §3.9 where we model different series which aim at measuring the same phenomenon simultaneously, which are all subject to sampling error and which are observed at different time intervals.

3.9 Simultaneous modelling of series from different sources

A different problem in which data come from two different sources has been considered by Harvey and Chung (2000). Here the objective is to estimate the level of UK unemployment and the month-to-month change of unemployment given two different series. Of these, the first is a series yt of observations obtained from a monthly survey designed to estimate unemployment according to an inter-nationally accepted standard definition (the so-called ILO definition where ILO stands for International Labour Office); this estimate is subject to survey error. The second series consists of monthly counts xt of the number of individuals claiming unemployment benefit; although these counts are known accurately, they do not themselves provide an estimate of unemployment consistent with the ILO defini-tion. Even though the two series are closely related, the relationship is not even approximately exact and it varies over time. The problem to be considered is how to use the knowledge of x, to improve the accuracy of the estimate based on yt alone.

3.10. STATE SPACE MODELS IN CONTINUOUS TIME 57

The solution suggested by Harvey and Chung (2000) is to model the bivariate series (y,, xt)' by the structural time series model

for / = 1 , . . . , n. Here, fit, st> vt, and ^ are 2 x 1 vectors and T,e, and are 2 x 2 variance matrices. Seasonals can also be incorporated. Many complications are involved in implementing the analysis based on this model, particularly those arising from design features of the survey such as overlapping samples. A point of particular interest is that the claimant count xt is available one month ahead of the survey value y,. This extra value of xt can be easily and efficiently made use of by the missing observations technique discussed in §4.8. For a discussion of the details we refer the reader to Harvey and Chung (2000).

This is not the only way in which the information available in xt can be utilised. For example, in the published discussion of the paper, Durbin (2000a) suggested two further possibilities, the first of which is to model the series yt — x( by a struc-tural time series model of one of the forms considered in §3.2.1; unemployment level could then be estimated by fit + xt where fl{ is the forecast of the trend fi} in the model using information up to time t — 1, while the month-to-month change could be estimated by v( + xt — where D, is the forecast of the slope vt. Alternatively, xt could be incorporated as an explanatory variable into an appropriate form of model (3.11) with coefficient fij replaced by f$jt which varies over time according to (3.12). In an obvious notation, trend and change would then be estimated by fit + fitxt and vt + fitXt — $t~-i*t-i •

3.10 State space models in continuous time

In contrast to all the models that we have considered so far, suppose that the observation y(t) is a continuous function of time for / in an interval which we take to be 0 < t < T. We shall aim at constructing state space models for y(f) which are the analogues in continuous time for models that we have already studied in discrete time. Such models are useful not only for studying phenomena which genuinely operate in continuous time, but also for providing a convenient theoretical base for situations where the observations take place at time points t\ < • • • < * „ which are not equally spaced.

3.10.1 LOCAL LEVEL MODEL

We begin by considering a continuous version of the local level model (2.3). To construct this, we need a continuous analogue of the Gaussian random walk. This is the Brownian motion process, defined as the continuous stochastic pro-cess iu(i) such that w(0) = 0, w(t) ~ N(0, t) for 0 < t < oo, where increments w(t2) — w(ti), w(t4) — w(t3) for 0 < ty < t2 < h < U are independent. We

^ J st~ N(0,£e),

Vr+i = f i t n ( O , m Vt+i £ ~ N ( 0 , £ f ) ,

(3.29)


sometimes need to consider increments dw(t), where dw(t) ~ N(0, dt) for dt in-finitesimally small. Analogously totherandomwalkai+i — a t + rjt, r]t ~ N(0, a 2 ) for the discrete model, we define a(t) by the continuous time relation da{t) = andw{t) where o^ is an appropriate positive scale parameter. This suggests that as the continuous analogue of the local level model we adopt the continuous time state space model

y(t) = a(t) + e(t), a(t) — a(0) + a^w(t), 0<t<T, ^ ;

where T > 0. The nature of s(t) in (3.30) requires careful thought. It must first be recognised

that for any analysis that is performed digitally, which is all that we consider in this book, y{t) cannot be admitted into the calculations as a continuous record; we can only deal with it as a series of values observed at a discrete set of time points 0 < ti < t2 < • • • < tn < T. Secondly, Var[e(i)] must be bounded significantly away from zero; there is no point in carrying out an analysis when y(t) is in-distinguishably close to a(t). Thirdly, in order to obtain a continuous analogue of the local level model we need to assume that Cov[e(i;), s(tj)] — 0 for observational points ti, tj (i / j). It is obvious that if the observational points are close together it may be advisable to set up an autocoirelated model for s(t), for example a low-order autoregressive model; however, the coefficients of this would have to be put into the state vector and the resulting model would not be a continuous local level model. In order to allow Var[s(i)] to vary over time we assume that Var[e(0] = o2{t) where a2(i) is a non-stochastic function of t that may depend on unknown parameters. We conclude that in place of (3.30) a more appropriate form of the model is

y{t) = ot{t) + s{t), t = tlt...,tn, s(ti) - N[0, o\ti) ], a(t) = a(0) + (TjjWit), 0 < f < T.

We next consider the estimation of unknown parameters by maximum likelihood. Since by definition the likelihood is equal to

piyihWymyik)) • • • p[y(tn)\y(h),. -. > y(tn-0],

it depends on a(t) only at values t\,... ,tn. Thus for estimation of parameters we can employ the reduced model

"Vi" = t £ i ' • t (3.32) ai+i - at + m, i = i , . . . , n ,

where yt — y(i/), — a ( t t \ s{ = e(i(), //f = (T}l[v)(ti+1) - w(ti)] and where the Si's are assumed to be independent. This is a discrete local level model which differs from (2.3) only because the variances of the 's can be unequal; consequently we can calculate the loglikelihood by a slight modification of the method of §2.10.1 which allows for the variance inequality.

Having estimated the model parameters, suppose that we wish to estimate a(t) at values t = tjt between tj and tj+i for 1 < j < n. We adjust and extend equations

3.10. STATE SPACE MODELS IN CONTINUOUS TIME 59

(3.32) to give

01U = a i + V p

yj* (3.33)

where = treated as missing, = crn[w(tjj — w(tj)] and if^ — &v[w(tj+i) ~ w ( t j J ] . We can now calculate E[oryt |y( i i ) , . . . , y(i„)l and Var[o; ̂ |y(ii), . . . , y(i«)l by routine applications of the Kalman filter and smoother for series with missing observations, as described in §2.7, with a slight modification to allow for unequal observational error variances.

3.10.2 LOCAL LINEAR TREND MODEL

Now let us consider the continuous analogue of the local linear trend model (3.2) for the case where of — 0, so that, in effect, the trend term (jl, is modelled by the relation A 2 / Z ( + I = For the continuous case, denote the trend by fi(t) and the slope by v(t) by analogy with (3.2). The natural model for the slope is then dv(t) — cr^dw(t), where w(t) is standard Brownian motion and a^ > 0, which gives

v(t) - v(0) + <rf w(t\ 0 <t<T. (3.34)

By analogy with (3.2) with of = 0, the model for the trend level is dfi(t) = v(t)dt, giving

fji(t) = ftfP) + f v(s) ds Jo

= pl(0) + v(0)t + a ( f w(s) ds. (3.35) Jo

As before, suppose that y(t) is observed at times h < • • • <tn. Analogously to (3.31), the observation equation for the continuous model is

y(f) = a ( 0 + «(0, t = t u . . . , t n , s(ti)~ N[0,<r2(/,)], (3.36)

and the state equation can be written in the form

' R H S U U <-> For maximum likelihood estimation we employ the discrete state space model,

yi = +

( 1 \ V )


where fij = ih — v (*,•), — £(t,) and <S(- = f f+1 — also i

I [U)(i) — w(/;)3 ds,

and

as can be verified from (3.34) and (3.35). From (3.36), Var(e,) = a2(t,). Since - w(ti)) = 0 for U < J < ti+u E&-) - E(£) = 0. To calculate Varfe),

approximate by the sum

j= o

where Wj ~ N(0, a?5f /M) and E(iu;-u^) —- 0 (7 k). This has variance

.3 M-\ / ,x2 y ( l - L ) f

which converges to

as M 00. Also,

f Jo

1 1 x2 = -CT^?

E(££) - or2 J '+1 E[{tu(j) - u/MHwft+i) - u>fe)}] ds

Jo X dx

o f f '

and E(f?) — a28i. Thus the variance matrix of the disturbance term in the state equation (3.38) is

Qi = Var ft) — of&j 1 J?2 ljj 1 2 h&i 1

(3.39)

The loglikelihood is then calculated by means of the Kalman filter as in §7.3. As with model (3.31), adjustments (3.33) can also be introduced to model

(3.36) and (3.37) in order to estimate the conditional mean and variance matrix of the state vector [//.(/), v(f )]' at values of t other than tx,..., t„.

Chapter 9 of Harvey (1989) may be consulted for extensions to more general models.

3.11. SPLINE SMOOTHING 61

3.11 Spline smoothing

3.11.1 SPLINE SMOOTHING IN DISCRETE TIME

Suppose we have a univariate series y j , . . . , yn of values which are equispaced in time and we wish to approximate the series by a relatively smooth function p,(t). A standard approach is to choose p,(t) by minimising

« «

- Mi)]2 + * £[A 2 /x(01 2 (3.40)

with respect to p(t) for given X > 0. It is important to note that we are considering flit) here to be a discrete function of t at time points t — 1 , . . . , n, in contrast to the situation considered in the next section where fi(t) is a continuous function of time. If A is small, the values of fi(t) will be close to the y,'s but p,(t) may not be smooth enough. If X is large the p,(t) series will be smooth but the values of fi(t) may not be close enough to the yt's. The function p(t) is called a spline. Reviews of methods related to this idea are given in Silverman (1985), Wahba (1990) and Green and Silverman (1994). Note that in this book we usually take t as the time index but it can also refer to other sequentially ordered measures such as temperature, earnings and speed.

Let us now consider this problem from a state space standpoint. Let at — pit) for t — 1 , . . . , « and assume that yt and at obey the state space model

y, = at + A2a, = t = 1 , . . . , n, (3.41)

where Var(£>) — o2 and Var(£) = o2jX with X > 0. We observe that the second equation of (3.41) is one of the smooth models for trend considered in §3.2. For simplicity suppose thata_i and oto are fixed and known. The log of the joint density of a\,..., an, y } , . . . , yn is then, apart from irrelevant constants,

( 3 « )

Now suppose that our objective is to smooth the yt series by estimating at by at — E(at\Yn). We shall employ a technique that we shall use extensively later so we state it in general terms. Suppose a — (a[,..., a'n)' and y = (y [,..., y'n)' are jointly normally distributed stacked vectors with density p(a, y) and we wish to calculate a — E{a\y). Then a is the solution of the equations

da This follows since \ogp(a\y) ~ l o g y ) — \ogp(y) so 3 logp(a\y) /da — d log p(a, y)/da. Now the solution of the equations d log p(afy)/da — 0 is the mode of the density p(a\y) and since the density is normal the mode is equal to the mean vector 6s. The conclusion follows. Since p{a\y) is the conditional distribution of a given y, we call this technique conditional mode estimation of ai , • - - ><*«•


Applying this technique to (3.42), we see that a \ , . . . , a n can be obtained by minimising

t=i r=l

Comparing this with (3.40), and ignoring for the moment the initialisation question, we see that the spline smoothing problem can be solved by finding E(at\Yn) for model (3.41). This is achieved by a standard extension of the smooth-ing technique of §2.4 that will be given in §4.3. It follows that state space techniques can be used for spline smoothing. Treatments along these lines have been given by Kohn, Ansley and Wong (1992). This approach has the advantages that the models can be extended to include extra features such as explanatory variables, calendar variations and intervention effects in the ways indicated earlier in this chapter; moreover, unknown quantities, for example X in (3.40), can be estimated by maximum likelihood using methods that we shall describe in Chapter 7.

3.11.2 SPLINE SMOOTHING IN CONTINUOUS TIME

Let us now consider the smoothing problem where the observation y(t) is a continuous function of time t for t in an interval which for simplicity we take to be 0 < t < T. Suppose that we wish to smooth y(t) by a function fi(t) given a sample of values y(V) for i — 1 , . . . , «where 0 < t\ < • • • < tn < T. A traditional approach to the problem is to choose pit) to be the twice-differentiable function on (0, T) which minimises

E w<> - m m 2 d t ' <3-43> for given X > 0. We observe that (3.43) is the analogue in continuous time of (3.40) in discrete time. This is a well-known problem, a standard treatment to which is presented in Chapter 2 of Green and Silverman (1994). Their approach is to show that the resulting fi(t) must be a cubic spline, which is defined as a cubic polynomial function in t between each pair of time points i i t ti+\ fori — 0 , 1 , . . . , n with £o — 0 and f„+i = T, such that fi(t) and its first two derivatives are continuous at each for i ~ 1 , . . . , « . The properties of the cubic spline are then used to solve the minimisation problem. In contrast, we shall present a solution based on a continuous time state space model of the kind considered in §3.10.

We begin by adopting a model for pit) in the form (3.35), which for convenience we reproduce here as

fi(t) = p(0) + v(Q)t + a r f w(s) ds, 0 < t < T. (3.44) Jo

This is a natural model to consider since it is the simplest model in continuous time for a trend with smoothly varying slope. As the observation equation we take

yitf) = Ufa) + s i t Si ~ N(0, a£2), i = 1 , . . . , n,


where 's are independent of each other and of w(t) for 0 < t < T. We have taken Var(£j) to be constant since this is a reasonable assumption for many smoothing problems, and also since it leads to the same solution to the problem of minimising (3.43) as the Green-Silverman approach.

Since im{0) and v(0) are normally unknown, we represent them by diffuse priors. On these assumptions, Wahba (1978) has shown that on taking

the conditional mean /£(i) of fi(t) defined by (3.44), given the observations y(h),..., y(tn), is the solution to the problem of minimising (3.43) with respect to fi(t). We shall not give details of the proof here but will instead refer to discussions of the result by Wecker and Ansley (1983, §2) and Green and Silverman (1994, §3.8.3). The result is important since it enables problems in spline smoothing to be solved by state space methods. We note that Wahba and Wecker and Ansley in the papers cited consider the more general problem in which the second term of (3.43) is replaced by the more general form

form = 2, 3, We have reduced the problem of minimising (3.43) to the treatment of a special

case of the state space model (3.36) and (3.37) in which a2(/,) — a 2 for all i. We can therefore compute ji(t) and Var[Ju.(/)|y(/I ) , . . . , y(i„)] by routine Kalman filtering and smoothing. We can also compute the loglikelihood and, consequently, estimate X by maximum likelihood; this can be done efficiently by concentrating out <x2 by a straightforward extension of the method described in §2.10.2 and then maximising the concentrated loglikelihood with respect to A in a one-dimensional search. The implication of these results is that the flexibility and computational power of state space methods can be employed to solve problems in spline smoothing.

Filtering, smoothing and forecasting

4.1 Introduction

In this chapter and the following three chapters we provide a general treatment from the standpoint of classical inference of the linear Gaussian state space model (3.1). The observations yt will be treated as multivariate. For much of the theory, the development is a straightforward extension to the general case of the treatment of the simple local level model in Chapter 2. In this chapter we will consider filtering, smoothing, simulation, missing observations and forecasting. Filtering is aimed at updating our knowledge of the system as each observation yt comes in. Smoothing enables us to base our estimates of quantities of interest on the entire sample y i , . . . , y„. As with the local level model, we shall show that when state space models are used, it is easy to allow for missing observations. Forecasting is of special importance in many applications of time series analysis; we will demon-strate that by using our approach, we can obtain the required forecasting results merely by treating the future values yrt+i, yn+2>... as missing values. Chapter 5 will discuss the initialisation of the Kalman filter in cases where some or all of the elements of the initial state vector are unknown or have unknown distributions. Chapter 6 will discuss various computational aspects of filtering and smoothing. In the classical approach to inference, the parameter vector \fr is regarded as fixed but unknown. In Chapter 7 we will consider the estimation of xfr by maximum likelihood together with the related questions of goodness of fit of the model and diagnostic checking. In the Bayesian approach to inference, \jr is treated as a random vector which has a specified prior density or a non-informative prior. We will show how the linear Gaussian state space model can be dealt with from the Bayesian standpoint by simulation methods in Chapter 8. The use in practice of the techniques presented in these four chapters is illustrated in detail in Chapter 9 by describing their application to five real time series.

Denote the set of observations y i , . . . , yt by Yt. In §4.2 we will derive the Kalman filter, which is a recursion for calculating at+\ — E(a i+i | Yt) and Pt+\ = Var(af+i | Yt) given at and P,. The derivation requires only some elementary proper-ties of multivariate normal regression theory. In the following section we investigate some properties of one-step forecast errors which we shall need for subsequent

4.2. FILTERING 65

work. In §4.3 we use the output of the Kalman filter and the properties of forecast errors to obtain recursions for smoothing the series, that is, calculating the conditional mean and variance matrix of a t given all the observations y j , . . . , >>„ for t = 1 , . . . , n. Estimates of the disturbance vectors s, and r)t and their error variance matrices given all the data are investigated in §4.4. The weights associated with filtered and smoothed estimates of functions of the state and disturbance vectors are discussed in §4.6. Section 4.7 describes how to generate random samples for purposes of simulation from the smoothed density of the state and disturbance vectors given the observations. The problem of missing observations is consid-ered in §4.8 where we show that with the state space approach the matter is easily dealt with by means of simple modifications of the Kalman filter and the smoothing recursions. Section 4.9 discusses forecasting by using the results of §4.8. A comment on varying dimensions of the observation vector is given in §4.10. Finally, in §4.11 we consider a general matrix formulation of the state space model.

4.2 Filtering

4.2.1 DERIVATION OF KALMAN FILTER

For convenience we restate the linear Gaussian state space model (3.1) here as

yt = Ztat + ef, et ~ N(0, Ht), at+i = Ttat + Rtrit, Vt ~ N(0, Qt), i = l , . . . , n , (4.1)

<*{ ~ N(AI , P I ) ,

where details are given below (3.1). Let Yt i denote the set of past observations yi,..., yt-\. Starting at t — 1 and building up the distributions of at and yt recur-sively, itiseasy to show that p(yt\au 1^-t) = piyt I«/) and p(at+i , • • -1 a,, Yt) — p(oct+i\at). In Table 4.1 we give the dimensions of the vectors and matrices of the state space model.

In this section we derive the Kalman filter for model (4.1) for the case where the initial state ai is N(ai, P\) where a.\ and P\ are known. Our object is to obtain the conditional distribution of at+i given Yt for t — I,... ,n where Yt — 0*1 > • • • > jy*}- Since all distributions are normal, conditional distributions of subsets of variables given other subsets of variables are also normal; the required

Table 4.1. Dimensions of state space model (4.1).

Vector Matrix

yt p x 1 Zf p xm 0it m x 1 Tt m x m £t p x 1 Ht px p Vt r x 1 Rt m x r

Qt r x r ai m x 1 Pi m xm

66 FILTERING, SMOOTHING AND FORECASTING

distribution is therefore determined by a knowledge of a{+\ — E(a i + l |F,) and Pt+1 — Var(a/+i | Yt). Assume that ctt given Yt-i is N(a(, Pt). We now show how to calculate at+\ and Pt+i from at and Pt recursively.

Since ai+i — Ttat + Rtrjt, we have

at+1 =E(Ttat + Rtr)t\Yt) = TtE(at\Yt), (4.2)

Pt+X - Var(Ttat + Rtr]t\Yt)

= Tt Var(a,|Yt)T{ + RtQtR'n (4.3)

for t ~ I,... ,n. Let

u, = y, - B(yt\Yt-i) = yt - E(Z ta, + £t\Yt-i) = y, - Ztat. (4.4)

Then vt is the one-step forecast error of yt given Yt-1. When Ft_i and vt are fixed then Yt is fixed and vice versa. Thus E(a.t\Yt) ~ E(at\Ytît vt). But E(u(|7(_i) = E(y, — Ztat 17,-0 — E(Z ta t + st — Z,at\Yt-\) ~ 0. Consequently, E(u() = 0and Cov(yj , vt) = E[yyE(i;f\Yt-tf] = 0 with j = 1 , . . . , t - 1. By (2.49) in the regression lemma in §2.13 we therefore have

EicctmÊiatlYt-uVt) = E(at\Yt^) + Cov(a(, t>,)[Var(u,)]"1^ = at + MlFt-1vt, (4.5)

where M, = Cov(af, vt), Ft — Var(yi) and E(arf jY,i) — at by definition of at. Here,

Mt = Cov(af, vt) = E[E{«,(Z{oi( + st - Ztat)'\Yt^}]

= E[E{at(at - at)!Z't\Yt_x)\ - PtZ't, (4.6)

and

Ft = Var(Z ta t + st - Ztat) = ZtPtZ't + Ht. (4.7)

We assume that Ft is nonsingular; this assumption is normally valid in well-formulated models, but in any case it is relaxed in §6.4. Substituting in (4.2) and (4.5) gives

at+1 = Ttat + T{MiF^vt

= T,at + Ktvt, t = l,...,n, (4.8)

with

Kt = TtMtF~l = TtPtZ'tF-1. (4.9)

We observe that a t + x has been obtained as a linear function of the previous value a, and vt, the forecast error of yt given

4.2. FILTERING 67

By (2.50) of the regression lemma in §2.13 we have

Var(al|yt)=Var(a<|yl_„Vr) = Var(af|7f_i) - Cov(<*„ ^[Var^,)]"1 Cov(at, vt)'

= Pt- MtF~lM\

= pt- ptz'tF;lztpt.

i

(4.10)


(4-11)

(4.12)

The recursions (4.8) and (4.11) constitute the celebrated Kalman filter for model (4.1). They enable us to update our knowledge of the system each time a new observation comes in. It is noteworthy that we have derived these recursions by simple applications of standard results of multivariate normal regression theory. The key advantage of the recursions is that we do not have to invert a (pt x pt) matrix to fit the model each time the tth observation comes in for t ~ 1 , . . . , n; we only have to invert the (p x p) matrix Ft and p is generally much smaller than n; indeed, in the most important case in practice, p — 1. Although relations (4.8) and (4.11) constitute the forms in which the multivariate Kalman filter recursions are usually presented, we shall show in §6.4 that variants of them in which elements of the observational vector yt are brought in one at a time, rather than the entire vector yt, are in general computationally superior.

It can be shown that when the observations are not normally distributed and we restrict attention to estimates which are linear in the yt's, and when also matrices Zt and Tt do not depend on previous 37's, then under appropriate assumptions the value of at+1 given by the filter minimises the mean square error of the estimate of each component of ctt+i. See, for example, Duncan and Horn (1972) or Anderson and Moore (1979) for details of this approach. A similar result holds for arbitrary linear functions of the elements of the at\. This emphasises the point that although our results are obtained under the assumption of normality, they have a wider validity in the sense of minimum mean square errors when the variables involved are not normally distributed.

4.2.2 KALMAN FILTER RECURSION

For convenience we collect together the filtering equations

Vt = yt ~ Ztat, Ft = ZtPtZ[ + Ht, Kt = TtPtZ'tFt Lt = Tt — KtZt,

at+1 = Tta, + Ktvt, Pt+i = TtP,L't + RtQtR't

t = ! , . . . , « , (4.13)


Table 4.2. Dimensions of Kalman filter.

Vector Matrix Vt p x 1 Ft pxp

Kt m x p Lt m x m Mt m x p

at m x 1 Pt mxm at\t rn x 1 P«t mxm

with «1 and P\ as the mean vector and variance matrix of the initial state vector «^respectively. The recursion (4.13) is called the Kalman filter.

The so-called contemporaneous filtering equations incorporate the compu-tation of the state vector estimator E(af|F,) and its associated error variance matrix, which we denote by at\t and Pt\t, respectively. These equations are just a re-formulation of the Kalman filter and are given by

v, =yt - Ztat, Ft - ZtPtZ\ + Ht, Mt = PtZ't,

at]tât + MtF~xvt, Pt[t = Pt - MtF~l M't, (4.14) a,+1 = T,a{]t, Pl+l = TtPsUT(' + RtQtR't, for given a\ and P\. In Table 4.2 we give the dimensions of the vectors and matrices of the Kalman filter equations.

4.2.3 STEADY STATE

When dealing with a time-invariant state space model in which the system matrices are constant over time, the Kalman recursion for Pt+j converges to a constant matrix P which is the solution to the matrix equation

P = TPT' - TPZ'F~lZPT' + RQR

where F — ZPZ' H. The solution that is reached after convergence to P is referred to as the steady state solution of the Kalman filter. Use of the steady state after convergence leads to considerable computational savings because the computations for Ft, Kt ~ MtF~l and P1+\ are no longer required.

4.2.4 STATE ESTIMATION ERRORS AND FORECAST ERRORS

Define the state estimation error as

xt —at— a{, with Var(x,) — Pt, (4.15)

as for the local level model in §2.3.2. We now investigate how these errors are related to each other and to the one-step forecast errors vt = yt — E(yt|yi_1) — v, — Ztat. Since vt is the part of yt that cannot be predicted from the past we shall sometimes refer to the v, 's as innovations. It follows immediately from the Kalman

4.2. FILTERING 69

filler relations and the definition of x, that

vt=yt- Ztat

— Zt ctt + S{ — Ztat

= Ztxt+st, (4.16)

and

î+i — — 1 = Tta, + Rtrjt - Ttat - K,v, = Ttx, + Rt7]t — KtZtxt — Ktst

= Ltx( R(r}t ~ Ktet, (4.17)

where we note that these recursions are similar to (2.18) for the local level model in Chapter 2. Analogously to the state space relations

yt — Ztoct + «/+1 - Tta( + Rtr}t,

we obtain the innovation analogue of the state space model, that is,

vt — Ztxt xt+i = Ltxt + Rtrjt - Ktst, (4.18)

with xi — oil — ai, for t = 1 , . . . , « . The recursion for Pt+i can be derived more easily than in §4.2.1 by the steps

Pt+1 = Var(A-i+1) = E[(at+i - at+x)x'i+l\ - E(at+ix'i+i) = E[(Ttat 4- Rtr}t)(LtXt + - Ktst)'] = TtPtL't + RtQtRf

t,

since COV(A"( , r\t) = 0. Relations (4.18) will be used for deriving the smoothing recursions in the next section.

We finally show that the forecast errors are independent of each other using the same arguments as in §2.3.1. The joint density of the observational vectors

n

t=2 Transforming from y, to vt = yt — Ztat we have

p ( v u . - . , v n ) = Y [ p ( V t ) ,

t=i since p(y\) = p(vi) and the Jacobian of the transformation is unity because each v, is yt minus a linear function of y l 5 . . . , y(_i for t = 2 , . . . , n. Consequently u i , . . . , vn are independent of each other, from which it also follows that v,,..., vn are independent of Ff_i.


4.3 State smoothing

We now consider the estimation of at given the entire series yi,..., yn. Let us denote the stacked vector . . . , y'n)' by ; thus y is Yn represented as a vector. We shall estimate a, by its conditional mean a t = E(af |y) and we shall also calculate the error variance matrix Vt — \ai(at — ott) — Var(af for t — 1 , . . . , n. Our approach is to construct recursions for at and Vt on the assumption that a\ ~ N(«i, Pi) where a\ and Pi are known, deferring consideration of the case ai and Pi unknown until Chapter 5. We emphasise the point that the derivations are elementary since the only theoretical background they require is the regression lemma in §2.13.

4.3.1 SMOOTHED STATE VECTOR

The vector y is fixed when Yt-1 and vt,..., v„ are fixed. By (2.49) in the regression lemma in §2.13 and the fact that vt,..., v„ are independent of Yt-1 and of each other with zero means, we therefore have

at - E{at\y) = E(at\Yt_u vt,...,vn) n

= at + £ Cov(a„ V j ^ v j , (4.19) j = t

for t — 1 , . . . ,n, with Cov(af, Vj) = E(ori i/.). It follows from (4.18) that

E(a, i / j ) = E [ a t ( Z j X j + £;)'] - E(a tx'j)Z'j , j =/,..., n. (4.20)

Moreover,

E(atx't) = E[E(atx!t\y)] = E[E{at(at - at)'\y}] = Pt,

B(a,x't+l) - Eâ^L,*, + Rtr}t - Ktst)'\y}] - PtL't, E (a(x't+2) = PtL'tL't+l, (4.21)

Note that here and elsewhere we interpret Vt - • • Vn_x as Im when t — n and as L'n l when i — n — 1. Substituting into (4.19) gives

— an + PnZ'nFnlvn,

«„_i - + Pn-\Z'n_yF~\vn-i + Pn-\L'nZ'nF~lvni

d, = a, + PtZ'tF~lvt + PtL'tZ't+iF-+\vt+l

for t — n — 2, n — 3 , . . . , 1. We can express the smoothed state vector as

&t=at + Ptrt-1, (4.22)


where r„_i = Z'nFn lvn, rn^2 - zh-iFn\v»-i + K-iz'nFn and

rf_, =Z'tFr1vt + L'tZ't+lFt-+\vt+i+ •.. + L'tL,t+r.-L'n_lZ'nF-lvn, (4.23)

for t — n — 2, n — 3 , . . . , 1. The vector is a weighted sum of innovations Vj occurring after time t — 1, that is, j — t,..., n. The value at time t is

rr = Z't+1Ft-\vt+i + L't+lZ't+2F^2vt+2 + • • • + L't+l - • • L'^F^v», (4.24) and rn = 0 since no innovations are available after time n. Substituting (4.24) into (4.23) we obtain the backwards recursion

r f_j = Z'tF~lvt -{- L'trt, t = n,..., 1, (4.25)

with /"„ = 0. Collecting these results together gives the recursion for state smoothing,

r,_i = Z'tF~lvt + Vtrt, &t = at + Ptrt-U t = n,..., 1, (4.26)

with rn — 0; this provides an efficient algorithm for calculating « i , . . . , an. The smoother, together with the recursion for computing the variance matrix of the smoothed state vector which we present in §4.3.2, is sometimes referred to as the fixed interval smoother and was proposed in the forms (4.26) and (4.31) below by de Jong (1988a), de Jong (1989) and Kohn and Ansley (1989) although the earlier treatments in the engineering literature by Bryson and Ho (1969) and Young (1984) are similar.

Alternative algorithms for state smoothing have also been proposed. For example, Anderson and Moore (1979) present the so-called classical fixed inter-val smoother which for our state space model is given by

at = at\t + Pt\tT{ P~+x(ut+i -at+i), t = n,...,U (4.27)

where

atu - B(at\Yt) - a, + PtZ'tFrlvt, PtU = Varied) = P, - PtZ'tFt~xZtPt,

see equations (4.5) and (4.10). Notice that TtPqt = LtPt. Following Koopman (1998), we now show that (4.26) can be derived from

(4.27). Substituting for atU and Tt P,u into (4.27) we have

&t = a, + PtZ'tF~lvt + PtL'tPîai+i ~~ ai+l).

By defining r, — P^\(at±\ — at.vi) and re-ordering the terms, we obtain

P~\at - at) = Z'tF~\ + L'tp-+\(at+l - am),

and hence r,_, =Z'tF-xvt + L'tru

which is (4.25). Note that the alternative definition of r, also implies that rn — 0. Finally, it follows immediately from the definitional relation r,_i = P~x(at — a() that a, — at + Ptrt-\.


A comparison of the two different algorithms shows that the Anderson and Moore smoother requires inversion of n — 1 possibly large matrices P, whereas the smoother (4.26) requires no inversion other than of Ft which has been inverted during the Kalman filter. This is a considerable advantage for large models. For both smoothers the Kalman filter vector at and matrix P{ need to be stored together with vt, F~l and Kt, for t = 1 T h e state smoothing equation of Koopman (1993), which we consider in §4.4.2, does not involve a( and P( and it therefore leads to further computational savings.

4.3.2 SMOOTHED STATE VARIANCE MATRIX

A recursion for calculating Vt — Yax(at \ y) will now be derived. Using (2.50) in the regression lemma in §2.13.2 with x =at, y ~ (y[,..., y'tî)' and z = (v't,..., v'J, we obtain

n

Vt = Var(af IY,-U vu . - -, vn) = Pt - Cov(at ,Vj)F7l Cov(ai} i^)', j=t

since vt,..., vn are independent of each other and of Yt-1 with zero means. Using (4.20) and (4.21) we obtain immediately

V, = Pt — PtZ'tFt lZtPt — PtLf

tZ;+1 FtJ^lZt+iLtPt

— PtL't • • • L'n_xZ'nF~xZnLn-1 • • • LtPt

— Pt ~ PtNt-\Pt,

where

Nt-i = Z'tF~lZt + L'sZ't+lF!+\zt+lLt + + L't • •. L'n_xZ'nF~lZnLn_\ • • - Lt. (4.28)

We note that here, as in the previous subsection, we interpret Vt • • • L' x as Im when t = n and as when t = n — 1. The value at time t is

Nt = Zî^Zf+i + L't+lZ's+2Ft^2Zt+2L(+\ H

+ L'T+{ ... L ' ^ Z ' N F ^ Z N L ^ • •. L t + 1 . (4.29)

Substituting (4.29) into (4.28) we obtain the backwards recursion

- Z'tF~lZt + L'tNtLt, t = n,..., 1. (4.30) Noting from (4.29) that Nn_i = Z'nF~lZn we deduce that recursion (4.30) is initialised with N„ — 0. Collecting these results, we find that V, can be efficiently calculated by the recursion

JV,_i = Z'tF~lZt + L'tNtLt, Vt = Pt- PtNt-iPt, / = «,..., 1, (4.31)


Table 4.3. Dimensions of smoothing recursions of §§4.3.3 and 4.4.4.

Vector Matrix r, m x 1 a t m x 1 U( p x 1 et p x 1

r x 1

jV( m x m y( m Km Dt p x p

with = 0. Since t> i+i,. . . , vn are independent it follows from (4.24) and (4.29) that Nt = Var(r,).

4.3.3 STATE SMOOTHING RECURSION

For convenience we collect together the smoothing equations for the state vector,

for f — ft,..., 1 initialised with rn = 0 and Nn — 0. We refer to these collectively as the state smoothing recursion. Taken together, the recursions (4.13) and (4.32) will be referred to as the Kalman filter and smoother. We see that the way the filtering and smoothing is performed is that we proceed forwards through the series using (4.13) and backwards through the series using (4.32) to obtain a t and Vt for t ~ 1 , . . . ,n. During the forwards pass we need to store the quantities v,, Ft, Ks, at and Pt for t — 1 , . . . , n. Alternatively we can store at and Pt only and re-calculate vt, Ft and Kt using at and Pt but this is usually not done since the dimensions of vt, Ft and Kt are usually small relatively to a{ and Pt, so the extra storage required is small. In Table 4.3 we present the dimensions of the vectors and matrices of the smoothing equations of this section and §4.4.4.

4.4 Disturbance smoothing

In this section we will derive recursions for computing the smoothed estimates s, ~ E(ef|y) and r)t = E(^|y) of the disturbance vectors st and r)t given all the observations y i , . . . , y„. These estimates have a variety of uses, particularly for parameter estimation and diagnostic checking, as will be indicated in §§7.3

4.4.1 SMOOTHED DISTURBANCES

Let e, = E(e, |y). By (2.49) of the regression lemma at the end of Chapter 2 we have

r,_! - Z'tF~lvt + L\ru N^ - Z[F~lZt + L'tNtLt, at = at + Ptrt~\, Vt = Pt - PtNt-,Pt,

(4.32)

and 7.5.

n st - E(st\Yt-u vt, ...,«„) = £ E ( s ^ F ^ v j , t - 1 , . . . , n, (4.33)

j=t


since E(st\Yt-\) = 0. It follows from (4.18) that E(e,i>)) = E{stx'^Z) + E with E(s tx' t) = 0 for t — 1 , . . . , n and j — t,..., n. Therefore

with

, _ f Ht, j = t, ^^-{E(stx'^Z'j, j — t -\r \ ,.. • ,n,

E(etx't+l) = -HtK'n

E(etx>+2) = — HtK'tL't+1,

E(Btx'n) = -HtK,tL,

t+i'--L'n_1,

(4.34)

(4.35)

which follow from (4.16) and (4.18), for t = i, ...,n — I. Note that here as elsewhere we interpret L't+1 • • • L'n_x as Im when t = n — 1 and as L'n__l when t — n - 2. Substituting (4.34) into (4.33) leads to

è, - Ht{F~lvt - K'tZ't+lFtl\vt+i - K'tL't+1Z't+2^t+2

- K'tLW 1 • ' • Ln-lZnFnlvn) - Ht(F~lv, K'trt)

— Htut, t = « , . . . , 1, (4.36)

where rt is defined in (4.24) and

ut - F~lvt - K'trt. (4.37)

We refer to the vector ut as the smoothing error. The smoothed estimate of is denoted by fjt — E(r)t |_y) and analogously to

(4.33) we have n

f}t=J2 Eintv'jWj-'vj, t = 1 , . . . , n. (4.38) j=*

The relations (4.18) imply that

[QWt+v J=t +1. E (r]tVj) = I . (4.39)

with

E (!hxft+1)=QtR'tL't+l,

E (ritx't+3) = Q tRtL'nlL' t+2,

E (ntx,H) = QtR'tL't+1-~L'n_lt

(4.40)


for t — 1 , . . . , n - 1. Substituting (4.39) into (4.38) and noting that E(r]tv't) = 0 leads to

f j , = QtR'f {Z't+1Ft-\vt+l + L't+lZ<+2Ft^2vi+2 + • • •-f L'(+l • - • L'n^Z'nFnlvn)

where rt is obtained from (4.25). This result is useful as we will show in the next section but it also gives the vector rt the interpretation as the 'scaled' smoothed esti-mator of r}{. Note that in many practical cases the matrix Qt R[ is diagonal or sparse. Equations (4.36) and (4.44) below were first given by de Jong (1988a) and Kohn and Ansley (1989). Equations (4.41) and (4.47) below were given by Koopman (1993).

4.4.2 FAST STATE SMOOTHING

The smoothing recursion for the disturbance vector t], of the transition equation is particularly useful since it leads to a computationally more efficient method of calculating a, for t — 1 , . . . , n than (4.26). Since the state equation is

off+i = Ttat + Rjr]t, it follows immediately that

at+i = Ttoit + Rti)t

which is initialised via the relation (4.22) for t — 1, that is, 6t\ — a\ -f /V'o where r0 is obtained from (4.25). This recursion, due to Koopman (1993), can be used to generate the smoothed states « i , . . . , 6sn by an algorithm different from (4.26); it does not require the storage of at and Pt and it does not involve multiplications by the full matrix Pt, for t — 1 , . . . , n. After the Kalman filter and the storage of vt, F~ l and Kt has taken place, the backwards recursion (4.25) is undertaken and the vector rt is stored for which the storage space of Kt can be used so no additional stor-age space is required. It should be kept in mind that the matrices Tt and Rt Qt R't are usually sparse matrices containing many zero and unity values which makes the ap-plication of (4.42) rapid; this property does not apply to Pt which is a full variance matrix. This approach, however, cannot be used to obtain a recursion for the calculation of Vt — Variety); if V{ is required then (4.26) and (4.31) should be used.

4.4.3 SMOOTHED DISTURBANCE VARIANCE MATRICES

The error variance matrices of the smoothed disturbances are developed by the same approach that we used in §4.3.2 to derive the error variance matrix of the smoothed state vector. Using (2.50) of the regression lemma in §2.13 we have

Var(£,|y) = Var(£,|y;_i, vt,..., vn)

— QtR'trt, t = n,..., 1, (4.41)

Tt&t + RtQtR'trt, i= 1 «, (4.42)

n

Var(£,|y,_,) - J 2 C o v ( £ " ^ V a r ^ - r 1 Covfe, vj)'

n Ht~YL Cov(£<> v J ^ f J ~ 1 C o v ( f " (4.43)


where Cov(f, ,Vj)~ E(st, i/. ) which is given by (4.34). By substitution we obtain

Var(£f|y) = Ht - Ht (F~l + K'tZ't+lFt~+\Zt+1Kt

~~ K'tL't+lZ't+2Ft±2Zt+2Lt+\Kt - • • •

— KtLt+1 • • • Ln_lZnFn Z„Lnî • • • L,+i Kt) H;

= Ht-Ht (Ft~l + K'tNtKt) H,

= Ht- HtDtHt, (4.44)

with Dt = F~l + K'tNtKt, (4.45)

where N, is defined in (4.29) and can be obtained from the backwards recursion (4.30).

In a similar way the variance matrix Var(j7, |y) is given by

n Var(îy) - Varfa,) - £ Covfo,, Vj)FJl Cov(?/f, V j ) \ (4.46)

where Cov(r)t, Vj) = E(r)t, v'j) which is given by (4.39). Substitution gives

Var(rjAy) =Qt QtR't (Z't+1F-+\Zt+l + L!t+lZf

t+2Ft~^Zi+2Lt+l + • - •

Lt+i) RtQt

= Qt - QtKNtRtQt, (4.47)

where N, is obtained from (4.30).

4.4.4 DISTURBANCE SMOOTHING RECURSION

For convenience we collect together the smoothing equations for the disturbance vectors,

et = Ht (Ft lvt - K'tr{) , Varfoly) - Ht - Ht (F'1 + Kf

tNtKt) Hu

% = QtR[rt, Var(iy,Sy) = Qt - QtR[NtRtQt, r,_, - Z'tF~xvt + L'trt, Nt-i = Z[Ft"lZt + L'tNtLu

(4.48)

for t = n,..., 1 where r„ — 0 and Nn — 0. These equations can be reformulated as

s, - Htut, Var(e,|y) = H, - HtDtHt, fit - QtR'trt, Var(^|y) = Q, - QtR'tNtRtQt, ut = F~lvt - K'trt, z>, - F,-' +

rf_i = z;u, + r / r„ Nt_y = z ; A Z , + T ; n , T , - z ; A - ; a t , r , - T ; n , K , Z T ,

4.5. COVARIANCE MATRICES OF SMOOTHED ESTIMATORS 77

for t = n,..., 1, which are computationally more efficient since they rely directly on the system matrices Zt and Tt which have the property that they usually contain many zeros and ones. We refer to these equations collectively as the disturbance smoothing recursion. The smoothing error ut and vector rt are important in their own rights for a variety of reasons which we will discuss in §7.5. The dimensions of the vectors and matrices of disturbance smoothing are given in Table 4.3.

We see that disturbance smoothing is performed in a similar way to state smoothing: we proceed forwards through the series using (4.13) and backwards through the series using (4.48) to obtain st and fjt together with the corresponding conditional variances for t = I,..., n. The storage requirement for (4.48) during the forwards pass is less than for the state smoothing recursion (4.32) since here we only need vt, Ft and Kt of the Kalman filter. Also, the computations are quicker for disturbance smoothing since they do not involve the vector a, and the matrix Pt which are not sparse.

4.5 Covariance matrices of smoothed estimators

In this section we develop expressions for the covariances between the errors of the smoothed estimators s t , r)t and a t contemporanously and for all leads and lags.

It turns out that the covariances of smoothed estimators rely basically on the cross-expectations E (etr'j), E(r]ttJj) and E(atr'j) for j — t + 1 , . . . , « . To develop these expressions we collect from equations (4.34), (4.35), (4.39), (4.40), (4.21) and (4.20) the results

E(stx't) -0, E(etv't) =Ht, E (£tx'j) = ~HtK'tL't+1 • • • L'._ J, E(etv'j) = E(stx))Z'}, E(r}tx't)=0, E f o , « ; ) = 0 ,

(4.49) E = Efytv'j) = EiritxfiZ'j, E(atx't) - Pt, E(atv't) - PtZ'„ E(atx)) - PtVtL't+l - - - L'j-\, E(atvfj) = E(o^.)Z;.,

for j — t + 1 , . . . , n. For the case j ~ t -f 1, we replace L't+l • • • L't by the identity matrix /,„.

We derive the cross-expectations below using the definitions

n

k=i+1 n

Nj = E L'j+l'"L'k-lZkFkiZkLk-l---Lj+U k=j+1


which are given by (4.23) and (4.28), respectively. It follows that

E(e,rj) = E(£,U;.+1)Fr+\zy+1 + E(stv'j+2)F^2Zj+2Lj+l + • - • + E(stv'n)F-1ZnLn-.1-'-Lj+1

= -HfK'tL't+1 • • • L'jZ'j^F^Zj+i )L , _ i _ i — • • •

(4.50)

(4.51)

(4.52)

~HtK'tL't+l • • • L'j+lZj+2Fj+2Zj+2Lj+1

~H1KtL't+l • ' ' Ln-\Z'nFnlZnLn-l ' • ' Lj+1 — - fitK'tL't+l • ••L'j_lL'jNjt

E (r}tr'j) = E(r)tv/].+l)F^lZJ+1 + E(rjtvf

]+2)F^2Zj+2Lj+l + • • --\-E(r}tv'n)F~lZnLn~i • • * Lj+i

— QtR'tL't+1 • • • L'jz'j+iFj+izi+i + QtRf

tL'i+1 • • • L'j+lZj+2Fj*2zj+2Lj+1 H

+ QtR'tL't+j • • • L'n_xZ'nF^lZnLn_x •••Lj+i = QtR>tL't+r--L'J_lL'JNj,

E(atr'j) = EiatV'j^Fj-^Zj+i + E(at v'j+2)F7*2 ZH2Lj+i + • • • ZnLn-1 • • ' + l

+ PtL'tL't+l • • • Lfj+lZj+2Fj+2Zj+2Lj+i H -F PTL'TLF

L+I • • • L'^Z^F^ZNLN-I • • • LJ+I

for j — t,..., n. Hence

EferJ) - E(etx't+l)N?+l J,

Hr)tr'j) = E(Vix'i+l)K+ij> (4-53)

E(atr'}) = E(atx'i+l)N*+iJ,

where N*j =Lft--- L'j^L'jNj for j = t n.

The cross-expectations of st, r}t and at between the smoothed estimators

= H A F J l v i ~ K ' j r i ) ' Vj = Q j R ' j r j , ccj xj - Pjrj U

for j — t + 1 , . . . , ft, are given by

E(ete'j) = Eistv'jWj-'Hj - E ^ K j H j ,

EM«,- - aj)f] = Eietx'j) - E(eSj^Pj,

E(n ts ' j) = E ( w ' ^ F j ' H ; - E ( t h r p K j H j ,

4.5. C0VAR1ANCE MATRICES OF SMOOTHED ESTIMATORS 79

E M J ) = E (ThfyRjQj,

E M a j - «;)'] - E^x'j) - E ( m r ' j - J P j ,

E f a f y = E(atvr-)FJ{ Hj - E(ectrfj)KjHj,

E(atfj'J) = E(atr'j)RjQj,

E[ar((a/ - ajY] = E (a t x f j ) - E(ai^_1)i>J-,

into which the expressions in equations (4.49), (4.50), (4.51) and (4.52) can be substituted.

The covaiiance matrices of the smoothed estimators at different times are derived as follows. We first consider the covariance matrix for the smoothed diturbance vector st, that is, Cov(et — st, Sj — Sj) for t = 1 , . . . , « and j = t + 1 , . . . , n. Since

E i s t (s j - sj)'] = E[E{es(Sj - S,)'|y}] - 0, we have

Cov(£( - et, Sj - Sj) = E[st(sj - Sj)'] = - E (s(s'j)

= HtK'tL't+! • • • L'j_{Z'jFJxHj + HtK'tL't+i • - - L'j^L'jNjKjHj

~ H(K'tL'l+l • • - L'j_{ W'j, where

Wj=Hj(FflZj-KtJNjLj), (4.54)

for j — t -f 1 , . . . , n. In a similar way obtain

Cov(r)t — r)t, rjj — f j j ) — - E ( r j t f j ' j )

= -QtR'tL>+l.--L>j„lLfjNjRjQj,

Co\(at — at, otj — otj) — —E\at(a.j — ctj)']

= PtL'tL't+1 • • • L'j_x - PtL'tL'i+i • • • L'j^Nj-iPj

= PtL'tL't+x---L)_x(I -Nj-yPj),

for j — t -{- 1 The cross-covariance matrices of the smoothed disturbances are obtained as

follows. We have

Co\(s t - st, r/j - f j j ) = E[(st - et)(T)j - fjj)']

= E[et(T)j - fjj)']

= -E(e fn'j)

= HtK'tL,t+r-'L'j_iL,jNjRjQj,


for j — t, t + 1 , . . . , n, and

CoV(ij, - fjt, sj - sj) - -E(?it&j)

= - Q t K L ' t + \ ' " L ' j - i z ' j F j ~ l H J

+ QtR,tL't+x---L'j_lN'jKjHj

= -QtR'tL't+} •••L'j_iW'j, for j ~ t + 1 , . . . , n.

The cross-covariances between the smoothed state vector and the smoothed disturbances are obtained in a similar way. We have

Cov(a, - at, Sj - Sj) = - E ( a t s ' j )

- -PtL'tL't+3 • • • L'j^Z'jFJ^Hj

+ PtL'tL't+l--.L'j„lN>jKjHj

= -PtL'tL,t+l---L'j_iW'j,

Cov(tt, — at, Tjj — f j j ) — —E(atfi'j)

^-PtLftL't+l---L'j_iLf

jNjRjQj,

for j = t, t -j- 1 , . . . , ny and

Cov(£/ - st,otj -&j) = EfoCay - a,-)'] = -HtK'tL't+x • • • L'j_!

+ HtKltLt

t+1...Ltj_1Nj-1Pj

- - f f f • • • L'j^(l - Nj_xPj), Co\(i)t - f)t,oij - &j) = E[r]t(aj - ay)']

for j — f + 1 , . . . , n.

Table 4.4. Covariances of smoothed estimators for t — 1 , . . . , n.

Ht • - • T' W' j >' Vj tit&tLt+\' • •L^L'jNjRjQj j > t

rj y-t j / lif JV j •••L'jîln-Nj-xPj) j >t

Vt -QtRiLt+1 • T' W' WJ j > t -QtR'tL't+x ••.L'J^L'jNJRjQJ j >t QtR'tL't+! •• •L'jÎn-Nj-yPj) j > t

at —PtL'tL't+1 • • V W' y-1 7 j > f Vj —PtLtLt+l • '•Lj-iL'jNjRjQj j > t OLj PtLtLt+l • • L ' j ^ - N j - r P j ) j > t

4.6. WEIGHT FUNCTIONS 81

These results here have been developed by de Jong and MacKinnon (1988), who derived the covariances between smoothed state vector estimators, and by Koopman (1993) who derived the covariances between the smoothed disturbance vectors estimators. The results in this section have also been reviewed by de Jong (1998). The auto- and cross-co variance matrices are for convenience collected in Table 4.4.

4.6 Weight functions

4.6.1 INTRODUCTION

Up to this point we have developed recursions for the evaluation of the conditional mean vector and variance matrix of the state vector a t given the observations _Vi,..., ;yf_i (filtering), given the observations yi,...,yt (contemporaneous filtering) and given the observations >>i, . . . , yn (smoothing). We also have developed recursions for the conditional mean vectors and variance matrices of the disturbance vectors st and rjt given the observation y i , . . . , yn. It follows that these conditional means are weighted sums of past (filtering), of past and present (contemporaneous filtering) and of all (smoothing) observations. It is of interest to study these weights to gain a better understanding of the properties of the estimators as is argued in Koopman and Harvey (1999). For example, the weights for the smoothed estimator of a trend component around time t = n/2, that is, at the middle of the series, should be symmetric and centred around t with exponentially declining weights unless specific circumstances require a different pattern. Models which produce weight patterns for the trend components which differ from what is regarded as appropriate should be investigated. In effect, the weights can be regarded as what are known as kernel functions in the field of nonparametric regression; see, for example, Green and Silverman (1994, Chapter 2).

In the case when the state vector contains regression coefficients, the associated weights for the smoothed state vector can be interpreted as leverage statistics as studied in Cook and Weisberg (1982) and Atkinson (1985) in the context of regression models. Such statistics for state space models have been developed with the emphasis on the smoothed signal estimator Ztat by, for example, Kohn and Ansley (1989), de Jong (1989), Harrison and West (1991) and de Jong (1998). Since the concept of leverage is more useful in a regression context, we will refer to the expressions below as weights. Given the results of this chapter so far, it is straightforward to develop the weight expressions.

4.6.2 FILTERING WEIGHTS

It follows from the linear properties of the normal distribution that the filtered estimator of the state vector can be expressed as a weighted vector sum of past observations, that is

t-\


Table 4.5. Expressions for E(sf£y) with 1 < t < n given (filtering).

St j <t j = t j>t

at U-x • • Lj+iKjHj 0 0 (I -MtF~ lZt)Lt-i" Lj+iKjHj MtF~{Ht 0

Z,at ZtLt-\ •• Lj+iKjHj 0 0 Ztat\t HtF; ~lZtLt-1 • • (I - HtFt~l) Ht 0 Vt —ZtLt-1 • • Lj+iKjHj Ht 0

where cojt is an m x p matrix of weights associated with the estimator at and the 7tli observation. An expression for the weight matrix can be obtained using the fact that

E (atSj) = (ojtE(yje'j) = cojtHj.

Since xt — at — at and E(ate'-) ~ 0, we can use (4.17) to obtain

E (ate'j) = E(xte'j) = L ^ E f o - i ^ ) = L(_iL,_2 ••• KjHj,

which gives

<*>jt = • • • Lj+iKj,

for j = t — 1 , . . . , 1. In a similar way we can obtain the weights associated with other filtering estimators. In Table 4.5 we give a selection of such expressions from which the weights are obtained by disregarding the last matrix Hj. Finally, the expression for weights of Zt at follows since ZtMt = F( — Ht and

Zt (/ - MtF~lZt) = [ I - (Ft - Ht)Ft~l] Zt = HtF~lZt.

4.6.3 SMOOTHING WEIGHTS

The weighting expressions for smoothing estimators can be obtained in a similar way to that used for filtering. For example, the smoothed estimator of the measurement disturbance vector can be expressed as a weighted vector sum of past, current and future observations, that is

n

where coEjt is a p x p matrix of weights associated with the estimator e, and the

j th observation. An expression for the weight matrix can be obtained using the fact that

4.7. SIMULATION SMOOTHING 83

Table 4.6. Expressions f o r E O ^ ) with 1 < t < n given (smoothing).

if j <t j=t j > t

h -WtLt.i • • • L j + i K j H j HtDtHt —HtK'tLft+i • • •w

fit -QtR'tNtLtLt-i" • Lj+iKjHj QtR'tNtKtHt QtR'tL'(+l • • at (/ - PtNt-OLt-i • • • Lj±iKjHj p*w; P T'T'

WtLt-i•• •Lj+iKjHj (I - HtDt)Ht 11 V1 T ' MtKtLt+i • • -i wi

Expressions for the covariance matrices for smoothed disturbances are developed in §4.5 and they are directly related to the expression for E(ste'j) because

Covfo -et,£j ~£j) = E[(sf - st)s'j] = -E

with 1 < t < n and j = ! , . . . , « . Therefore, no new derivations need to be given here and we only state the results as presented in Table 4.6.

For example, to obtain the weights for the smoothed estimator of a t , we require

E(6cte!j) = -E[(ar - at)e'j) = -E[s,-(af - a,)']' = Cov(fiy - £j, at - &i)\

for j < t. An expression for this latter quantity can be directly obtained from Table 4.4 but notice that the indices j and t of Table 4.4 need to be reversed here. Further,

E(ats'j) — Cov(at — at, £j — sj)

for j > t can also be obtained from Table 4.4. In the same way we can obtain the weights for the smoothed estimators of et and rjt from Table 4.4 as reported in Table 4.6. Finally, the expression for weights of Ztat follows since ZtMt — Ft - Ht and ZtPtL't = HtK'r Hence, by using the equations (4.30), (4.45) and (4.54), we have

Zt(I - PtNt-0 = Zf - ZtPtZ'tF-YZt + ZtPtL'tNtLt

= HtF~lZt + HtK'tNtLt - Wt, ZtPtw; = {ZtPtZ\F;x - ZtPtL'tNtKt)Ht

- [(Ft ~ Ht)F~l - HtK'tNtKt]Ht

= Cl-HtDt)Ht.

4.7 Simulation smoothing

The drawing of samples of state or disturbance vectors conditional on the observations held fixed is called simulation smoothing. Such samples are useful for investigating the performance of techniques of analysis proposed for the linear Gaussian model, and for Bayesian analysis based on this model. The primary purpose of simulation smoothing in this book, however, will be to serve as the basis


for the importance sampling techniques we shall develop in Part II for dealing with non-Gaussian and nonlinear models from both classical and Bayesian perspectives.

In this section we will show how to draw random samples of the disturbance vectors st and r\t, and the state vector at, for t — 1 , . . . , «, generated by the linear Gaussian model (4.1) conditional on the observed vector y. The resulting algorithm is sometimes called a forwards filtering, backwards sampling algorithm.

Fruhwirth-Schnatter (1994) and Carter and Kohn (1994) independently developed methods for simulation smoothing of the state vector based on the identity

p(au... ,an\y) ~ p(ctn\y)p(ctn-i\y, an) - - • p(ai\y, a2,..., an). (4.55)

de Jong and Shephard (1995) made significant progress by first concentrating on sampling the disturbances and subsequently sampling the states. Our treatment to follow is based on the de Jong-Shephard approach.

[Added in Proof: Since this book went to the printer we have discovered a new simulation smoother which is simpler than the one presented below. A paper on it with the title 'A simple and efficient simulation smoother for state space time series analysis' can be found on the book's website at http://www.ssfpack.com/ dkbook/]

4.7.1 SIMULATING OBSERVATION DISTURBANCES

We begin by considering how to draw a sample of values si,... ,sn from the conditional density p{s\,..., e„|y). We do this by proceeding sequentially, first drawings„ from p(sn\y), then drawing e„„i from /?(e„_i |y, sn), then drawing £„_2 from p(en-2\y, s„, £n-i)> and so on, making use of the relation

p(si, ...,en\y) = p(en\y)p(En-i\y, £„)••• ^(fiiiy, £2,..., sn). (4.56)

Since we deal with Gaussian models, we only need expressions for E(e(|y, , . . . ,£„ ) and Var(£, |y, £ i + i , . . . , e„) to draw from p(s( |y, £ i+1 , . . . ,£„) . These

expressions for the conditional means and variances can be evaluated recursively using backwards algorithms as we will now show.

We first develop a recursion for the calculation of

£t =E(etlylt...,yH,et+it ...,£„), t =n - 1 , . . . , 1, (4.57)

with £n = E(e„|yi, . . . , y„). Let vj ^ yj - E(y; | y , , . . . , yÔ, as in §4.2, for y = Then

st = E(£f|l>i, 1*2, . . . , vn, £t+i, . ..,£„)

= wm,..., u„,fii+i,.. .,£„), t = n - 1,..., 1,

since £t and s[+i,... ,sn are independent of t>i,..., . Now let

dj — £j — £j, 7 — / + 1 , . . . , n. (4.58)

Since vt,..., vn,d(+\,..., dn are fixed when vt,... ,vn, st+i,... ,sn are fixed and

http://www.ssfpack.com/


vice versa,

s, = E(e, | v{, v i + i , . . . , vn,d(+1, ...,dn), t = n - 1 , . . . , 1,

with e„ =E(e J I | i ; i , . . . , u „ ) - e n = Efo.li>,,). We know from §4.2.4 that vt,..., vn are independent so the joint distribution

of vt,..., vn,st+1,

is

n n—I

j-t k=t+l since

) = pfatlu*,. . . , Vn, Sk+1, . . . , £„),

due to the independence of . . . , v^-i and Transforming from vt,..., vn, s t + i , s n to Vt,..., vn, d!+i,... ,dn, noting that the Jacobian is one and that

pfotlv*. - • •> vn, £k+1, ...,«„) = p(dk),

it follows that vt, ...,vn, dt+\,..., dn are independently distributed. From the results in §4.4.1, we have

Bt - E f o | y i , . . . , yn) = E(st\vt, ...,vn) = Ht (F~lvt - K'trt) , (4.59)

where rt is generated by the backwards recursion

r t„ i = Z'tF~lvt + L'trt,

for t = « , . . . , 1 with rn — 0. It follows since sn — s„ ~ Efo,|vw) that

dn — — HnF~lvn. (4.60)

For the next conditional mean sn-\ we have

«„_ i = EfoI_i|i;ll_i,t>II,4l)

= E(f„-1 |Wn_i, w») + E(sn-1 \dn)

- £ n _ 1 + E ( £ n _ 1 < ) C - 1 4 , (4.61)

where we define

Now

Ct = Var(«i,), t = l,...,n.

E(£„_!<) - E[£N L(£„ - / /„F;1^)']

- -E{en-Wn)F~lHn

86 FILTERING, SMOOTHING A N D FORECASTING

using (4.34), so we have from (4.59),

1 — Hn-\F~\vn-i — Hn-\K'n_xrn-i -f Hn-\K'n_xZ'nF~x HnC~ld„.

We note for comparison with a general result that will be obtained later that this can be written in a form analogous to e„_i given by (4.59) with t — n — 1,

«„-i - Hn i (F~\vn_, - K'n^rn l ) , (4.62)

where r„_i = rn l - Z'nF~lHnC~ldn. For calculating

— E(sn-2\vn-2, vn-i, vn,dn-\,dn) = £n-2 +E(£„_ 2 <_ 1 )C" i 1 4- i + E(en-2d'n)C~1dn, (4.63)

we need the results from (4.34) and (4.50),

E f o v ' t + ^ -HtK'tZ't+l, E(stv't+2) ~ -HtK'tL't+1Z't+2,

E(etr't+1) - -//^/M+i, E{etr't+2) = -HtK'tL't+iNt+2,

where Af, is generated by the backwards recursion

Nt-i =Z,tF-lZt + L'tNtLtt

for t 1 with N„ ~ 0. This recursion was developed in §4.3.2. Further,

dn — — HnFn 1vn,

dn-1 = sn—\ — Hn-iF^yVn-i + Hn-\K'n_xrn-.i

H-n—i Kn //n C„ dn,

so that

E(e„-2 d'n_{) — —E(e„-2^-3)

- E {sn-2d'n)C; lHnF~ = Hn2K!

nlZ'nX Fn^ Hni — Hn-2K'n-2Nn-\Kn-iHn-i + Hn2K'n_2L'n_lZ'nFn

1 Hn Cn 1 Hn Fn

1 Z n K n i H n _ i .

Also,

E (e n - 2 4) = -E(£„_ 2 ^) - - E ( S n - j v ' J F - ' H » = H ^ K ' ^ L ' ^ Z ^ F ' 1 Hn.

To simplify the substitution of these expressions into (4.63), we adopt a notation that will be useful for generalisation later,

Wn - HnF~lZn,

Wn-\ = Hn-i{F~\Zn-\ — K'n^Nn-iLn-i),


Thus using (4.59) we can rewrite (4.63) as

Sn-2 — Hn-2(FnX

2Vn-2 — ~ + Hn_2K'n_2L'n_lWnC-ldn.

It follows that we can put this in the form analogous to (4.62),

£w_2 - Hn_2(F-_}2vn_2 ~ K_ 2 r n„ 2 ) , (4.64)

where r„„2 is determined by the backwards recursion

rf_i - Z; F~l vt ~ WiC~ldf + L'trt, t = n, n - 1, (4.65)

with rn ~ 0.

4.7.2 DERIVATION OF SIMULATION SMOOTHER FOR OBSERVATION DISTURBANCES

The results of the previous subsection lead us to conjecture the general form

Bt = Ht(F~xvt - K'{ft), t = n, n - 1 , . . . , 1, (4.66)

where rt is determined by the recursion

ft-x - z ; F^vt - W'tC~ldt 4- L'trt, t = n, n - 1 , . . . , 1 (4.67)

with rn — 0 and

Wt = Ht(Ft-lZt - K'tNtLt), (4.68)

Nt-x = Z'(F~lZt + W;C~lWt + L'tNtLu (4.69) with Nrt — 0, for t ~ n, n — 1 , . . . , 1. All related definitions given earlier are consistent with these general definitions. We shall prove the result (4.66) by backwards induction assuming that it holds when t is replaced by 14- 1 , . . . , « — 1. Note that throughout the proof we take j > t.

We have st = EFO\vt, ...,vn, dt+1,..., dn)

n

Ê(st\vt,...,vn)+ ^ E(st\dj) j=t+1

n = s t + J 2 V & d ^ C f d j , (4.70)

i=t+\ where st is given by (4.59). Since we are assuming that (4.66) holds when t is replaced by j with j > t we have

E(std'j) = -E(sts'j)

= —E(st v'-)FJ{ Hj + E (EtfyKjHj

- HtK>L't+x - • • L'j_xZ'jFJlHj + E ( s t r ' j ) K j H j , (4.71)

by (4.34), noting that L't+l • • • Lfj_x = Im for j = t + 1 and L't+l • • • L'}_x = L't+i

for j =t-1-2.


We now prove by induction that

E(s tr'j) = ~HtK'tL't+1 • • • L)Nh j = t + 1 , . . . , n, (4.72)

assuming that it holds when j is replaced by j + 1. Using (4.67), (4.34) and also (4.71) and (4.72) with j replaced by j + 1, we have

E - E ( e . v ' j ^ F ^ Z j + i - E ( e ^ C J ^ W j ^ +E ( s t r ' j + 1 )L j + i

= —HsK'tL't+l • • • L'j[Zj+1

- -HtK'tL't+1 • '•L,J(Z,

J+iF7+iZj+i + Wj+lCjl1Wj+i -\-L'j+lNj+lLj+1) = -HtK'tL't+l~-L'jNi.

The second last equality follows from (4.68) and the last equality follows from (4.69) with / = j + 1. This completes the proof of (4.72). Substituting (4.72) into (4.71) gives

E(s td'}) = HtK'tL't+1 • • • {Z.'-FJlHj - L)NSKSH})

= HtK,tLf

t+l.--L'J_1W'j. (4.73)

Substituting (4.73) into (4.70), we obtain

n

et=et + HtK't E L't+1 • • • L^W'jCj'dj, j=t+1

with the interpretation that for j — t 4- 1, L't+l • • • L!}_x — Im and for j — t + 2,

L' t+l • • • L'j-x = î+i-'K follows from (4.59) that

n - HtF~lvt - HtKf

t £ L;+1 • • .L^jZJFTV 7=i+I

Combining the two last expressions gives (4.66), that is,

S, = HtF^vt - HtK> J2 Li+i • • • L'j-i(ZjFJ~lvJ ~ WjCJ^j) j=t+i

= Ht(Ft-lvt ~ K[ft),

since it follows from (4.67) that

* = £ • • • LU (Z'jFilvJ - W'jCjtdj). j=t+i

Thus (4.66) is proved.


To perform the simulation we take et = et + dt for t — 1 , . . . , n where dt is a random draw from the normal distribution

dt ~ N(0, Ct).

The variance matrix Ct is given by

Ct — Varfo jy, £>+ i , . . . , sn) = VarfofUi, u/+li..., vn,dt+i, ...,dn)

n

= Ht- HtDtHt - J2 Hstd^Cj'Eistd'j)' j=t+i

Ht-Ht

+ W,jC]lWj)Lj-l---Lt+iKt Ht,

from §4.4.3 and equation (4.73). It follows from (4.69) that n

fit = E • • • L U ( z J F i l z J + W j c i l W j ) L J - i • -• w i=t+\

Thus we have C, = Ht - HtDtHt, t = n, . . . , 1 , (4.74)

where Dt = Ft

l + K'tNtKt,

and Nt is recursively evaluated by the backwards recursion (4.69). It is remarkable that in spite of complications caused by the conditioning

on st+1, the structures of the expressions (4.66) for st and (4.74) for Var(ei — s t) are essentially the same as (4.36) for s t and (4.44) for Varfo — et).

4.7.3 SIMULATION SMOOTHING RECURSION

For convenience we collect together the simulation smoothing recursions for the observation disturbance vectors:

5-1 -z'tFt~\ - + L\ft, Nf-i —Z[F~lZt -f WtC~lWt + L'tNtLt,

for t — n , . . . , 1 and initialised by fn — 0 and Nn = 0. For simulating from the smoothed density of the observation disturbances, we

define

Wt = Ht(F~lZt - K'tNtLt), dt ~ N(0, Ct), (4.76) Ct — Ht — HtD,Ht,

FILTERING, SMOOTHING AND FORECASTING

Dimensions of simulation smoothing recursions of §§4.7.3 and 4.7.4.

Observation disturbance State disturbance st,dt p x 1 n t,dt r x 1

wt p x m wt r x m Q pxp Ct r x r

where Dt — Ft 1 + K'tNtKt. A draw s t , which is the ith vector of a draw from the

smoothed joint density p(e\y), is then computed by

et=dt + Ht (Ft~l v< - K'tft). (4.77)

Simulation smoothing is performed in a similar way to disturbance smoothing: we proceed forwards through the series using (4.13) and backwards through the series using (4.75), (4.76) and (4.77) to obtain the series of draws st for t — !,..., n. The quantities v,, F, and Kt of the Kalman filter need to be stored fori = 1 , . . . , n. In Table 4.7 we present the dimensions of the vectors and matrices occurring in the simulation smoothing recursion.

4.7.4 SIMULATING STATE DISTURBANCES

Similar techniques to those of §4.7.1 can be developed for drawing random samples from the smoothed density p{q\ |}>)- We first consider obtaining a recursion for calculating

f j , = E(r]t\y, t]t+1,..., r}n), t = 1,..., n.

For comparison with the treatment of §4.7.1 it is convenient to use some of the same symbols, but with different interpretations. Thus we take

Wj - Q j R ' j N j L j ,

dj - Vj - r)h (4.78) Cj = Var (djl

where N j is determined by the backwards recursion (4.69) with values of Wj as given in (4.78); this form of Wj can be found to be appropriate by experimenting with j — n, n — 1 as was done in §4.7.1. We shall explain how to prove by induction that

f}t = QtKn, i - B , . . . , 1 , (4.79)

90

Table 4.7.

where rt is determined by recursion (4.67) with Wj defined by (4.78), leaving some details to the reader.


As in §4.7.1, n

ifc = E 0 f r | y ) + £ E(^My) y=t+i

n

= Q t R > t + £ E ^ X ^ , (4.80) y - m

where

E (ritd'j) = -Ei r j r f j )

= - E ( t h f y R j Q j .

By methods similar to those leading to (4.72) we can show that

E (%?}) - QtR'tL'i+i - - • L'jN j, j = t+l,...,n. (4.81)


E (md j ) = -QtR'tL't+i •••L'.NjRJQJ, = -QtR'tL't+l.--L!

j_lW'j, (4.82)

whence substituting in (4.80) and expanding rt proves (4.79) analogously to (4.66). By similar means we can show that

Q - Qt - QtR'tNtRtQt, t = « , . . . , 1. (4.83)

As with the simulation smoother for the observation disturbances, it is remarkable that fjt and Ct have the same forms as for |y) and Var(rçf|y) in (4.41) and (4.47), apart from the replacement of r{ and Nt by r, and Nt.

The procedure for drawing a vector f j , from the smoothed density p(i]\y) is then: draw a random element dt from N(0, C() and take

m=dt + Q(R'trt> t — n,..., I. (4.84)

The quantities vt, F, and Kt of the Kalman filter need to be stored for t — 1 , . . . , n. The storage space of Kt is used to store the fjt

f& during the backwards recursion.

4 .7 .5 SIMULATING STATE VECTORS

By adopting the same arguments as we used for developing the quick state smoother in §4.4.2, we obtain the following forwards recursion for simulating from the conditional density p(ct\y)

at+l = Tt&t + R,T)„ (4.85)

for t — 1 , . . . , n with ô>\ — a\-f PitQ. For simulation state smoothing we proceed forwards through the series using (4.13) and backwards through the series using (4.75) and (4.84) to obtain the series of draws ^ fori — 1 , . . . , «.Then we proceed forwards again through the series using (4.85) to obtain at for t — 1 , . . . , n.


4.7.6 SIMULATING MULTIPLE SAMPLES

In the previous subsections we have derived methods to generate draws from the linear Gaussian model conditional on the observations. When more than one draw is required we can simply repeat the algorithm with different random values. However, we can save a considerable amount of computing after the first draw at the cost of storage. When simulating observation disturbance vectors using the method described in §4.7.3, we can store Nt-i, Wt and C, for t = 1 , . . . , n. In order to generate a new draw, we only require to draw new values for dt, apply tbe recursion for rt-\ and compute et for / = 1 , . . . , n. The matrices Nt-i, Wt and Ct

do not change when different values of dt, rt-\ and s, are obtained. This implies huge computational savings when multiple draws are required which is usually the case when the methods and techniques described in Part II of the book are used. The same arguments apply when multiple draws are required for state disturbance vectors or state vectors.

4.8 Missing observations

In §2.7 we discussed the effects of missing observations on filtering and smoothing for the local level model. In case of the general state space model the effects are similar. Therefore, we need not repeat details of the theory for missing observations but merely indicate the extension to the general model of the results of §2.7. Thus when the set of observations yt for t — r , . . . , r* — 1 is missing, the vector vt and the matrix K, of the Kalman filter are set to zero for these values, that is, v, — 0 and Kt — 0, and the Kalman updates become

at+l=Ttat, Pt+l^TtPtT; + RtQtR't, t = r T* - 1; (4.86)

similarly, the backwards smoothing recursions (4.25) and (4.30) become

r,_i = r / r i t N t-! = T;NtTt, f = r* - 1 , . . . , r. (4.87)

Other relevant equations for smoothing remain the same. The reader can easily verify these results by applying the arguments of §2.7 to the general model. This simple treatment of missing observations is one of the attractions of the state space methods for time series analysis.

Suppose that at time t some but not all of the elements of the observation vector y, are missing. Let y* be the vector of values actually observed. Then y* = W,y, where W, is a known matrix whose rows are a subset of the rows of Ip. Consequently, at time points where not all elements of yt are available the first equation of (4.1) is replaced by the equation

y* = z*at+e*, <~N(o ,#;),

whereZ; = WtZtie; = Wt£: and H* ^ Ht W{. The Kalman filter and smoother then proceed exactly as in the standard case, provided that yt, Zt and H, are

4.9. FORECASTING 93

replaced by y*, Z* and H* at relevant time points. Of course, the dimensionality of the observation vector varies over time, but this does not affect the validity of the formulae. The missing elements can be estimated by appropriate elements of Ztoc, where at is the smoothed value. A more convenient method for dealing with missing elements for such multivariate models is given in §6.4 which is based on an element by element treatment of the observation vector yt.

When observations or observational elements are missing, the corresponding values should be omitted from the simulation samples obtained by the methods described in §4.7. When the observation vector at time t is missing, such that the Kalman filter gives vt = 0, F~l = 0 and Kt - 0, it follows that W, = 0, D, = 0 and Ct — Ht in the recursive equations (4.75). The consequence is that a draw from p(st |y) as given by (4.77) becomes

g, ~N(0, t t , ) , when yt is missing. We notice that this draw does not enter the simulation smooth-ing recursions since Wt = 0 when yt is missing. When only some elements of the observational vector are missing, we only need to adjust the dimensions of the system matrices Z, and Ht. The equations for simulating state disturbances when vector yt is missing remain unaltered except that vt =0, F~l =0 and Kt = 0.

4.9 Forecasting

Suppose we have vector observations y i , . - . , y„ which follow the state space model (4.1) and we wish to forecast yn + ; for j — 1 , . . . , / . For this purpose let us choose the estimate yn+j which has minimum mean square error matrix given Yn, that is, Fn+j ~ E[(yH+j ~ yn+j)(%+j - yn+y/iy] is a minimum in the matrix sense for all estimates of yn+J-. It is standard knowledge that if x is a random vector with mean /x and finite variance matrix, then the value of a constant vector X which minimises E[(A. — x)(X — x)'} is X — (JL. It follows that the minimum mean square error forecast of yn+j given Yn is the conditional mean yn + j ~ E(yn+y |y).

For j — 1 the forecast is straightforward. We have yn+l — Zn+ia„+i + £,i+1 so

yn+i - Zn+lE(an+1\y)

where an+] is the estimate (4.8) of an+i produced by the Kalman filter. The error variance matrix or mean square error matrix

Fn+i = E[(y„+1 - y„+i)(yn+i - y^+i)'] ~ Zn+iPn+\Z'n+l + Hn.|_i,

is produced by the Kalman filter relation (4.7). We now demonstrate that we can generate the forecasts yn+j for j = 2 , J merely by treating yM+i,... , yn+j

as missing values as in §4.8. Let an+j — E(ctn+y jy) and Pn+j — E[(«,2+y — a n + j )


(an+J - )'!}']. Since yn+j = Zn+Jan+j + sn+j we have

%+j = Zn+j E(an+j\y) — Zn-\-jdn-\-j,

with mean square error matrix

Fn+j = E [{Zn+j(an+j — ocn+j) — en+j}{Zn+j(an+j — an+j) — £„+;}'] — Pn+jZn+j + Hn+j.

We now derive recursions for calculating an+j and Pn+j- We have ot„+j+i ~ Tn+ian+j + ^n+y'^n+j so

— Tn+jan+j,

for j — 1 , — 1 and with <5„+i — <3„+i. Also,

Pn+j+1 — E[(A„+j+I -art+y+i)(fl„+y+1 - ttn+z+iyij] = E[(i2„+y - an+j)(an+j -an+j)'\y]T>+j

+ ^ n + 7 Ûh+jln+^K+j

— Tn+jP n+j T,'l+j + Rn+jQn+jRn+j-

We observe that the recursions for an+j and Pn+j are the same as the recursions for an+j and Pn+j of the Kalman filter (4.13) provided that we take vn+j — 0 and Kn+j — 0 for j — 1 , — 1. But these are precisely the conditions that in §4.8 enabled us to deal with missing observations by routine application of the Kalman filter. We have therefore demonstrated that forecasts of yn+x,..., v n + j together with their forecast error variance matrices can be obtained merely by treating yt for t > n as missing observations and using the results of §4.8. In a sense this conclusion could be regarded as intuitively obvious; however, we thought it worth while demonstrating it algebraically. To sum up, forecasts and their associated error variance matrices can be obtained routinely in state space time series analysis based on linear Gaussian models by continuing the Kalman filter beyond t = n with vt — 0 and K{ — 0 for t > n. These results for forecasting are a particularly elegant feature of state space methods for time series analysis.

4.10 Dimensionality of observational vector

Throughout this chapter we have assumed both for convenience of exposition, and also because this is by far the most common case in practice, that the dimensionality of the observation vector y, isafixed value p. It is easy to verify however that none of the basic formulae that we have derived depend on this assumption. For example, the filtering recursion (4.13) and the disturbance smoother (4.48) both remain valid when the dimensionality of yt is allowed to vary. This convenient generalisation arises because of the recursive nature of the formulae. In fact we made use of

4.11. GENERAL MATRIX FORM FOR FILTERING AND SMOOTHING 95

this property in the treatment of missing observational elements in §4.8. We do not, however, consider explicitly in this book the situation where the dimensionality of the state vector a, varies with t, apart from the treatment of missing observations just referred to and the conversion of multivariate séries to univariate series in §6.4.

4.11 General matrix form for filtering and smoothing

The state space model can itself be represented in a general matrix form. The observation equation can be formulated as

y — Za + £, s ~ N(0, H), (4.88)

with

y -yi

\y*.

£ —

z = ~ZX 0 0

0 a —

H

Z„ 0

Hx 0

0 Hn

/ ai \

a n

1/

The state equation takes the form

a = T(ot\ + Rn), i] ~ N(0, Q),

with

(ai\ "0 0 0 0 Ri 0 0

< = 0 R - 0 Ri 0

u> 0 0 •• • Rn

~Qi 0 "

: • G =

Kin) _ 0 Qn_

(4.89)

(4.90)

r im 0 0 0 0 0 " Ti In 0 0 0 0

T2Tx T2 Im 0 0 0 T -- T^TlTl TiT2 r3 Im 0 0

Tn-l — T! T n - i • • 'T2 Tn-i - - T3 Tn-i • •-Î4 Im 0 _ Tn-.-Tx Tn • • - T2 Tn • ••Ti Tn • •-7*4 • • Tn Im.

(4.91)

(4.92)

(4.93)


This representation of the state space model is useful for getting a better understanding of some of the results in this chapter. For example, it can now be seen immediately that the observation vectors yt are linear functions of the initial state vector a} and the disturbance vectors et and rjt for t = 1 s i n c e it follows by substitution that

It also follows that

E(y) - ZTa\,

where

* —

y = ZTct* + ZTRrj + e. (4.94)

Var(y) = S = ZT(P* + RQR')T'Z' 4- ff, (4.95)

0 0

\°7

Px 0 0 • • 0' 0 0 0 0 0 0 0 0

0 0 0 0

We now show that the vector of innovations can be represented as v — Cy — Ba* where v = (v\,..., v(

n)' and where C and B are matrices of which C is lower block triangular. Firstly we observe that

a,+1 = Ltat + Ktyt,

which follows from (4.8) with vt — yt — Ztat and Lt — Tt — KtZt. Then by substituting repeatedly we have

t-i at+\ — LtLt-x • - • L\a\ + ^ LtU-1 * • • Lj+iKjyj + K,yt

J=i

and

v\ = y\ - Zxax, v2 = -Z2Lia1 +y2 — Z2Kiyu

v3 = —Z$L2Liai + - Z3K2y2 ~ Z3L2K1yu

v4 = —Z4L3L2L\a\ + y4 - Z4K3y3 - Z4L3K2y2 - Z4L3L2Kxyx,

and so on. Generally,

vt — ~Ztht-\Lt-2- " L\a\ +yt -ZtKt^xyt-x t—2

— zt E ^ f - i • •• Lj+iKjyj. 7=1

Note that the matrices Kt and Lt depend on P\, ZyT, R, H and Q but do not depend on the initial mean vector ax or the observations yx,..., yn, fori — 1 , . . . ,n. The

4.11. GENERAL MATRIX FORM FOR FILTERING AND SMOOTHING 97

innovations can thus be represented as

v=(I - ZLK)y - ZLa\ = Cy - ZLa\,

where C — I — ZLK,

L =

I 0 0 0 0 0" Li I 0 0 0 0

L2Lx L2 I 0 0 0 L3L2Li L3L2 Ls / 0 0

Ln-i • • - L\ Ln-i ••• L2 Ln~i • Ln -I ' "U I 0 Ln-• L\ Ln ' • • L2 • •L-i Ln' • La • • Ln /_

~ 0 0 0 " 0 0

K = 0 K2 0

_ 0 0 Kn

and matrix Z is defined in (4.89). It can be easily verified that matrix C is lower block triangular with identity matrices on the leading diagonal blocks. Since v = Cy- ZLaf, Var(y) = £ and a\ is constant, it follows that Var(u) = CSC' . However, we know from §4.2.4 that the innovations are independent of each other so that Var(t>) is the block diagonal matrix

F =

Fi 0 •• 0 0 F2 0

0 0 F L n

This shows that in effect the Kalman filter is essentially equivalent to a block version of a Cholesky algorithm applied to the observational variance matrix implied by the state space model (4,1).

Given the special structure of C = / — ZLK we can easily reproduce some smoothing results in matrix form. We first notice that E(u) CE(y) and since E(y) = ZTa*, we obtain the identity CZT — ZL.lt follows that

v — C[y — E(y)], F - Var(u) = C£C ' ,

with E(y) — ZLa*. Further we notice that C is nonsingular and

ZLa\ = 0

S"1 = C'F~lC.


The stack of smoothed observation disturbance vectors can be obtained directly by

i = E(s\y) = Cov(£, y)^[y - E(>)j,

where y — E(y) — ZTRrj + s. Given the results v = C[y — E(y)] and S" 1 — C'F lC it follows straightforwardly that

e = HC'F~lv = H(J ~ K'L'Z')F~xv = H(F~xv — K'r) = Hu,

where u — F"1 v — K'r and r = L'Z'F~xv. Notice that

« = £-»[),-EGO] ^C 'F - 1 ! ; .

It can be verified that the definitions of u and r are consistent with the stack of vectors ut of (4.37) and rt of (4.25), respectively.

In a similar way we obtain the stack of smoothed state disturbance vectors directly via

i) = Covfo, y)V-l[y-E(y)] = QR'T'Z'u = QR'r,

where r = T'Z'u. This result is consistent with the stack of fjt — QtR'trt where rt is evaluated by = Z'tu, + T(rt; see §4.4.4. Notice the equality r — L'Z'Fxv = T'Z'u.

Finally, we obtain the smoothed estimator of a via

a = Cov(oi, y ^ - ' y - E i y ) } = Co v(a, y)u - Cov(a, QR'T'Z'u + Cov(a, s)u = TRQR'T'Z'u — TRfj,

since y — E(y) = Z T Rrj -}- e. This is consistent with the way at is evaluated using fast state smoothing as described in §4.4.2.

5 Initialisation of filter and smoother

5.1 Introduction

in the previous chapter we have considered the operations of filtering and smoothing for the linear Gaussian state space model

under the assumption that «i ~ N(«i, Pi) where ai and Pi are known. In most practical applications, however, at least some of the elements of ay and Pj are unknown. We now develop methods of starting up the series when this is the situation; the process is called initialisation. We shall consider the general case where some elements of ot\ have a known joint distribution while about other elements we are completely ignorant.

A general model for the initial state vector a\ is

where the rax 1 vector a is known, 8 is a q x 1 vector of unknown quantities, the m x q matrix A and the ra x (ra — q) matrix R0 are selection matrices, that is, they consist of columns of the identity matrix lm; they are defined so that when taken together, their columns constitute a set of g columns of Im with g <m and A'Rq — 0. The matrix Qq is assumed to be positive definite and known. In most cases vector a will be treated as a zero vector unless some elements of the initial state vector are known constants. When all elements of the state vector a t are stationary, the initial means, variances and covariances of these initial state elements can be derived from the model parameters. For example, in the case of a stationary ARMA model it is straightforward to obtain the unconditional variance matrix Qo as we will show in §5.6.2. The Kalman filter (4.13) can then be applied routinely with ai = 0 and P\ — Qq.

To illustrate the structure and notation of (5.2) for readers unfamiliar with the subject, we present a simple example in which

yt = Ztat + st, st ~ N(0, Ht), «i+i = Ttott + RtVt, Vt ~ N(0, Qt),

(5.1)

ai =a + A& + R0t]0, T)o ~ N ( 0 , ö o X (5.2)

yt = ßt+Pt + £t ~ N(0, ae2),

100 INITIALISATION OF FILTER AND SMOOTHER

where p,t+1 = p,t + v, + ~ N(0, a 2 ) ,

= * + C, ~N(0,crf2),

A+i = 4>Pt + rt, xt~ N(0, a2), in which \(f>\ < 1 and the disturbances are all mutually and serially uncorrelated. Thus fij is a local linear trend as in (3.2), which is nonstationary, while pt is an unob-served stationary first order AR(1) series with zero mean. In state space form this is

* = 0 *

Thus we have

> 0 0" ( Pt\ "1 0 0" 0 1 1 * + 0 1 0 0 0 1 V v, / 0 0 1

0 0" / > A - 1 0 9 Ro =

0 1 \ rço = pi and Qq ~ a2/(\ — 02) where o^/(l — <f>2) is the variance of the stationary series pt.

Although we treat (j> as known for the purpose of this section, in practice it is an unknown parameter which in a classical analysis is replaced by its maximum likelihood estimate. We see that the object of the representation (5.2) is to separate out «i into a constant part a, a nonstationary part A8 and a stationary part Roqo. In a Bayesian analysis, can be treated as having a known or noninformative prior density.

The vector <5 can be treated as a fixed vector of unknown parameters or as a vector of random normal variables with infinite variances. For the case where 8 is fixed and unknown, we may estimate it by maximum likelihood; a technique for doing this was developed by Rosenberg (1973) and we will discuss this in §5.7. For the case where S is random we assume that

8 ~ N(0, Klq\ (5.3)

where we let tc —> co. We begin by considering the Kalman filter with initial conditions a\ — E(o'1) — a and Pi — Var(ari) where

'1 = K Poo P* (5.4)

and we let K oo at a suitable point later. Here P^ = A A' and P* = RQQQRQ, since A consists of columns of Im it follows that P^ is an m x m diagonal matrix with q diagonal elements equal to one and the other elements equal to zero. Also, without loss of generality, when a diagonal element of P^ is nonzero we take the

5.2. THE E X A C T INITIAL K A L M A N FILTER 101

corresponding element of a to be zero. A vector S with distribution N(0, Klq) as ic -» oo is said to be diffuse. Initialisation of the Kalman filter when some elements of a\ are diffuse is called diffuse initialisation of the filter. We now consider the modifications required to the Kalman filter in the diffuse initialisation case.

A simple approximate technique is to replace K in (5.4) by an arbitrary large number and then use the standard Kalman filter (4.13). This approach was employed by Harvey and Phillips (1979). While the device can be useful for approximate exploratory work, it is not recommended for general use since it can lead to large rounding errors. We therefore develop an exact treatment.

The technique we shall use is to expand matrix products as power series in K taking only the first two or three terms as required, and then let K oo to obtain the dominant term. The underlying idea was introduced by Ansley and Kohn (1985) in a somewhat inaccessible way. Koopman (1997) presented a more transparent treatment of diffuse filtering and smoothing based on the same idea. Further developments were given by Koopman and Durbin (2001) who obtained the results that form the basis of §5.2 on filtering and §5.3 on state smoothing. This approach gives recursions different from those obtained from the augmentation technique of de Jong (1991) which is based on ideas of Rosenberg (1973); see §5.7. Illustrations of these initialisation methods are given in §§5.6 and 5.7.4.

A direct approach to the initialisation problem for the general multivariate linear Gaussian state space model turns out to be somewhat complicated as can be seen from the treatment of Koopman (1997). The reason for this is that for multi-variate series the inverse matrix Ft~l does not have a simple general expansion in powers of K~1 for the first few terms of the series, due to the fact that in very specific situations the part of Ft associated with P^ can be singular with varying rank. Rank deficiencies may occur when observations are missing near the beginning of the series, for example. For univariate series, however, the treatment is much simpler since F, is a scalar so the part associated with P*, can only be either zero or positive, both of which are easily dealt with. In complicated cases, it turns out to be simpler in the multivariate case to adopt the filtering and smoothing approach of §6.4 in which the multivariate series is converted to a univariate series by introducing the elements of the observational vector y, one at a time, rather than deal with the series directly as a multivariate series. We therefore begin by assuming that the part of Ft associated with P^ is nonsingular or zero ÍOT any t. In this way we can treat most multivariate series, for which this assumption holds directly, and at the same time obtain general results for all univariate time series. We shall use these results in §6.4 for the univariate treatment of multivariate series.

5.2 The exact initial Kalman filter

In this section we use the notation 0(tc~i) to denote a function f ( i c ) of ic such that the limit of ic} f(jc) as k —> oo is finite for j — 1 ,2.


5.2.1 THE BASIC RECURSIONS

Analogously to the decomposition of the initial matrix P\ in (5.4) we show that the mean square error matrix Pt has the decomposition

Pt = KPoo.t + p*,t 4- t = 2 , . . . , n, (5.5)

where P^ f and P* t do not depend on K. It will be shown that P^, t = 0 for t > d where d is a positive integer which in normal circumstances is small relative to n. The consequence is that the usual Kalman filter (4.13) applies without change for t = d 4- I,.. .,n with Pt — P*t. Note that when all initial state elements have a known joint distribution or are fixed and known, matrix P^ ~ 0 and therefore d = 0.

The decomposition (5.5) leads to the similar decompositions

Ft = KFcoj 4- + 0(K~1), Mt = KM^t + A/*,, + 0(K~1), (5.6)

and, since Ft ~ ZtPtZ't 4- Ht and Mt = PtZ't, we have

Foo t ~ ZfPoo tZf, Futj ~ ZtP* tZft H~ if/» _ „

(5 7) Moo,, - Poo,tZ'f, M,tt = P*jZ't, J

for t = 1 , . . . , d. The Kalman filter that we shall derive as K oo we shall call the exact initial Kalman filter. We use the word exact here to distinguish the resulting filter from the approximate filter obtained by choosing an arbitrary large value for K and applying the standard Kalman filter (4.13). In deriving it, it is important to note from (5.7) that a zero matrix M^j (whether Pooj is a zero matrix or not) implies that F^t = 0. As in the development of the Kalman filter in §4.2.2 we assume that Ft is nonsingular. The derivation of the exact initial Kalman filter is based on the expansion for Ft~l = [kF^ + T7*,, 4- O ^ 1 ) ] 1 as a power series in K~~l

y that is

F~l = F/0) 4- / c 1 Fi1} + K~2Fl2) + 0(/c~3), (5.8)

for large k. Since Ip = FtFt~l we have

lp - (kFoo,! + F*J + ic-lFatt + K~2Fb,t + • - •)

On equating coefficients of K* for j = 0, —1, —2,... we obtain

F^tFP + FoojF^ = Ip, (5.9)

Fa,tFi0) 4 F*,,Ff(1) 4- F^F^ = 0, etc.

We need to solve equations (5.9) for Ff°\ F^ and F^2); further terms are not required. We shall consider only the cases where F0Q>, is nonsingular or F^j = 0. This limitation of the treatment is justified for three reasons. First, it gives a complete solution for the important special case of univariate series, since if yt is

5.2. THE EXACT INITIAL KALMAN FILTER 103

univariate FQ0i, must obviously be positive or zero. Secondly, if yt is multivariate the restriction is satisfied in most practical cases. Thirdly, for those rare cases where yt js multivariate but the restriction is not satisfied, the series can be dealt with as a univariate series by the technique described in §6.4. By limiting the treatment at this point to these two cases, the derivations are essentially no more difficult than those required for treating the univariate case. However, solutions for the gen-eral case where no restrictions are placed on F ^ , are algebraically complicated; see Koopman (1997). We mention that although F ^ , nonsingular is the most common case, situations can arise in practice where F^ — 0 even when P^t ^ 0 if Moo.t = PoojZ'r — 0-

Taking first the case where F ^ , is nonsingular we have from (5.9),

F , ( 0 ) ^0 , F/1-1 = F~lt, F,(2) = - F - ] t F . t t F Z ] r (5-10)

The matrices Kt — TtM(F~l and Lt — Tt — KtZt depend on the inverse matrix F" 1 so they also can be expressed as power series in <c_1. We have

Kt = TAtcMooit 4- 4- Ofo" 1 ) !^" 1 F,(I) + k~2F,(2) + •••),

so

Kt = K<0) + k " 1 / ^ + 0(k~\ Lt = L(0) + K~1L{P 4- 0(K~2), (5.11)

where

K,(0) = T.M^Fi", L,® = T, - K f z t ,

By following the recursion (4.8) for at+\ starting with t — 1 we find that at has the form

at = 4 0 ) + + 0(k~2),

where af* ~ a and =• 0. As a consequence vt has the form

where v f ] = yt — Ztaf] and — -Ztdp. The updating equation (4.8) for at+l

can now be expressed as

at+i - T,a, + Ktvt

^ T ^ + K - ^ + OiK"2)}

+ [/Sf> + K"1 Kt(l) 4 O(K-2)][i>i0) 4- K " 1 ^ + 0(k~%

which becomes as ic —oo,

a f l = Tt<40) 4- Ki%f\ t = 1 , . . . , n. (5.13)

The updating equation (4.11) for Pt+l is

Pt+i — TtPtL't 4- RtQtR't

= TAkP^j + P*,, + O f o " 1 ) ] ^ + k~1L(P + 0(K~2)]' 4- RtQtR't-


Consequently, on letting K oo, the updates for Poo.f+i and P*,f+i are given by

p T P I Poo.i+l - l,Pco,tLt ,

1 - TtP^L^' + TtP^Lf* + RtQtR't,

for t = 1 , . . . ,n. The matrix P,+i also depends on terms in K1 , tc~~2, etc. but these terms will not be multiplied by K or higher powers of K within the Kalman filter recursions. Thus the updating equations for Pt+i do not need to take account of these terms. Recursions (5.13) and (5.14) constitute the exact Kalman filter.

In the case where FOQJ — 0, we have

Ft = F*,, + 0(k~1), Mt = + 0(tc-1),

and the inverse matrix F" 1 is given by

F" 1 - F - j + 0(k~1).

Therefore,

Kt = Tt[M^t + 0(k~1)][F-J + 0(K~l)] — TtM^tF~l + 0(k~1).

The updating equation for (5.13) has

Kf{0) - TtM^tF~}, (5.15)

and the updating equation for P f+1 becomes

Pt+l = TtPtL't + RtQtR't

= UKPou + P*,t + O f o - 1 ) ] ^ + K-'lSP + 0(K~2)]' + Rt QtRf,

where L f } = Tt - tf((0)Z, and F;

(1) - ~K^Zt. The updates for P^^ and P*,, can be simplified considerably since — Poc,jZ': — 0. By letting ic oo we have

p t p r (0)' 'oo, H-l — 't*ooj^t TD T' T P 7' Y^1

— li *00,t11 ~~ 1troo.t£jtfit

= TtPoojTl, (5.16)

— FjPqo^LI ̂ -(- TfPfc jLf^' -f RtQR't = -TtP^Z'tK^' + TtP^L^' + R,QR'S

= T t P ^ L f y + RtQR't, (5.17) for / - 1 , . . . , d, with Poo j = Poo- A A' and = P* = RQQQR'Q. It might be thought that an expression of the form F^t -f k_,F** f + 0(k2) should be used for Ft here so that two-term expansions could be carried out throughout. It can be shown however that when M ^ , = PoojZ't — 0, so that F^ f = 0, the contribution of the term t is zero; we have therefore omitted it to simplify the presentation.

5.2. THE EXACT INITIAL KALMAN FILTER 105

5.2.2 TRANSITION TO THE USUAL KALMAN FILTER

We now show that for non-degenerate models there is a value of d of t such that Poo.t i1 0 for / < d and P»,, — 0 for t > d. From (5.2) the vector of diffuse elements of «{ is 8 and its dimensionality is q. For finite k. the log density of 8 is

logp(8) - - ^ l o g 2 j r - f l o g * - ~8'8, 2 2 2fc

since E(5) = 0 and Var(5) — tclq. Now consider the joint density of <5 and Yt. In an obvious notation the log conditional density of 8 given Yt is

l o g p(8\Yt) = l o g p(S, Y() ~ l o g p(Yt)t

for t — 1 , . . . , n. Differentiating with respect to 8, letting K oo, equating to zero and solving for 5, gives a solution <5—5 which is the conditional mode, and hence the conditional mean, of 8 given Yt. Since p(8, Yt) is Gaussian, log p(8, Y,) is quadratic in 8 so its second derivative does not depend on 8. The reciprocal of minus the second derivative is the variance matrix of 8 given Yt. Let d be the first value of t for which this variance matrix exists. In practical cases d will usually be small relative to n. If there is no value of t for which the variance matrix exists we say that the model is degenerate, since a series of observations which does not even contain enough information to estimate the initial conditions is clearly useless.

By repeated substitution from the state equation at+i ~ T{at -f Rtr]t we can express a t + \ as a linear function of «i and rji, . . . , r\t. Elements of «i other than those in 5 and also elements of . . . , r\t have finite unconditional variances and hence have finite conditional variances given Yt. We have also ensured that elements of 8 have finite conditional variances given Yt for t >d by definition of d. It follows that Var(oi/+1 jFj) — Pt+i is finite and hence foo.r+i = 0 for t > d.

On the other hand, for t < d, elements of Var(5|Yr) become infinite as K —> oo from which it follows that elements of Var(a,+i | Yt) become infinite, so Poo,r+i / 0 for t < d. This establishes that for non-degenerate models there is a value doit

such that Poo r / 0 for t < d and PaQ>t — 0 for t > d. Thus when t > d we have Pt — P*,i + 0(K~1) so on letting K oo we can use the usual Kalman filter (4.13) starting with aj+i — afh and Pd+i — P*,d+i- A similar discussion of this point is given by de Jong (1991).

5.2.3 A CONVENIENT REPRESENTATION

The updating equations for P*i/+i and Poo,r+i, for / — I,... ,d, can be combined to obtain a very convenient representation. Let

P( — Pcoj), —

From (5.14), the limiting initial state filtering equations as K oo can be written as

P/+1 - TtP}L\' + [RtQtR't 0], t = l , . . . , d , (5.19)

with the initialisation p/ — pt = [P+ P^]. For the case F ^ j = 0, the equations

L?> L) o L(;

(i) (0)

(5.18)


in (5.19) with the definitions in (5.18) are still valid but with

KP = TtM^tF~l Li°> = Tt - K?>Zt, iJp = 0.

This follows directly from the argument used to derive (5.15), (5.16) and (5.17). The recursion (5.19) for diffuse state filtering is due to Koopman and Durbin (2001). It is similar in form to the standard Kalman filtering (4.13) recursion, leading to simplifications in implementing the computations.

5.3 Exact initial state smoothing

5.3.1 SMOOTHED MEAN OF STATE VECTOR

To obtain the limiting recursions for the smoothing equation at — at + Ptrt~i given in (4.26) for t — d,..., 1, we return to the recursion (4.25) for r f_i, that is,

i = Z'tF~xvt + L'trt, t - «,..., 1,

with r„ — 0. Since rf l depends on F~l and Lt which can both be expressed as power series in we write

rt~\ = r,(0)j + K~lrf\ + 0(K~\ t = dt..., 1. (5.20)

Substituting the relevant expansions into the recursion for r t_\ we have for the case Foo^ nonsingular,

r!°\ + / c - V S + - • • = Z ' ^ F P + + - - -)(„?» + + . . . )

+ (L?~> + K - 1 ^ + - • • )'(r® + K-^rj» + •--),

leading to recursions for rf* and

r(0) _ r(0V (0) 't—I — ^t ' t > -(I) _ 7/ pi1),/0) _L I (0Vr(!) -i- T PV,® rt_j — Ztrt

vt rt + Lt rt ,

for t = d,..., 1 with r<0) - rd and rf - 0. The smoothed state vector is

at = at + Ptrt-\ = a, + [icPoo., + P*,, + O^"1)]^ + K-'r^ + 0(K~2)}

= * + kPoo,M% + + P*,t(r?\ + *"V<i\) + 0(K~1)

- a, + KPîf}, + P*ytrf}x + Poo,^ + 0(K~1\ (5.22)

where a, = . It is immediately obvious that for this expression to make sense we must have Poo,ti~f\ — 0 for all t. This will be the case if we can show that Var(o:f | Yn) is finite for all t as K 00. Analogously to the argument in §5.2.2 we can express a t as a linear function of 5, ??!>••-> rjt-1 • But Var(5jFrf) is finite by definition ofd so Var(5j7„)mustbefiniteas K 00 sinced < «.Also,

5.3. EXACT INITIAL STATE SMOOTHING 107

Qj = Var(^) is finite so Var(?yj \ Yn) is finite for j — 0 , . . . , t — 1. It follows that Var(a( |7n) is finite for all t as k oo so from (5.22) P0ojrf} i — 0.

Letting K —> oo we obtain

&t = «!0) + + Poojtj^, t=d,..., 1, (5.23)

with r f ] - rd and r(J} = 0. The equations (5.21) and (5.23) can be re-formulated to obtain

rli = + Lfrl a, = ar(0) + P/,-^, 1,

(5.24)

where

*•=(;«>)•with ^(o)' and the partioned matrices p/ and l J are defined in (5.18). This formulation is convenient since it has the same form as the standard smoothing recursion (4.26). Considering the complexity introduced into the model by the presence of the diffuse elements in oti, it is very interesting that the state smoothing equations in (5.24) have the same basic structure as the corresponding equations (4.26). This is a useful property in constructing software for implementation of the algorithms.

To avoid extending the treatment further, and since the case FOQJ — 0 is rare in practice, we omit consideration of it here and in §5.3.2 and refer the reader to the discussion in Koopman and Durbin (2001).

5 .3 .2 SMOOTHED VARIANCE OF STATE VECTOR

We now consider the evaluation of the variance matrix of the estimation error a, ~~ at for t ~ d,..., 1 in the diffuse case. We shall not derive the cross-cova-riances between the estimation errors at different time points in the diffuse case because in practice there is little interest in these quantities.

From §4.3.2, the error variance matrix of the smoothed state vector is given by V, = P, - PtNt-\Pt with the recursion Nt-1 = Z'tF~lZt + L'tNtLt, for £ — 1, and Nn — 0. To obtain exact finite expressions for Vt and N,_i where Fqoj is nonsingular and ic oo, for t ~ d,..., 1, we find that we need to take three-term expansions instead of the two-term expressions previously employed. Thus we write

Nt = Nj0) + + K^-N^ + 0(K"3). (5.25)

Ignoring residual terms and on substituting in the expression for Nt-\, we obtain the recursion for Nt~\ as

= Z\(K~X F,(1) + k"2P<2) + • • - )Z t + (Lj0) + + *-2Li2 ) + . . . ) '

x (Atf°> + K- 'Ni l ) + + • • - )(L}0) + + k - 2 L ^ + - . . ) ,


which leads to the set of recursions

N.

K = Z'tF™Zt + L^'N^lT + L™N?»Lf> + L?»N?>L?\

= Z'M2)Zt + L^'N^L™ + L^'N^L? + L^'N^L™

+ Lf)'Nf)L?) + L^N^L™ + LpTV^zW, (5.26)

with N f } = Nd and = Np - 0. Substituting the power series in K~2, etc. and the expression P, =

KPooj -(- [ into the relation Vt = Pt — PfNt_iPt we obtain

V, = + P*)f

- (kP«,., + P * , , ) ^ + k - ' N ^ + K - 2 N ^ -1- • • • )(*?«,., + P*,,) ,,2p ATC) p

+ - / W t y - l ^ " PtjN^ P^ ~ P o o j N ^ P ^ j ) m

(0) 00,f

(1) + P*,, - - P*,tK-lPooJ - PooM%P^ t

Poo.tN^Pooj + 0(KL). -U (5.27)

It was shown in the previous section that Vt — Var(cKf | Yn) is finite fori — 1 , . . . , n. Thus the two matrix terms associated with K and K2 in (5.27) must be zero. Letting K oo, the smoothed state variance matrix is given by

Vf — P*t ~ P*,t ̂ t-l ~ Poo J ~ P*,T ~~ Poo,T^tJ\ PQGJ • (5.28)

Note that P+JN^POOJ ~ (Poo jN^P^) ' . Koopman and Durbin (2001) have shown by exploiting the equality

P o o j L t N ^ = 0, that when the recursions for n{1) and N p in (5.26) are used to calculate the terms in and Af^ , respectively, in (5.28), various items vanish and that the effect is that we can proceed in effect as if the recursions are

N™ = Z'tFll)Zt + LT'N^LT + L^'N^LfK (5.29)

N?\ = Z'tF^Zt + Lf^NPLT + LT'N^L^ + L^'N^Lf + L™N?>L?\

and that we can compute Vt by

V, — P*,/ — — (PcojN^P*,,) ~ P<x>,tN^2\P*,t ~ P<x>,tî~\Poo,t-(5.30)

Thus the matrices calculated from (5.26) can be replaced by the ones computed by

5.4. EXACT INITIAL DISTURBANCE SMOOTHING 109

(5.29) to obtain the correct value for V,. The new recursions in (5.29) are convenient since the matrix Lj;2) drops out from our calculations for

Indeed the matrix recursions for N ^ and N ^ in (5.29) are not the same as the recursions for N,L) and N^ in (5.26). Also, it can be noticed that matrix N ^ in (5.26) is symmetric while N; l ) in (5.29) is not symmetric. However, the same notation is employed because N ^ is only relevant for computing V{ and matrix P 0 i s the same when is computed by (5.26) as when it is computed by (5.29). The same argument applies to matrix N;2\

It can now be easily verified that equations (5.30) and the modified recursions (5.29) can be re-formulated as

N{ -0 Z'tF^Zt

Z'tF\y)Zt Z'rFi2)Zt + L ] ' N } L \ , v

t = />*., - P}N}_X P?',

(5.31)

for t — d,..., 1, where

-

Nm Na y

N, (i) N (2) t-1 "i-l

t r t

with yV1 - \Nd

I « and the partioned matrices P, and L) are defined in (5.18). Again, this formulation has the same form as the standard smoothing recursion (4.31) which is a useful property when writing software. The formulations (5.24) and (5.31) are given by Koopman and Durbin (2001).

5.4 Exact initial disturbance smoothing

Calculating the smoothed disturbances does not require as much computing as calculating the smoothed state vector when the initial state vector is diffuse. This is because the smoothed disturbance equations do not involve matrix multiplications which depend on terms in K or higher order terms. From (4.36) the smoothed estimator is st — Ht(F^xvt — K'trt) where F~l — 0(fc~l) for F^t positive definite,F,-1 - F~J + O(ic^) f o r = 0,Kt = /T,(0) + 0{K l)<m&rt = rj0) + 0(K~1) SO, as tc oo, we have

i—HtKf^'rf^ if Fooyt is nonsingular,

H,(F-Jv, - X f r ® ) if F„,, = 0, for t = </, . . . , 1. Other results for disturbance smoothing are obtained in a similar way and for convenience we collect them together in the form

S, = - H t K f ) ! r f \ fh - QrR[rf\

Varfoly) = Ht - HtK^'N^tcf]H,, Var^jy) = Qt - QtR'tN{0) RtQt,


for the case where F ^ j ^ 0 and

n< = Q<R'tr?\ Varfoly) = Ht - ^ ( F " / + J ^ a W K Var(r?f|y) = g , - Qt^N^RtQt,

for the case F^j — 0 and for t = d,..., 1. It is fortunate that disturbance smoothing in the diffuse case does not require as much extra computing as for state smoothing. This is particularly convenient when the score vector is computed repeatedly within the process of parameter estimation as we will discuss in §7.3.3.

5.5 Exact initial simulation smoothing

When the initial state vector is diffuse it turns out that the simulation smoother of §4.7 can be used without the complexities of §5.3 required for diffuse state smoothing. We first consider how to obtain a simulated sample of a given y.

Taking a — (orj,. . . , a„, an + i) ' as before, define a/\ as a but without a\. It follows that

p(a\y) = p(ai\y)p(a/i\y,ai). (5.32) Obtain a simulated value of a\ by drawing a sample value from p(«i|y) — N(#i, V\) for which a\ and V\ are computed by the exact initial state smoother as developed in §5.3. Next initialise the Kalman filter with a.\ ~ and P\ — 0, since we now treat ct\ as given, and apply the Kalman filter and simulation smoother as decribed in §4.7. This procedure for obtaining a sample value of a given y is justified by equation (5.32).

To obtain multiple samples, we repeat this procedure. This requires computing a new value of «i, and new values of vt from the Kalman filter for each new draw. The Kalman filter quantities F{y Kt and Pt+\ do not need to be re-computed. Note that for a model with non-diffuse initial conditions, we do not re-compute the vt's for generating multiple samples; see §4.7.6.

A similar procedure can be followed for simulating disturbance vectors given y: weinitialisetheKalmanfilterwithai — a\ and P\ — 0 as above and then use the simulation smoothing recursions of §4.7 to generate samples for the disturbances.

5.6 Examples of initial conditions for some models

In this section we give some examples of the exact initial Kalman filter for t — 1 , . . . , d for a range of state space models.

5.6.1 STRUCTURAL TIME SERIES MODELS

Structural time series models are usually set up in terms of nonstationary components. Therefore, most of the models in this class have the initial state

5.6. EXAMPLES OF INITIAL CONDITIONS FOR SOME MODELS 111

vector equals 5, that is, a\ — S so that a.\ — 0, P* — 0 and Poo — fm- We then proceed with the algorithms provided by §§5.2, 5.3 and 5.4.

To illustrate the exact initial Kalman filter in detail we consider the local linear trend model (3.2) with system matrices

Z r = < 1 0), «-[j : Qt = m o ]

and with Ht — of , Rt = I2, where — cr^/cr2 and q^ — o2 Jo2, The exact initial Kalman filter is started with

ax = 0, - 0,

and the first update is based on

Poo.i ~ h,

o]' Ki" =

L( D _ —

such that

The second update gives the quantities

i _ 2 ~ I I I '

and

IS 0 oo ,2

I 1 1 1

r(0) _ 2 — - 1 1 - 1 1

(1) 3 + ^ 0 ] Ï 2 + 0

- < 5 + 2 q ç + q ç 3 + q^+q^ 3 + qç+qç l + q^+lq^

together with the state update results

o ^ f 2 » - » ) , \yi-y\ )

Poo,3 =

It follows that the usual Kalman filter (4.13) can be used for t — 3 , . . . , n.

5.6.2 STATIONARY ARMA MODELS

The univariate stationary ARMA model with zero mean of order p and q is given by

yt = 01K-1 + • • • + <t>py,-p + St + OxSt-x + • • • + 0qSt-q> Zt ~ N(0, a \

"0 0] _o o j -


The state space form is

yt = (1 ,0 , oi+i = Tott + RCt+u

where the system matrices T and R are given by (3.17) with r — max(p, q + 1). All elements of the state vector are stationary so that the part a + AS in (5.2) is zero and RQ — lm. The unconditional distribution of the initial state vector a\ is therefore given by

ax ~N(0 ,a 2 Q 0 ) ,

.I1 !ii' il:

•

Ji1-1'"' • '

1..::"'

? h

where, since Var(a,+i) = VariTa, + Rh+i), O2QQ = O2TQQT' + o2RR'. This equation needs to be solved for QQ. It can be shown that a solution can be obtained by solving the linear equation (Im2 — T ® r )vec(0 o) — vec(RR') for QQ, where vec(£?o) and vec(RR') are the stacked columns of QQ and RR' and where

T =

tnT h\T

tm\T

hmT hmT

with tij denoting the (/, j ) element of matrix T\ see, for example, Magnus and Neudecker (1988, Theorem 2, p. 30) who give a general treatment of problems of this type. The Kalman filter is initialised by ax = 0 and P\ = QQ,

As an example, consider the ARMA(1,1) model

Then

yt = (f>yt-i +Çt + eÇt-i, - N(0, a2).

[s i] - *=(0-so the solution is

So (i-4>*yl(i+el + 24>0) 0

0 o2 i 5.6.3 NONSTATIONARY ARIMA MODELS

The univariate nonstationary ARIMA model of order p, d and q can be put in the form

y* = + • • • + 4>Py*-p + Cr + + • • • + N(o, a2).

5.6. EXAMPLES OF INITIAL CONDITIONS FOR SOME MODELS 113

where y * — Adyt. The state space form of the ARIMA model with p — 2, d — 1 and q — 1 is given in §3.3 with the state vector given by

a t -

where y* — Ay, = yt — . The first element of the initial state vector ot\, that is yo, is nonstationary while the other elements are stationary. Therefore, the initial vector «i — a + + RQT]Q is given by

0 0 1 0 0 1

m ~ N(o, Q 0 \

where QQ is the 2 x 2 unconditional variance matrix for an ARMA model with p — 2 and q — 1 which we obtain from §5.6.2. When S is diffuse, the mean vector and variance matrix are

ax = 0 , Pi = KPvo + P*,

where

Poo 1 0 0 0 0 0 0 0 0

p* -ro o " L° ô o .

Analysis then proceeds using the exact initial Kalman filter and smoother of §§5.2, 5.3 and 5.4. The initial state specification for ARIMA models with d — 1 but with other values for p and q is obtained in a similar way.

From §3.3, the initial state vector for the ARIMA model with p — 2, d — 2 and <7 = 1 is given by

/ yo \ Avo y* «1 -

The first two elements of «i, that is, jo and Ayo, are nonstationary and we therefore treat them as diffuse. Thus we write

W ~ N ( 0 , fi0),

where QQ is as for the previous case. It follows that the mean vector and variance matrix of ocx are

/ o \ 1 0" 0 0" 0

+ 0 1 5 + 0 0 «1 = 0 + 0 0 5 + 1 0 Vo

0 0 0 1

ai = 0, Pi = tcPoo + P*,


where

Poo -[ h 01 [0 0 1 [ 0 0 ' 0 ÖoJ '

We then proceed with the methods of §§5.2, 5.3 and 5.4. The initial conditions for n on-seasonal ARIMA models with other values for p, d and q and seasonal models are derived in similar ways.

5.6 .4 REGRESSION MODEL WITH ARMA ERRORS

The regression model with k explanatory variables and ARMA(p, q) errors (3.27) can be written in state space form as in §3.7. The initial state vector is

«1 - 00 + [ o ] 5 + [ ° ] m ' Q o X

where Qo is obtained as in §5.6.2 and r — max(p, q + 1). When 8 is treated as diffuse we have a.\ ~ N(fli, Pi) where a\ — 0 and P\ — kP^ -f with

P = 1 no h 01 _ [ 0 0 1 0 O j ' o ßo J'

We then proceed as described in the last section. To illustrate the use of the exact initial Kalman filter we consider the simple

case of an AR(1) model with a constant, that is

y, +

6 - fa-1 + St,

In state space form we have

a, =

and the system matrices are given by

Zt - (1 1), Tt =

St ~ N(0, <r2).

ft) with Ht = 0 and Qt — a2. The exact initial Kalman filter is started with

ax - 1 0 0 0 ( o ) ' p*>i=c[°o

where c — cr2/(l — (pi2). The first update is based on

L<'> = C [ . ~<t> -<t>

5.7. AUGMENTED KALMAN FILTER AND SMOOTHER 115

such that

( y i \ P g2

r 1 P _r° °i p*'2~T̂ l-<i> i J' ^-[o o.'

It follows that the usual Kalman filter (4.13) can be used for í — 2 , . . . , n.

5.6 .5 SPLINE SMOOTHING

The initial state vector for the spline model (3.41) is simply a.\ = 5, implying that ax = 0, P* = 0 and P^ - I2.

5.7 Augmented Kalman filter and smoother

5.7.1 INTRODUCTION

An alternative approach for dealing with the initialisation problem is due to Rosenberg (1973), de Jong (1988b) and de Jong (1991). As in (5.2), the initial state vector is defined as

Rosenberg (1973) treats 5 as a fixed unknown vector and he employs maximum likelihood to estimate 5 while de Jong (1991) treats 8 as a diffuse vector. Since the treatments of Rosenberg and de Jong are both based on the idea of augmenting the observed vector, we will refer to their procedures collectively as the augmentation approach. The approach of Rosenberg offers relief to analysts who feel uncomfort-able about using diffuse initialising densities on the ground that infinite variances have no counterpart in real data. In fact, as we shall show, the two approaches give effectively the same answer so these analysts could regard the diffuse assumption as a device for achieving initialisation based on maximum likelihood estimates of the unknown initial state elements.

5.7 .2 AUGMENTED KALMAN FILTER

In this subsection we establish the groundwork for both the Rosenberg and the de Jong techniques. For given 5, apply the Kalman filter, (4.13) with a\ = E(«i) — a + A8, P\ — Var(a!i) — P* = RQQQR'Q and denote the resulting value of at from the filter output by aS t. Since a$,, is a linear function of the observations and ax — a + AS we can write

where aa ! is the value of at in the filter output obtained by taking a\ = a, Px — P* and where the yth column of AAJ is the value of a, in the filter output obtained from an observational vector of zeros with ax = Aj, Pi = P*, where Ay is the yth column of A. Denote the value of vt in the filter output obtained by taking ax — a 4- A5, Pi = P* by v&J. Analogously to (5.34) we can write

= a + AS 4- Rom, m ~ N(0, Q0). (5.33)

as,t = Oa,t + AAtt&, (5.34)

V$j = Va,t + VA,t&> (5.35)


where vaJ and VAJ are given by the same Kalman filters that produced aa , and AAJ.

The matrices (aat, AAJ) and (vaJ, VAJ) can be computed in one pass through a Kalman filter which inputs the observation vector yt augmented by zero values. This is possible because for each Kalman filter producing a particular column of (aait, AAJ), the same variance initialisation P\ = P* applies, so the variance output, which we denote by FS t, K& t and Psj+i, is the same for each Kalman filter. Replacing the Kalman filter equations for the vectors vt and at by the corresponding equations for the matrices (aati AA {) and (vaJ, VA,t) leads to the equations

(va,t,VA<t) = (yt,0)-Zt(aa<t,AAj), (5.36)

(«fl.t+i, Aa,{+i) = Tt(aaJ, AAJ) + KNjt(vaJ> VAJ),

where (G0,I , AAJ) ~ (a, A); the recursions corresponding to Ft, K, and Pt+\ remain as for the standard Kalman filter, that is,

Fs.t = ZtP$jZ't + Ht, Ks,t = Tt P$jZ'tFfo , L$j -Tt — KsjZt, (5.37)

P&j+i = TtPsjL'&t + RtQtR't,

for t ~ 1 , . . . , n with Ps, i = P*. We have included the suffix <5 in these expressions not because they depend mathematically on 8 but because they have been calculated on the assumption that 8 is fixed. The modified Kalman filter (5.36) and (5.37) will be referred to as the augmented Kalman filter in this book.

5.7 .3 FILTERING BASED ON THE AUGMENTED KALMAN FILTER

With these preliminaries, let us first consider the diffuse case (5.2) with 8 ~ N(0, Klq) where k —> oo; we will consider later the case where 8 is fixed and is estimated by maximum likelihood. From (5.34) we obtain for given K,

al+y - E(at+i\Yt) = aaJ+l + AAj+1Sti (5.38)

where <5, — E(5|y,). Now

log p(8\Yt) = log p(8) + log p(Yf |5) - log p(Yt) t

= log p(8) + log p(vSj) - log p(Yt) j=i

1 1 = - b[8 - -8'SA>t8 + terms independent of 8, (5.39)

where

= E VAJFs:lv°-j> - E VajF8~]VAJ- (5.40)


Since densities are normal, the mean of p(8\Yt) is equal to the mode, and this is the value of S which maximises log p(<5|F,), so on differentiating (5.39) with respect to <5 and equating to zero, we have

St (sA,t + bt. (5.41)

Also,

Pt+1 = E[(a/+i - ai+1)(ai+i - a,+1)'] = E[{a5,i+i - a(+i - Aa>,+i(5 - 5*)}{aa,i+i - ai+i - AA,i+i(5 - 3*)}'] = PSit+1 + AA,t+1Var(8\Yt)A'A,t+i

= p4i f+l + A^r+i ( s A j + A'At(+1, (5.42)

since Var(5|yf) - (SAj 4-Letting K oo we have

*t = (5.43) VaKW) = S j i , (5.44)

when SAj is nonsingular. The calculations of b, and SAt are easily incorporated into the augmented Kalman filter (5.36) and (5.37). It follows that

at+1 = Oa.i+i - A A j + iS^ ( b t , (5.45)

Pt+l = Pitt+1 + AA,,+1S^,A'Alm> (5.46)

as k —> oo. For t < d, SAJ is singular so values of at+i and P i+i given by (5.45) and (5.46) do not exist. However, when t — d, ad+1 and Pj+ 1 exist and consequently when t > d the values a f + i and Pt+i can be calculated by the standard Kalman filter for t — d + 1 , . . . , n. Thus we do not need to use the augmented Kalman filter (5.36) for t — d + 1 , . . . , n. These results are due to de Jong (1991) but our derivation here is more transparent.

We now consider a variant of the maximum likelihood method for initialising the filter due to Rosenberg (1973). In this technique, S is regarded as fixed and unknown and we employ maximum likelihood given Yt to obtain estimate The Logiikelihood of Yt given 5 is

' 1 log p(Y,\8) = } log piVsj) — —bf

t8 8'SA,t8 + terms independent of 5, ;=i 2

which is the same as (5.39) apart from the term —8'8/(2K). Differentiating with respect to 8, equating to zero and taking the second derivative gives

8, = Var(^) -


when SAit is nonsingular, that is for t — d,..., n. These values are the same as St and Var^jy,) when K oo. In practice we choose t to be the smallest value for which §t exists, which is d. It follows that the values of at+\ and for t > d given by this approach are the same as those obtained in the diffuse case. Thus the solution of the initialisation problem given in §5.2 also applies to the case where S is fixed and unknown. From a computational point of view the calculations of §5.2 are more efficient than those for the augmented device described in this section when the model is reasonably large. A comparison of the computational efficiency is given in §5.7.5. Rosenberg (1973) used a procedure which differed slightly from this. Although he employed essentially the same augmentation technique, in the notation above he estimated 5 by the value S„ based on all the data.

5.7.4 ILLUSTRATION: THE LOCAL LINEAR TREND MODEL

To illustrate the augmented Kalman filter we consider the same local linear trend model as in §5.6.1. The system matrices of the local linear trend model (3.2) are given by

Z — (1 0), T = 1 1 0 1 ft J'

with H = 07 and R = 1% and where q^ = a^f of and q( — The augmented Kalman filter is started with

0 1 0 0 0 1 (00,1. Aa,I)

and the first update is based on

(va,i , K u ) = (yi " 1 0), Ft,i=cr?,

L$t i =

[ 0 0 [o 0

K)

1 1 0 1

so

and

bx = Sa, I — 1

0 1 1 0 0 1 (aa>2, Aa,2) =

The second update gives the quantities

K2,VA .2) = (ya - l ) ,

ft

ft 0 L ° ft J

^,2 = 0 - / (1+^ ) , r_L_ \

T - 1


with

b2 — ( 1 + V (i + qi)y\ + y2

yi 2 — 1 2 + 11

1 1 J '

and the state update results

(a0i3, Aa,3) = 1

1 + i t

1 2 + ^ 0 0 1 + ^ ^,3 = ft + i f c + if if

The augmented part can be collapsed since SA,2 is nonsingular, giving

It follows that

03 = «0,3 + =

P3 = ft,3 + A

/ 2y2 - y 1 \ V - yi J [ A a , - <x: 3 + + ^ 2 + qt: + 2q% _

and the usual Kalman filter (4.13) can be used for t — 3 , . . , , n. These results are exactly the same as obtained in §5.6.1, though the computations take longer as we will now show.

5.7.5 COMPARISONS OF COMPUTATIONAL EFFICIENCY

The adjusted Kalman filters of §§5.2 and 5.7.2 both require more computations than the Kalman filter (4.13) with known initial conditions. Of course the adjustments are only required for a limited number of updates. The additional computations for the exact initial Kalman filter are due to updating the matrix Poo(i+i computing the matrices K p and ifP when F ^ , ^ 0, for t — 1 , . . . , d. For many practical state space models the system matrices Zt and Tt are sparse selection matrices containing many zeros and ones; this is the case for the models discussed in Chapter 3. Therefore, calculations involving Z, and Tt are particularly cheap for most models. Table 5.1 compares the number of additional multiplications (compared to the Kalman filter with known initial conditions) required for filtering using the devices of §§5.2 and 5.7.2 applied to several structural time series models which are discussed in §3.2. The results in Table 5.1 show that the additional

Table 5.1. Number of additional multiplications for filtering.

Model Exact initial Augmenting Difference (%) Local level 3 7 57 Local linear trend 18 46 61 Basic seasonal (s — 4) 225 600 63 Basic seasonal (s — 12) 3549 9464 63


number of computations for the exact initial Kalman filter of §5.2 is less than half the extra computations required for the augmentation device of §5.7.2. Such computational efficiency gains are important when the Kalman filter is used many times as is the case for parameter estimation; a detailed discussion of estimation is given in Chapter 7. It will also be argued in §7.3.5 that many computations for the exact initial Kalman filter only need to be done once for a specific model since the computed values remain the same when the parameters of the model change. This argument does not apply to the augmentation device and this is another important reason why our approach in §5.2 is more efficient than the augmentation approach.

5.7.6 SMOOTHING BASED ON THE AUGMENTED KALMAN FILTER

The smoothing algorithms can also be developed using the augmented approach. Tire smoothing recursion for in (4.48) needs to be augmented in the same way as is done for v{ and at of the Kalman filter. When the augmented Kalman filter is applied for t — 1, the modifications for smoothing are straightforward after computing 8„ and Var(<5„) and then applying similar expressions to those of (5.45) and (5.46). The collapse of the augmented Kalman filter to the standard Kalman filter is computationally efficient for filtering but, as a result, the estimates and Var(<5,[) are not available for calculating the smoothed estimates of the state vector. It is not therefore straightforward to do smoothing when the collapsing device is used in the augmentation approach. A solution for this problem has been given by Chu-Chun-Lin and de Jong (1993).

6 Further computational aspects

6.1 Introduction

In this chapter we will discuss a number of remaining computational aspects of the Kalman filter and smoother. Two different ways of incorporating regression effects within the Kalman filter are described in §6.2. The standard Kalman filter recursion for the variance matrix of the filtered state vector does not rule out the possibility that this matrix becomes negative definite; this is obviously an undesirable outcome since it indicates the presence of rounding errors. The square root Kalman filter eliminates this problem at the expense of slowing down the filtering and smoothing processes; details are given in §6.3. The computational costs of implementing the filtering and smoothing procedures of Chapters 4 and 5 can become high for high-dimensional multivariate models, particularly in dealing with the initialisation problem. It turns out that by bringing the elements of the observation vectors in multivariate models into the filtering and smoothing computing operations one at a time, substantial gains in computational efficiency are achieved and the initialisation problem is simplified considerably. Methods based on this idea are developed in §6.4. The various algorithms presented in the Chapters 4 and 5 and this chapter need to be implemented efficiently on a computer. Recently, the computer package SsfPack has been developed which has implemented all the algorithms considered in this book in a computationally efficient way. Section 6.6 describes the main features of the package together with an example.

6.2 Regression estimation

6.2.1 INTRODUCTION

As for the structural time series model considered in §3.2, the general state space model can be extended to allow for the incorporation of explanatoiy variables and intervention variables into the model. To accomplish this generalisation we replace the measurement equation of the state space model (3.1) by

yt = Ztat + Xtp + et, (6.1)

where Xt — (xi> ; , . . . , is a p x k matrix of explanatory variables and is a k x 1 vector of unknown regression coefficients which we assume are constant

122 FURTHER COMPUTATIONAL ASPECTS

over time and which we wish to estimate. We will not discuss here time-varying regression coefficients because they can be included as part of the state vector a t in an obvious way as in §3.6 and are then dealt with by the standard Kalman filter and smoother. There are two ways in which the inclusion of regression effects with fixed coefficients can be handled. First, we may include the coefficient vector £ in the state vector. Alternatively, and particularly on occasions where we wish to keep the dimensionality of the state vector as low as possible, we can use the augmented Kalman filter and smoother. Both solutions will be discussed in the next two sections. Different types of residuals exist when regression variables are included in the model. We show in §6.2.4 how to calculate them within the two different solutions.

6.2 .2 INCLUSION OF COEFFICIENT VECTOR IN STATE VECTOR

The state space model in which the constant coefficient vector ft in (6.1) is included in the state vector has the form

for t — 1 , . . . , n. In the initial state vector, ft can be taken as diffuse or fixed. In the diffuse case the model for the initial state vector is

where ic —oo; see also §5.6.4 where we give the initial state vector for the regression model with ARMA errors. We attach suffixes to the fts purely for convenience in the state space formulation since ft+1 — ft = ft The exact initial Kalman filter (5.19) and the Kalman filter (4.13) can be applied straightforwardly to this enlarged state space model to obtain the estimate of ft The enlargement of the state space model will not cause much extra computing because of the sparse nature of the system matrices.

6.2 .3 REGRESSION ESTIMATION BY AUGMENTATION

Another method of estimating /3 is by augmentation of the Kalman filter. This technique is essentially the same as was used in the augmented Kalman filter of §5.7. We will give details of this approach on the assumption that the initial state vector does not contain diffuse elements. The likelihood function in terms of fi is constructed by applying the Kalman filter to the variables yt, x\j,..., x*,/ in turn. Each of the variables jq >t, . . . , x^, is treated in the Kalman filter as the observation vector with the same variance elements as used for yt. Denote the resulting one-step forecast errors by v*, x*t,..., x£ t, respectively. Since the filtering operations

6.2. REGRESSION ESTIMATION 123

are linear, the one-step forecast errors for the series yt — Xtft are given by vt = v* — X*ft where X* = (x*t... It should be noted that the k + 1 Kalman filters are the same except that the values for vt and at in (4.13) are different. We can therefore combine these filters into an augmented Kalman filter where we replace vector yt by matrix (yi; Xt) to obtain the 'innovations' matrix (vf, X*)\ this is analogous to the augmented Kalman filter described in §5.7. The likelihood may be written as

Maximising this with respect to ft gives the generalised least squares estimates ft and Var(/i) where

In the case where the initial state vector contains diffuse elements we can extend the augmented Kalman filter for <5 as shown in §5.7. However, we ourselves prefer to use the exact initial Kalman filter for dealing with S. The equations for

t and Poo j of the exact initial Kalman filter are not affected since they do not depend on the data. The update for the augmented state vector is given by the equations

(v*, xlt) = (y,, xu,..., xKt) - Zt(a0tt, AXJ), /qi (6.3)

(aa,t+i, Ax>i+i) = Tt(aaJ, AXyt) + K) \v*, x * t , . . . , x*t),

for t — 1 , . . . , d with (aa i , Ax l ) = (a, 0 , . . . , 0)and where Kj® is defined in §5.2. Note that F~l in (6.2) must be replaced by zero or depending on the value for Fooit in the exact initial Kalman filter. Overall, the treatment given in the previous section where we include ft in the state vector, treat S and ft as diffuse and then apply the exact initial Kalman filter is conceptually simpler, though it may not be as efficient computationally for large models.

6 .2 .4 LEAST SQUARES AND RECURSIVE RESIDUALS

By considering the measurement equation (6.1) we define two different types of residuals following Harvey (1989, §7.4.1): recursive residuals and least squares residuals. The first type are defined as

n log L(y — constant v't F(

lv, 2 r=i 1 - constant - - - X*ft)'F~\v* - X f f t ) .

(6.2)

vt = y{ — Ztat — X t 0 t - t — d + 1,..., n,


where ftt~i is the maximum likelihood estimate of ft given Yt-x. The residuals v, are computed easily by including ft in the state vector with ct\ diffuse since the filtered state vector of the enlarged model in §6.2.2 contains ftt-\• The augmentation method can of course also evaluate vt but it needs to compute ftt-i at each time point which is not computationally efficient. Note that the residuals vt are serially uncorrected. The least squares residuals are given by

„+ =yt- Ztat - Xt$, t = d+ 1,... ,n,

where ft is the maximum likelihood estimate of ft based on the entire series, so ft — ftn. For the case where the method of §6.2.2 is used to compute ft, we require two Kalman filters: one for the enlarged model to compute ft and a Kalman filter for the constructed measurement equation yt — Xtft = Ztat -f et whose 'innovations' vt are in fact . Hie same applies to the method of §6.2.3, except that ft is computed using (6.2). The least squares residuals are correlated due to the presence in them of p, which is calculated from the whole sample.

Both sets of residuals can be used for diagnostic purposes. For these purposes the residuals vt have the advantage of being serially uncorrelated whereas the residuals i>,+ have the advantage of being based on the estimate ft calculated from the whole sample. For further discussion we refer to Harvey (1989, §7.4.1).

6.3 Square root filter and smoother

6.3.1 INTRODUCTION

! In this section we deal with the situation where, because of rounding errors and ji matrices being close to singularity, the possibility arises that the calculated value

of Pt is negative definite, or close to this, giving rise to unacceptable rounding :) errors. From (4.13), the state variance matrix Pt is updated by the Kalman filter " equations

Ft — ZtPtZ't + Ht, Kt - TtPtZ'tF-\

:,; , , (6.4) - Pt+l = TtPsL>t + RtQtR't

= TtPtT{ + RtQtKt-KtFtKl.

It can happen that the calculated value of Pt becomes negative definite when, for example, erratic changes occur in the system matrices over time. The problem can be avoided by using a transformed version of the Kalman filter called the square root filter. However, the amount of computation required is substantially larger than that required for the standard Kalman filter. The square root filter is based on orthogonal lower triangular transformations for which we can use Givens rotation techniques. The standard reference to square root filtering is Morf and Kailath (1975).

6.3. SQUARE ROOT FILTER AND SMOOTHER 125

6.3.2 SQUARE ROOT FORM OF VARIANCE UPDATING

Define the partitioned matrix Ut by

r Z T P T HT o ]

[TTPT 0 RTQ,Y (6.5)

where

PT = PTP,N H, = HTH'T, QT = QTQ'T,

in which the matrices PT, HT and QT are lower triangular matrices. It follows that

utu; = FT

TTPTZ< T(PT

The matrix Ut can be transformed to a lower triangular matrix using the orthogonal matrix G such that GG' — IM+P+R. Note that a lower triangular matrix for a rectangular matrix such as ULY where the number of columns exceeds the number of rows, is defined as a matrix of the form [A 0] where A is a square and lower triangular matrix. Postmultiplying by G we have

U(G = T/;, (6.7)

and U*U*' — UT U'T as given by (6.6). The lower triangular rectangular matrix U* has the same dimensions as UT and can be represented as the partitioned matrix

' K * , "{ , o j •

where U*T and are lower triangular square matrices. It follows that

U*U*' =

from which we deduce that

TT* TT*'

vhKt Ft

L Tt Pt z\

a KVt ]

ZtPtT/ TTPTTF + RTQTR

where FT — Pt+1 since

U*2T - TTPTZ'TF'~L = KTFT,

FTF'T and F, is lower triangular. It is remarkable to find that U%T —

vitKt = T'p'Ti + RtQ'K - uitUZ

= TT PT R/ + RT QTR'T — KT FT K'T

— Pt+1,


which follows from (6.4). Thus by transforming U, in (6.5) to a lower triangular matrix we obtain P(+j; this operation can thus be regarded as a square root recursion for Pt. The update for the state vector at can be easily incorporated using

at+i ~ Ttat + Ktvt, = Ttat + TtPtZ'(F'-lF;lv(

= Ttat + UlsWulvt,

where vt — yt — Ztat. Note that the inverse of U*t is easy to calculate since it is a lower triangular matrix.

6.3.3 GIVENS ROTATIONS

Matrix G can be any orthogonal matrix which transforms Ut to a lower triangular matrix. Many different techniques can be used to achieve this objective; for example, Golub and Van Loan (1997) give a detailed treatment of the Householder and Givens matrices for the purpose. We will give here a short description of the latter. The orthogonal 2 x 2 matrix

G2 c 5 : (6.8) —s c

with c2 + s2 — 1 is the key to Givens transformations. It is used to transform the vector

x (Ai X2),

into a vector in which the second element is zero, that is

y = xG2 = (yi 0),

by taking

» = (6.9) / 2 2 Í 2

for which c2 -{- s2 — 1 and S-Xi + cx2 — 0. Note that yi = cxx — sx2 and yG'2 — xG2G!2 — JC.

The general Givens matrix G is defined as the identity matrix Iq but with four elements la, Ijj, , replaced by

Gu — Gjj = c, Gij = 5, Gji = -i,

for 1 < i < j < q and with c and s given by (6.9) but now enforcing element (i, j) of matrix XG to be zero for all 1 < i < j < q and for any matrix X. It follows that GG' — I so when Givens rotations are applied repeatedly to create zero blocks in a matrix, the overall transformation matrix is also orthogonal. These properties of the Givens rotations, their computational efficiency and their numerical stability

6.3. SQUARE ROOT FILTER AND SMOOTHER 127

makes them a popular tool to transform nonzero matrices into sparse matrices such as a lower triangular matrix. More details and efficient algorithms for Givens rotations are given by Golub and Van Loan (1997).

6.3.4 SQUARE ROOT SMOOTHING

The backwards recursion (4.30) for N t-i of the basic smoothing equations can also be given in a square root form. These equations use the output of the square root Kalman filter as given by:

Kt = ft.

Vit = ft+1.

The recursion for A^-i is given by

NT-I — Z'TFT + L'TN(LT,

where FT~L = (UITU*;TRL,

L, — TT — U*(U*JLZT.

We introduce the lower triangular square matrix Nt such that

NT = NTN'T,

and the m x (m + p) matrix

from which it follows that NT L — N*_XN*'_V

The matrix can be transformed to a lower triangular matrix using some orthogonal matrix G such that GG' — IM+P\ compare §6.3.2. We have

tfüUG 0],

such that ATf-i — N*_XN*'_{ ~ N t-iN' t_x. Thus by transforming the matrix depending on matrices indexed by time t, to a lower triangular matrix we obtain the square root matrix of NT-1. Consequently we have developed a backwards recursion for N t-\ in square root form. The backwards recursion for r t i is not affected apart from the way in which F~L and LT are computed.

6.3.5 SQUARE ROOT FILTERING AND INITIALISATION

The square root formulation could be developed for the exact initial version of the Kalman filter of §5.2. However, the motivation for developing square root versions for filtering and smoothing is to avoid computational numerical instabilities due to rounding errors which are built up during the recursive computations. Since initialisation usually only requires a limited number of d updates, the numerical problems are not substantial during this process. Thus, although use of the square


root filter may be important for t = d,..., n, it will normally be adequate to employ the standard exact initial Kalman filter as described in §5.2 for / — 1 , . . . , d.

The square root formulation of the augmented Kalman filter of §5.7 is more or less the same as the usual Kalman filter because updating equations for FT, KT and PT are unaltered. Some adjustments are required for the updating of the augmented quantities but these can be derived straightforwardly; some details are given by de Jong (1991). Recently, Snyder and Saligari (1996) have proposed a Kalman filter based on Givens rotations, such as the ones developed in §6.3.3, with the fortunate property that diffuse priors K -> oo can be dealt with explicitly within the Givens operations. Their application of this solution however was limited to filtering only and it does not seem to provide an adequate solution for initial diffuse smoothing.

6.3.6 ILLUSTRATION: LOCAL LINEAR TREND MODEL

For the local linear trend model (3.2) we take

The zero elements of U* are created row-wise by a sequence of Givens rotations applied to the matrix U,. Some zero elements in U* are already zero in UT and they mostly remain zero within the overall Givens transformation so the number of computations can be limited somewhat,

6.4 Univariate treatment of multivariate series

6.4.1 INTRODUCTION

In Chapters 4 and 5 and in this chapter we have treated the filtering and smoothing of multivariate series in the traditional way by taking the entire observational vectors y, as the items for analysis. In this section we present an alternative approach in which the elements of yt are brought into the analysis one at a time, thus in effect converting the multivariate series into a univariate time series. This device not only offers significant computational gains for the filtering and smoothing of the bulk of the series but it also provides substantial simplification of the initialisation process when the initial state vector orj is partially or wholly diffuse.

This univariate approach to vector observations was suggested for filtering by Anderson and Moore (1979, §6.4) and for filtering and smoothing longitudinal models by Fahrmeir and Tutz (1994, §8.4). The treatment given by these authors

PIU 0 CR, 0 0 C/f- P\\,t + Pzu P 22,t 0 o^ 0

P2U P22,t 0 0 <7?

<*B 0 0

which is transformed to the lower triangular matrix

FT 0 0 0 0 E ? = K i u F j Pju+1 0 0

0 0

6.4. UNIVARIATE TREATMENT OF MULTIVARIATE SERIES 129

was, however, incomplete and in particular did not deal with the initialisation problem, where the most substantia] gains are made. The following discussion of the univariate approach is based on Koopman and Durbin (2000) who gave a complete treatment including a discussion of the initialisation problem.

6.4.2 DETAILS OF UNIVARIATE TREATMENT

Our analysis will be based on the standard model

yt = Ztat + st, ori+i = Ttat + Rtt],,

with st ~ N(0, Ht) and r}t ~ N(0, Qt) for t — 1, ..., n. To begin with, let us as-sume that «i ~ N((j!, P\) and H{ is diagonal; this latter restriction will be removed later. On the other hand, we introduce two slight generalisations of the basic model: first, we permit the dimensionality of yt to vary over time by taking the dimen-sion of vector yt to be pt x 1 for t = 1 , . . . , n; secondly, we do not require the prediction error variance matrix Ft to be non-singular.

Write the observation and disturbance vectors as

Í >'U \ yt - Et -

\yt,pj

and the observation equation matrices as

/ Zr.i

\S',P1 /

fit = \ Zt,pi}

T2 7tA

0 0 0

0 \

/ 0

r2 t,pt

where ytti, etj and of t are scalars and Z/i(- is a(1 x m) row vector, for/ — 1 , . . . , pt. The observation equation for the univariate representation of the model is

ytti — Ztjdtj + st>i, i = 1,..., pt, t — 1,..., n,

where atj — a,. The state e q u a t i o n co iTesponding to (6.10) i s

«í+ M = Ttttt.p, + Rtm, i = 1,..., pt - 1, t = 1,...,«,

(6.10)

(6.11)

where the initial state vector = ai ~ N(a\, P\). Define

atA =ECaifi|yf_i), PtA - Vaifa.iiri-O,

and

atyi = E4atji\Yt-i, yt.u ytj-i), Pt,i = Var(afij|yf_i, yfj,..., yf,/_i),


for i — 2,..., pt. By treating the vector series y i , . . . , y„ as the scalar series

y u . • • • > yi.pi>y2,u •"•> yn,Pn, the filtering equations (4.13) can be written as

atj+1 = (hj + KtJvt.i, Pt,i+i = Pt,i ~ KuFuK'ti, (6.12) where

Vt,i = ytj — Ztjatj, F,j = ZtiP(jZ'ti +cffi, Kti — PtjZ'tiF~x, (6.13)

for i — 1 , . . . , pt and t — I, . . . , n. This formulation has vtj and Ft i as scalars and Kt i ^ a column vector. The transition from time t to time t + 1 is achieved by the relations

at+i,i = Ttat,Pt+u Pt+i,i - TtPt,Pl+lTt' + RtQtR'r (6.14) The values aJ+u and Pt+i,\ are the same as the values at+\ and P(+1 computed by the standard Kalman filter.

It is important to note that the elements of the innovation vector v, are not the same as vtj, for i — 1 , . . . , pt; only the first element of vt is equal to vtt i. The same applies to the diagonal elements of the variance matrix Ft and the variances Ft j, for / — 1 , . . . , pt ; only the first diagonal element of Ft is equal to Ft^. It should be emphasised that there are models for which Ft i can be zero, for example the case where yt is a multinomial observation with all cell counts included in yt. This indicates that yu- is linearly dependent on previous observations for some i. In this case,

atj+i = E(of,./+i|yi_i, y , , i , . . . , ytJ) = E(aM+iHV_i, y f , i , . . -, y,,i_i) = atJ,

and similarly Ptj+\ = Ptj- The contingency is therefore easily dealt with. The basic smoothing recursions (4.48) for the standard state space model can

be reformulated for the univariate series

as = Z'^F-fvtj + Lf

t irtJ, N ^ - Z'tJF^Zhi + L'tiNtJLt<i, rt-hPl — Tl_Yrty o, Nt-i,p, = T(_xNtfiTt-u

(6.15) where Lti = lm — KttiZtiF~lyioxi — pt,..., 1 andi — n,..., 1. The initialisa-tions are rn-Pa — Oand N„tPn = 0. The equations for rf_i>Pl and ty-i.p, do not apply for t — 1. The values for rt o and are the same as the values for the smoothing quantities rt-\ and of the standard smoothing equations, respectively.

The smoothed state vector at = E(a,|y) and the variance error matrix Vt — Var(a, {y), together with other related smoothing results for the transition equation,


are computed by using the standard equations (4.26) and (4.31) with

at = af> i, Pt = PtA, rf_! = rt>0, Nt-i =

Finally, the smoothed estimators for the observation disturbances s t j of (6.10) follow directly from our approach and are given by

%,i = tfiFt~Hvt,i - Ktjrt,i\ Var(cfii) = + K'tiNtAKtA).

Similar formula can also be developed for the simulation smoother of §4.7 after conversion to univariate models.

6.4.3 CORRELATION BETWEEN OBSERVATION EQUATIONS

For the case where Ht is not diagonal, the univariate representation of the state space model (6.10) does not apply due to the correlations between the et/s. In this situation we can pursue two different approaches. Firstly, we can put the disturbance vector st into the state vector. For the observation equation of (3.1) define

and for the state equation define

M i £)• Mo' 2,)' leading to

yt — Ztat, a(+i - ftat ~h RtTjt, f j , ~ N(0, Qt),

for t — 1 W e then proceed with the same technique as for the case where Ht is diagonal by treating each element of the observation vector individually. The second approach is to transform the observations. In the case where Ht is not diagonal, we diagonalise it by the Cholesky decomposition

Ht = CtH*C't,

where H* is diagonal and Ct is lower triangular with ones on the diagonal. By transforming the observations, we obtain the observation equation

y* = Z*at + e% s* ~ N(0, H?),

where y* ~ C~lyt, Z* — C~xZt and e* — C~xet. Since C, is a lower triangular matrix, it is easy to compute its inverse. The state vector is not affected by the transformation. Since the elements of e* are independent we can treat the series )>* as a univariate series in the above way.


Table 6.1. Percentage savings for filtering using univariate approach.

State dim. Obs. dim. p= I P = 2 p = 3 p = 5 p = 10 p = 20 m — 1 0 39 61 81 94 98 m — 2 0 27 47 69 89 97 m — 3 0 21 38 60 83 95 m = 5 0 15 27 47 73 90 fn ~ 10 0 8 16 30 54 78 m — 20 0 5 9 17 35 58

These two approaches for correlated observation disturbances are complemen-tary. The first method has the drawback that the state vector can become large. The second method is illustrated in §6.4.5 where we show that simultaneously trans-forming the state vector can also be convenient.

6.4.4 COMPUTATIONAL EFFICIENCY

The main motivation for this 'univariate' approach to filtering and smoothing for multivariate state space models is computational efficiency. This approach avoids the inversion of matrix F( and two matrix multiplications. Also, the implementation of the recursions is more straightforward. Table 6.1 shows that the percentage savings in the number multiplications for filtering using the univariate approach compared to the standard approach are considerable. The calculations concerning the transition are not taken into account in the calculations for this table because matrix TT is usually sparse with most elements equal to zero or unity.

Table 6.2 presents the considerable percentage savings in the number of multiplications for state smoothing compared to the standard multivariate approach. Again, the computations involving the transition matrix TT are not taken into account in compiling these figures.

Table 6.2. Percentage savings for smoothing using univariate approach.

State dim. Obs. dim. p= 1 p = 2 P = 3 p = 5 p-10 p = 20 m = 1 0 27 43 60 77 87 m = 2 0 22 36 53 72 84 m — 3 0 19 32 48 68 81 m — 5 0 14 25 40 60 76 m = 10 0 9 16 28 47 65 m - 20 0 5 10 18 33 51


6.4.5 ILLUSTRATION: VECTOR SPLINES

We now consider the application of the univariate approach to vector splines. The generalisation of smoothing splines of Hastie and Tibshirani (1990) to the multivariate case is considered by Fessler (1991) and Yee and Wild (1996). The vector spline model is given by

yt = 0(ti) + st, Efe) - 0, Var(£;) = , i = 1 , . . . , n,

where yf is a p x 1 vector response at scalar i( , 0 (•) is an arbitrary smooth vector function and errors St are mutually uncorrected. The variance matrix is assumed to be known and is usually constant for varying r. This is a generalisation of the univariate problem considered in §3.11.2. The standard method of estimating the smooth vector function is by minimising the generalized least squares criterion

J2 -0 ft-)}' & -0 (*t))+£ fe" o>2 dt' where the non-negative smoothing parameter X j determines the smoothness of the yth smooth function Oj (•) of vector 9 (•) for j — 1 , . . . , p. Note that ti+l > i, for i — 1 , . . . , n — 1 and 0" (t) denotes the second derivative of 0j (t) with respect to t. In the same way as for the univariate case in (3.38), we use the discrete model

yi = fit + s^ % p,i+1 = p,i + SiVi + in, Var(7/0 — —A,

S2

Vi+i = Vi + Var(£) = Si A, Co ft) = y A ,

with vector /x, = 0 (i{), scalar — — ti and diagonal matrix A — diag(A,i,..., Xp). This model is equivalent to the continuous-time representation of the multivariate local linear trend model with no disturbance vector for the level equation; see Harvey (1989, Chapter 8). In the case of = S and with the Cholesky decomposition £ = C D C where matrix C is lower triangular and matrix D is diagonal, we obtain the transformed model

Var(^) = S1Q , S2

Var(£) - SiQ, Cov(m, ft) = jQ,

with y* = C_1yj and Var(e*) = D where we have used (3.39). Furthermore, we have p,* - C lpi, v* = C"!vr and Q - C~lKC'~l. The Kalman filter smoother algorithm provides the fitted smoothing spline. The untransformed model and the transformed model can both be handled by the univariate strategy of filtering and smoothing discussed in this section. The advantage of the transformed model

y* = /x* + S*,

P?i+X = II* + Si V* + nf>


is that e* can be excluded from the state vector which is not possible for the untransformed model because Varfo) — is not necessarily diagonal; see the discussion in §6.4.3.

Hie percentage computational saving of the univariate approach for spline smoothing depends on the size p. The state vector dimension for the transformed model is m — 2p so the percentage saving in computing for filtering is 30 if p = 5 and it is 35 if p — 10; see Table 6.1. The percentages for smoothing are 28 and 33, respectively; see Table 6.2.

i

fi"-::

6.5 Filtering and smoothing under linear restrictions

We now consider how to carry out filtering and smoothing subject to a set of time-varying linear restrictions on the state vector of the form

R*at = r*, t — 1,..., n, (6.16)

where the matrix Rf and the vector r* are known and where the number of rows in R* can vary with t. Although linear restrictions on the state vector can often be easily dealt with by re-specifying the elements of the state vector, an alternative is to proceed as follows. To impose the restrictions (6.16) we augment the obervation equation as

(SHSHCO- t = i, n. (6.17)

For this augmented model, filtering and smoothing will produce estimates at and ât which are subject to the restrictions R*a, ~ r* and R*ât — r*\ for a discussion of this procedure, see Doran (1992). Equation (6.17) represents a multivariate model whether yt is univariate or not. This can, however, be converted into a univariate model by the device of §6.4.

-:!•

i

6.6 The algorithms of Ssf Pack

6.6.1 INTRODUCTION

SsfPack is a suite of C routines for carrying out computations involving the sta-tistical analysis of univariate and multivariate models in state space form. SsfPack allows for a range of different state space forms from simple time-invariant models to complicated time-varying models. Functions are specified which put standard models such as ARIMA and spline models in state space form. Routines are avail-able for filtering, smoothing and simulation smoothing. Ready-to-use functions are provided for standard tasks such as likelihood evaluation, forecasting and signal extraction. The headers of these routines are documented by Koopman, Shephard and Doornik (1999). SsfPack can be easily used for implementing, fitting and analysing Gaussian and non-Gaussian models relevant to many areas of

6.6. THE ALGORITHMS OF SsfPack 135

time series analysis. In future versions, the exact initial Kalman filter and smoother and the square root filter will become part of SsfPack. Routines for dealing with multivariate models after conversion to univariate models will also be provided. A Gaussian illustration is given in §6.6.3. A further discussion of SsfPack in the context of non-Gaussian models is given in §14.6.

6 . 6 . 2 THE SSFPACK FUNCTIONS

A list of SsfPack functions is given below; they are grouped in functions which put specific univariate models into state space form, functions which perform the basic filtering and smoothing operations and functions which execute specific important tasks for state space analysis such as likelihood evaluation. The first column contains the function names, the second column gives reference to the section number of the SsfPack documentation of Koopman et al. (1999) and the third column describes the function with references to equation or section numbers in this book.

Models in state space form AddSsfReg §3.3 GetSsfArma §3.1 GetSsfReg §3.3 GetSsfSpline §3.4 GetSsfStsm §3.2

SsfCombine §6.2 SsfCombineSym §6.2

adds regression effects (3.27) to state space, puts ARMA model (3.15) in state space, puts regression model (3.27) in state space, puts cubic spline model (3.43) in state space, puts structural time series model of §3.2.1 in

state space, combines system matrices of two models, combines symmetric system matrices of two models.

General state space algorithms KalmanFil §4.3 provides output of the Kalman filter in §4.2.2. KalmanSmo §4.4 provides output of the basic smoothing algorithm

in §4.3.3. SimSmoDraw §4.5 provides a simulated sample such as (4.77). SimSmoWgt §4.5 provides output of the simulation smoother (4.75).

Ready-to-use functions SsfCondDens §4.6

SsfLik SsfLikConc SsfLikSco SsfMomentEst

SsfRecursion

§5.1 §5.1 §5.1 §5.2 §5.3 §4.2

provides mean or a draw from the conditional density (4.56).

provides log-likelihood function (7.3). provides concentrated log-likelihood function (2.46). provides score vector information (7.15). provides output from prediction, forecasting and

smoothing. provides output of the state space recursion (3.1).


These SsfPack functions are documented in Koopman et ai (1999). We will not discuss the functions further but an example of Ox code, which utilises the link with the SsfPack library, is given below.

6.6.3 ILLUSTRATION: SPLINE SMOOTHING

•

:-tf--, I s-..•II. i.r-":.

i".. r-Vih.j.;;;,

Í-F- i

•i^ -

I!

In the Ox code below we consider the continuous spline smoothing problem which aims at minimising (3.43) for a given value of X. The aim of the program is to fit a spline through the Nile time series of Chapter 2 (see §2.2.2 for details). To illustrate that the SsfPack functions can deal with missing observations we have treated two parts of the data set as missing. The continuous spline model is easily put in state space form using the function Getssf Spline. The smoothing parameter A is chosen to take the value 2500 (the function requires the input of A,-1 — 0.004). We need to compute an estimator for the unknown scalar value of a 2 in (3.44) which can be obtained using the function Ssf Lik. After rescaling, the estimated spline function using filtering (ST_FIL) and smoothing (ST_SM0) is computed using the function Ssf Moment Est. The output is presented in Figure 6.1 and shows the filtered and smoothed estimates of the spline function. The two diagrams illustrate the point that filtering can be interpreted as extrapolation and that when observations are missing, smoothing is in fact interpolation.

1500

1000

500

1870 1880 1890 1900 1910 1920 1930 1940 1950 I960 1970

Smooth+/-2SE | (jj) 1250 -

1000

750

500 1870 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970

Fig. 6.1. Output of Ox program spline. ox: (i) Nile data with filtered estimate of spline function with 95% confidence interval; (ii) Nile data with smoothed estimate of spline function with 95% confidence interval.

6.6. THE ALGORITHMS OF SsfPack 137

spline.ox #include <oxstd.h> #include <oxdraw.h> #include <oxfloat.h> #include <packages/ssfpack/ssfpack.h> main() {

decl mdelta, mphi, mom eg a, insignia, myt, mfil , msmo, cm, dlik, dvar; myt = loadmatO'Nile.dat") ' ; myt[][1890-1871:1900-1871] = M_NAN; myt[][1950-1871:1960-1871] = MJiAN;

/ / set 1890..1900 to missing / / set 1950..1960 to missing

GetSsfSpline(0.004, <>, fcmphi, fanomega, fcmsigma); / / SSF for spline SsfLik(&dlik, fedvar, myt, mphi, momega); / / need dVar cm = columns(mphi); / / dimension of s ta te momega *= dvar; / / set correct scale of Omega SsfMomentEst(ST_FIL, femfil, myt, mphi, momega); SsfKomentEst(ST_SM0, &msmo, myt, mphi, momega); / / NB: f i r s t f i l t e red estimator doesn't exist DrawTMatrix(0, myt, {"Nile"}, 1871, 1, 1); DrawTMatrix(0, mfil[cm][1:], {"Pred +/- 2SE"}, 1872, 1, 1, 0, 3); DrawZ(sqrt(mfil[2*cm+l][1:]), "", ZM0DE_BAND, 2.0, 14); DrawTMatrix(1, myt, {"Nile"}, 1871, 1, 1); DrawTMatrix(l, msmo[cm][], {"Smooth +/- 2SE"}, 1871, 1, 1, 0, 3); DrawZ(sqrt(msmo[2*cm+l] [] ) , ZM0DEJ3AND, 2.0, 14) ; ShowDrawtfindowC);

}

7 Maximum likelihood estimation

7.1 Introduction

Ji. -'"'Il"

.«sJff-«iI-h

i::;:::1

In virtually all applications in practical work the models depend on unknown parameters. In this chapter we consider the estimation of these parameters by maximum likelihood. For the linear Gaussian model we shall show that the likelihood can be calculated by a routine application of the Kalman filter, even when the initial state vector is fully or partially diffuse. We go on to consider how the loglikelihood can be maximised by means of iterative numerical procedures. An important part in this process is played by the score vector and we show how this is calculated, both for the case where the initial state vector has a known distribution and for the diffuse case. A useful device for maximisation of the loglikelihood in some cases, particularly in the early stages of maximisation, is the EM algorithm; we give details of this for the linear Gaussian model. We go on to consider biases in estimates due to errors in parameter estimation. The chapter ends with a discussion of some questions of goodness-of-fit and diagnostic checks.

7.2 Likelihood evaluation

7.2.1 LOGLIKELIHOOD WHEN INITIAL CONDITIONS ARE KNOWN

We first assume that the initial state vector has density N ^ , Pi) where a\ and P\ are known. The likelihood is

L(y) = p(yu .... y») = pCyi)0pCv*Iir-O, f=2

where Yt — {yi, . . . , y,}. In practice we generally work with the loglikelihood n

logL(y) = P(yt\Yt-\)-> (7.1)

where KMô) = Kyi)- For model (3.1), E(yt\Yt-i) = Ztat. Putting v, = yt - Z(at, Ft = Var(yi|Ff_i) and substituting N(Zfaf, Ft) for p(yl\Yt^x) in (7.1), we obtain

logL(y) - log2jt - i (\og\Ft\ + v'tF^Vt). (7.2)

7.2. LIKELIHOOD EVALUATION 139

The quantities vt and Ft are calculated routinely by the Kalman filter (4.13) so log L(y) is easily computed from the Kalman filter output. We assume that Ft is nonsingular for t — 1 I f this condition is not satisfied initially it is usually possible to redefine the model so that it is satisfied. The representation (7.2) of the loglikelihood was first given by Schweppe (1965). Harvey (1989, §3.4) refers to it as the prediction error decomposition.

7.2 .2 DIFFUSE LOGLIKELIHOOD

We now consider the case where some elements of oti are diffuse. As in §5.1, we assume that a \ = a +A8-\-RoJ}q where a is a known constant vector, S ~ N(0, Klq), 7]o ~ N(0, Q0) and A'R0 = 0, giving otx ~ N(au P0 where P\ ~ kPoO + P* and K -»• oo. From (5.6) and (5.7),

Ft = KFOOJ + F*,T + 0(K-1) with F ^ j = ZtP^tZ',,

where by definition of d, Pooj( / 0 for t = 1 , . . . , d. The number of diffuse elements in «i is q which is the dimensionality of vector Thus the loglikelihood (7.2) will contain a term — |log2jr/c so log L(y) will not converge as K. oo. Following de Jong (1991), we therefore define the diffuse loglikelihood as

tog Ld(y)= lim [log L(y) + | log/tl o L 2 - 1

and we work with log Ld(y) in place of log L(y) for estimation of unknown parameters in the diffuse case. Similar definitions for the diffuse loglikelihood function have been adopted by Harvey and Phillips (1979) and Ansley and Kohn (1986). As in §5.2, and for the same reasons, we assume that F^ ( is positive definite or is a zero matrix. We also assume that q is a multiple of p. This covers the important special case of univariate series and is generally satisfied in practice for multivariate series; if not, the series can be dealt with as if it were univariate as in §6.4.

Suppose first that F^j is positive definite and therefore has rank p. From (5.8) and (5.10) we have for t ~ 1 , . . . , d,

It follows that

- log|F, | = l o g ^ l - log|K~{FM]t + 0(k~2)|

- -plogK + log\F^lt + 0(k~%

and

lim(-log|/ i ,i | + plog«) = log|F0^1J = -loglFoo.,1.

I T — 1 I

140 MAXIMUM LIKELIHOOD ESTIMATION

;ï'. :: I

ï -ii 1,

Moreover,

lim v'tF~lvt = lim k ( 0 ) + /c" 1 ^ 0 + 0{K2)]\K~1F^\ + 0(K~2)] K-+ OO K->OO L

x ^ + i t - ^ + Oiit-2)] = 0

for t — 1 , . . . , d, where t>{0) and are defined in §5.2.1. When Foo,t = 0, it follows from §5.2.1 that Ft = F*it + OO^1) and F~l =

FZT + O(A: "1). Consequently,

lim (-Iog|F, |) = — logiF* tI and lim v / ^ ' i ; , = vf* F; v)">. K-*0O K^ROO

Putting these results together, we obtain the diffuse loglikelihood as

n 1 ^ 1 n

logL d (y ) = log2TT J ] (loglM + wjF- 1*), (7.3) :i;! Z Z r=l d+1

where

[logI Foo.,1, if F«,., is positive definite,

^ i logics + vfyF~}vf\ if F0oif = 0,

for / = 1 , . . . , d. The expression (7.3) for the diffuse loglikelihood is given by Koopman (1997).

7.2 .3 DIFFUSE LOGLIKELIHOOD EVALUATED VIA AUGMENTED KALMAN FILTER

if,::;! In the notation of §5.7.3, the joint density of 5 and y for given K is

p(8,y). p(8)p(y\8)

t=i

(7.4) x exp ( ^ T + Sa,n + + S'SAjnS^

where v$,t is defined in (5.35), bn and SA n are defined in (5.40), and Sa n — U - i K ^ v a j . From (5.41) we have Sn = E(S|Yn) = -(SA,r t 4-The exponent of (7.4) can now be rewritten as

-\iSa>n + (5 - Sn)'(SA,n + AC"1 rqX& - &n) - K(SA,n + Iqtfnl

7.2. LIKELIHOOD EVALUATION 141

as is easily verified. Integrating out <5 from p(8, y) we obtain the marginal density of y. After taking logs, the loglikelihood appears as

log LOO = log27T - | logic - i \og\SA,n + Iq\

- i £ l o g 17 ,̂1 - ~[Sa,n (7.5) 1 i=i 1

Adding | log k and letting fc oo we obtain the diffuse loglikelihood

np 1 1 A logL d (y ) = — ^ log2TI - - \og\SA,n\ ~ ^ Z ^ M ^ r l

which is due to de Jong (1991). In spite of its very different structure (7.6) necessarily has the same numerical value as (7.3).

It is shown in §5.7.3 that the augmented Kalman filter can be collapsed at time point t — d. We could therefore form a partial likelihood based on Yd for fixed A:, integrate out 8 and let k —> oo as in (7.6). Subsequently we could add the contribution from innovations vd+i,..., vn obtained from the collapsed Kalman filter. However, we will not give detailed formulae here.

These results were originally derived by de Jong (1988b) and de Jong (1991). The calculations required to compute (7.6) are more complicated than those required to compute (7.3). This is another reason why we ourselves prefer the initialisation technique of §5.2 to the augmentation device of §5.7. A further reason for prefering our computation of (7.3) is given in §7.3.5.

7 .2 .4 LIKELIHOOD WHEN ELEMENTS OF INITIAL STATE VECTOR ARE FIXED

BUT UNKNOWN

Now let us consider the case where 5 is treated as fixed. The density of y given 8 is, as in the previous section,

p(y\8) = (2n)-^2f[ [i^pêxp t=l

1 / — + 2 b'n8n + 8'nSAyn8n) . (7.7)

The usual way to remove the influence of an unknown parameter vector such as 8 from the likelihood is to estimate it by its maximum likelihood estimate, 5„ in this case, and to employ the concentrated loglikelihood log Lc(y) obtained by substituting 8n — —SA\bn for 8 in p(y\8). This gives

logLc{y) = log(27r) - ~ J 2 ~ ~ KS?J>n). (7.8) f=i


Comparing (7.6) and (7.8) we see that the only difference between them is the presence in (7.6) of the term — | logliS^J. The relation between (7.6) and (7.8) was demonstrated by de Jong (1988b) using a different argument.

Harvey and Shephard (1990) argue that parameter estimation should preferably be based on the loglikelihood function (7.3) for which the initial vector 5 is treated as diffuse not fixed. They have shown for the local level model of Chapter 2 that maximising (7.8) with respect to the signal to noise ratio q leads to a much higher probability of estimating q to be zero compared to maximising (7.3). This is undesirable from a forecasting point of view since this results in no discounting of past observations.


7.3.1 INTRODUCTION

So far in this book we have assumed that the system matrices Zt, Ht, Tt, Rt and Qt in model (3.1) are known for t — 1,... ,n. We now consider the more usual situation in which at least some of the elements of these matrices depend on a vector ifr of unknown parameters. We shall estimate f by maximum likelihood. To make explicit the dependence of the loglikelihood on ^ we write log L(y\-f), log Ld(y\ir) and log Lc(y \\jr).the diffuse case we shall take it for granted that for models of interest, estimates of ifr obtained by maximising log L(y\^) for fixed K converge to the estimates obtained by maximising the diffuse loglikelihood tog LdiyW) as K oo.

7.3.2 NUMERICAL MAXIMISATION ALGORITHMS

A wide range of numerical search algorithms are available for maximising the loglikelihood. Many of these are based on Newton's method which solves the equation

a log L(yl^) W ) = \ . - 0, (7.9) ayr

using the first-order Taylor series

3 , ^ ^(l/r) + d 2 ( f ) ( f - f ) , (7.10)

for some trial value where

U f ) = a i W O i ^ , = h m i ^ ,

with a

2

log L(y\f) W ) = a w • ( 7 1 1 )

By equating (7.10) to zero we obtain a revised value if from the expression


This process is repeated until it converges or until a switch is made to another optimisation method. If the Hessian matrix 32(Vr) is negative definite for all i/ the loglikelihood is said to be concave and a unique maximum exists for the likelihood. The gradient 9j (\{r) determines the direction of the step taken to the optimum and the Hessian modifies the size of the step. It is possible to overstep the maximum in the direction determined by the vector

and therefore it is common practice to include a line search along the gradient vector within the optimisation process. We obtain the algorithm

xjr — r̂ +ijf(^r),

where various methods are available to find the optimum value for s which is usually found to be between 0 and 1. In practice it is often computationally demanding or impossible to compute and d2(if) analytically. Numerical evaluation of is usually feasible. A variety of computational devices are available to approximate 32(V/) in order to avoid computing it analytically or numerically. For example, the STAMP package of Koopman et al. (2000) and the Ox matrix programming system of Doornik (1998) both use the so-called BFGS (Broyden-Fletcher-Goldfarb-Shannon) method which approximates the Hessian matrix using a device in which at each new value for iff a new approximate inverse Hessian matrix is obtained via the recursion

where g is the difference between the gradient and the gradient for a trial value of i}/ prior to $ and

g* = wrV The BFGS method ensures that the approximate Hessian matrix remains nagative definite. The details and derivations of the Newton's method of optimisation and the BFGS method in particular can be found, for example, in Fletcher (1987).

Model parameters are sometimes constrained. For example, the parameters in the local level model (2.3) must satisfy the constraints a 2 > 0 and a 2 > 0 with cr£2 + a 2 > 0. However, the introduction of constraints such as these within the numerical procedure is inconvenient and it is preferable that the maximisa-tion is performed with respect to quantities which are unconstrained. For this example we therefore make the transformations \frE — ~ log of and ^ — | log a2

where — oo < ire, ^ < oo, thus converting the problem to one in unconstrained maximisation. The parameter vector is tfr = [\fr£, V^)'. Similarly, if we have a parameter x which is restricted to the range [—a, a\ where a is positive we can


make a transformation to <//v for which

X = axf/.

—oo < i?x < oo.

7.3.3 THE SCORE VECTOR

We now consider details of the calculation of the gradient or score vector

dlogL(y\jf) df

As indicated in the last section, this vector is important in numerical maximisation since it specifies the direction in the parameter space along which a search should be made.

We begin with the case where the initial vector ori has the distribution ot\ ~ N(aj, Pi) where a\ and P\ are known. Let p(a, yji/Obe the joint density of a and )\ let p(a\y, i f ) be the conditional density of a given y and let p(y\if) be the marginal density of y for given ir. We now evaluate the score vector 9 logL(y |^) /3^ — 8 log p(y\if)/dis at the trial value ir. We have

l °gp(y \ i r ) — logp(or, y\f)~ log p(cc\y, i f ) .

Let E denote expectation with respect to density p(a\y, i f ) . Since p(y\if) does not depend on a, taking E of both sides gives

\ogp(y\ir) -E[logjP(a,y |^)] - E[log/?(ajy, f ) ] .

To obtain the score vector at ijr, we differentiate both sides with respect to if and put if — if. Assuming that differentiating under integral signs is legitimate,

91ogp(a|y, ir) dir \jf=$

f 1 f ) J p(a\y,$) dir

j p(a\y,ir)d<x

/)(ajy, i f )da

9 = 0.

Thus

3 log p(y\jf) dir

d\ogp(ct,y\jf) df

With substitutions r]t ~ R't(at+l — T,ctt) and e, = y, — Ztat and putting a3 — A\ — TJO and Pi — QQ, we obtain


log p(oc, ylVO — constant

+log | f i t _, | +s ' t Hf l E t + ^ G r i i i i - i ) .

(7.12)

2i = i

On taking the expectation E and differentiating with respect to 1//, this gives the score vector at ^ — i j f ,

d log L(y\f) df = ~ ^ r E [ ( l o g T O + l o g l G i - 1 l

+ t r [{^+Var(e i | y )} / i - 1 ]

+ tr[{f)/_,f);_1 + V a r i ^ - i l y M G r i ] ^ ) ! ^ ] . (7-13)

where s t , Var(e([y) and Var^- i ly) are obtained for tfr — -fy as in §4.4. Only the terms in H( and Qt in (7.13) require differentiation with respect to T/S.

Since in practice Ht and Qt are often simple functions of \ j f , this means that the score vector is often easy to calculate, which can be a considerable advantage in numerical maximisation of the loglikelihood. A similar technique can be developed for the system matrices Zt and 7} but this requires more computations which involve the state smoothing recursions. Koopman and Shephard (1992), to whom the result (7.13) is due, therefore conclude that the score values for \fr associated with system matrices Zt and Tt can be evaluated better numerically than analytically.

We now consider the diffuse case. In §5.1 we have specified the initial state vector «1 as

a, - a + AS 4- R0v0, S ~ N(0, iclq), m ~ N(0, Go),

where Go is nonsingular. Equation (7.12) is still valid except that a\ —a\ — r)0 is now replaced by a\ — a — AS + /f0i?o and Pi = KP^ + P* where P* = RQQQRQ. Thus for finite K the term

logic + K- ' t i iW + Var(Sjy)}) 2 ay/

must be included in (7.13). Defining

d log Ld(y\f) a lim J L riogL(y|^) + f log/cl, :—>oo dW L 2 J

analogously to the definition of log Ld(y) in §7.2.2, and letting tc oo, we have that

9 log Ld(y | rfr) df = 8 1 ° g L ( y l f ) (7.14)

which is given in (7.13). In the event that consists only of diffuse elements, so the vector rj0 is null, the terms in Go disappear from (7.13).


As an example consider the local level model (2.3) with rj replaced by for which

fe f t

±logorfi2\

I loger'J*

with a diffuse initiahsation for a\. We have on substituting yt ~at — st and a t + j ~ a t —

In — 1 n 2 n ~ 1 ? log p(a, y l^r) = — log2jr - - log<re — log a*

1 71 \ n

f i==2

and 2/1—1 « 7 n — 1

E[log />(«, y IVO! — — log 2TT - - log erf — log ot

Î (-2

where the conditional means and variances for st and are obtained from the Kalman filter and disturbance smoother with crf and o^ implied by if? = ijr. To obtain the score vector, we differentiate both sides with respect to ijr where we note that, with ir£ — \ log <7e

2,

d to? 1 ogal + ^[êf + Varfofy)} = -L _ ~ { s f + Vaife |j0},

M dft-

= Ist •

The terms et and Var(e, |y) do not vary with is since they have been calculated on the assumption that \fr — ijr. We obtain

^ log Ld(y\f) 8 f e

In a similar way we have

a log M y 1^) l a

=2 t-

- 1 - » + Ê + 1=2


The score vector for \jr of the local level model evaluated at ifr — $ is therefore

3 log Ld(y\xls) df <r=f

—2

TH=2 (rt-\ - Nt~l)

with a j and a^ from ijr. This result follows since from §§2.5.1 and 2.5.2 £f — Varfe \y) - <x£

2 - <t£4A, It = and Varfo |y) = of - a*Nt.

It is very satisfactory that after so much algebra we obtain such a simple expression for the score vector which can be computed efficiently using the distur-bance smoothing equations of §4.4. We can compute the score vector for the diffuse case efficiently because it is shown in §5.4 that no extra computing is required for disturbance smoothing when dealing with a diffuse initial state vector. Finally, score vector elements associated with variances or variance matrices in more complicated models such as multivariate structural time series models continue to have similar relatively simple expressions. Koopman and Shephard (1992) give for these models the score vector for parameters in Ht, R, and Qt as the expression

dlogLAym dijr

where ut,Dt,rt and Nt are evaluated by the Kalman filter and smoother as discussed in §§4.4 and 5.4.

7.3 .4 THE EM ALGORITHM

The EM algorithm is a well-known tool for iterative maximum likelihood estimation which for many state space models has a particularly neat form. The earlier EM methods for the state space model were developed by Shumway and Stoffer (1982) and Watson and Engle (1983). The EM algorithm can be used either entirely instead of, or in place of the early stages of, direct numerical maximisation of the loglikelihood. It consists of an E-step (expectation) and M-step (maximisation) for which the former involves the evaluation of the conditional expectation E[log p(at y\ifr)] and the latter maximises this expectation with respect to the elements of f . The details of estimating unknown elements in H, and Qt are given by Koopman (1993) and they are close to those required for the evaluation of the score function. Taking first the case of a\ and P\ known and starting with (7.12), we evaluate Eflog p(a, y\\(f)] and as in (7.13) we obtain

+ tr[{e ie;+Var(£r|y)}//-1] + tr[{f)f-i + Varfar-i |y)} Q ' } ^ f ] , (7.16)


where e t, i)t-\, Var(£,l)') and Var(fy_i|y) are computed assuming }jr ~ xjr, while Ht and retain their original dependence on The equations obtained by setting (7.16) equal to zero are then solved for the elements of iff to obtain a revised estimate of if. This is taken as the new trial value of iff and the process is repeated either until adequate convergence is achieved or a switch is made to numerical maximisation of log L(y\\jr). The latter option is often used since although the EM algorithm usually converges fairly rapidly in the early stages, its rate of convergence near the maximum is frequently substantially slower than numerical maximisation; see Watson and Engle (1983) and Harvey and Peters (1990) for discussion of this point. As for the score vector in the previous section, when a\ is diffuse we merely redefine r]o and £>o in such a way that they are consistent with the initial state vector model a\ — a + AS + Ror/o where 5 — N(0, iclq) and rjo ~ N(0, Qo) and we ignore the part associated with 8. When a.\ consists only of diffuse elements the term in Qq 1 disappears from (7.16).

To illustrate, we apply the EM algorithm to the local level model as in the previous section but now we take

The E-step involves the Kalman filter and disturbance smoother to obtain s t, , Varfoly) and Var(f,_i|y) of (7.16) given ty ~ The M-step solves for or? and a^ by equating (7.16) to zero. For example, in a similar way as in the previous section we have

and similarly for the term in cr2. New trial values for ae2 and o^ are therefore

obtained from

W - W) = " " + k Ê («? - D')< n r=l " f=i äi = ~ 1 t if.2-. - Ml=+jrrr'if £ (r--> ~

" 1 t—2 n 1

i=2

since st = <r2wt, Var(er|y) = <r2 — a*Dt, f , = a^r, and Var(£, jy) = of; — The disturbance smoothing values ut, Dt, r, and Nt are based on <r2 and <r2. The new values a 2 and a 2 replace <x2 and a 2 and the procedure is repeated until either convergence has been attained or until a switch is made to numerical optimisation. Similar elegant results are obtained for more general time series models where unknown parameters occur only in the Ht and Q, matrices.

- 2 - ^ Ë [ l o g p(a, y = A £ + "T^? + VarfebO} n

— 0,


7.3.5 PARAMETER ESTIMATION WHEN DEALING WITH DIFFUSE INITIAL CONDITIONS

It was shown in previous sections that only minor adjustments are required for parameter estimation when dealing with a diffuse initial state vector. The diffuse loglikelihood requires either the exact initial Kalman filter or the augmented Kalman filter. In both cases the diffuse loglikelihood is calculated in much the same way as for the non-diffuse case. No real new complications arise when computing the score vector or when estimating parameters via the EM algorithm. There is a compelling argument however for using the exact initial Kalman filter of §5.2 rather than the augmented Kalman filter of §5.7 for the estimation of parameters. For most practical models, the matrix P^ t and its associated matrices F q q j , M ^ , and K ^ t do not depend on parameter vector \[r. This may be surprising but, for example, by studying the illustration given in §5.6.1 for the local linear trend model we see that the matrices Poo ,, K^ — TtMQOitFOQ

lt

and L(0) — Tt — K.f^Zt do not depend on o f , o2 or on a2. On the other hand, we see that all the matrices reported in §5.7.4, which deals with the augmentation approach to the same example, depend on q̂ — a^ jo 2 and q^ — a 2 j o 2 . Therefore, every time that the parameter vector iff changes during the estimation process we need to recalculate the augmented part of the augmented Kalman filter while we do not have to re-calculate the matrices related to P ^ j for the exact initial Kalman filter.

First we consider the case where only the system matrices Ht, Rt and Qt depend on the parameter vector xfr. The matrices F00>, = ZtPQQtZ\ and M q q j — Poo,tZ'( do not depend on i/r since the update equation for Poo , is given by

Poo,l+i = TtP(xi,t{Tt ~ Kf } Z f ) ,

where FCf] = TtM^F^ and PMii - AA', for t = 1 , . . . , d. Thus for all quan-tities related to P ^ t the parameter vector i// does not play a role. The same holds for computing at+i for t = 1 , . . . , d since

ai+i = Ttat + Kfyvt,

where vt = y, — Ztat and a\ ~ a. Here again no quantity depends on xfr. The update equation

P*,t-hi — Tt P*, (r, — Kf^Zi) — F^t K(ti)r + RtQtR't,

where K ^ — T tM^ tF^ t — K ^ F ^ F ^ depends on ty. Thus we compute vector v, and matrices A ^ and F00J for t — 1 , . . . , d once at the start of parameter estimation and we store them. When the Kalman filter is called again for likelihood evaluation we do not need to re-compute these quantities and we only need to update the matrix P* t for t — I , . . . , d. This implies considerable computational savings during parameter estimation using the EM algorithm or maximising the diffuse loglikelihood using a variant of Newton's method.

For the case where ijr also affects the system matrices Z, and Tt we achieve the same computational savings for all nonstationary models we have considered


in this book. The matrices Z t and TT may depend on but the parts of ZT and TT which affect the computation of Poo,t, Foo,*» ôo; and K^ for t — 1 , . . . , d do not depend on IF. It should be noted that the rows and columns of Pooj associated with elements of ot\ which are not elements of S are zero for t = 1 , . . . , d. Thus the columns of ZT and the rows and columns of TT related to stationary elements of the state vector do not influence the matrices P^j, F ^ t , M ^ , and K^j. In the nonstationary time series models of Chapter 3 such as the ARIMA and structural time series models, all elements of IS which affect ZT and TT only relate to the stationary part of the model, for t ™ 1 , . . . , d. The parts of ZT and TT associated with 8 only have values equal to zero and unity. For example, the ARIMA(2,1,1) model of §3.3 shows that IF ~ (4>\ <P2 &2)' does not influence the elements of Zf and TT associated with the first element of the state vector.

7.3.6 LARGE SAMPLE DISTRIBUTION OF MAXIMUM LIKELIHOOD ESTIMATES

^ Et - ^ ! s»... :<.;,;;

i-s-i". :

Iii

'cJ i"1 !

i i . 1

y • •

&#>•*>*• t!l I •• (CCih j ^ , . . . . . .

It can be shown that under reasonable assumptions about the stability of the model over time, the distribution of if for large n is approximately

f ~ N(f , £2),

where

Q 92logL difdif

(7.17)

(7.18)

This distribution has the same form as the large sample distribution of maximum likelihood estimators from samples of independent and identically distributed observations. The result (7.17) is discussed by Hamilton (1994) in §5.8 for general time series models and in §13.4 for the special case of linear Gaussian state space models. In his discussion, Hamilton gives a number of references to theoretical work on the subject.

7.3.7 EFFECT OF ERRORS IN PARAMETER ESTIMATION

Up to this point we have followed standard classical statistical methodology by first deriving estimates of quantities of interest on the assumption that the parameter vector if is known and then replacing in the resulting formulae by its maximum likelihood estimate if. We now consider the estimation of the biases in the estimates that might arise from following this procedure. Since an analytical solution in the general case seems intractable, we employ simulation. We deal with cases where Var(^) = 0(n {) so the biases are also of order n"1.

The technique that we propose is simple. Pretend that if is the true value of if. From (7.17) and (7.18) we know that the approximate large sample distribution of the maximum likelihood estimate of xjf given that the true if is tjf is N ( i f , fo), where & is Q given by (7.18) evaluated at I/ — \JF. Draw a simulation sample of N


independent values fromN(f, £2), i — 1 , . . . , N. Denote by e a scalar, vector or matrix quantity that we wish to estimate from the sample y and let

e — E(e[y)|^=^

be the estimate of e obtained by the methods of Chapter 4. For simplicity we focus on smoothed values, though an essentially identical technique holds for filtered estimates. Let

be the estimate of e obtained by taking \jf — for i ~ 1 , . . . , N. Then estimate the bias by

1 N

Be = -J2e(t)~S- <7-19> i=i

The accuracy of Be can be improved significantly by the use of antithetic variables, which are discussed in detail in §11.9.3 in connection with the use of importance sampling in the treatment of non-Gaussian models. For example, we can balance the sample of ty^'s for location by taking only N/2 draws from

Q.), where N is even, and defining = 2$ — x/r^fori = 1 , . . . , N/2. Since — ^ — — — tjjr) and the distribution of ifr^ is symmetric about the distribution of is the same as that of ifr̂ 'K In this way we not only reduce the numbers of draws required from the £2) distribu-tion by a half, but we introduce negative correlation between the xj/^'s which will reduce sample variation and we have arranged the simulation sample so that the sample mean (V'^ 4 1- i / r w ) /N is equal to the population mean We can balance the sample for scale by a technique described in §11.9.3 using the fact that

where w is the dimensionality of however, our expectation is that in most cases balancing for location only would be sufficient. The mean square error matrix due to simulation can be estimated in a manner similar to that described in §12.5.6.

Of course, we are not proposing that bias should be estimated as a standard part of routine time series analysis. We have included a description of this technique in order to assist workers in investigating the degree of bias in particular types of problems; in most practical cases we would expect the bias to be small enough to be neglected.

Simulation for correcting for bias due to errors in parameter estimates has previously been suggested by Hamilton (1994, §13.7). His methods differ from ours in two respects. First he uses simulation to estimate the entire function under study, which in his case is a mean square error matrix, rather than just the bias,


as in our treatment. Secondly, he has omitted a term of the same order as the bias, namely n~l, as demonstrated for the local level model that we considered in Chapter 2 by Quenneville and Singh (1997). This latter paper corrects Hamilton's method and provides interesting analytical and simulation results but it only gives details for the local level model. A different method based on parametric bootstrap samples has been proposed by Pfefferman and Tiller (2000).

7.4 Goodness of fit

Given the estimated parameter vector we may want to measure the fit of the model under consideration for the given time series. Goodness of fit measures for time series models are usually associated with forecast errors. A basic measure of fit is the forecast variance Ft which could be compared with the forecast variance of a naive model. For example, when we analyse a time series with time-varying trend and seasonal, we could compare the forecast variance of this model with the forecast variance of the time series after adjusting it with fixed trend and seasonal.

When dealing with competing models, we may want to compare the loglikelihood value of a particular fitted model, as denoted by log L(y\tfr) or logLd(y\\jr), with the corresponding loglikelihood values of competing models. Generally speaking, the larger the number of parameters that a model contains the larger its loglikelihood. In order to have a fair comparison between models with different numbers of parameters, information criteria such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are used. For a univariate series they are given by

AIC — 2log L(y\i) + 2w), BIC - 1 [ -2log L(y\$) + w logn],

and with diffuse initialisation they are given by

AIC - n" I[-21ogLd(y |^) + 2(q + w)], BIC = n~l[-2\ogLd(y\$) + (q + u>)logn],

where w is the dimension of f . Models with more parameters or more nonstationary elements obtain a larger penalty. More details can be found in Harvey (1989, §2.6.3 and §5.5.6). In general, a model with a smaller value of AIC or BIC is preferred.

7.5 Diagnostic checking

The diagnostic statistics and graphics discussed in §2.12 for the local level model (2.3) can be used in the same way for all univariate state space models. The basic diagnostics of §2.12.1 for normality, heteroscedasticity and serial correlation are


applied to the one-step forecast errors defined in (4.4) after standardisation by dividing by the standard deviation F^ 2 . In the case of multivariate models, we can consider the standardised individual elements of the vector

but the individual elements are correlated since matrix Ft is not diagonal. The innovations can be transformed such that they are uncorrected:

vst - Btvt, F~l - B'sBt.

It is then appropriate to apply the basic diagnostics to the individual elements of vf. Another possibility is to apply multivariate generalisations of the diagnostic tests to the full vector A more detailed discussion on diagnostic checking can be found in Harvey (1989, §§5.4 & 8.4) and throughout the STAMP manual of Koopman et al. (2000).

Auxiliary residuals for the general state space model are constructed by

sst = Bfst, [Varié,)]1 - Be/B%

rjst = B?rjt, [VarOl)]1 = B* Bj t

for t ~ 1 , . . . , n. The auxiliary residual est can be used to identify outliers in the

yt series. Large absolute values in ss( indicate that the behaviour of the observed

value cannot be appropriately represented by the model under consideration. The usefulness of i)f depend on the interpretation of the state elements in a t implied by the design of the system matrices Tt, Rt and Qt. The way these auxiliary residuals can be exploited depends on their interpretation. For the local level model considered in §§7.3.3 and 7.3.4 it is clear that the state is the time-varying level and is the change of the level for time t + 1. It follows that structural breaks in the series yt can be identified by detecting large absolute values in the series for f / . In the same way, for the univariate local linear trend model (3.2), the second element of f / can be exploited to detect slope changes in the series yt. Harvey and Koopman (1992) have formalised these ideas further for the structural time series models of §3.2 and they constructed some diagnostic normality tests for the auxiliary residuals.

It is argued by de Jong and Penzer (1998) that such auxiliary residuals can be computed for any element of the state vector and that they can be considered as t-tests for the hypotheses

Ho : (ani - Ttat ~ RtVt)i — 0,

the appropriate large-sample statistic for which is computed by

rsit = rit/y/NÎÛ>

for i — 1 , . . . , m, where (•), is the z'th element of the vector within brackets, rit is the z'th element of the vector r{ and is the (z, j)th element of the matrix the recursions for evaluating r, and Nt are given in §4.4.4. The same applies to the


measurement equation for which t-test statistics for the hypotheses

Ho : (y, - Zta, - et)i = 0,

are computed by

4 = eit!\j Dat,

for i — 1 , . . . , p, where the equations for computing et and Dt are given in §4.4.4. These diagnostics can be regarded as model specification tests. Large values in r?t and es

it, for some values of i and t, may reflect departures from the overall model and they may indicate specific adjustments to the model.

8 Bayesian analysis

$.1 Introduction

In this chapter we consider the analysis of observations generated by the linear Gaussian state space model from a Bayesian point of view. There have been many disputes in statistics about whether the classical or the Bayesian is the more valid mode of inference. Our approach to such questions is eclectic and pragmatic: we regard both approaches as valid in appropriate circumstances. If software were equally available for both standpoints, a practical worker might wish to try them both. Hence, although our starting point in both parts of the book is the classical approach, we shall include treatment of Bayesian methodology and, in separate publications, we will provide software which will enable practical Bayesian analyses to be performed. For discussions of this eclectic approach to statistical inference see Durbin (1987) and Durbin (1988).

8.2 Posterior analysis of state vector

8.2.1 POSTERIOR ANALYSIS CONDITIONAL ON PARAMETER VECTOR

For the linear Gaussian state space model (3.1), with parameter vector xjr specified, the posterior analysis of the model from a Bayesian standpoint is straightforward. The Kalman filter and smoother provide the posterior means, variances and covari-ances of the state vector a t given the data. Since the model is Gaussian, posterior densities are normal, so these together with quantiles can be estimated easily from standard properties of the normal distribution.

8.2.2 POSTERIOR ANALYSIS WHEN PARAMETER VECTOR IS UNKNOWN

We now consider Bayesian analysis for the usual situation where the parameter vector x(f is not fixed and known; instead, we treat ^ as a random vector with a known prior density p(i/r), which to begin with we take as a proper prior, leaving the non-informative case until later. For discussions of choice of prior see Gelman, Carlin, Stern and Rubin (1995) and Bernardo and Smith (1994). The problems we shall consider amount essentially to the estimation of the posterior mean

| X = E[*(OR)|Y] (8.1)

156 BA YES IAN ANALYSIS

of a function x(a) of the stacked state vector a. Let

l i w

.... ii-

1 ,Mi il

x ( f ) = E[x(a)\ir, y]

be the conditional expectation of x(a) given f and y. We shall restrict consideration in this chapter to those functions x(a) for which x ( f ) can be readily calculated by the Kalman filter and smoother, leaving aside the treatment of other functions until Chapter 13. This restricted class of functions still, however, includes many important cases such as the posterior mean and variance matrix of a, and forecasts E(yn+j \y) for j = 1,2, The treatment of initialisation in Chapter 5 permits elements of the initial state vector to have either proper or diffuse prior densities.

Obviously,

/ x = / x(x(r)p(f\y)df. (8.2)

By Bayes theorem p(iff\y) = Kp(ijf)p(y\if) where K is the normalising constant defined by

/ K~l = / p(ir)p(y\f)df.

We therefore have

x = f x(f)p(f)p(y\jr)df

f p(f)p(y \f)df

(8.3)

(8.4)

Now p(y\is) is the likelihood, which for the linear Gaussian model is easily calculated by the Kalman filter as shown in §7.2. We have already restricted x(a) to cases which can be calculated by Kalman filtering and smoothing operations. Thus the integrands of both integrals in (8.4) can be computed relatively easily by standard numerical methods in cases where the dimensionality of is not large.

The main technique we shall employ in this book for Bayesian analysis however is simulation. As will be seen in Chapters 11 and 13, simulation produces methods that are not only effective for linear Gaussian models of any reasonable size but which also provide accurate Bayesian analyses for models that are non-Gaussian or nonlinear. The starting point for our simulation approach is importance sampling, for a discussion of which see Ripley (1987, pp. 122-123) or Geweke (1989). We shall outline here the basic idea of importance sampling applied to the linear Gaussian model but will defer completion of the numerical treatment until Chapter 13 when full details of the simulation techniques are worked out.

In principle, simulation could be applied directly to formula (8.4) by drawing a random sample . . . , from the distribution with density and then estimating the numerator and denominatior of (8.4) by the sample means of x(ty)p(y\\lf) and p(y\if) respectively. However, this estimator is inefficient in cases of practical interest. By using importance sampling we are able to achieve greater efficiency in a way that we now describe.

8.2. POSTERIOR ANALYSIS OF STATE VECTOR 157

Suppose that simulation from density p(f\y) is impractical. Let g(if\y) be a density which is as close as possible to p(f\y) while at the same time permitting simulation. We call this an importance density. From (8.2) we have

* = f x(f)P^\y]g(f\y)df J gifly)

= *(V0-777-r L g(f\y)„

= KEg[x(f)zg(ir,y)] (8.5)

by Bayes theorem where Eg denotes expectation with respect to density g(f\y),

y) = —, (8.6) g(f\y)

and K is a normalising constant. By replacing x(xfr) by 1 in (8.5) we obtain

K-1 - E g [ z s { f , y)],

so the posterior mean of jr(ar) can be expressed as Egvm)z8W,y)-\

Eglz'W,y)\ ' This expression is evaluated by simulation. We choose random samples of N draws of ifr, denoted by from the importance density gCV'l.v) and estimate x by

, X x _ < .̂8)

where

. _ p(tm)p(y\fm) i X Q , g(+m\y) • <8-9)

As an importance density for p($\y) we take its large sample normal approximation

&(&\y) = N(A, ô). where $ is the solution to the equation

d log p(f\y) _ 8 log p ( f ) dlogp(y\f) — _ — j — — u, (Ö.1U) àyr 3 t}f dip

and

ß-i = _ 92 log p ( f ) d2 log p(y I f ) d f d f di/dxif'

(8.11)

For a discussion of this large sample approximation to p(f\y) see Gelman et al (1995, Chapter 4) and Bernardo and Smith (1994, §5.3). Since p(y\f) can

158 BA YES IAN ANALYSIS

easily be computed by the Kalman filter for if — ifr(i\ p ( f ) is given and g(if\y) is Gaussian, the value of zi is easy to compute. The draws for i/r^ are independent and therefore x converges probabilistically to x as N —oo under very general conditions.

The value if is computed iteratively by an obvious extension of the technique of maximum likelihood estimation as discussed in Chapter 7 while the second derivatives can be calculated numerically. Once ijr and £2 are computed, it is straightforward to generate samples from g(ir\y) by use of a standard normal random number generator. Where needed, efficiency can be improved by the use of antithetic variables, which we discuss in §11.9.3. For example, for each draw i/r(') we could take another value — 2 $ — which is equiprobable with

The use of and ijf^ together introduces balance in the sample. The posterior mean of the parameter vector if is if = E(i/r|y). An estimate ijr

of ir is obtained by putting x( i r^ ) — if ̂ in (8.8) and taking if — tc. Similarly, an estimate V(^ly) of the posterior variance matrix Var(^ jy) is obtained by putting x(ir^) = if(i)ifW in (8.8), taking S=% and then taking V(f\y) - S -

To estimate the posterior distribution function of an element if\ of if, which is not necessarily the first element of i f , we introduce the indicator function /i CV̂i*' which equals one if \jf^ < ifi and zero otherwise, where i r ^ is the value of if\ in the ith simulated value of if and if\ is fixed. Then F(i//1 |y) = < lî) — E[/i(^"{i))|y] is the posterior distribution function of if\. Putting x(ifw) = I\ (\fr in (8.8), we estimate F(if\ |y) by F(V>i |y) — x. This is equivalent to taking F(î jy) as the sum of values of z, for which ^ < Vi divided by the sum of all values of Zi. Similarly, if 8 is the interval — \d, ir\ + \d) where d is small and positive then we can estimate the posterior density of if\ by p(if\\y) = d~lS&[ Zi where Ss is the sum of the values of z/ for which 6 8.

B.2.3 NON-INFORMATIVE PRIORS

For cases where a proper prior is not available we may wish to use anon-informative prior in which we assume that the prior density is proportional to a specified function piif) in a domain of if of interest even though the integral / p(ir)di!r does not exist. For a discussion of non-informative priors see, for example, Chapters 3 and 4 of Gelman et al. (1995). Where it exists, the posterior density is p(if\y) ~ Kp(yjf)p(y\if) as in the proper prior case, so all the previous formulae apply without change. This is why we use the same symbol p(ir) for both cases even though in the non-informative case p(\fr) is not a density. An important special case is the diffuse prior for which p(ir) — 1 for all i f .

Our intention in this chapter has been to describe the basic ideas underlying Bayesian analysis of linear Gaussian state space models by means of simulation based on importance sampling. Practical implementation of the analysis will be dealt with in Chapter 13 and illustrated by an example after we have developed further techniques, including antithetic variables for state space models, which we consider in §11.9.3.

8.3. MARKOV CHAIN MONTE CARLO METHODS 159

8.3 Markov chain Monte Carlo methods

Another approach to Bayesian analysis based on simulation is provided by the Markov chain Monte Carlo (MCMC) method which has received a considerable amount of interest recently in the statistical and econometric literature on time series. Fruhwirth-Schnatter (1994) was the first to give a full Bayesian treatments of the linear Gaussian model using MCMC techniques. The proposed algorithms for simulation sample selection were later refined by de Jong and Shephard (1995). This work resulted in the simulation smoother which we discussed in §4.7. We showed there how to generate random draws from the conditional densities p(s\y,if), p(w\y, f ) and p(a|y, xfr) for a given parameter vector xfr. Now we briefly discuss how this technique can be incorporated into a Bayesian MCMC analysis in which we treat the parameter vector as stochastic.

The basic idea is as follows. We evaluate the posterior mean of x(a) or of the parameter vector \jr via simulation by choosing samples from an augmented joint density p ( f , ajy). In the MCMC procedure, the sampling from this joint density is implemented as a Markov chain. After initialisation for i//, say i[r — we repeatedly cycle through the two simulation steps:

(1) sample a ^ from /?(aiy, (2) sample xjf^ from p(f\y, a(l));

for i — 1, 2 After a number of 'burning-in' iterations we are allowed to treat the samples from step (2) as being generated from the density p(f\y). The attraction of this MCMC scheme is that sampling from conditional densities is easier than sampling from the marginal density p(f\y). The circumstances under which subsequent samples from the marginal densities p(a\y, V^'-^) and p(xjf\y, a{,)) converge to samples from the joint density p(\j/, a|y) are considered in books on MCMC, for example, Gamerman (1997). It is not straightforward to develop appropriate diagnostics which indicate whether convergence within the MCMC process has taken place, as is discussed, for example, in Gelman (1995).

There exist various implementations of the basic MCMC algorithm for the state space model. For example, Carlin, Poison and Stoffer (1992) propose to sample individual state vectors from p(at\y, ct\ f ) where a' is equal to a excluding oic. It turns out that this approach to sampling is inefficient. It is argued by Fruhwirth-Schnatter (1994) that it is more efficient to sample all the state vectors directly from the density p(a|y, f ) . She provides the technical details of implementation, de Jong and Shephard (1995) have developed this approach further by concentrating on the disturbance vectors st and rjt instead of the state vector a t . The details of the resulting simulation smoother were given in §4.7.

Implementing the two steps of the MCMC is not as straightforward as suggested so far. Sampling from the density p(a|y, ijs) for a given f is done by using the simulation smoother of §4.7. Sampling from p(xjf\y,a) depends partly on the model for \Jr and is usually only possible up to proportionality. To sample under such circumstances, accept-reject algorithms have been developed; for example,

160 BAYESIAN ANALYSIS

the Metropolis algorithm is often used for this purpose. Details and an excellent general review of these matters are given by Gilks, Richardson and Spiegelhalter (1996). Applications to state space models have been developed by Carter and Kohn (1994), Shephard (1994) and Gamerman (1998).

In the case of structural time series models of §3.2 for which the parameter vector consists only of variances of disturbances associated with the components, the distribution of the parameter vector can be modelled such that sampling from p(ifr\y, a) in step (2) is relatively straightforward. For example, a model for a variance can be based on the inverse gamma distribution with logdensity

and p(a2\c, s) = 0 for cr2 < 0; see, for example, Poirier (1995, Table 3.3.1). We denote this density by a2 ~ IG(c/2, s/2) where c determines the shape and s determines the scale of the distribution. It has the convenient property that if we take this as the prior density of o 2 and we take a sample u \ , . . . , un of independent N(0, cr2) variables, the posterior density of cr2 is

for further details see, for example, Poirier (1995, Chapter 6). For the implemen-tation of step (2) a sample value of cr2 is chosen from this density. We can take u, as an element of et or r)t obtained by the simulation smoother in step (1). Further details of this approach are given by Fruhwirth-Schnatter (1994) and Carter and Kohn (1994).

Our general view is that, at least for practitioners who are not simulation experts, the methods of §8.2 are more transparent and computationally more convenient than MCMC for the practical problems we consider in this book.

log p(a2\c, s) - - l o g r ( ! ) ~~ | l o g

2

p(a2\u\,..., un) — IG (c + n)/2, f s + £ uf 2 ;

•i/i

9 Illustrations of the use of the linear Gaussian model

9.1 Introduction

In this chapter we will give some illustrations which show how the use of the linear Gaussian model works in practice. State space methods are usually employed for time series problems and most of our examples will be from this area but we also will treat a smoothing problem which is not normally regarded as part of time series analysis and which we solve using cubic splines.

The first example is an analysis of road accident data to estimate the reduction in car drivers killed and seriously injured in the UK due to the introduction of a law requiring the wearing of seat belts. In the second example we consider a bivaiiate model in which we include data on numbers of front seat passengers killed and seriously injured and on numbers of rear seat passengers killed and seriously injured and we estimate the effect that the inclusion of the second variable has on the accuracy of the estimation of the drop in the first variable. The third example shows how state space methods can be applied to Box-Jenkins ARMA models employed to model series of users logged onto the Internet In the fourth example we consider the state space solution to the spline smoothing of motorcycle acceleration data. The final example provides an approximate analysis based on the linear Gaussian model of a stochastic volatility series of the logged exchange rate between the British pound and the American dollar. The software we have used for most of the calculations is SsfPack and is described in §6.6.

9.2 Structural time series models

The study by Durbin and Harvey (1985) and Harvey and Durbin (1986) on the effect of the seat belt law on road accidents in Great Britain provides an illustration of the use of structural time series models for the treatment of problems in applied time series analysis. They analysed data sets which contained numbers of casualties in various categories of road accidents to provide an independent assessment on behalf of the Department of Transport of the effects of the British seat belt law on road casualties. Most series were analysed by means of linear Gaussian state space models. We concentrate here on monthly numbers of drivers, front seat passengers

162 ILLUSTRATIONS OF THE USE OF THE LINEAR GAUSSIAN MODEL

Fig. 9.1. Monthly numbers (logged) of drivers who were killed or seriously injured (KSI) in road accidents in cars in Great Britain.

and rear seat passengers who were killed or seriously injured in road accidents in cars in Great Britain from January 1969 to December 1984. Data were transformed into logarithms since logged values fitted the model better. Data on the average number of kilometres travelled per car per month and the real price of petrol are included as possible explanatory variables. We start with a univariate analysis of the drivers series. In the next section we perform a bivariate analysis using the front and rear seat passengers.

The log of monthly number of car drivers killed or seriously injured is displayed in Figure 9.1. The graph shows a seasonal pattern which may be due to weather conditions and festive celibrations. The overall trend of the series is basically constant over the years with breaks in the mid-seventies, probably due to the oil crisis, and in February 1983 after the introduction of the seat belt law. The model that we shall consider initially is the basic structural time series model which is given by

yf = & + Yt +

where jJLt is the local level component modelled as the random walk (xt+\ = Mi + Ht, Yt i s the trigonometric seasonal component (3.6) and (3.7), and st is a disturbance term with mean zero and variance CT£

2. Note that for illustrative purposes we do not at this stage include an intervention component to measure the effect of the seat belt law.

The model is estimated by maximum likelihood using the techniques described in Chapter 7. The iterative method of finding the estimates for a 2 , a 2 and a 2 is


implemented in STAMP 6.0 based on the concentrated diffuse loglikelihood as discussed in §2.10.2; the estimation output is given below where the first element of the parameter vector is <pi — 0.5 log qv and the second element is (f)2 — 0.5 log qm where — cr^/a2 and qw = o^fo2. We present the results of the successive it-erations in the parameter estimation process as displayed on the computer output of the STAMP package (version 6) of Koopman et al. (2000). Here ' i t ' indicates the iteration number and ' f ' is the loglikelihood times 1 jn.

MaxLik i t e r a t i n g . . . i t f parameter vector score vector 0 2, .26711 -0, .63488 -4. .2642 -0. , 00077507 -0. .00073413 1 2 .26715 -0. .64650 -4. .3431 7. ,9721e-005 -0. .00030219 2 2. .26716 -0. .64772 -4. .3937 7. ,3021e-005 -7. .7059e-005 3 2. .26716 -0. .64742 -4. .4106 9. . 2115e-006 - 1 . . 1057e-005 4 2. .26716 -0 . , 64738 -4 . .4135 3. . 6016e-008 -5. ,0568e-007 5 2. .26716 -0, .64739 -4 . .4136 -3 . . 2285e-008 -5. ,5955e-009 6 2, .26716 -0, .64739 -4 . .4136 -3. .2507e-008 -5. . 3291e-009

The initial estimates are determined by an ad hoc method which is described in detail by Koopman et al. (2000, § 10.6.3.1). It amounts to some cycles of univariate optimisations with respect to one parameter and for which the other parameters are kept fixed to their current values, starting with arbitrary values for the uni-variate optimisations. Usually the initial parameter values are close to their final maximum likelihood estimates as is the case in this illustration. Note that during estimation one parameter is concentrated out. The resulting parameter estimates are given below. The estimate for a 2 is very small and may be set equal to zero, implying a fixed seasonal component.

Estimated var iances of d is turbances

Component d r i v e r s ( q - r a t i o ) I r r 0.0034160 ( 1.0000) Lvl 0.00093585 ( 0.2740) Sea 5.0109e-007 ( 0.0001)

The estimated components are displayed in Figure 9.2. We conclude that the seasonal effect hardly changes over time and therefore can be treated as fixed. The estimated level does pick up the underlying movements in the series and the estimated irregular does not cause much concern to us.

In Figure 9.3 the estimated level is displayed; the filtered estimator is based on only the past data, that is E(/A/|y;_1), and the smoothed estimator is based on all the data, that is E(/x, j Yn). It can be seen that the filtered estimator lags the shocks in the series as is to be expected since this estimator does not take account of current and future observations.


Fig. 9.4. (I) one-step ahead prediction residuals; (ii) auxiliary irregular residuals; (iii) auxiliary level residuals.

The model fit of this series and the basic diagnostics initially appear satisfactory. The standard output provided by STAMP is given by

Diagnostic summary report

Estimation sample is 69. 1 - 8 4 . 1 2 . (T = 192, n = 180). Log-Likelihood is 435.295 (-2 LogL = -870.59). Prediction error variance is 0.00586998

Summary statistics drivers

Std.Error 0.076616 N 4.6692 H( 60) 1.0600 c( 1) 0.038623 c(12) 0.014140 Q(12,10) 11.610

The definitions of the diagnostics can be found in §2.12. When we inspect the graphs of the residuals in Figure 9.4, however, in particular

the auxiliary level residuals, we see a large negative value for February 1983. This


suggests a need to incorporate an intervention variable to measure the level shift in February 1983. We have performed this analysis without inclusion of such a variable purely for illustrative purposes; obviously, in a real analysis the variable would be included since a drop in casualties was expected to result from the introduction of the seat belt law.

By introducing an intervention which equals one from February 1983 and is zero prior to that and the price of petrol as a further explanatory variable, we re-estimate the model and obtain the regression output

Estimated coefficients of explanatory variables. Variable Coefficient R.m.s.e. t-value petrol -0.29140 0.098318 -2.9638 [ 0.0034] Lvl 83. 2 -0.23773 0.046317 -5.1328 [ 0.0000]

The estimated components, when the intervention variable and the regression effect due to the price of petrol are included, are displayed in Figure 9.5.

The estimated coefficient of a break in the level after January 1983 is —0.238, indicating a fall of 21%, that is, 1 - exp(—0.238) = 0.21, in car drivers killed and seriously injured after the introduction of the law. The t-value of —5.1 indicates

:.";!"' that the break is highly significant. The coefficient of petrol price is also significant. ii!

:::,!!!

Oil)

Fig. 9.5. Estimated components for model with intervention and regression effects: (i) level; (ii) seasonal; (iii) irregular.

9.3. BIVARLATE STRUCTURAL TIME SERIES ANALYSIS 167

9.3 Bivariate structural time series analysis

Multivariate structural time series models are introduced in §3.2.2. To illustrate state space methods for a multivariate model we analyse a bivariate monthly time series of front seat passengers and rear seat passengers killed and seriously injured in road accidents in Great Britain which were included in the assessment study by Harvey and Durbin (1986).

The graphs in Figure 9.6 indicate that the local level specification is appropriate for the trend component and that we need to include a seasonal component. We start by estimating a bivariate model with level, trigonometric seasonal and irregular components together with explanatory variables for the number of kilo-meters travelled and the real price of petrol. We estimate the model only using observations available before 1983, the year in which the seat belt law was introduced. The variance matrices of the three disturbance vectors are estimated by maximum likelihood in the way described in §7.3. The maximum likelihood estimates of the three variance matrices, associated with the level, seasonal and irregular components, respectively, are reported below where the upper off-diagonal elements are transformed to correlations:

Irregular disturbance Level disturbance 0.0050306 0.67387 0.00025301 0.93059 0.0043489 0.0082792 0.00022533 0.00023173

Seasonal disturbance 7.5475e-007 -1.0000 -1.6954e-007 3.8082e-008

Fig. 9.6. Front seat and rear seat passengers killed and seriously injured in road accidents in Great Britain.


The variance matrix of the seasonal component is small which lead us to model the seasonal component as fixed and to re-estimate the remaining two variances matrices:


The loglikelihood value of the estimated model is 761.85 with AIC equal to —4.344. The correlation between the two level disturbances is close to one. It may therefore be interesting to re-estimate the model with the restriction that the rank of the level variance matrix is one:

pj^:"" St 'iat-... JI;

r


The loglikelihood value of this estimated model is 758.34 with AIC equal to —4.329 A comparison of the two AIC's shows only a marginal preference for the unrestricted model.

We now assess the effect of the introduction of the seat belt law as we have done for the drivers series using a univariate model in §9.2. We concentrate on the effect of the law on front seat passengers. We also have the rear seat series which is highly correlated with the front seat series. However, the law did not apply to rear seat passengers for which the data should therefore not be affected by the introduction of the law. Under such circumstances the rear seat series may be used as a control group which may result in a more precise measure of the effect of the seat belt law on front seat passengers; for the reasoning behind this idea see the discussion by Harvey (1996) to whom this approach is due.

We consider the unrestricted bivariate model but with a level intervention for February 1983 added to both series. This model is estimated using the whole data-set giving the parameter estimates:


Estimates level intervention February 1983 Coefficient R.m.s.e. t-value

front seat -0.33704 0.049243 -6.8445 [ 0.0000] rear seat 0.00089514 0.051814 0.017276 [ 0.9862]

9.4. BOX-JENKINS ANALYSIS 169

From these results and the time series plot of casualties in rear seat passengers in Figure 9.6, it is clear that they are unaffected by the introduction of the seat belt as we expect. By removing the intervention effect from the rear seat equation of the model we obtain the estimation results:


Estimates level intervention February 1983 Coeff icient R.m.s.e. t-value

front seat -0.33797 0.028151 -12.005 [ 0.0000]

The almost two-fold decrease of the root mean squared error for the estimated intervention coefficient for the front seat series is remarkable. Enforcing the rank of to be one, such that the levels are proportional to each other, also leads to a large increase of the i-value:


Estimates level intervention February 1983 Coefficient R.m.s.e. t-value

front seat -0.33640 0.019711 -17.066 [ 0.0000]

The graphs of the estimated (non-seasonal) signals and the estimated levels for the last model are presented in Figure 9.7. The substantial drop of the underlying level in front seat passenger casualties at the introduction of the seat belt law is clearly visible.

9.4 Box-Jenkins analysis

In this section we will show that fitting of ARMA models, which is an important part of the Box-Jenkins methodology, can be done using state space methods. Moreover, we will show that missing observations can be handled within the state space framework without problems whereas this is difficult within the Box-Jenkins methodology; seethe discussion in §3.5. Finally, since an important objective of the Box-Jenkins methodology is forecasting, we also present forecasts of the series under investigation. In this illustration we use the series which is analysed by Makridakis, Wheelwright and Hyndman (1998): the number of users logged on to an Internet server each minute over 100 minutes. The data are differenced in order to get them closer to stationarity and these 99 observations are presented in Figure 9.8(i).


jp-i

M Ifci

6.5

6.25

5.75

5.5

1.9

1.8

1.7

(iv)

70 75 80 85 70 75 80 85

Fig. 9.7. (i) front seat passengers and estimated signal (without seasonal); (ii) estimated front seat casualties level; (iii) rear seat passengers and estimated signal (without seasonal); (iv) estimated rear seat casualties level.

Fig. 9.8. (i) First difference of number of users logged on to Internet server each minute; (ii) The same series with 14 observations omitted

9.4. BOX-JENKINS ANALYSIS 171

Table 9.1. AIC for different ARMA models.

Q 0 1 2 3 4 5

P 0 2.111 2.636 2.648 2.653 2.661 1 2.673 2.608 2.628 2.629 (1) 2.642 2.658 2 2.647 2.628 2.642 2.657 2.642(1) 2.660 (4) 3 2.606 2.626 2.645 2.662 2.660 (2) 2.681 (4) 4 2.626 2.646 (8) 2.657 2.682 2.670 (1) 2.695 (1) 5 2.645 2.665 (2) 2.654 (9) 2.673 (10) 2.662 (12) 2.727 (A) The value between parentheses indicates the number of times the loglikelihood could not be evaluated during optimisation. The symbol A indicates that the maximisation process was automatically aborted due to numerical problems.

We have estimated a range of ARMA model (3.14) with different choices for p and q. They were estimated in state space from based on (3.17). Table 9.1 presents the Akaike information criteria (AIC), which is defined in §7.4, for these different ARMA models. We see that the ARMA models with (p, q) equal to (1, 1) and (3,0) are optimal according to the AIC values. We prefer the ARMA(1, 1) model because it is more parsimonious. A similar table was produced for the same series by Makridakis et al. (1998) but the AIC statistic was computed differently. They concluded that the ARMA(3,0) model was best.

We repeat the calculations for the same differenced series but now with 14 observations treated as missing (these are observation numbers 6,16,26, 36,46, 56,66,72,73,74,75,76, 86,96). The graph of the amended series is produced in Figure 9.8(ii). The reported AIC's in Table 9.2 lead to the same conclusion as for the series without missing observations: the preferred model is ARMA(1, 1) although its case is less strong now. We also learn from this illustration that estimation of higher order ARMA models with missing observations lead to more numerical problems.

Table 9.2. AIC for different ARMA models with missing observations.

q 0 1 2 3 4 5

p 0 3.027 2.893 2.904 2.908 2.926 l 2.891 2.855 2.877 2.892 2.899 2.922 2 2.883 2.878 2.895 (6) 2.915 2.912 2.931 3 2.856 (1) 2.880 2.909 2.924 2.918 (12) 2.940 (1) 4 2.880 2.901 2.923 2.946 2.943 2.957 (2) 5 2.901 2.923 2.877 (A) 2.897 (A) 2.956 (26) 2.979 The value between parentheses indicates the number of times the loglikelihood could not be evaluated during optimisation. The symbol A indicates that the maximisation process was automatically aborted due to numerical problems.


II»

là':; r

mr ^ i r > jyifwi-' to

15 r

10

-10

T M i ! A! if I t \\

I 'i

M W \ »A/ i/ii •ir i ^

V I i hAii

•u • 11 U < V

1!(

Hi

n i i ! / y

ru* if./ ! »

10 20 30 40 50 60 70 80 90 100 110 120

Fig. 9.9. Internet series (solid blocks) with in-sample one-step forecasts and out-of-sample forecasts with 50% confidence interval.

Finally, we present in Figure 9.9 the in-sample one-step-ahead forecasts for the series with missing observations and the out-of-sample forecasts with 50% confidence intervals. It is one of the many advantages of state space modelling that it allows for missing observations without difficulty.

9.5 Spline smoothing

The connection between smoothing splines and the local linear trend model has been known for many years; see, for example, Wecker and Ansley (1983). In §3.11 we showed that the equivalence is with a local linear trend formulated in continuous time with the variance of the level disturbance equal to zero.

Consider a set of observations y\,..., yn which are irregularly spaced and associated with an ordered series Ti , . . . , r„. The variable rt can also be a measure for age, length or income, for example, as well as time. The discrete time model implied by the underlying continuous time model is the local linear trend model with

• I

VarOfc) - Var(^) = o^S,, - <r't&f/2t (9.1)

as shown in §3.10, where the distance variable £R = T/+J — z, is the time between observation t and observation t + 1. We shall show how irregularly spaced data can be analysed using state space methods. With evenly spaced observations the V s are s e t to one.

2S2


CO

.......-i//:.-

W/: /if

10 15 20 25 30 35 40 45 50 55

Cri)

-2-

, , . i . . . . i . . . . , , . 5 10 15 20 25 30 35 40 45 50 55

Fig. 9.10. Motorcycle acceleration data analysed by a cubic spline, (i) observations against time with spline and 95% confidence intervals, (ii) standardised irregular.

We consider 133 observations of acceleration against time (measured in milliseconds) for a simulated motorcycle accident. This data set was originally analysed by Silverman (1985) and is often used as an example of curve fitting techniques; see, for example, Hardle (1990) and Harvey and Koopman (2000). The observations are not equally spaced and at some time points there are multiple ob-servations; see Figure 9.10. Cubic spline and kernel smoothing techniques depend on a choice of a smoothness parameter. This is usually determined by a technique called cross-validation. However, setting up a cubic spline as a state space model enables the smoothness parameter to be estimated by maximum likelihood and the spline to be computed by the Kalman filter and smoother. The model can easily be extended to include other unobserved components and explanatory variables, and it can be compared with alternative models using standard statistical criteria.

We follow here the analysis given by Harvey and Koopman (2000). The smoothing parameter k = a^jal is estimated by maximum likelihood (assuming normally distributed disturbances) using the transformation k — exp(^). The estimate of \jr is —3.59 with asymptotic standard error 0.22. This implies that the estimate of k is 0.0275 with an asymmetric 95% confidence interval of 0.018 to 0.043. Silverman (1985) estimates k by cross-validation, but does not report its value. In any case, it is not clear how one would compute a standard error for an estimate obtained by cross-validation. The Akaike information criterion (AIC) is 9.43. Figure 9.10 (i) presents the cubic spline. One of the advantages of representing the cubic spline by means of a statistical model is that, with little additional


K!

p i a

4L,.:, tUX- - A"

computing, we can obtain variances of our estimates and, therefore, standardised residuals defined as the residuals divided by the overall standard deviation. The 95% confidence intervals for the fitted spline are also given in Figure 9.10 (i). These are based on the root mean square errors of the smoothed estimates of obtained from Vt as computed by (4.31), but without an allowance for the uncertainty arising from the estimation of X as discussed in §7.3.7.

In Figure 9.10 (ii) the standardised irregular is presented for the cubic spline model and it is evident that the errors are heteroscedastic. Harvey and Koopman (2000) correct for heteroscedasticity by fitting a local level signal through the ab-solute values of the smoothed estimates of the irregular component. Subsequently, the measurement error variance, or?, of the original cubic spline model is replaced by Ogh*2 where h* is the smoothed estimate of the local level signal, scaled so that h* — 1. The h*'s are always positive because the weights of a local level model with uncorrelated disturbances are always positive. The absolute values of the smoothed irregular and the A*'s are presented in Figure 9.11 (i). Estimating the heteroscedastic model, that is with measurement error variances proportional to the h*ys gives an AIC of 8.74 (treating the h*'s as given). The resulting spline, shown in Figure 9.11 (ii), is not too different to the one shown in Figure 9.10 but the confidence interval is much narrower at the beginning and at the end. The smoothed irregular component in Figure 9.11 (iii) is now closer to being homoscedastic.

U-ETl-i

M 1

iEr* l i » m

i l

- 1 0 0 -

, (iii)

•A-. - : • ;

10 15 20 25 30 35 40 45 50 55

Fig. 9.11. Motorcycle acceleration data analysed by a cubic spline corrected for het-eroscedasticity. (i) absolute values of smoothed irregular and h% (ii) data with signal and confidence intervals, (iii) standardised irregular.

9.6. APPROXIMATE METHODS FOR MODELLING VOLATILITY 175

9.6 Approximate methods for modelling volatility

Let yt be the first difference of log prices of some portfolio of stocks, bonds or foreign currencies. Such a financial time series will normally be approximately serially uncorrected. It may not be serially independent, however, because of serial dependence in the variance. This behaviour can be modelled by

yt = cttet — ost exp(ft,/2), st ~ N(0, 1), t = I,,.., n, (9.2)

where

Af+i + i fc~N(0,aJ) , |0| < 1 , (9.3)

and the disturbances st and r), are mutually and serially uncorrected. In this exposition we assume the y( has zero mean. The term a is a scale factor, 0 is an unknown coefficient, and i]t is a disturbance term which in the simplest model is uncorrelated with st. This model is called a stochastic volatility (SV) model; it is dicussed further in §10.6.1. The model can be regarded as the discrete time analogue of the continuous time model used in papers on option pricing, such as Hull and White (1987). The statistical properties of yt are easy to determine. However, the model as it stands is not linear and therefore the techniques described in Part I of this book cannot provide an exact solution. We require the methods of Part II for a full analysis of this model.

To obtain an approximate solution based on a linear model, we transform the observations yt as follows:

log y} = K + ht+$t, t — I,..., n, (9.4)

where

St = logs2 - E ( l o g s 2 )

and

K — log cr2 + E( log e2). (9.5)

The noise term f, is not normally distributed but the model for log yf is linear and therefore we can proceed approximately with the linear techniques of Part I. This approach is taken by Harvey, Ruiz and Shephard (1994) who call it quasi-maximum likelihood (QML). Parameter estimation is done via the Kalman filter; smoothed estimates of the volatility component, ht, can be constructed and forecasts of volatility can be generated. One of the attractions of the QML approach is that it can be carried out straightforwardly using STAMP. This is an advantage compared with the more complex methods of Part II.

In our illustration of the QML method we use the same data as analysed by Harvey et ai (1994) in which yt is the first difference of the logged exchange rate between pound sterling and US dollar. Parameter estimation is carried out by STAMP and some estimation results are given below. The high value for the


Fig. 9.12. Estimated $x#(ht\T/2) for the Pound series.

normality statistic confirms that the observation error of the SV model is not Gaussian. The smoothed estimated volatility, measured as exp(/i(/2)f is presented in Figure 9.12. A further discussion of the SV model is given in §10.6.1.

Summary statistics

SVDLPound

Std.Error 1.7549

Normality 30.108

H(314) 1.1705

c( 1) -0.019915

c(29) 0.044410

Q(29,27) 19.414

Estimated variances of disturbances.

Component SVDLPound (q-ratio)

Irr 2.9278 ( 1.0000)

Arl 0.011650 ( 0.0040)

Estimated autoregressive coefficient.

The AR(1) rho coefficient is 0.986631.

Part II

Non-Gaussian and nonlinear state space models

In Part I we presented a comprehensive treatment of the construction and analysis of linear Gaussian state space models, and we discussed the software required for implementing the related methodology. Methods based on these models, possibly after transformation of the observations, are appropriate for a wide range of problems in practical time series analysis.

There are situations, however, where the linear Gaussian model fails to provide an acceptable representation of the behaviour of the data. For example, if the observations are monthly numbers of people killed in road accidents in a particular region, and if the numbers concerned are relatively small, the Poisson distribution will generally provide a more appropriate model for the data than the normal distribution. We therefore need to seek a suitable model for the develop-ment over time of a Poisson variable rather than a normal variable. Similarly, there are cases where a linear model fails to represent the behaviour of the data to an adequate extent. For example, if the trend and seasonal terms of a series combine multiplicatively but the disturbance term is additive, a linear model is inappropriate. In Part II we develop a unified methodology, based on efficient simulation techniques, for handling broad classes of non-Gaussian and nonlinear state space models.

10 Non-Gaussian and nonlinear state space models

10.1 Introduction

In this chapter we present the classes of non-Gaussian and nonlinear models that we shall consider in this book; we leave aside the analysis of observations generated by these models until later chapters. In §10.2 we specify the general forms of non-Gaussian models that we shall study and in §§10.3,10.4 and 10.6 we consider special cases of the three main classes of models of interest, namely exponential family models, heavy-tailed models and financial models. In §10.5 we describe a class of nonlinear models that we shall investigate.

10.2 The general non-Gaussian model

The general multivariate non-Gaussian model that we shall consider has a similar state space structure to (3.1) in the sense that observational vectors yt are determined by a relation of the form

p(yt\ai,...,at,ylf... ,y,_i) = p(yt\Ztat), (10.1)

while the state vectors are determined independently of previous observations by the relation

«i+i = Ttat + Rtr)t, r}t ~ p(r]t), (10.2)

for t — 1 , . . . , « , where the r)t'$ are serially independent and where either p (yt\Ztat) or p (rjt) or both can be non-Gaussian. The matrix Zt has a role and a form analogous to those in the linear Gaussian models considered in Part I. We denote Zta, by 0t and refer to it as the signal. While we begin by considering a general form for p (yrify), we shall pay particular attention to two special cases:

(1) observations which come from exponential family distributions with densities of the form

p (y,\0t) = exp[y't9t - bt(0t) + ct(yt)), - o o < 0, < oo, (10.3)

where b,(d,) is twice differentiable and Q(y,) is a function of yt only;

180 NON-GAUSSIAN AND NONLINEAR STATE SPACE MODELS

(2) observations generated by the relation

yt=0t+st, et~p(t$t), (10.4)

where the et's are non-Gaussian and serially independent.

The model (10.3) together with (10.2) where r}t is assumed to be Gaussian was introduced by West, Harrison and Migon (1985) under the name the dynamic generalised linear model. The origin of this name is that in the treatment of non-time series data the model (10.3), with 0{ — Zta where a does not depend on t, is called a generalised linear model. In this context 0t is called the link function; for a treatment of generalised linear models see McCullagh and Nelder (1989). Further development of the West, Harrison and Migon model is described in West and Harrison (1997). Smith (1979), Smith (1981) and Harvey and Fernandes (1989) gave an exact treatment for the special case of a Poisson observation with mean modelled as a local level model. Their approach however does not lend itself to generalisation.

10.3 Exponential family models

feí ; For model (10.3), let

• dbt(0t) .. , d%(61) bM) - " and bt(0t) = , . (10.5) fa:-: M, detde v }

«SÎÏÎ

m

m

For brevity, we will write bt(6t) as b, and bt(ôt) as bt in situations where it is unnecessary to emphasise the dependence on 0t. Using the standard results

E [ d i o g ^ m i L 3ft J '

ra2logp(yr|gf)] E f d log pjyt j0f) 3 log p(yt jflf)1 [ dOtdOí J L set do't J

it follows immediately from (10.3) and (10.6) that

(10.6)

'*3p< E(y,) — bt and Var^ ) — bt.

Consequently bt must be positive definite for non-degenerate models. Results (10.6) are easily proved, assuming that relevant regularity conditions

are satisfied, by differentiating the relation

/ P(yt\0t)dyt = 1

twice with respect to 9t and using the result

91ogpQ ïï-idpp)

10.3. EXPONENTIAL FAMILY MODELS 181

10.3.1 POISSON DENSITY

For our first example of an exponential family distribution, suppose that the univariate observation y, comes from a Poisson distribution with mean /x,. For example, yt could be the number of road accidents in a particular area during the month. Observations of this kind are called count data.

The logdensity of yt is

Comparing (10.7) with (10.3) we see that we need to take 0( ~log/z, and bt — exp 6t with 0, — Ztat, so the density of yt given the signal 9t is

p(yt\9t) - exp[yA - exp0, - log yt!], t = 1 , . . . , n. (10.8)

It follows that the mean b( — exp#, = fit equals the variance bt — p,t. Mostly we will assume that rjt in (10.2) is generated from a Gaussian distribution but all or some elements of ty may come from other continuous distributions.

10.3.2 BINARY DENSITY

An observation y, has a binary distribution if the probability that y, = 1 has a specified probability, say it,, and the probability that yt = 0 is 1 — itt. For example, we could score 1 if Cambridge won the Boat Race in a particular year and 0 if Oxford won.

Thus the density of yt is

so we have

log p(yt\xt) = yt[logTtj ~~ log(l - itt)\ + log(l - TT,). (10.10)

To put this in form (10.3) we take 6t = log[7rf/(l - jrt)] and bt(6t) = Iog(l + e°< ), and the density of yt given the signal Bt is

logpOvl^f) ^ yt l o8 Pt ~ Mr ~ log(y,!). (10.7)

piy,\xt) = x ? { l - x t ) 1 - * , yt = 1,0, (10.9)

piytWt) ^ exp[yr^ - log(l + exp0,)L

for which ct — 0. It follows that mean and variance are given by

(10.11)

- 7tt( 1 - 7tt),

as is well-known.

10.3.3 BINOMIAL DENSITY

Observation y, has a binomial distribution if it is equal to the number of successes in kt independent trials with a given probability of success, say n(. As in the binary


case we have

log P(yt\7tt) - yt[log711 - log(l - *,)] + kt 10g(l - 711) + log ^ , (10.12)

with yt — 0 , . . . , kt. We therefore take 0f=log[7Tf/( 1 — TT,)] and bt(6t) — kt log(l + exp0 t) giving for the density of yt in form (10.3),

P(yt Wt) = exp yt6t - kt log(l + exp0,) 4- log G ) ] 10.3.4 NEGATIVE BINOMIAL DENSITY

There are various ways of defining the negative binomial density; we consider the case where yt is the number of independent trials, each with a given probability 7it of success, that are needed to reach a specified number kt of successes. The density of y, is

piytM = - 1 ) rfv - 7ti)y'~kl' y< = " • •

and the logdensity is

log p i y t M = yt log(l - 7tt) + fcfPog iit - log(l - jrf)3 + log

(10.14)

(10.15)

We take 9t = log(l — itt) and bt(dt) ~ kt[6t — log(l — exp#,)l so the density in the form (10.3) is

p(yt\et) = exp {0t - log(l - expöf)} + log ( S : i ) (10.16)

Since in non-trivial cases 1 — 7tt < 1 we must have 0t < 0 which implies that we cannot use the relation 9t — Ztat since Z,at can be negative. A way around the difficulty is to take 9, — — exp9* where 8* = Ztat. The mean E(y,) is given by

bt =kt

as is well-known.

1 + l - e x p 0 , ] k -7 l t

7lt kt_ Tlt

10.3.5 MULTINOMIAL DENSITY

Suppose that we have h > 2 cells for which the probability of falling in the /th cell is 7iu and suppose also that in k, independent trials the number observed in the zth cell is yit for i ~ 1,..., h. For example, monthly opinion polls of voting preference: labour, conservative, liberal democrat, others.

10.4. HEAVY-TAJQLED DISTRIBUTIONS 183

Let yt = (yu,..., yh-u)' and nt — . . . , izh-U)' with 7ijt < 1. Then yt is multinomial with logdensity

h-1 log^CyriTTi) = yit

i=i log JTii - log ( 1 -

4- A:, log l l — / 4 4 \

+ log Ct, (10.17)

for 0 < yit < where

Q = k t!/ 'fc-l / A—1 \

We therefore take - ( 6 U , . . . , d h - U ) ' where 0it - \og[x i t/(\ - j t j t ) l and

bt($t) = log

so the density of yt in form (10.3) is

p(yt\ot) = exp y% - kt log + + log Ct. (10. 18)

10.4 Heavy-tailed distributions

10.4.1 I-DISTRIB UTION

A common way to introduce error terms into a model with heavier tails than those of the normal distribution is to use Student's t. We therefore consider modelling et of (10.4) by the /-distribution with logdensity

log p(et) - logfl(v) + i log* - log (1 + kef), (10.19)

where v is the number of degrees of freedom and

+ M a(?) = r ( ^ ) ' X " 1 = ^ " 2 > o r ? ' <r} = V*r<fit), v > 2, f = l , . . . , « .

The mean of £, is zero and the variance is <7e2 for any v degrees of freedom which

need not be an integer. The quantities v and cr£2 can be permitted to vary over time, in which case k also varies over time.


10.4.2 MIXTURE OF NORMALS

A second common way to represent error terms with tails that are heavier than those of the normal distribution is to use a mixture of normals with density

X* i —s2\ I — X* ( —s2 \ P(et) = r e xP ir\ + , exp , (10.20)

(27t*iy \2 (2n Xa}Y J

where X* is near to one, say 0.95 or 0.99, and x is large, say from 10 to 100. This is a realistic model for situations when outliers are present, since we can think of the first normal density of (10.20) as the basic error density which applies 1001* per cent of the time, and the second normal density of (10.20) as representing the density of the outliers. Of course, X* and x can be made to depend on t if appropriate. The investigator can assign values to X* and x but they can also be estimated when the sample is large enough.

10.4.3 GENERAL ERROR DISTRIBUTION

A third heavy-tailed distribution that is sometimes used is the general error distri-bution with density

p(et) = — exp

where

2[r(3</4)]

-c(i) £i r

Ob 1 < t < 2, (10.21)

w(t) = t[Y(tf 4)] 5

r r w 4 ) l t v L n m J

Some details about this distribution are given by Box and Tiao (1973, §3.2.1), from which it follows that Var(e() — a2 for all i.

10.5 Nonlinear models

In this section we introduce a class of nonlinear models which is obtained from the standard linear Gaussian model (3.1) in a natural way by permitting y, to depend nonlinearly on a t in the observation equation and a i + i to depend nonlinearly on a t in the state equation. Thus we obtain the model

yt = Zt(at) + - N(0, Ht), (10.22)

a,+i - Tt(ctt) + Rtiiu rit ~N(0, Qt), (10.23)

for t — 1 , . . . , n, where Zt(-) and Tt( ) are differentiate vector functions of at with dimensions p and m respectively. In principle it would be possible to extend this model by permitting et and t]t to be non-Gaussian but we shall not pursue this extension in this book. Models with general forms similar to this were considered by Anderson and Moore (1979), but only for filtering, and their solutions were approximate only, whereas our treatment is exact. A simple example of the relation

10.6. FINANCIAL MODELS 185

(10.22) is a structural time series model in which the trend fi( and seasonal yt combine multiplicatively and the observation error s, is additive, giving

yt +

a model of this kind has been considered by Shephard (1994).

10.6 Financial models 10.6.1 STOCHASTIC VOLATILITY MODELS

In the standard state space model (3.1) the variance of the observational error st is frequently assumed to be constant. In the analysis of financial time series, such as daily fluctuations in stock prices and exchange rates, it is often found that the observational error variance is subject to substantial variability over time. This phenomenon is usually referred to as volatility clustering. An allowance for this variability in models for such series may be achieved via the stochastic volatility (SV) model. The SV model has a strong theoretical foundation in the financial theory on option pricing based on the work of the economists Black and Scholes; for a discussion see Taylor (1986). Further, the SV model has a strong connection with the state space approach as will become apparent below. In later chapters we shall give an exact treatment of this model based on simulation, in contrast to the approximate treatment of a zero-mean version of the model in §9.6 based on the linear Gaussian model. An alternative treatment of models with stochastic heterogenous error variances is provided by the autoregressive conditional heteroscedasticity (ARCH) model which will be discussed in §10.6.2.

Denote the first (daily) differences of a particular series of asset log prices by yt. Financial time series are often constructed by first differencing log prices of some portfolio of stocks, bonds, foreign currencies, etc. A basic SV model for yt is given by

yt = a + a exp st, et ~ N(0, 1), (10.24)

where the mean a and the average standard deviation a are assumed fixed and unknown. The signal dt is regarded as the unobserved log-volatility and it can be modelled in the usual way by 8t ~ Ztat where at is generated by (10.2). In standard cases 6t is modelled by the AR(1) with Gaussian disturbances, that is

et+l = 4>et + RH, R}T~ N(O, tf), o < <j> < 1 , ( 1 0 . 2 5 )

but the generality of the state equation (10.2) can be exploited. For areview of work and developments of the SV model see Shephard (1996) and Ghysels, Harvey and Renault (1996).

Various extensions of the SV model can be considered. For example, the Gaussian distributions can be replaced by distributions with heavier tails such as the /-distribution. This extension is often appropriate because many empirical studies


IV";;'

, i ez:;^ c; • e - • i f e - . . :

:

•FRW*»*'»

find outliers due to unexpected jumps or downfalls in asset prices caused by 'over-heated' markets. A key example is the 'black Monday' crash in October 1987.

The basic SV model (10.24) captures only the salient features of changing volatility in financial series over time. The model becomes more precise when the mean of yt is modelled by incorporating explanatory variables. For example, the SV model may be formulated as

b(L)yt = a + c(L)'xt + a exp Q f t ^ st,

where L is the lag operator defined so that Uzt — Zt-j for zt —yt, xt and where b(L) — 1 — b\L bp*Lp* is a scalar lag polynomial of order p*; the column vector polynomial c(L) = Co + c\L\ H j- Ck*L^ contains k* + 1 vectors of coeffients and xt is a vector of exogenous explanatory variables. Note that the lagged value yt~j, for j — 1 , . . . , p*, can be considered as an explanatory variable to be added to exogenous explanatory variables. The signal 0t may also depend on explanatory variables.

Another useful SV model extension is the inclusion of the volatility in the mean process which permits the measurement of risk premiums offered by the market. When risk (as measured by volatility) is higher, traders want to receive higher premiums on their transactions. Such a model is labelled as the SV in Mean (SVM) model and its simplest form is given by

yt ~ a + d exp(0,) + a exp (W where d is the risk premium coefficient which is fixed and unknown. Other forms of the SVM model may also be considered but this one is particularly convenient; see the discussion in Koopman and Hol-Uspensky (1999).

Finally, another characteristic of financial time series is the phenomenon of leverage. The volatility of financial markets may adapt differently to positive and negative shocks. It is often observed that while markets might remain more or less stable when large positive earnings have been achieved but when huge losses have to be digested, markets become more unpredictable in the periods ahead. To incorporate leverage in the model we introduce et- \ , . . . , et-p, for some lag p, as explanatory variables in the generating equation for 6t.

The extended SV model is then given by

b(L)yt = a + c(L)'xt + d exp(0,) + cr exp ( —

and <f>(L)0t = ß{L)'xt + y(L)et + r,t,

(10.26)

(10.27)

with lag polynomials (j>(L) = 1 - <j)\L <PpLp, jS(L) = A) + H H and y(L) — y\L -J- 1- yqLq. The disturbances e, and are not necessarily

Gaussian; they may came from densities with heavy tails.

111

10.6. FINANCIAL MODELS 187

Parameter estimation for the SV models based on maximum likelihood has been considered as a difficult problem; see the reviews in Shephard (1996) and Ghysels et al. (1996). Linear Gaussian techniques only offer approximate maximum likelihood estimates of the parameters as is shown in §9.6 and can only be applied to the basic SY model (10.24). The techniques we develop in the remaining chapters of this book, however, provide analyses of SV models based on importance sampling methods which are as accurate as is required.

10.6.2 GENERAL AUTOREGRESSIVE CONDITIONAL HETEROSCEDASTICITY

The general autoregressive conditional heteroscedasticity (GARCH) model, a special case of which was introduced by Engle (1982) and is known as the ARCH model, is a widely discussed model in the financial and econometrics literature. A simplified version of the GARCH(1, 1) model is given by

yt=otst, et~ N(0, 1),

where the parameters to be estimated are a* and ft*. For a review of the GARCH model and its extensions see Bollerslev, Engle and Nelson (1994).

It is shown by Barndorff-Nielsen and Shephard (2001) that recursion (10.28) is equivalent to the steady state Kalman filter for a particular representation of the SV model. Consider, the model

yt=atst, s, ~ N(0, 1), , ' (10.29)

°7-H = <pof + r)t, Vt > 0,

for t — 1 , . . . , n and where disturbances st and rjt are serially and mutually independently distributed. Possible distributions for r}t are the gamma, inverse gamma or inverse Gaussian distributions. We can write the model in its squared form as follows

vf2 = crr

2 + ut, ut = oJ{e] - 1),

which is in a linear state space form with E(ut) — 0. The Kalman filter provides the minimum mean squared error estimator at of of. When in steady state, the Kalman update equation for a l + i can be represented as the GARCH(1,1) recursion

Ot+\ - a*y? + p*at,

with

P 1 p +1' p + r

where P is the steady state value for Pt of the Kalman filter which we have defined for the local level model in §2.11 and for the general linear model in §4.2.3. We note that a* + p* = 4>.


10.6.3 DURATIONS: EXPONENTIAL DISTRIBUTION

Consider a series of transactions in a stock market in which the ith transaction xt is time-stamped by the time xt at which it took place. When studying the behaviour of traders in the market, attention may be focused on the duration between successive transactions, that is yt = Ar, — xt — xt-\. The duration yt with mean /xf can be modelled by a simple exponential density given by

P(yt\/J>t) ~ — exp(—yf//Af), yt> to > 0. (10.30) to

This density is a special case of the exponential family of densities and to put it in the form (10.3) we define

1 0t = and bt(9t)^\og}it = -\og{-0t),

to

so we obtain

log p(yt \9t) = yt0s + 1 og(-0t). (10.31)

Since bf — —0~l — to we confirm that to is the mean of yt, as is obvious from (10.30). The mean is restricted to be positive and so we model 9* — log(/xf) rather than }JL, directly. The durations in financial markets are typically short at the opening and closing of the daily market hours due to heavy trading in these periods. The time stamp T, is therefore often used as an explanatory variable in the mean function of durations and in order to smooth out the huge variations of this effect, a cubic spline is used. A simple durations model which allows for the daily seasonal is then given by

0t = y(ri_i) + Vri. ft — pft-i + XFI Xt ~ N ( 0 , O $ ,

where y(-) is the cubic spline function and Xt is serially uncorrected. Such mod-els can be regarded as state space counterparts of the influential autoregressive conditional duration (ACD) model of Engle and Russell (1998).

10.6.4 TRADE FREQUENCIES: POISSON DISTRIBUTION

Another way of analysing market activity is to divide the daily market trading period up into intervals of one or five minutes and record the number of transactions in each interval. The counts in each interval can be modelled by a Poisson density for which the details are given in §10.3.1. Such a model would be a basic discrete version of what Rydberg and Shephard (1999) have labelled as BIN models.

11 Importance sampling

11.1 Introduction

In this chapter we begin the development of a comprehensive methodology for the analysis of observations from the non-Gaussian and nonlinear models that we specified in the previous chapter. Since no purely analytical techniques are available the methodology will be based on simulation. The simulation techniques we shall describe were considered for maximum likelihood estimation of the parameters in these models briefly by Shephard and Pitt (1997) and in more detail by Durbin and Koopman (1997). These techniques were extended to provide comprehensive classical and Bayesian analyses by Durbin and Koopman (2000); the resulting methods are easy to apply using publicly available software and are computationally efficient.

We shall consider inferential aspects of the analyses from both the classical and the Bayesian standpoints. From the classical point of view we shall discuss the estimation of functions of the state vector and the estimation of the error variance matrices of the resulting estimates. We shall also provide estimates of conditional densities, distribution functions and quantiles of interest, given the observations. Methods of estimating unknown parameters by maximum likelihood will be developed. From the Bayesian standpoint we shall describe methods for estimating posterior means, variances and densities.

The methods are based on standard ideas in simulation methodology, namely importance sampling and antithetic variables, as described, for example, in Ripley (1987, pp. 122-123). In this chapter we will develop the basic ideas of importance sampling that we employ in our methodology. Details of applications to particular models will be given in later chapters.

Denote the stacked vectors (a[,..., <x'n+x)' and (y[,..., y'n)' by a and y. From the classical standpoint, most of the problems we shall consider in this book are essentially the estimation of the conditional mean

of an arbitrary function x{a) of a given the observation vector y. This formulation includes estimates of quantities of interest such as the mean E(af |y) of the state vector a t given y and its conditional variance matrix Var(a, |y); it also includes

(11.1)

190 IMPORTANCE SAMPLING

estimates of the conditional density and distribution function of x(a) given y when x(a) is scalar. The conditional density p(a\y) depends on an unknown parameter vector -t/f, but in order to keep the notation simple we shall not indicate this dependence explicitly in this chapter. In applications based on classical inference, \/s is replaced by its maximum likelihood estimator while in Bayesian analyses

is treated as a random vector. In theory, we could draw a random sample of values from the distribution

with density p(a\y) and estimate x by the sample mean of the corresponding values of x(a). In practice, however, since explicit expressions are not available for p(a\y) for the models of Chapter 10, this idea is not feasible. Instead, we seek a density as close to p(a\y) as possible for which random draws are available, and we sample from this, making an appropriate adjustment to the integral in (11.1). This technique is called importance sampling and the density is referred to as the importance density. The techniques we shall describe will be based on Gaussian importance densities since these are available for the problems we shall consider and they work well in practice. We shall use the generic notation g(-), g(-t •) and g( j-) for Gaussian marginal, joint and conditional densities.

Techniques for handling /-distributions and Gaussian mixture distributions without using importance sampling are considered in §§11.9.5 and 11.9.6.

11.2 Basic ideas of importance sampling

In order to keep the exposition simple we shall assume in this section and the next section that the initial density p(cti) is non-degenerate and known. The case where some of the elements of ofi are diffuse will be considered in §11.9.4, where we shall show that the treatment of partially diffuse initialisation is a good deal simpler for the non-Gaussian case than for the linear Gaussian model.

For given , let g(a\y) be a. Gaussian importance density which is chosen to resemble />(a|_y) as closely as is reasonably possible; we have from (11.1),

- f x(a)?^g(a\y)da=Eg (11.2)

where Eg denotes expectation with respect to the importance density g(a |y). For the models of Chapter 10, p(of|y) and g(a|y) are complicated algebraically, whereas the corresponding joint densities p(a, y) and g (a, y) are straightforward. We therefore put p(a\y) = p(aty)/p(y) and g(a|;y) = g(a, y)/g(y) in (11.2), giving

= g(y) P(y)

Putting *(a:) — 1 in (11.3) we have

g(y) P(y) * = M f — r r I • (H-4)

11.3. LINEAR GAUSSIAN APPROXIMATING MODELS 191

Taking the ratio of (11.3) and (11.4) gives _ Eg[*(a)w(a> y)] p(a, y) x = * — , where w(a, y) = — ( 1 1 . 5 )

Eg[w(a, y)] g(a, y)

In this formula, n

p(a, y) = p<Qfi)]~[ p{rit)piyt\at% (11.6) t=\

with the substitution rjt = — Ttat) for / = ! , . . . , « . The formula for log g(a, y) will be given in equation (11.9). Expression (11.5) provides the basis for the simulation methods in this book. We could in principle obtain a Monte Carlo estimate x of x in the following way. Choose a series of independent draws a ( 1 \ . . . , a ( N ) from the distribution with density g(a|y) and take

x = , where xt - x(a(0) and w( = w(a(i\ y). (11.7) £ /= i W

Since the draws are independent, and under assumptions which are usually satisfied in cases of practical interest, x converges to x probabilistically as TV —> oo. This simple estimate however is numerically inefficient and we shall improve it considerably in Chapter 12.

An important special case is where the observations are non-Gaussian but the state vector is generated by the state equation in the linear Gaussian model (3.1). We then have p(a) ~ g(a) so

p(q, y) = p(<*)p(y 1«) _ p(y\a) = p(y\Q) g(tt,y) g(a)g(y\oi) ^(y|a) g(y|0)'

where as before $ is the stacked vector of the signals 0, — Ztat for t — 1 , . . . , n. Thus (11.5) becomes the simpler formula

x — — , where w (6, y) — — (11.8) E g[u;*(0,y)] g(y \0)

its estimate x is given by an obvious analogue of (11.7) which we shall improve in Chapter 12. The advantage of (11.8) relative to (11.5) is that the dimensionality of Bt is often much smaller than that of at. In the important case in which yt is univariate, 9t is a scalar.

11.3 Linear Gaussian approximating models

In this section we obtain the Gaussian importance densities that we need for simu-lation. We shall take these as the densities derived from the linear Gaussian models which have the same conditional modes of a given y as the non-Gaussian models.

Since we are considering the case of p(a\) known, it is reasonable to suggest taking the density of a\ in the approximating model as the Gaussian density g(ai)


which has the same mean vector and variance matrix as p{a\). In fact, this does not have much practical significance since the assumption that p(ct\) is known was made for simplicity of exposition in the early stages of the development of the theory. In practice a\ normally consists of diffuse or stationary elements and the treatment we shall give of this case in §11.9.4 does not normally require the construction of g(ai).

Let g(a\y) and g(a, y) be the conditional and joint densities generated by linear Gaussian model (3.1) and let p(a\y) and p(a, y) be the corresponding densities generated by the general model (10.1) and (10.2). We will determine the approximating model by choosing Ht and Qt in model (3.1) so that densities g(a\y) and p(a\y) have the same mode a. The possibility that p(a, y) might be multimodal will be considered in §11.8. Taking the Gaussian model first, the mode a is the solution of the vector equation 9 log g{a\y)/da — 0. Now logg(a|y) = logg(a, y) — logg(y) . Thus the mode is also the solution of the vector equation d log g(a, y)/da — 0. This version of the equation is easier to manage since g(a, y) has a simple form whereas g(a | y) does not. Since Rt consists of columns of Im, r)t — R't(at+i — T,at). Assuming that g(ai) ~ N(ai, Pi), we therefore have

\ logg(a, y) = constant - - ( « i - a y ) ' - ax) 2

1 " - r E («'+1 - W RtQT'R'M 4-1 - Ttat) 1

- ~ Z ^ y ^ i y t - Zt<*t). (11.9)

Differentiating with respect to at and equating to zero gives the equations

(dt - O P f V i - fli) - dtRt-1 < 2 ~ \ - Tâ^)

+ T/RtQ^R;(at+1 - T^t) + Z'tH~\yt - Ztat) - 0, (11.10)

for t — 1 , . . . , n, where d\ = 0 and dt — 1 for t — 2 , . . . , n, together with the equation

~ Tnan) - 0.

The solution to these equations is the conditional mode o:. Since g(a \ y) is Gaussian the mode is equal to the mean so a can be routinely calculated by the Kalman filter and smoother as described in Chapters 4 and 5. It follows that linear equations of the form (11.10) can be solved by the Kalman filter and smoother which is efficient computationally.

Assuming that the non-Gaussian model (10.1) and (10.2) is sufficiently well behaved, the mode a of /?(a|y) is the solution of the vector equation

aiogp(a |y) da

11.4. LINEARISATION BASED ON FIRST TWO DERIVATIVES 193

and hence, as in the Gaussian case, of the equation

9 log/?(«,)>) _ 0

da Let qt{rjt) = - log p(iy) and let ht(yt\6t) = - log p(y, |0,) where 6t — Ztat. Then,

n logp(a , y) - constant + log/j(ai) - ^ [q, (rjt) + ht CM^)], (11.11)

with r)t — R[(<XT+\ — T(at), so a is a solution of the vector equations

aiogâ,^) Slog/KaO , dqt-i(r}t-1) — = (1 - dt) dtRt_ j ————-

oat dan or)t-1 + T , R i ^ M _ z f h , i y , W,)

3% ddt

for t — 1 , . . . , n, where, as before, d\ — 0 and d, — 1 for t — 2 , . . . , n, together with the equation

dVn We solve these equations by iteration, where at each step we linearise,

put the result in the form (11.10) and solve by the Kalman filter and smoother. Convergence is fast and normally only around ten iterations or less are needed. The final linearised model in the iteration is then the linear Gaussian model with the same conditional mode of a given y as the non-Gaussian model. We use the conditional density of a given y for this model as the importance density. A different method of solving these equations was given by Fahrmeir and Kaufmann (1991) but it is more cumbersome than our method; moreover, while it finds the mode, it does not find the linear approximating model directly.

11.4 Linearisation based on first two derivatives

We shall consider two methods of linearising the observation component of the mode equation (11.12). The first method is based on the first two derivatives of the observational logdensity and it enables exponential family observations, such as Poisson distributed observations, to be handled; the second method is given in §11.5 and deals with observations having the form (10.4) when p(st) is a function of £?; this is suitable for distributions with heavy tails such as the /-distribution.

To simplify the exposition we shall ignore the term (dt — l)P11(o'f — a\)

in (11.10) and (1 — dt)d \ogp(on)/dai in (11.12) in the derivation of the approximating model that follows. While there would be no difficulty in including them if this were thought to be necessary, we shall claim in §11.9.4 that the contri-butions of their analogues in the case of diffuse initialisation, which is the important case in practice, are generally negligible.


Suppose that a = [ a \ , . . . , is atrial value of a, let 0t — Ztat and define

ht = dht(yt\9t)

d&(

Expanding about 9t gives approximately dht{yt\et)

ht = d%(yt\o{)

ot = Qt detse; et - e; (11.13)

= ht+ht(8t~0t). (11.14) ovt

Substituting in the final term of (11.12) gives the linearised form

-Z't(ht + ht0t - htet). (11.15)

To put this in the same format as the final term of (11.10) put

H t = h ; \ yt = e t - h ; l h t . (11.16)

Then the final term becomes Z'tH~l(yt — 9t) as required. Consider, for example, the important special case in which the state equation

retains the original linear Gaussian form a i + i = Ttat + Rtrjt in which Tji ~ N(0, Qt). Equations (11.12) then have the linearised form

- Tt-Xat-i) + T;RtQ;lR't(at+l ~~ Ttat) + Z'tHTxiyt-Ztat) = 0, (11.17)

which has the same structure as (11.10) and therefore can be solved for a by the Kalman filter and smoother to give a new trial value; the process is repeated until convergence. The values of a and 9 after convergence to the mode are denoted by a and 9, respectively.

It is evident from (11.16) that this method only works when h{ is positive definite. When ht is negative definite or semi-definite, the method based on one derivative, as presented in the next section, should normally be used. Finally, it is important to note that in the special case when the state equation is linear and Gaussian, the second derivative of the log of the approximating density with respect to a t is

—tff-iGtJiî-i — Tt'R,Qt iR'iTt — Z'f htZt,

for t — 2 , . . . , n, and

~RtQ7lK for i — n + 1, which is the same as that of the logdensity of the non-Gaussian density. This means that not only does the approximating linear Gaussian model have the same conditional mode as the non-Gaussian model when the state equation is linear and Gaussian, it also has the same curvature at the mode. In consequence, when the state equation is given by (10.2) and p(r}t) is not too far from normality, it is reasonable to expect that the curvatures at the mode of the non-Gaussian model and the approximating model will be approximately equal. This is a useful bonus for the technique of equalising modes.

11.4. LINEARISATION BASED ON FIRST TWO DERIVATIVES 195

11.4.1 EXPONENTIONAL FAMILY MODELS

The most important application of these results is to observations from exponential family distributions. For density (10.3) we have

ht{yt\0t) = - l ogp(y , \ 0 t ) - -\y[et - bM) + ct(yt)}. (11.18)

Define dbt{dt)

bt

bt =

8BT

d2bt(0t) ddtde;

BT - E;

0t = §t

Then h( =bt — yt and ht — bt so using (11.16) we take Ht — bf-1 and yt —

Bt — bt (bt — yt). These values can be substituted in (11.17) to obtain a solution for the case where the state equation is linear and Gaussian. Since, as shown in §10.3, bt = VarOvlflf), it is positive definite in non-degenerate cases, so for the exponential family, linearisation based on first two derivatives can always be used.

As an example, for the Poisson distribution with density (10.8) we have b, — bt exp(0r) so we take Ht — exp(—Bt) and yt — Bt — 1 + Qxp(—0t)yt. Other examples for the exponential family models are given in Table 11.1.

11.4 .2 STOCHASTIC VOLATILITY M O D E L

For the basic SV model (10.24) we have

ht(ytIOt) ~ log p(yt19t) - ~[xf exp(-0 i) + log 2na2],

where xt — (yr —a)/a. It follows that

h ~ ~\ [*t e xP (~°t) - !]. kt = ^xf exp(-0r).

Using the definitions in (11.16), we have

H t - 2exp(0f)/*?, Jv =6t + I - exp (Bt)/xf,

where we note that Ht is always positive as required. The approximating model for the SV model (10.24) is obtained by starting with

putting Bt — 0 and computing Ht and yt as given above. New 6>f's are obtained by applying the Kalman filter and smoother to the model implied by (11.16). The new Bt values can be used to compute new H/s and new yt's. This recursive process is repeated until convergence.

11.5 Linearisation based on the first derivative

We now consider the case where the observations are generated by model (10.4) and where linearisation is based on the first derivative of the observational logdensity.


Table 11.1. Approximating model details for exponential family models.

Distribution Poisson bt exp 9{

bt exp dt

h exp 9(

K lbt 1

binary bt log(l +exp 9t) bt exp 0,(1 + exp et)~l

h lbt

expô^l +exp0,)~2

K lbt 1 + exp 9t

binomial bt kt log(l + expfy) bt kt exp0,(1 -+- exp0,)_I

bt kt exp 0,(1 + exp0,)~2

b't 1bt t + exp 9t

negative binomial bt kt{@t — log(l — exp0,)} bt kt{\ -exp^r1

h 1bt

kt exp 0/(1 — exp0,)~2

K 1bt exp(-0,)™l exponential bt - log 9t

bt -or1 h 8t~2

K lbt ~ot Note that and yt — 6t + Hty, -

We shall assume that y, is univariate and that p(et) is a function of sf; this case is important for heavy-tailed densities such as the ^-distribution, and for Gaussian mixtures with zero means.

Let log p(st) — —^h*(ef). Then the contribution of the observation component to the equation 3 log p(a, y)/dat — 0 is

ldh*(e?)def ,dhUef)

Let dh*(s?)

K = d t f ) £t = yt-ot'

(11.20)

Then take Z'th*(yt — 0t) as the linearised form of (11.19). By taking H~x = h* we have the observation component in the correct form (11.10) so we can use the Kalman filter and smoother at each step of the solution of the equation 3 log p(ct, y)/d(xt — 0. We emphasise that now only the mode of the implied

11.5. LINEARISATION BASED ON THE FIRST DERIVATIVE 197

Gaussian density of st is equal to that of p(st), compared to linearisation using two derivatives with the linear Gaussian state which equalised both modes and curvatures at the mode.

It is of course necessary for this method to work that h{ is positive with probability one; however, this condition is satisfied for the applications we consider below. Strictly speaking it is not essential that p(st) is a function of s f . In other cases we could define

h* = 1 d log p(st) st dst st = yt -Õt'

(11.21)

and proceed in the same way. Again, the method only works when h* is positive with probability one.

11.5.1 DISTRIBUTION

As a first example, we take the i-distribution (10.19) for which

h* = (V + l)[cr? (v - 2) + (yt - 6 t f } - \

11.5.2 MIXTURE OF NORMALS

From (10.20),

h* = —2 log p(st)

— —2 log

So from (11.20) we have

K = 1

P&t) | Ot^l-KO.

where sf = (yt - 6tf/(2<y*).

11.5.3 GENERAL ERROR DISTRIBUTION

From (10.21),

log p(et) — constant — c(l)

X* 1 — X* exp ( - s f ) + — — _ exp ( ~ s f / X )

ajx^Xê

Í < 2.

Since this is not a function of sf we employ (11.21) which gives i-1

h* = 1

C(t)i S.Or

£-1

for &t > 0,

for et < 0,


where et — yt — df. Thus h* is positive with probability one so we can take H~l = ht and proceed with the construction of the linear Gaussian approximation as before.

11.6 Linearisation for non-Gaussian state components

We now consider the linearisation of the state component in equations (11.12) when the state disturbances r}t are non-Gaussian. Suppose that r\ — [?][,..., fj'n}' is atrial value of r] ~ (i][,..., r)'n)' where f\t — R't (at+i — Ttat\ We shall confine ourselves to the situation where the elements r]n of r\t are mutually independent and where the density p(r}it) of r)it is a function of r]ft. These assumptions are not very restrictive since they enable us to deal relatively easily with two cases of particular interest in practice, namely heavy-tailed errors and models with structural shifts for univariate series.

Let q*t(r]ft) — —2log p(t]n) and denote the /th column of Rt by Rit. Then the state contribution to the conditional mode equations (11.12) is

o dtRi, d<ih M t \) T/J? dvîM)

driu-i dm

The linearised form of (11.22) is r

~ ^ydtRij-iqlt-iVU-i ~ Tt'Ruqpiitl

where

m dq*M)

ànl

t = 1,..., n.

(11.22)

(11.23)

(11.24) m = rit

Putting Qt 1 = diag(^* , . . . , q*t), T), = R[ (a i + I - Ttat), and similarly for Qt+\

and rj t+i, we see that (11.23) has the same form as the state component of (11.10). Consequently, in the iterative estimation of a the Kalman filter and smoother can be used to update the trial value a .

Hie general form for the joint density g(ce, y) for all three types of linearisation is

g(a, y) = constant x g(«i)exp E Wer1*+j, (u.25) where for the approximation in §11.4 we substitute et = yt — 6t and H, = h~ in (11.25), where y, and 0t are the values of yt and 0, at the mode and ht is the value of h, given by (11.13) at the mode; for the approximation in §11.5 we substitute £t = yt — 6t and Ht ~ h} , where 0t is as before and h* is the value given by (11.20) at the mode; and finally, for the approximation in this section we substitute

11.7. LINEARISATION FOR NONLINEAR MODELS 199

rjt — R't(oit+1 — Ttat) and Qt — diag(^j ? , . . . , q*rt), where q]t is the value given by (11.24) at the mode.

This approach can be extended to functions of rja other than r]ft and to non-diagonal variance matrices Qt.

11.6.1 i-DISTRIBUTION FOR STATE ERRORS

As an illustration we consider the local level model (2.3) but with a ^-distribution for the state error term rj, so that

yt — at + st, st ~ N(0, a£2),

ofr+i = <xt + r\t, j]t ~ iv,

and we assume that ai ~ N(0, K) with tc —oo. It follows that

Starting with initial values for r)t, we compute q* and apply the Kalman filter and disturbance smoother to the approximating Gaussian local level model with

<*t+i = <*t + fit, Vt ~ N(0, q*).

New values for the smoothed estimates f j , are used to compute new values for q* until convergence to fjt.

11.7 Linearisation for nonlinear models

We now develop an analogous treatment of the nonlinear model (10.22) and (10.23). As with the non-Gaussian models, our objective is to find an approximating linear Gaussian model with the same conditional mode of a given y as the nonlinear model. We do this by a technique which is slightly different, though simpler, than that used for the non-Gaussian models. The basic idea is to linearise the obser-vation and state equations (10.22) and (10.23) directly, which immediately delivers an approximating linear Gaussian model. We then iterate to ensure that this approximating model has the same conditional mode as the original nonlinear model.

Taking first the observation equation (10.22), let a t be a trial value of a t . Expanding about a t gives approximately

Zt(at) - Zt(at) + Zt(cet)(at - af),

where Z(at) — dZt(ctt)/da't, which is a p xm matrix, and Zt(at) is the value of this matrix at at — a,. Putting yt — yt — Zt(at) + Zt(at)at, we approximate (10.22) by the relation

y, - Zt(a()at+et, (11.26)

which is in standard form. Similarly, if we expand (10.23) about a t we obtain approximately

Tt(at) = T{{a.t) + f{(cit)(at - «,),


where f(ots) — dTt(oit)/da't, which is a mxm matrix. Thus we obtain the linearised relation

at+i = ft + Tt(at)at + Rtr}t, (11.27)

where

ft = Tt(at)-ft(<it)at.

This is not quite in standard form because of the presence of the non-random term %. Instead of approximating the nonlinear model (10.22) and (10.23) at its conditional mode by a linear Gaussian model of the standard form (3.1), we must therefore consider approximating it by a linear Gaussian model of the modified form

y, - Z,(a,)oct + st, et~ N(0, Ht\ <11 28) at+l = ft + tt(cet)at + Rtrit, rjt ~ N(0, Qt\

where ft is a known vector. This does not raise any difficulties for the methods of Chapter 4. It can easily be shown that the standard Kalman filter (4.13) applies to model (11.28) except that the state update equation (4.8),

at+i — T,at + Ktvt>

is replaced by the equation

at+1 = ft + Ttat + Ktvt,

with the appropriate change of notation from model (4.1) to model (11.28). We use the output of the Kalman filter to define a new a t which gives a

new approximating model (11.28), and we continue to iterate as in §11.4 until convergence is achieved. Denote the resulting value of a by a .

To show that a is the conditional mode for the nonlinear model (10.22) and (10.23), we write

p(a, y) — constant — ^ £>t+1 - TMy\'RtQ-l#t{ctt+l - Tt(at)} t=i

+ D * ™ ZtMYH^tot - ZtM) t=i

Differentiating with respect to a t gives for the mode equation

- dtRt-iQt-iR't_x[cxt ~ r ,-i(a,-i)]

+ tt{atyRtQjlR't[at+l - Tt(at)\ (11.29)

- Zt(atyH~l\yt - Zt(oct)] - 0,

11.7. LINEARISATION FOR NONLINEAR MODELS 201

for / = I n with d\ — 0 and dt — 1 for t > 1. The mode equation for the approximating model (11.28) is

- dtRt-iQt-iR't-ilat ~ Tt-i ~ Tt-i(at-i)at-i]

+ tt(aiyRtQ-lRl[al+i-ft-Tt(at)at] (11.30)

- Zt{atyH~l{yt - Z((a,)a(],

for t — 1 , . . . , n. Now when the iteration converges we have a t = a t — ott, so

a,+i - ft - Tt{at)at = at+l - Tt{6tt) + Tt{a{)at - ft(at)a(

— «i+i - Tt(a:),

and

yt - Zt(at)at = y, - Zt(at) + Zt(at)at - Zt(cit)at

= yt- Zt(&t).

It follows that a ( — &t satisfies (11.29) so a is the conditional mode for the nonlinear model. As in §11.3, the linear approximating model (11.28) at a = a is used to provide the density.

Suppose that, as in the non-Gaussian case, we wish to estimate the mean x of a function x(ar) of the state vector a . The same formula (11.5) applies, that is,

_ Kg\_x(ot)w(ct, y)] p(a,y) x ^ — — — — , with w(ot, y)= ,

Eg[w(a, y)] g(a, y)

where, by neglecting p(ct\) as before,

p(a, y) — constant x exp (11.31)

with the substitutions ly — R't[at+1 — Tt(at)] and st ~ yt — Zt(at). For g(ay y) the same formula (11.31) applies, with substitutions

T]t = R',[at+i ~ t t - ft(at)at], where ft ^ Tt(at) - ft(at)6t},

and

£,=$,- Zt{at)ctt, where % = yt - Zt(at) + Zc(a()a(.

11.7.1 MULTIPLICATIVE M O D E L S

Shephard (1994) considered the multiplicative trend and seasonal model with additive Gaussian observation noise. We consider here a simple version of this model with the trend modelled as a local level and the seasonal given by a single trigonometric term such as given in (3.6) with s — 3. We have

yt = f i tYt+E t , s t ~ N(0, ore2),


with

a t + i = 1 0 0 0 cos X sin A, 0 — sin X cos X

a t +

and X = 27t/3. It follows that Zt(at) = toYt and Zt(at) — (yt, \xt, 0) which lead us to the approximating model

where et and are mutually and serially uncorrelated Gaussian disturbance terms. For the general model (10.22) and (10.23) we have at — st, rjr = (s,+lt W Zt(at) = iitst, Ht = 0, Tt(at) = <JL&, 0, 0)', Rt - [0, J2]' and Qt is a 2 x 2 diagonal matrix. It follows that

with yt — yt -fLtet. Thus the Kalman filter and smoother can be applied to an approximating time-varying local level model with state vector a t — fi t.

11.8 Estimating the conditional mode

So far we have emphasised the use of the mode a of p(ajy) to obtain a linear approximating model which we use to obtain Gaussian importance densities for simulation. If, however, the sole object of the investigation was to estimate a and if economy in computation was desired, then a could be used for the purpose without recurse to simulation; indeed, this was the estimator used by Durbin and Koopman (1992) and an approximation to it was used by Fahrmeir (1992).

The property that the conditional mode is the most probable value of the state vector given the observations can be regarded as an optimality property; we now consider a further optimality property possessed by the conditional mode. To find it we examine the analogous situation in maximum likelihood estimation. The maximum likelihood estimate of a parameter \jr is the most probable value of it given the observations and is well known to be asymptotically efficient. To develop a finite-sample property analogous to asymptotic efficiency, Godambe (1960) and Durbin (1960) introduced the idea of unbiased estimating equations and Godambe showed that the maximum likelihood estimate of scalar \jr is the solution to an

% = ( f t , to, 0)af

where yt — yt + jityt. Another example of a multiplicative model we consider is

yt ^ (J-t+i = toHt,

%t 0 fit Zt(oit) = (st, (it, 0), Tt(at)= 0 0 0 .

0 0 0 The approximating model (11.26) and (11.27) reduces to

yt = £tfJ>t + — Ailr +1t&t + &

11.8. ESTIMATING THE CONDITIONAL MODE 203

unbiased estimating equation which has a minimum variance property. This can be regarded as a finite-sample analogue of asymptotic efficiency. The extension to multidimensional ijr was indicated by Durbin (1960). Since that time there have been extensive developments of this basic idea, as can be seen from the collection of papers edited by Basawa, Godambe and Taylor (1997). Following Durbin (1997), we now develop a minimum-variance unbiased estimating equation property for the conditional mode estimate a of the random vector a.

If a* is the unique solution for a of the rnn x 1 vector equation H(a, y) — 0 and if E[H(a, y)] — 0, where expectation is taken with respect to the joint density p(a, y), we say that H(a, y) = 0 is an unbiased estimating equation. It is obvious that the equation can be multiplied through by an arbitrary nonsingular matrix and still give the same solution a*. We therefore standardise H(a, y) in the way that is usual in estimating equation theory and multiply it by \E{H{a, y)}]-1, where H(a, y) = 3 H(a, y)/da!, and then seek a minimum variance property for the resulting function h(a, y) = [E{#(a, y)}]_1/7(a, y).

Let

Var[h(a,y)]=Eih(ot,y)h(a,yy], d log p(a, y) 9 log p(a, y)

J ^ E dot da'

Under mild conditions that are likely to be satisfied in many practical cases, Durbin (1997) showed that Var [h (a, y)] — J~l is non-negative definite. If this is a zero matrix we say that the corresponding equation H(a, y) — 0 is an optimal esti-mating equation. Now take H(a, y) — d log p(a, y)fda. Then E[H(a, y)] = — J, so h(pi, y) — — J~ld log p(a, y)/dot. Thus Var[h(a, y)] — J~l and consequently the equation 3 log p(a, y)/da — 0 is optimal. Since a is the solution of this, it is the solution of an optimal estimating equation. In this sense the conditional mode has an optimality property analogous to that of maximum likelihood estimates of fixed parameters in finite samples.

We have assumed above that there is a single mode and the question arises whether multimodality will create complications. If multimodality is suspected it can be investigated by using different starting points and checking whether iterations from them converge to the same mode. In none of the cases we have examined has multimodality of p(a\y) caused any difficulties. For this reason we regard it as unlikely that this will give rise to problems in routine time series analysis. If, however, multimodality were to occur in a particular case, we would suggest fitting a linear Gaussian model to the data at the outset and using this to define the first importance density gi(rj\y) and conditional joint density gi(r), y). Simulation is employed to obtain a first estimate rf^ of E(r) \ y) and from this a first estimate of 6, is calculated for t ~ 1,..., n. Now linearise the true densities at or to obtain a new approximating linear Gaussian model which defines a new gMy), g2(*l\y)> and a new g(rj, y), g%{ri, y). Simulation using these gives a new estimate rf^ of E(?/jy). This iterative process is continued until adequate


convergence is achieved. We emphasise, however, that it is not necessary for the final value of a at which the model is linearised to be a precisely accurate esti-mate of either the mode or the mean of /?(a|y). The only way that the choice of the value of a used as the basis for the simulation affects the final estimate x is in the variances due to simulation as we shall show later. Where necessary, the simu-lation sample size can be increased to reduce these error variances to any required extent. It will be noted that we are basing these iterations on the mean, not the mode. Since the mean, when it exists, is unique, no question of 'multimeanality' can arise.

11.9 Computational aspects of importance sampling

11.9.1 INTRODUCTION

In this section we describe the practical methods we use to implement importance sampling for the models of Chapter 10. The first step is to express the relevant formulae in terms of variables which are as simple as possible; we do this in §11.9.2. In §11.9.3 we describe antithetic variables, which increase the efficiency of the simulation by introducing a balanced structure into the simulation sample. Questions of initialisation of the approximating linear Gaussian model are considered in §11.9.4. In §11.9.5 and §11.9.6 we develop simulation-based methods which do not require importance sampling for special cases of linear state space models for which the errors have /-distributions or distributions which are mixtures of Gaussian distributions.

11.9.2 PRACTICAL IMPLEMENTATION OF IMPORTANCE SAMPLING

Up to this point we have based our exposition of the ideas underlying the use of importance sampling on a and y since these are the basic vectors of interest in the state space model. However, for practical computations it is important to express formulae in terms of variables that are as simple as possible. In particular, in place of the a, 's it is usually more convenient to work with the state disturbance terms rjt = R [ (at+1 Ttat). We therefore consider how to reformulate the previous results in terms of r) rather than a.

By repeatedsubstitutionfiromtherelationoi/+i = Tta, + Rtt),,fort = 1 , . . . , n, we express Jc(a) as a function of ott and i}\ for notational convenience and because we intend to deal with the initialisation in §11.9.4, we suppress the dependence on ot\ and write as a function of rj in the form x*(t]). We next note that we could have written (11.1) in the form

(11.32)

Analogously to (11.5) we have

_ Bg[x*{tiyw*(rjiy)] x — — , £«[»>*(??, y)]

X — (11.33)

11.9. COMPUTATIONAL, ASPECTS OF IMPORTANCE SAMPLING 205

In this formula, Eg denotes expectation with respect to importance density g(i]\y), which is the conditional density of rj given y in the approximating model, and

n

p(v, y) = n p(rjt)p(yt\0t), r=i

where 9t = Ztcct. In the special case where yt = 9t + et, p(yt\Qt) — p(£t)- In a similar way, for the same special case,

n

g(n,y) = Y[8(rit)g(Gt)> t=1

For cases where the state equation is not linear and Gaussian, formula (11.33) provides the basis for the simulation estimates. When the state is linear and Gaussian, p(rjt) — g(rjt) so in place of w*(rj, y) in (11.33) we take

= (11-34) t=1

For the case p{rjt) — gOh) and yt — 9t + £t, we replace w*(rj, y) by

PM w U 11.9.3 ANTITHETIC VARIABLES

The simulations are based on random draws of rj from the importance density g(r}\y) using the simulation smoother as described in §4.7; this computes efficiently a draw of rj as a linear function of rn independent standard normal deviates where r is the dimension of vector r)t and n is the number of observations. Efficiency is increased by the use of antithetic variables. An antithetic variable in this context is a function of a random draw of rj which is equiprobable with r] and which, when included together with r) in the estimate of x increases the efficiency of the estimation. We shall employ two types of antithetic variables. The first is the standard one given by r\ — 2r) — rj where r) ~Eg(r]\y) is obtained from the disturbance smoother as described in §4.4. Since T) — f j — —(77 — rj) and rj is normal, the two vectors rj and fj are equi-probable. Thus we obtain two simulation samples from each draw of the simulation smoother; moreover, values of conditional means calculated from the two samples are negatively correlated, giving further efficiency gains. When this antithetic is used we say that the simulation sample is balanced for location.

The second antithetic variable was developed by Durbin and Koopman (1997). Let u be the vector of rn N(0,1) variables that is used in the simulation smoother to generate rj and let c — u'u\ then c ~ x}n- For a given value of c let q — Pr(x^M < c) — F(c) and let c = F _ 1 ( l — <?)• Then as c varies, c and c have the same distribution. Now take, f j — f j + +/c/c(tj — fj). Then f j has the same distribution as rj. This follows because c and {tj — fj)/sfc are independently


distributed. Finally, take 77 = ?) + Jc(c(r] — rj). When this antithetic is used we say that the simulation sample is balanced for scale. By using both antitheses we obtain a set of four equi-probable values of q for each run of the simulation smoother giving a simulation sample which is balanced for location and scale.

The number of antithetics can be increased without difficulty. For example, take c and q as above. Then q is uniformly distributed on (0, 1) and we write q ~ U(0,1). Let ql—q-\- 0.5 modulo 1; then q\ ~ £7(0, 1) and we have a balanced set of four £/(0, 1) variables, q, q\, 1 —q and 1 — q\. Take c — F _ 1 ( l — q) as before and similarly c\ — F~l(q\) and c\ — F~l( 1 — q\). Then each of c\ and c\ can be combined with r} and q as was c previously and we emerge with a balanced set of eight equi-probable values of rj for each simulation. In principle this process could be extended indefinitely by taking q\ ~ q and qj+i ~ qj -f- 2~k

modulo 1, for j — 1 , . . . , 2k~l and k — 2, 3 , . . . ; however, two or four values of q are probably enough in practice. By using the standard normal distribution function applied to elements of a, the same idea could be used to obtain a new balanced value q 1 from ry so by taking q 1 —2rj — t)i we would have four values of q to combine with the four values of c. In the following we will assume that we have generated N draws of r\ using the simulation smoother and the antithetic variables; this means that TV is a multiple of the number of different values of r) obtained from a single draw of the simulation smoother. For example, when 250 simulation samples are drawn by the smoother and the two basic antithetics are employed, one for location and the other for scale, N ~ 1000. In practice, we have found that satisfactory results are obtained by only using the two basic antithetics.

In theory, importance sampling could give an inaccurate result on a particular occasion if in the basic formulae (11.33) very high values of w*(q, y) are associated with very small values of the importance density g(rj|y) in such a way that together they make a significant contribution to x, and if also, on this particular occasion, these values happen to be over- or under-represented; for further discussion of this point see Gelman et al. (1995, p. 307). In practice, we have not experienced difficulties from this source in any of the examples we have considered.

11.9.4 DIFFUSE INITIALISATION

We now consider the situation where the model is non-Gaussian and some elements of the initial state vector are diffuse, the remaining elements having a known joint density; for example, they could come from stationary series. Assume that is given by (5.2) with qq ~ Poiô) where pq(-) is a known density. It is legitimate to assume that 5 is normally distributed as in (5.3) since we intend to let tc —> 00. The joint density of a and y is

n p(at y) - pfoo)*(i) n P(nt)p(yt\0tl (11.36)

t=i


withrjo — Ro(cti —a),8 = A'(a\ — a)andr?< — R't(at+1 — Ttat)for/ — 1 , . . . , n, since p(yt\at) = p(yt\9t).

As in §11.3, we find the mode of p(a|y) by differentiating log p(a, y) with respect to a i, . . . , «„+1. Forgiven K the contribution from 3 log g(8)/d<xi is —A8/K which —• 0 as K —> oo. Thus in the limit the mode equation is the same as (11.12) except that 3 log p(a.\)/da\ is replaced by 3 log p(r]o)/da{. In the case that ai is entirely diffuse, the term p(r)o) does not enter into (11.36), so the procedure given in § 11.4 for finding the mode applies without change.

When p(r]o) exists but is non-Gaussian, it is preferable to incorporate a normal approximation to it, g(r)o) say, in the approximating Gaussian density g(a, y), rather than include a linearised form of its derivative 3 log p(rjo)/drjo within the linearisation of 3 log p(a, y)/da. The reason is that we are then able to initialise the Kalman filter for the linear Gaussian approximating model by means of the standard initialisation routines developed in Chapter 5. For g(rj0) we could take either the normal distribution with mean vector and variance matrix equal to those of p(r)o) or with mean vector equal to the mode of p(rjo) and variance matrix equal to [—d2p(R}o)/d R}0D TJ'Q]'1 • For substitution in the basic formula (11.5) we take

, . p(.m)p(<x2,---,0tn+l,y\rio) /if m\ w(a, y) = - ; , (11.37) g(m)gK<*2, ...,an+uy\m)

since the denisties p(8) and g(<5) are the same and therefore cancel out; thus w(a,y) remains unchanged as K -> oo. The corresponding equation for (11.33) becomes simply

w(ri,y)=——— - . (11.38) g(riQ)g(m, •••,Vn,y)

While the expressions (11.37) and (11.38) are technically manageable, the practical worker may well believe in a particular situation that knowledge of p(r)o) contributes such a small amount of information to the investigation that it can be simply ignored. In that event the factor p(r}o)/g(r]o) disappears from (11.37), which amounts to treating the whole vector ct\ as diffuse, and this simplifies the analysis significantly. Expression (11.37) then reduces to

r-r p(ott)p(yt\at) w(a, y) - I I . gMgiytM

Expression (11.38) reduces to

h p(nt)p(yt\et) u > jLi g(*it)g(yt\0t)

with rjf — R't(at+\ — T,at) for / = 1 , . . . , n. For nonlinear models, the initialisation of the Kalman filter is similar and the

details are handled in the same way.


11.9.5 TREATMENT OF DISTRIBUTION WITHOUT IMPORTANCE SAMPLING

In some cases it is possible to construct simulations by using antithetic variables without importance sampling. For example, it is well-known that if a random variable ut has the standard /^distribution with v degrees of freedom then u, has the representation

£*~N(0,1) , C, ~X2(V)> v > 2 , (11.39) 1/2 '

Lt

where s* and ct are independent. In the case where v is not an integer we take jct as a gamma variable with parameter It follows that if we consider the case where et is univariate and we take s, in model (10.4) to have logdensity (10.19) then e, has the representation

(v - 2) i /2aeef * = - Trr^-' ( 1 L 4 ° )

<V

where s* and c, are as in (11.39). Now take sp . . . , s* and c\,..., c„ to be mutually independent. Then conditional on c\,..., cn fixed, model (10.4) and (10.2), with T}t ~ N(0, Qt), is a linear Gaussian model with Ht — Var(e/) — (i> — 2)or(?ci~1. Put c — (c i , . . . , c„)'. We now show how to estimate the conditional means of functions of the state using simulation samples from the distribution of c.

Suppose first that a, is generated by the linear Gaussian model at+i ~ Ttat + RtWt, Vf ~ N(0, Qt), and that as in (11.32) we wish to estimate

x = E[x*(r})\y)

= Jx*(v)p(c,r}\y)dcdr}

= jx*(7])p(r]\c, y)p(c\y)dcdrj

= Jx*(r})p(r]\c,y)p(c,y)p(yr1dcdr]

= P(y)~X Jx*(r))p(rj\c,y)p(y\c)p(c)dcdr]. (11.41)

For given c, the model is linear and Gaussian. Let

x(c) = Jx*(r})p(ri\c,y)dr).

For many cases of interest, x(c) is easily calculated by the Kalman filter and smoother, as in Chapters 4 and 5; to begin with, let us restrict attention to these cases. We have

p(y) = fp(y>c)dc = j p(y\c)p(c)dc,


where p(y|c) is the likelihood given c which is easily calculated by the Kalman filter as in 7.2. Denote expectation with respect to density p{c) by Ec Then from (11.41),

_ Ec[x(c)p(y\cy\ x — — — •. (11.42)

Ec[/>(y|c)]

We estimate this by simulation. Independent simulation samples c(I), c ( 2 ) , . . . of c are easily obtained since c is a vector of independent x„ variables. We suggest that antithetic values of x l are employed for each element of c, either in balanced pairs or balanced sets of four as described in §11.9.3. Suppose that values . . . , c ^ have been selected. Then estimate x by

( 1 1 4 3 )

When x(c) cannot be computed by the Kalman filter and smoother we first draw a value of c ^ as above and then for the associated linear Gaussian model that is obtained when this value is fixed we draw a simulated value rf1^ of rj using the simulated smoother of §4.7, employing antithetics independently for both c ^ and The value x*(r}^) is then calculated for each If there are N pairs of values if** we estimate x by

" lXiP(.vk«>) ' '

Since we now have sampling variation arising from the drawing of values of rj as well as from drawing values of c, the variance of x* will be larger than of x for a given value of N. We present formulae (11.43) and (11.44) at this point for expository convenience in advance of the general treatment of analogous formulae in §12.2.

Now consider the case where the error term st in the observation equation is N(0, <re2) and where the elements of the error vector rjt in the state equation are independently distributed as Student's t. For simplicity assume that the number of degrees of freedom in these ^-distributions are all equal to v, although there is no difficulty in extending the treatment to the case where some of the degrees of freedom differ or where some elements are normally distributed. Analogously to (11.40) we have the representation

(v __ 2)l/2a -if rjit = 1/2 , rfu - N(°> 1). ~ X,2, v > 2, (11.45)

cu

for i — 1 , . . . , r and t — 1 , . . . , n, where a^ ~ Var(??;i). Conditional on e n , . . . , crn held fixed, the model is linear and Gaussian with H, — <7e

2 and rjt ~ N(0, Qr) where Q, = diag[(v - 2)a^cu

l, ...,(v- 2)a^c~1]. Formulae (11.43) and (11.44) remain valid except that c(i) is now a vector with r elements. The


extension to the case where both e, and elements of rjt have /-distributions is straightforward.

The idea of using representation (11.39) for dealing with disturbances with /-distributions in the local level model by means of simulation was proposed by Shephard (1994) in the context of MCMC simulation.

11.9.6 TREATMENT OF GAUSSIAN MIXTURE DISTRIBUTIONS WITHOUT

IMPORTANCE SAMPLING

An alternative to the /-distribution for representing error distributions with heavy tails is to employ the Gaussian mixture density (10.20), which for univariate st we write in the form

p(st) - X*N(0, cr2) + (1 - r )N(0 , X°f)-> 0 < A* < 1. (11.46)

It is obvious that values of et with this density can be realised by means of a two-stage process in which we first select the value of a binomial variable bt such that Pr(b, = 1) - X* and Pr(bt — 0) = 1 ~ A*, and then take et ~ N(0, cr2) if bt = 1 and s{ ~ N(0, if bt — 0. Assume that the state vector at is generated by the linear Gaussian model at+1 — Ttat Rti)t, r), ~ N(0, Qt). Putting b — (b\,..., bn)', it follows that for b given, the state space model is linear and Gaussian. We can therefore employ the same approach for the mixture distribution that we used for the /-distribution in the previous subsection, giving as in (11.41),

M m p x - p{yTlM~l J2 / x*(n)p(n\bUby)p(y\b(jMb{j))dr1, (11.47)

j=i J

where b(i),..., b(M) are the M = 2" possible values of b. Let

x(b) = J x*(7])p(n\b, y)dr},

and consider cases where this can be calculated by the Kalman filter and smoother. Denote expectation over the distribution of b by Et,. Then

M p(y) = M~l J2p(y\ba))p(hj)) = Efc[/>(y|Z>)],

j=1 and analogously to (11.42) we have,

_ Eb[x(b)p(y\b)] Eb[p(y\b))

We estimate this by simulation. A simple way to proceed is to choose asequence . . . , of random values of b and then estimate x by

(11.48)


Variability in this formula arises only from the random selection of b. To construct antithetic variables for the problem we consider how this variability can be restricted while preserving correct overall probabilities. We suggest the following approach. Consider the situation where the probability 1 — k* in (11.46) of taking N(0, x a s ) is small. Take 1 — X* — l / B where B is an integer, say B = 10 or 20. Divide the simulation sample of values of b into K blocks of B, with N — KB. Within each block, and for each t — 1 , . . . , n, choose integer j randomly from 1 to B, put the yth value in the block as b, — 0 and the remaining B — 1 values in the block as bt = 1. Then take b(l) — (bi,..., bn)( with b\,... ,bn defined in this way for i — I,... ,N and use formula (11.49) to estimate x. With this procedure we have ensured that for each i, Pr(bt — 1) = X* as desired, with bs and bt independent for s ^ t, while enforcing balance in the sample by requiring that within each block b, has exactly B — 1 values of 1 and one value of 0. Of course, choosing integers at random from 1 to B is a much simpler way to select a simulation sample than using the simulation smoother.

The restriction of B to integer values is not a serious drawback since the results are insensitive to relatively small variations in the value of X*, and in any case the value of X* is normally determined on a trial-and error basis. It should be noted that for purposes of estimating mean square errors due to simulation, the numerator and denominator of (11.49) should be treated as composed of M independent values.

The idea of using the binomial representation of (11.46) in MCMC simulation for the local level model was proposed by Shephard (1994).

12 Analysis from a classical standpoint

12.1 Introduction

In this chapter we will discuss classical inference methods based on importance sampling for analysing data from non-Gaussian and nonlinear models. In §12.2 we show how to estimate means and variances of functions of the state using simulation and antithetic variables. We also derive estimates of the additional variances of estimates due to simulation. We use these results in §12.3 to obtain estimates of conditional densities and distribution functions of scalar functions of the state. In §12.4 we investigate the complications due to missing observations and we also consider the related question of forecasting. Finally, in §12.5 we show how to estimate unknown parameters by maximum likelihood and we examine the effect of parameter estimation errors on estimates of variance.

12.2 Estimating conditional means and variances

Following up the treatment in §11.9.2, we now consider details of the estimation of conditional means x of functions x*(rj) of the stacked state error vector and the estimation of error variances of our estimates. Let

">*()?) = —

taking the dependence of w*(q) on y as implicit since y is constant from now on. Then (11.33) gives

x = g " — — , (12.1)

which is estimated by

EN ,-_I Xi W;

* = "N > (12-2)

12.3. ESTIMATING CONDITIONAL DENSITIES AND DISTRIBUTION FUNCTIONS 213

where

p(v(i), y) Xi =x*(n(i)), wi = w*(rj^) = g(*}(i),y)'

and is the j'th draw of rj from the importance density g(r/|y) for i = 1, ..., N. Note that by 'the /th draw of rj' here we include antithetic variables as described in §11.9.3. For the case where x*(rj) is a vector we could at this point present formulae for estimating the matrix Var[x*(?})|y] and also the variance matrix due to simulation of x — x. However, from a practical point of view the covariance terms are of little interest so it seems sensible to focus on variance terms by taking x*(rj) as a scalar for estimation of variances; extension to include covariance terms is straightforward. We estimate Var[x*(r?)|y] by

- ^ ^ - (12.3)

The estimation error due to the simulation is

HiLi w»(xi -x — X

EN l wi

To estimate the vaiiance of this, consider the introduction of the antithetic variables as described in §11.9.3 and for simplicity will restrict the exposition to the case of the two basic antithetics for location and scale; the extension to a larger number of antithetics is straightforward. Denote the sum of the four values of Wi(xi — i ) that come from the jth run of the simulation smoother by vj and the sum of the corresponding values of w^Xi — x) by vj. For N large enough, since the draws from the simulation smoother are independent, the variance due to simulation is, to a good approximation,

which we estimate by

Wars(x) — , 7 (12.5) (EiLi w»i)

The ability to estimate simulation variances so easily is an attractive feature of our methods.

12.3 Estimating conditional densities and distribution functions

When x*(i]) is a scalar function the above technique can be used to estimate the conditional distribution function and the conditional density function of x given y. Let G[x|y] — Pr[x*(7j) < x|y] and let Ix(t]) be an indicator which is

214 ANALYSIS FROM A CLASSICAL STANDPOINT

unity if X*(T]) < x and is zero if x*(RJ) > x. Then G(;t|y) = Eg(Ix(r})\y). Since Ix(v) is a function of t] we can treat it in the same way as x*(rj). Let Sx be the sum of the values of W; for which x{ < x, for i — 1 , . . . , N. Then estimate G(x |y) by

E*=i w

i

This can be used to estimate quantiles. We order the values of x, and we order the corresponding values u>,- accordingly. The ordered sequences for xt- and Wj are denoted by x^ and w^, respectively. The I00k% quantile is given by x^-j which is chosen such that

I t i«m

We may interpolate between the two closest values for m in this approximation to estimate the 100k% quantile. The approximation error becomes smaller as N increases.

Similarly, if S is the interval (x. — jd, x + \d) where d is suitably small and positive, let Ss be the sum of the values of W; for which x*(jj) e S. Then the estimate of the conditional density p{x\y) of x given y is

P(x\y) = d~l f . (12.7) Ei=i

This estimate can be used to construct a histogram. We now show how to generate a sample of M independent values from

the estimated conditional distribution of x*(rj) using importance resampling; for further details of the method see Gelfand and Smith (1999) and Gelman et al (1995). Take x[k] - xj with probability wj/ £ i l i w; for j = 1 , . . . , N. Then

=6{x\y). i wi

Thus x[k] is a random draw from the distribution function given by (12.6). Doing this M times with replacement gives a sample oi M < N independent draws. The sampling can also be done without replacement but the values are not then independent.

12.4 Forecasting and estimating with missing observations

The treatment of missing observations and forecasting by the methods of this chapter is straightforward. For missing observations, our objective is to estimate x ~ f x*(rj)p(ri\y)dr) where the stacked vector y contains only those observational elements actually observed. We achieve this by omitting from the linear Gaussian approximating model the observational components that correspond to the missing elements in the original model. Only the Kalman filter


and smoother algorithms are needed in the determination of the approximating model and we described in §4.8 how the filter is modified when observational vectors or elements are missing. For the simulation, the simulation smoother of §4.7 must be similarly modified to allow for the missing elements.

For forecasting, our objective is to estimate yn+j = E(y„+y|y), j — 1, where we assume that y„ + i , . . . y n + j and a„ + 2, . . . , an+J have been generated by model (10.1) and (10.2), noting that a n + i has already been generated by (10.2) with t = n. It follows from (10.1) that

for jr" ~ 1 , . . . , where 8n+j — Zn+join+j, with Zn+\,..., Zn+J assumed known. We estimate this as in §12.2 with x*(rj) — E(y,l+y-\8n+j), extending the simulation smoother for t — n + 1 , . . . , n + J.

For exponential families,

l^n+V") — bn+j(0n+j),

as in §10.3 for / < n, so we take x*(r/) = bn+j($n+j), for j — 1 , . . . , J. For the model yt ~0t + st in (10.4) we take x*(rj) = 6t.


12.5.1 INTRODUCTION

In this section we consider the estimation of the parameter vector iff by maximum likelihood. Since analytical methods are not feasible we employ techniques based on simulation using importance sampling. We shall find that the techniques we develop are closely related to those we employed earlier in this chapter for estimation of the mean of x(a) given y, with x{ot) — 1. Estimation of tf/ by maximum likelihood using importance sampling was considered briefly by Shephard and Pitt (1997) and in more detail by Durbin and Koopman (1997) for the special case where p(yt\6t) is non-Gaussian but at is generated by a linear Gaussian model. In this section we will begin by considering first the general case where both p(yt\0t) and the state error density p(t]t) in (10.2) are non-Gaussian and will specialise later to the simpler case where p(rjt) is Gaussian. We will also consider the case where the state space models are nonlinear. Our approach will be to estimate the loglikelihood by simulation and then to estimate \fr by maximising the resulting value numerically.

12.5.2 ESTIMATION OF LIKELIHOOD

The likelihood is defined by — p(y where for convenience we suppress the dependence of L(\jr) on y, so we have

yn+j ~ EIE(yrt+y|0n+y)|y], (12.8)


Dividing and multiplying by the importance density g(a|y) as in §11.2 gives

J g(a, y) = Lg(ir)Eg [w(a, y)L (12.9)

where Lg(\fr) = g(y) is the likelihood of the approximating linear Gaussian model that we employ to obtain the importance density g(a\y),Bg denotes expectation with respect to density g(a\y), and w(ot, y) — p(a, y)/g(a, y) as in (11.5). Indeed we observe that (12.9) is essentially equivalent to (11.4). We note the elegant feature of (12.9) that the non-Gaussian likelihood L ( f ) has been obtained as an adjustment to the linear Gaussian likelihood Lg(f), which is easily calculated by the Kalman filter; moreover, the adjustment factor Eg[w(oc, y)] is readily estimable by simulation. Obviously, the closer the importance joint density g(oc, y) is to the non-Gaussian density p(a, y), the smaller will be the simulation sample required.

For practical computations we follow the practice discussed in §11.9.2 and §12.2 of working with the signal $t = Ztat in the observation equation and the state disturbance r}t in the state equation, rather than with a t directly, since these lead to simpler computational procedures. In place of (12.9) we therefore use the form

L { f ) = Lgty)Eg[w*(ri, y)l, (12.10)

where L ( f ) and L g ( f ) are the same as in (12.9) but Eg and w*(rj, y) have the interpretations discussed in §11.9.2. We then suppress the dependence on y and write w*(t)) in place of w*(t), y) as in §12.2. We employ antithetic variables as in § 11.9.3, and analogously to (12.2) our estimate of L ( f ) is

LW)=Lg(jr)w, (12.11)

where w = (1/7V) X ^ i w,-, with wt = w*{rfl)) where rf^,... r)(hf) is the simulation sample generated by the importance density g(i]\y).

12.5.3 MAXIMISATION OF LOGLIKELIHOOD

We estimate f by the value ty of \fr that maximises L(f). In practice, it is numer-ically more stable to maximise

logL(f) = log LgW) -flog u>, (12.12)

rather than to maximise L(ijr) directly because the likelihood value can become very large. Moreover, the value of i\r that maximises log L(i/r) is the same as the value that maximises L(f ) .

To calculate ip, log L ( f ) is maximised by any convenient iterative numerical optimisation technique, as discussed, for example in §7.3.2. To ensure stability of the iterative process, it is important to use the same random numbers from the


simulation smoother for each value of i(r. To start the iteration, an initial value of xfr can be obtained by maximising the approximate loglikelihood

log L ( f ) & log L g ( f ) + log 10(f)),

where fj is the mode of g(r;|y) that is determined during the process of approximating p(r]\y) by g(r)\y); alternatively, the more accurate non-simulated approximation given in expression (21) of Durbin and Koopman (1997) may be used.

12.5.4 VARIANCE MATRIX OF MAXIMUM LIKELIHOOD ESTIMATE

Assuming that appropriate regularity conditions are satisfied, the estimate of the large-sample variance matrix of \jr is given by the standard formula

2 1 W.XH-1 [ a2 log Ujr) 3 \[fd\fff (12.13)

xjr = iff

where the derivatives of log L{ijf) are calculated numerically from values of \{r in the neighbourhood of

12.5.5 EFFECT OF ERRORS IN PARAMETER ESTIMATION

In the above treatment we have performed classical analyses in the traditional way by first assuming that the parameter vector is known and then substituting the maximum likelihood estimate rjr for \jf. The errors \jr — if/ give rise to biases in the estimates of functions of the state and disturbance vectors, but since the biases are of order n~l they are usually small enough to be neglected. It may, however, be important to investigate the amount of bias in particular cases. In §7.3.7 we described techniques for estimating the bias for the case where the state space model is linear and Gaussian. Exactly the same procedure can be used for estimating the bias due to errors -\jr — iff for the non-Gaussian and nonlinear models considered in this chapter.

12.5.6 MEAN SQUARE ERROR MATRIX DUE TO SIMULATION

We have denoted the estimate of \fr that is obtained from the simulation by $ ; let us denote by i/r the 'true' maximum likelihood estimate of rjf that would be obtained by maximising the exact log L(\js) without simulation, if this could be done. The error due to simulation is \jf — \jjr, so the mean square error matrix is

Now is the solution of the equation

aiogL(^r) 3 iff

- 0 ,


which on expansion about f gives approximately,

dlo g L ( f ) 9 2 logL(^) df

where

aiogL(^) d log L ( f ) df

giving

df

f — f

02logL(iÂ) d2\ogL{f)

\!r=f d f d f d f d f

a2

log L(f) a i o g i ( ^ ) ~df ' dfd f'

Thus to a first approximation we have

a i o g L ( ^ ) a i o g Z # ) MSE(f) = QEg

where & is given by (12.13). From (12.12) we have

df d f Q, (12.14)

log L ( f ) = log L g ( f ) + log w,

so

3 log L ( f ) _ a log L g ( f ) 1 dw

df df W df

Similarly, for the true loglikelihood log L ( f ) we have

a log L ( f ) a log Lg($) , aiogM, + d f df ' df

where ixw — Eg(w). Since -iff is the 'true' maximum likelihood estimator of f ,

B log L ( f )

Thus

so we have

df

9 log L J f ) a log to

0.

1 dto df df (JLW df

a log \ dw i a fiw

df w df (iw df


It follows that, to a first approximation,

91ogL(^) 1 a df io dijr

(w - fj,w),

and hence

9 log L(\jr) 3 log L(\jr) df df

= —Var| — wL i-V

Taking the case of two antithetics, denote the sum of the four values of w obtained from each draw of the simulation smoother by w* for j — 1 , . . . , N/4. Then w - AT1 E ^ i 8 0

/dw\ 4 Var — - —Var — ^ }.

\ d f j N \dir J Let qW ~ dw*/di{r, which we calculate numerically at yjr — and let q (4/AO q(j)- The« estimate (12.14) by

MSE(jr) = Q / 4 \ 2 N / 4

(4) £2. (12.15)

The square roots of the diagonal elements of (12.15) may be compared with the square roots of the diagonal elements of Q in (12.13) to obtain relative standard errors due to simulation.

12.5.7 ESTIMATION WHEN THE STATE DISTURBANCES ARE GAUSSIAN

When the state disturbance vector is distributed as rjt ~ N(0, Qt), the calculations are simplified substantially. Denote the normal density of r\ by g(rj) and arrange the approximating linear Gaussian model to have the same state disturbance density. Then p{q, y) - g(y)p(y\0) and g(rj, y) - g(v)g(y\0) so for io*(*7, y) in (12.10) we have w*(rj, y) = p(/j, y)/g(rj, y) = p(y \O)/g(y\0) giving

= (12.16)

which is the same as (8) of Durbin and Koopman (1997). Since in many important applications yt is univariate, or at least has dimensionality significantly smaller than that of r}t, (12.16) is normally substantially easier to handle than (12.10). Other aspects of the analysis proceed as in the general case considered in earlier sections.

12.5.8 CONTROL VARIABLES

Another traditional device for improving the efficiency of simulations is to use control variables. A control variable is a variable whose mean is exactly known and for which an estimate can be calculated from the simulation sample. The idea is then


to use the difference between the sample estimate and the true mean to construct an adjustment to the initial estimate of aquantity of interest which enhances efficiency. Although we do not make use of control variables in the illustrations considered in Chapter 14, we believe it is worth including a brief treatment of them here in order to stimulate readers to consider their potential for other problems. We shall base our discussion on the use we made of control variables for the estimation of the likelihood function in Durbin and Koopman (1997) for the case where observations have non-Gaussian density p(yt\9t), where 0t — Ztat and where the state equation has the linear Gaussian form at+i — Ttat + Rtilt>Vt ~ N(0, Qt) fori ~ 1 , . . . ,n. For simplicity, we assume that y, is univariate and that the distribution of c^ is known.

In (12.16) let

w W = ^ T l ^ ' 1 ^ = (12.17) g(y W)

and put

u, - U,™ u/ - ^ «/' - ^ ^

w ' a et ' sef We start with the idea that a way to obtain a control variable for w(0) is to expand it as a Taylor series and then take the difference between exact and simulation means of the first few terms. We therefore expand w(9) about 9 — Bg(9 jy). We have

w'{ — wl't, < = w[v; + (vt)\ < = w[i't" + %v; + (/;)3], xv1;" = w[u;" + Avtv;' + 3</;o2 + e/^/;)2 + (z;)4].

Since the state equation is linear and Gaussian it follows from §11.4 that l't — 0 and V; = 0 at 9 = §. Denote the values of w, Vt" and V™ at 6 = 9 by u>, V? and V/". The required Taylor series as far as the term in (9t — 9t)4 is therefore

w(6) — w 1 + \ E - e t f + -*<)4 + •• • " 24 , f=i t=1

Now draw a sample of N values of e\l) given y, and hence of — y, — ef('\

for the approximating linear Gaussian model using the simulation smoother. Take as the control variable wf where / = N~l Yl^Li fi with

and assume that the location balanced antithetic of §11.9.3 is used. The effect of


this is that X^Cfy^ — 0t)3 = 0 so we can neglect the first term in the expression for /, . The antithetic therefore reduces to wf where / = N~l ft with

jsÊW-ft)4-t=i

Let g; — Var(£f |y), which is obtained from the disturbance smoother. Then

I » 8 i=i

Without the use of the control variable, the estimate of (12.16) that we would have used is

L(ir) = Lg(ir)w, (12.18)

where w — N~1 EiLi wi with — given by (12.17). Using the control variable, we estimate L($r) by

V(#) = Ls(f)w\ (12.19)

where u>+ = AT1 J^Li U'J with w} = wt - u)(/« - | l™gf). What we have set out to achieve by the use of the control variable is to take out a substantial amount of the variation in w from expression (12.18).

We applied the technique in Durbin and Koopman (1997) to two illustrations. Our overall conclusion was stated in the following terms: 'The location and scale balancing variables (by this we meant the antithetics) together are so efficient that the extra variance reduction provided by the new control variable is small. Nevertheless, the control variable is so cheap computationally relative to the cost of extra simulation samples that it is worthwhile using them in practical applications.' Since we wrote these words our views have shifted somewhat due to the rapid reduction in computing costs. We therefore decided not to highlight control variables in our presentation of simulation methods in Chapter 11. Nevertheless, as we stated above, we believe it is worthwhile including this brief presentation here in the belief that control variables might prove to be of value in specific time series applications.

Analysis from a Bayesian standpoint

13.1 Introduction

In this chapter we discuss the analysis of non-Gaussian and nonlinear state space models from the standpoint of Bayesian inference. As we made clear in the introduction to Chapter 8, which deals with the Bayesian approach to the analysis of the linear Gaussian model, we regard both the classical and Bayesian approaches as providing valid modes of inferences in appropriate circumstances.

In the next section we develop Bayesian techniques for estimating posterior means and posterior variance matrices of functions of the state vector. We also show how to estimate posterior distribution and density functions of scalar functions of the state vector. Remarkably, it turns out that the basic ideas of importance sampling and antithetic variables developed for classical analysis in Chapter 11 can be applied with little essential change to the Bayesian case. Different considerations apply to questions regarding the posterior distribution of the parameter vector and we deal with these in §13.4. The treatment is based on the methods developed by Durbin and Koopman (2000). We have found these methods to be transparent and computationally efficient.

Previous work on Bayesian analysis of non-Gaussian state space models has been based almost entirely on Markov chain Monte Carlo (MCMC) methods: we note in particular here the contributions by Carlin et aL (1992), Shephard (1994), Carter and Kohn (1994), Carter and Kohn (1996), Carter and Kohn (1997), Shephard and Pitt (1997), Cargnoni, Muller and West (1997) and Gamerman (1998). General accounts of Bayesian methodology and computation are given by Gelman etal. (1995) and Bernardo and Smith (1994). In §13.5 we give a brief overview of MCMC methods as applied to state space models. References to soft-ware will also be given so that workers who wish to compare the methods based on importance sampling with the MCMC approach will be able to do so.

13.2 Posterior analysis of functions of die state vector

In the Bayesian approach the parameter vector iff is treated as random with a prior density p(i/0, which to begin with we shall take to be a proper prior. We first

13.2. POSTERIOR ANALYSIS OP FUNCTIONS OF THE STATE VECTOR 223

obtain some basic formulae analogous to those derived in §11.2 for the classical case. Suppose that we wish to calculate the posterior mean

x = E[x(a)|y],

of a function of the stacked state vector a given the stacked observation vector y. As we shall show, this is a general formulation which enables us not only to estimate posterior means of quantities of interest such as the trend or seasonal, but also posterior variance matrices and posterior distribution functions and densities of scalar functions of the state. We shall estimate x by simulation techniques based on importance sampling and antithetic variables analogous to those developed in Chapter 11 for the classical case.

We have

x — Jx(oc)p(iff, a\y)d\frda

= j x m \ y M « U r , y W . (13.D

As an importance density for p(ijr\y) we take its large sample normal approxima-tion

= V),

where T// is the solution of the equation

'd log p(jr\y) _ 3 log p(jr) + 3 log p(y\jr) _ Q

3 iff difr 3 iff

and

32log P ( f ) 92logp(y\ifr) dijfdf d f d f

(13.3)

For a discussion of this large sample approximation to p(^\y) see Gelman et al. (1995, Chapter 4) and Bernardo and Smith (1994, §5.3).

Let g(a\\fr, y) be a Gaussian importance density for a given \fr and y which is obtained from an approximating linear Gaussian model in the way described in Chapter 11. From (13.1),

f . .p(f\y)p(a\ir, y) , w ., . . . . x - I x(a) , , , N , , , _^g{ir\y)g(ct\f, y)dfda

I g(ir\y)g(a\ir, y ) '

{f I N / I I \ ( iTT^' <x\y)dfda-

By Bayes theorem,

p(ir\y) = Kp(if)p(yW),


in which K is a normalising constant, so we have

x^K x(a) —g{f,a\y)dfda J g{f\y) g(oi,y\f)

= KEg [x(a)z(f, a, y)], (13.4)

where Eg denotes expectation with respect to the importance joint density

g ( f , a\y) - g(f\y)gipt\f, y),

and where

z ( f , a, y) = - — . (13.5) g(f\y) g(a,y \jr)

In this formula, g(y\f) is the likelihood for the approximating Gaussian model, which is easily calculated by the Kalman filter.

Taking x(a) — 1 in (13.4) gives

K~l =Eg[z(f,a>y)l

so we have finally

Eg[x(a)z(f,a,y)] E J z ( f , a, y)]

We note that (13.6) differs from the corresponding formula (11.5) in the classical inference case only in the replacement of w(a, y) by a, y) and the inclusion of f in the importance density g ( f , ofjy).

In the important special case in which the state equation error r)t is N(0, Qt), then a is Gaussian so we can write its density as g(a) and use this as the state density for the approximating model. This gives p(a, y \ f ) — g(a)p(y\6, f ) and g(a, y \ f ) = ^(a)g(y|0, xfs), where 8 is the stacked vector of signals 9t — Ztat, so (13.5) simplifies to

z(&,ot,y)= - (13.7) g(f\y) s W . f )

For cases where a proper prior is not available, we may wish to use a non-informative prior in which we assume that the prior density is proportional to a specified function p ( f ) in a domain of f of interest even though the integral f p ( f ) d f does not exist. The posterior density, where it exists, is

p(ir\y) = KpW)p(ym,

which is the same as in the proper prior case, so all the previous formulae apply without change. This is why we can use the same symbol p ( f ) in both cases even when p(ir) is not a proper density. An important special case is the diffuse prior

13.3. COMPUTATIONAL ASPECTS OF BAYESJLAN ANALYSIS 225

for which /?(i/0 = 1 for all ijr. For a general discussion of non-informative priors, see, for example, Gelman e.t al. (1995, Chapters 2 and 3).

13.3 Computational aspects of Bayesian analysis

For practical computations based on these ideas we express the formulae in terms of variables that are as simple as possible as in §§11.9, 12.3 and 12.4 for the classical analysis. This means that to the maximum feasible extent we employ formulae based on the disturbance terms rjt — R't(at+i ~ Ttat) and et — yt — 6, for t = 1 , . . . , n. By repeated substitution for at we first obtain x(a) as a function x*(r]) of rj. We then note that in place of (13.1) we obtain the posterior mean of

/ x*(l)p{f\y)p(ri\tyiy)dfdr}. (13.8)

By reductions analogous to those above we obtain in place (13.6)

Eg[z*(&, v,y)] (13.9)

where

Z ( f , ri,y) = —— , (13.10) gW\y) gfayw)

and Eg denotes expectation with respect to the importance density g(\Jr, r)\y). Let yjr^ be a random draw from the importance density for xfr, g(ir\y) —

V}, where ijf satisfies (13.2) and V is given by (13.3), and let rj^ be a random draw from density g(rj\^l\ y) for i — 1 , . . . , N. To obtain this we need an approximation to the mode rjof density g(r}\y) but this is rapidly obtained in a few iterations from the mode of g(r}\yr, y). Let

* - zt = z*(if(iK v{i\ y), (13.11)

and consider as an estimate of x the ratio

* = ( 1 3 1 2 > 1 Zi

The efficiency of this estimate can obviously be improved by the use of antithetic variables. F o r ' we can use the location and scale antithetics described in §11.9.3. Antithethics may not be needed for since V — 0(nl) but it is straightforward to allow for them if their use is worthwhile; for example, it would be an easy matter to employ the location antithetic — 2\Js —

There is flexibility in the way the pairs rf1^ are chosen, depending on the number of antithetics employed and the way the values of \fr and rj are combined. For example, one could begin by making arandom selection i/rs of \(r from N(ip, V). Next we compute the antithetic value — 2$ — \}rs. For each of the values ijrs and


one could draw separate values of i] from g(rf y), and then employ the two antithetics for each r/ that are described in § 11.9.3. Thus in the sample there are four values of rj combined with each value of f so N is a multiple of four and the number of draws of rj from the simulation smoother is N/4. For estimation of variances due to simulation we need however to note that, since if/s and are related, there are only N/8 independent draws from the joint importance density g ( f , r)\y).

For the purpose of estimating posterior variances of scalar quantities, assume that x*(t]) is a scalar. Then, as in (12.3), the estimate of its posterior variance is

Var[x*(V)\y] = ^ ^ - J t 2 (13.13) i zt

Let us now consider the estimation of variance of the estimate x of the posterior mean of scalar x*(q) due to simulation. As indicated above the details depend on the way values of f and r] are combined. For the example we considered, with a single antithetic for f and two antithetics for rj, combined in the way described, let i)j be the sum of the eight associated values of Zt— x). Then as in (12.5), the estimate of the variance of x due to errors of simulation is

_ yN/S VarT(s) = \ 2 . (13.14)

(EiU) For the estimation of posterior distribution functions and densities of scalar

x*(ri), let lx(rj) be an indicator which is unity if x*(rj) < x and is zero if x*(rj) > x. Then the posterior distribution function is estimated by (12.6) provided that Wi is replaced by Zi- With the same proviso, the posterior density of x*(if) is estimated by (12.7). Samples of independent values from the estimated posterior distribution can be obtained by a method analogous to that described by a method at the end of§12.3.

13.4 Posterior analysis of parameter vector

In this section we consider the estimation of posterior means, variances, distribution functions and densities of functions of the parameter vector f . Denote by v ( f ) the function of if/ whose posterior properties we wish to investigate. Using Bayes theorem, the posterior mean of i>(ifr) is

V - E[V(yr)|y]

= j v(f)p(f\y)dir

= K Jv(f)p(ir)p(y\f)df

= K f ( 1 3 . 1 5 )

13.4. POSTERIOR ANALYSIS OF PARAMETER VECTOR .227

where K is a normalising constant. Introducing importance densities g(if\y) and g ( V j ) as in §13.3, we have

v = K v ( f ) —— TTT^' Vly)dfdr}

= KBg[v(f)z*{f,T],y)l (13.16)

where Eg denotes expectation with respect to the joint importance density g ( f , r}\y) and

* f , , p ( f ) g ( y \ f ) p(v,y\f) z (Y, y) = — — gfay

Putting v ( f ) — 1 in (13.16) we obtain as in (13.9),

„ = E 8[v(f)z*(f,y,y)] (13 17) Eg[z* ( f , T}, y)]

In the simulation, take xfr^ and r\{l) as in §13.3 and let vt- = v{xjr^). Then the estimates v of v and Var[v(^)iyl of Var[v(^)|y] are given by (13.12) and (13.13) by replacing xi by Vj. Similarly, the variance of 0 due to simulation can, for the

4.

antithetics considered in §12.3, be calculated by defining uj as the sum of the eight associated values of Zi(vj — v) and using (13.14) to obtain the estimate Vars(0). Estimates of the posterior distribution and density functions are obtained by the indicator function techniques described at the end of § 13.3. While v can be a vector, for the remaining estimates v ( f ) has to be a scalar quantity.

The estimate of the posterior density p[v(f)\y] obtained in this way is essentially a histogram estimate, which is accurate at values of v(rfr) near the midpoint of the intervals containing them. An alternative estimate of the posterior density of a particular element of f , which is accurate at any value of the element, was proposed by Durbin and Koopman (2000). Without loss of generality take this element to be the first element of f and denote it by . Denote the remaining elements by f i . Let giV^lî» y) be the approximate conditional density of f i given f i and y, which is easily obtained by applying standard regression theory to g(f\y), where g(ir\y) = N(/i, V). We take g i f i l f i , y) as an importance density in place of g{f\y)- Then

p{fi\y) = jp(i^\y)dif2

= K J pif)p(yW)dfi

= K f P{f)P<JI,yW)df2di)

= ffEgtzûM, y)], (13.18)

where E^ denotes expectation with respect to importance density g{if2\f\,y)


and

Z ( f , 7?, y) = , , . , 7—, rrr- (13.19)

Let be a draw from g(f2\fl, v), let = ( f i , f ? ' ) ' a n d let be a draw from g ( r ) y ) . Then take

. = p(v(i\y\f(i))

Now as in (13.17),

K- 1 =Ei[z*(Vr, i7,y)J,

where Eg denotes expectation with respect to importance denisty g(^\y)g(r)\ip-,y) and

, p ( f ) g ( y \ f ) p{n,y\t) Z T], y) — - T - - ; — .

Let if/* be a draw from g(ty\y) and let rj* be a draw from g(r}\ifrf, y). Then take

* = p { f t ) g { y m p(r}hy\tn n Zi g(f*\y) g(n*,y\f*y

and estimate p(\ffi jy) by the simple form N N

i = I i=t The simulations for the numerator and denominator of (13.22) are different

since for the numerator only is drawn, whereas for the denominator the whole vector xfr is drawn. The variability of the ratio can be reduced however by employing the same set of N(0,1) deviates employed for choosing i) from p{r}\ijr^, y) in the simulation smoother as for choosing r) from p{r)\iff*, y). The variability can be reduced further by first selecting ijr*. from g{\ff\|y) and then using the same set of N(0, 1) deviates to select xfr^ from g(\fr2\îi, y) as were used to select ifr^ from gi^Wi- .y) when computing z,-; in this case g(^*jy)in(13.21)is replaced by

To improve efficiency, antithetics may be used for draws of \Js and rj in the way suggested in §13.3.

13.5 Markov chain Monte Carlo methods

A substantial number of publications have appeared on Markov chain Monte Carlo (MCMC) methods for non-Gaussian and nonlinear state space models in the statistical, econometric and engineering literatures. It is beyond the scope of this book to give a treatment of MCMC methods for these models. An introductory

13.5. MARKOV CHAIN MONTE CARLO METHODS 229

and accessible book on MCMC methods has been written by Gamerman (1997). Other contributions in the development of MCMC methods for non-Gaussian and nonlinear state space models are referred to in §13.1. In addition, MCMC methods have been developed for the stochastic volatility model as introduced in §10.6.1; we mention in this connection the contributions of Shephard (1993) and Jacquier, Poison and Rossi (1994) which more recently have been improved by Kim, Shephard and Chib (1998) and Shephard and Pitt (1997).

In the next chapter we will present numerical illustrations of the methods of Part II for which we use the package SsfNong.ox developed by Koopman, Shephard and Doornik (1998). This is a collecion of Ox functions which use SsfPack, as discussed in §6.6, and it provides the tools for the computational implementation of the methods of Part II. However, SsfNong.ox can also be used to do MCMC computations and it provides a basis for comparing the MCMC methods with the simulation techniques developed in Chapters 11 to 13 on specific examples. We reiterate our view that, at least for practitioners who are not simulation experts, the methods presented in the earlier sections of this chapter are more transparent and computationally more convenient than MCMC for the type of applications that we are concerned with in this book.

14 Non-Gaussian and nonlinear illustrations

14.1 Introduction

In this chapter we illustrate the methodology of Part II by applying it to four real data sets. In the first example we examine the effects of seat belt legislation on deaths of van drivers due to road accidents in Great Britain modelled by a Poisson distribution. In the second we consider the usefulness of the /-distribution for modelling observation errors in a gas consumption series containing outliers. The third example fits a stochastic volatility model to a series of pound/dollar exchange rates. In the fourth example we fit a binary model to the results of the Oxford-Cambridge boat race over a long period with many missing observations and we forecast the probability that Cambridge will win in 2001. We discuss the software needed to perform these and similar calculations for non-Gaussian and nonlinear models in §14.6.

14.2 Poisson density: van drivers killed in Great Britain

The assessment for the Department of Transport of the effects of seat belt legislation on road traffic accidents in Great Britain, described by Harvey and Durbin (1986) and also discussed in §9.2, was based on linear Gaussian methods as described in Part I. One series that was excluded from this study was the monthly numbers of fight goods vehicle (van) drivers killed in road accidents from 1969 to 1984. The numbers of deaths of van drivers were too small to justify the use of the linear Gaussian model. A better model for the data is based on the Poisson distribution with mean exp(0,) and density

piytWt) = exp{6>f'y, - exp(0,) - logy,!}, t = 1 , . . . , n, (14.1)

as discussed in § 10.3.1. We model 6t by the relation

0t=V>t + Yt

where the trend /x, is the random walk

ixt+1 = fit + rjt, î]t~ N(0, a^), (14.2)

14.2. POISSON DENSITY: VAN DRIVERS KILLED IN GREAT BRITAIN 231

X is the intervention parameter which measures the effects of the seat belt law, xt is an indicator variable for the post legislation period and the monthly seasonal yt is generated by

u o)t ~ N(0,cr2). (14.3)

j=o

The disturbances t]t and o)t are mutually independent Gaussian white noise terms with variances a 2 — exp(^) and o 2 — e x p r e s p e c t i v e l y . The parameter estimates are reported by Durbin and Koopman (1997) as a n — exp(^^) — exp(—3.708) = 0.0245 and a w = 0. The fact that <rw = 0 implies that the seasonal is constant over time.

For the Poisson model we have bt(0t) = exp(0f). As in §11.4.1 we have bt = bt — exp(0(), so we take

H, = exp(~§ t), yt =et + Htyt - 1,

where 0t is some trial value for $t with t — 1 , . . . , n. The iterative process for determining the approximating model as described in §11.4 converges quickly; usually, between three and five iterations are needed for the Poisson model. For a classical analysis, the conditional value fit + Xxt for fixed at is computed and exponentiated values of this mean are plotted together with the raw data in Figure 14.1. The posterior mean from a Bayesian perspective with diffuse was also calculated and its exponentiated values are also plotted in Figure 14.1. The difference between the graphs is virtually imperceptible. Conditional and posterior standard deviations of fit + Xx, are plotted in Figure 14.2. The posterior standard deviations are about 12% larger than the conditional standard deviations; this is due to the fact that in the Bayesian analysis i/^ is random. The ratios of simulation standard deviations to actual standard deviations never exceeded the 9% level before the break and never exceed the 1% level after the break. The ratios for a Bayesian analysis are slightly greater at 10% and 8%, respectively.

The main objective of the analysis is the estimation of the effect of the seat belt law on the number of deaths. Here, this is measured by X which in the Bayesian analysis has a posterior mean of —0.280; this corresponds to a reduc-tion in the number of deaths of 24.4%. The posterior standard deviation is 0.126 and the standard error due to simulation is 0.0040. The corresponding values for the classical analysis are —0.278, 0.114 and 0.0036, which are not very different. It is clear that the value of A, is significant as is obvious visually from Figure 14.1. The posterior distribution of X is presented in Figure 14.3 in the form of a histogram. This is based on the estimate of the posterior distribution function calculated as indicated in §13.3. The distribution is slightly skewed to the right. All the above calculations were based on a sample of 500 generated draws from the simulation smoother with four antithetics per draw. The reported results show that this relatively small number of samples is adequate for this particular example.

232 NON-GAUSSIAN AND NONLINEAR ILLUSTRATIONS

Fig. 14.1. (i) Numbers of van drivers killed and estimated level including intervention; (ii) Seasonally adjusted numbers of van drivers killed and estimated level including intervention.

y e a r

Kg. 14.2. Standard errors for level including intervention.

14.3. HEAVY-TAILED DENSITY: OUTLIER IN GAS CONSUMPTION IN UK 233

.175 -

.15

.125

1

.075

.05

.025

Lt -.9 -.8 -.7 -.6 -.5 -.4 -.3 -.2 -.1 0

Fig. 14.3. Posterior distribution of intervention effect.

What we learn from this exercise so far as the underlying real investigation is concerned is that up to the point where the law was introduced there was a slow regular decline in the number of deaths coupled with a constant multiplicative seasonal pattern, while at that point there was an abrupt drop in the trend of around 25%; afterwards, the trend appeared to flatten out, with the seasonal pattern remaining the same. From a methodological point of view we learn that our simulation and estimation procedures work straightforwardly and efficiently. We find that the results of the conditional analysis from a classical perspective and the posterior analysis from a Bayesian perspective are very similar apart from the posterior densities of the parameters. So far as computing time is concerned, the calculation of trend and variance of trend for t — 1 , . . . , n took 78 seconds on a Pentium II computer for the classical analysis and 216 seconds for the Bayesian analysis. While the Bayesian time is greater, the time required is still small.

14.3 Heavy-tailed density: outlier in gas consumption in UK

In this example we analyse the logged quarterly demand for gas in the UK from 1960 to 1986 which is a series from the standard data set provided by Koopman et al. (2000). We use a structural time series model of the basic form as discussed in §3.2.1:

yt= fit + Yt+et, (14.4)

where /x, is the local linear trend, y, is the seasonal and s, is the observation disturbance. The purpose of the investigation underlying the analysis is to study


the seasonal pattern in the data with a view to seasonally adjusting the series. It is known that for most of the series the seasonal component changes smoothly over time, but it is also known that there was a disruption in the gas supply in the third and fourth quarters of 1970 which has led to a distortion in the seasonal pattern when a standard analysis based on a Gaussian density for st is employed. The question under investigation is whether the use of a heavy-tailed density for et would improve the estimation of the seasonal in 1970.

To model et we use the i-distribution as in §10.4.1 with logdensity

log p(st) - log a (v) + i log A - ~ log (1 + kef), (14.5)

where

a(v) = k'1 =(v-2)aF2, v > 2, t = l,...,n.

\2/

The mean of s, is zero and the variance is <re2 for any v degrees of freedom which need not be an integer. The approximating model is easily obtained by the method of §11.5 with

h* (ef) = constant + (v + 1) log (l + ksf) , • ^ 1 1 9 V — 2 n h* -1 - Ht = -ef + ~aE

2, ' 1 1

The iterative scheme is started with H, = erf, for t — 1 , . . . , n. The number of iterations required for a reasonable level of convergence using the i-distribution is usually higher than for densities from the exponential family; for this example we required around ten iterations. In the classical analysis, the parameters of the model, including the degrees of freedom v, were estimated by Monte Carlo maximum likelihood as described in §12.5.3; the estimated value for v was 12.8.

We now compare the estimated seasonal and irregular components based on the Gaussian model and the model with a i-distribution for e t. Figures 14.4 and 14.5 give the graphs of the estimated seasonal and irregular for the Gaussian model and the i-model. The most striking feature of these graphs is the greater effectiveness with which the i-model picks and corrects for the outlier relative to the Gaussian model. We observe that in the graph of the seasonal the difference between the classical and Bayesian analyses is imperceptible. Differences are visible in the graphs of the residuals, but they are not large since the residuals themselves are small. The /-model estimates are based on 250 simulation samples from the simulation smoother with four antithetics for each sample. The number of simulation samples is sufficient because the ratio of the variance due to simulation to the variance never exceeds 2% for all estimated components in the state vector except at the beginning and end of the series where it never exceeds 4%.

14.3. HEAVY-TAILED DENSITY: OUTLIER IN GAS CONSUMPTION IN UK 235

-_-•_-- Gaussian conditional seasonal — •— Gaussian posterior seasonal j | j | 1 i ^

w ̂ wWrn^m^m-wmmm I960 1965 1970 1975 1980 1985

year

^ 1 ------ Gaussian conditional irregular — — Gaussian poslcnor irregular

1960 1965 1970 1975 1980 1985 year

Fig, 14.4. Analyses of gas data based on Gaussian model.

We learn from the analysis that the change over time of the seasonal pattern in the data is in fact smooth. We also learn that if model (14.4) is to be used to estimate the seasonal for this or similar cases with outliers in the observations, then a Gaussian model for st is inappropriate and a heavy-tailed model should be used.

• lconditional seasonal — — t posterior seasonal

I960 1965 1970 1975 1980 1985 year

t conditional irregular t posterior irregular

, Q !

,1 1 - If" - . 2 -

-.3 ; -.4

1960 1965 1970 1975 1980 1985 year

Fig. 14.5. Analyses of gas data based on i-m ode I.


14.4 Volatility: pound/dollar daily exchange rates

The data are the pound/dollar daily exchange rates from 1/10/81 to 28/6/85 which have been used by Harvey et al. (1994). Denoting the daily exchange rate by xt, the observations we consider are given by y, — Alogx,, for t — 1 , . . . , n. A zero-mean stochastic volatility (SV) model of the form

yf = or exp y-9tJ ut> ut~ N(0,1), t = l , . . . , n , V ' (14.6)

0t+i = Wt + Vt, m ~ N(O, o < (j> < i,

was used for analysing these data by Harvey et al. (1994); see also the illustration of §9.6. The purpose of the investigations for which this type of analysis is carried out is to study the structure of the volatility of price ratios in the market, which is of considerable interest to financial analysts. The level of 9t determines the amount of volatility and the value of (j> measures the autocorrelation present in the logged squared data.

To illustrate our approach to SV models we consider the Gaussian logdensity of model (14.6),

logp(y,|0,) - -ilog27r<72 - ~9t - ^ e x p ( - 0 , ) . (14.7)

The linear approximating model can be obtained by the method of §11.4 with

yf 2

for which H t is always positive. The iterative process can be started with H t = 2 and yt — log(yf/a2), for t = 1 , . . . , n, since it follows from (14.6) that yf jo1 ^ exp(#,). When yt is zero or very close to zero, it should be replaced by a small constant value to avoid numerical problems; this device is only needed to obtain the approximating model so we do not depart from our exact treatment. The number of iterations required is usually fewer than ten.

The interest here is usually focussed on the estimates of the parameters or their posterior distributions. For the classical analysis we obtain by the maximum likelihood methods of §12.5 the following estimates:

a - 0.6338, = log£ = -0.4561, SE(^0 = 0.1033, & fj - 0.1726, f 2 - log & q - -1.7569, SE($2) = 0.2170,

$ - 0.9731, = log — ^ = 3.5876, SE($3) = 0.5007, 1 ~(f>

where SE denotes the standard error of the maximum likelihood estimator. We present the results in this form since we estimate the log-transformed parameters, so the standard errors that we calculate apply to them and not to the original parameters of interest. For the Bayesian analysis discussed in Chapter 13 we

14.5. BINARY DENSITY: OXFORD-CAMBRIDGE BOAT RACE 237

Fig. 14.6. Posterior densities of transformed parameters: (i) (ii) f2', (iii)

present in Figure 14.6 the posterior densities of the parameters. These results confirm that stochastic volatility models can be handled by our methods from both classical and Bayesian perspectives.

14.5 Binary density: Oxford-Cambridge boat race

In the last illustration we consider the outcomes of the annual boat race between teams representing the universities of Oxford and Cambridge. The race takes place from Putney to Moitlake on the river Thames in the month of March or April. The first took place in 1829 and was won by Oxford and, at the time of writing, the last took place in 2000 and was won by Oxford for the first time in eight years. There have been some occasions, especially in the 19th century, when the race took place elsewhere and in other months. In the years of both World Wars the race did not take place and there were also some years when the race finished with a dead heat or some other irregularity took place. Thus the time series of yearly outcomes contains missing observations for the years: 1830-1835, 1837, 1838, 1843, 1844, 1847, 1848, 1850, 1851, 1853, 1855, 1877, 1915-1919 and 1940-1945. We deal with these missing observations as described in §12.4.

The appropriate model is the binary distribution as described in § 10.3.2. We take yt = lifCambridge wins and yt — 0 if Oxford wins. Denoting the probability that Cambridge wins in year t by nt, then as in §10.3.2 we take 0t — log[jr, /(I — nt)].


Fig. 14.7. Dot at zero is a win for Oxford and dot at one is a win for Cambridge; the solid line is the probability of a Cambridge win and the dotted lines constitute the 50% (asymmetric) confidence interval.

A winner this year is likely to be a winner next year because of overlapping crew membership, training methods and other factors. Thus we model the transformed probability by the random walk

0t+i=0t + nt, nt ~N(0,cr2),

where i]t is serially uncorrected for / = 1 , . . . , n. The method described in § 11.4 provides the approximating model for this case

and maximum likelihood estimation for the unknown variance of is carried out as described in §12.5. We estimated the variance as of = 0.521.

The estimated conditional mean of the probability Jtt, indicating a win for Cambridge in year t, is computed using the method described in §12.2. The resulting time series of n t is given in Figure 14.7. The forecasted probability for a Cambridge win in 2001 is 0.67.

14.6 Noil-Gaussian and nonlinear analysis using SsfPack

The calculations in this section were carried out using the object oriented matrix programming language Ox of Doornik (1998) together with the library of state space functions SsfPack 2.2 by Koopman et al. (1999) and the Ox package SsfNong.ox for non-Gaussian and nonlinear state space analysis by Koopman et al.

14.6. NON-GAUSSIAN AND NONLINEAR ANALYSIS USING SsfPack 239

(1998). The data and programs are freely available on the Internet at h t t p : / / www. ss fpack . com/dkbook. In each illustration of this section we have referred to relevant Ox programs. Documentation of the functions used here and a discussion of computational matters can be found on the Internet at h t t p : //www. ssfpack. com/dkbook. The SsfPack library of computer routines for state, space models is discussed in §6.6. The SsjNong.ox package can be downloaded from the Internet at h t t p : //www. ss fpack . com.

References

Akaike, H. andKitagawa,G. (eds.) (1999). The Practice of Time Series Analysis. New York: Springer-Verlag.

Anderson, B. D. O. and Moore, J. B. (1979). Optimal Filtering. Englewood Cliffs: Prentice-Hall.

Anderson, T. W. (1984). An Introduction to Multivariate Statistical Analysis, 2nd edition. New York: John Wiley & Sons.

Ansley, C. F. and Kohn, R. (1985). Estimation, filtering and smoothing in state space models with incompletely specified initial conditions, Annals of Statistics, 13, 1286-1316.

Ansley, C. F. and Kohn, R. (1986). Prediction mean square error for state space models with estimated parameters, Biometrika, 73, 467-74.

Atkinson, A. C. (1985). Plots, Transformations and Regression. Oxford: Clarendon Press. Balke, N. S. (1993). Detecting level shifts in time series, J. Business and Economic Statist,

11, 81-92. Barndorff-Nielsen, O. E. and Shephard, N. (2001). Non-Gaussian OU based models and

some of their uses in financial economics (with discussion), J. Royal Statistical Society B, 63. Forthcoming.

Basawa, I. V., Godambe, V. P. and Taylor, R. L. (eds.) (1997). Selected Proceedings of Athens, Georgia Symposium on Estimating Functions. Hayward, California: Institute of Mathematical Statistics.

Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. Chichester: John Wiley. Bollerslev, T„ Engle, R. F. and Nelson, D. B. (1994). ARCH Models, In Engle, R. F.

andMcFadden, D. (eds.), The Handbook of Econometrics, Volume 4, pp. 2959-3038. Amsterdam: North-Holland.

Bowman, K. O. and Shenton, L. R. (1975). Omnibus test contours for departures from normality based on and b2, Biometrika, 62,243-50.

Box, G. E. P., Jenkins, G. M. and Reinsel, G. C. (1994). Time Series Analysis, Forecasting and Control 3rd edition. San Francisco: Holden-Day.

Box, G. E. P. and Tiao, G. C. (1973). Bayesian Inference in Statistical Analysis. Reading, MA: Addison-Wesley.

Box, G. E. P. and Tlao, G. C. (1975). Intervention analysis with applications to economic and environmental problems, J. American Statistical Association, 70, 70-79.

Brockwell, P. J. and Davis, R. A. (1987). Time Series: Theory and Methods. New York: Springer-Verlag.

Bryson, A. E. and Ho, Y. C. (1969). Applied Optimal Control. Massachusetts: Blaisdell. Burman, J. P. (1980). Seasonal adjustment by signal extraction, J. Royal Statistical Society

A,143,321-37.

242 REFERENCES

Cargnoni, C., Muller, P. and West, M. (1997). Bayesian forecasting of multinomial time series through conditionally Gaussian dynamic models, J. American Statistical Association, 92, 640-47.

Carlin, B. P., Poison, N. G. and Staffer, D. S. (1992). A Monte Carlo approach to nonnormal and nonlinear state-space modelling, J. American Statistical Association, 87,493-500.

Carter, C. K. and Kohn, R. (1994). On Gibbs sampling for state space models, Biometrika, 81, 541-53.

Carter, C. K. and Kohn, R. (1996). Markov chain Monte Carlo in conditionally Gaussian state space models, Biometrika, 83, 589-601.

Carter, C. K. and Kohn, R. (1997). Semiparameteric Bayesian inference for time series with mixed spectra, J. Royal Statistical Society B, 59, 255-68.

Chu-Chun-Lin, S. and de Jong, P. (1993). A note on fast smoothing, Discussion paper, University of British Columbia.

Cobb, G. W. (1978). The problem of the Nile: conditional solution to a change point problem, Biometrika, 65, 243-51.

Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression. New York: Chapman and Hall.

de Jong,P. (1988a). Across validation filter for time series models, Biometrika, 75,594-600. de Jong, P. (1988b). The likelihood for a state space model, Biometrika, 75,165-69. de Jong, P. (1989). Smoothing and interpolation with the state space model, J. American

Statistical Association, 84, 1085-88. de Jong, P. (1991). The diffuse Kalman filter, Annals of Statistics, 19, 1073-83. de Jong, P. (1998). Fixed interval smoothing, Discussion paper, London School of

Economics. de Jong, P. and MacKinnon, M. J. (1988). Covariances for smoothed estimates in state space

models, Biometrika, IS, 601-2. de Jong, P. and Penzer, J. (1998). Diagnosing shocks in time series, J. American Statistical

Association, 93,796-806. de Jong, P. and Shephard, N. (1995). The simulation smoother for time series models,

Biometrika, 82, 339-50. Doornik, J. A. (1998). Object-Oriented Matrix Programming Using Ox 2.0. London:

Tlmberlake Consultants Press. Doran, H. E. (1992). Constraining Kalman filter and smoothing estimates to satisfy time-

varying restrictions, Rev. Economics and Statistics, 74, 568-72. Doucet, A., deFreitas, J. F. G. and Gordon, N. J. (eds.) (2000). Sequential Monte Carlo

Methods in Practice. New York: Springer-Verlag. Duncan, D. B. and Horn, S. D. (1972). Linear dynamic regression from the viewpoint of

regression analysis, J. American Statistical Association, 67, 815-21. Durbin, J. (1960). Estimation of parameters in time series regression models, J. Royal

Statistical Society B, 22, 139-53. Durbin, J. (1987). Statistics and statistical science, (Presidential address), 7. Royal Statistical

Society A, 150, 177-91. Durbin, J. (1988). Is a philosophical consensus for statistics attainable? J. Econometrics,

37,51-61. Durbin. J. (1997). Optimal estimating equations for state vectors in non-Gaussian and

nonlinear state space time series models, in Basawa etal. (1997).

REFERENCES 243

Durbin, J. (2000a). Contribution to discussion of Harvey and Chung (2000), J. Royal Statistical Society A, 163, 303-39.

Durbin, J. (2000b). The state space approach to time series analysis and its potential for official statistics, (The Foreman lecture), Australian and New Zealand J. of Statistics, 42, I—23.

Durbin, J. and Harvey, A. C. (1985). The effects of seat belt legislation on road casualties in Great Britain: report on assessment of statistical evidence, Annexe to Compulsary Seat Belt Wearing Report, Department of Transport, London, HMSO.

Durbin, J. and Koopman, S. J. (1992). Filtering, smoothing and estimation for time series models when the observations come from exponential family distributions, Unpublished paper: Department of Statistics, LSE.

Durbin, J. and Koopman, S. J. (1997). Monte Carlo maximum likelihood estimation of non-Gaussian state space model, Biometrika, 84,669-84.

Durbin, 3. and Koopman, S. J. (2000). Time series analysis of non-Gaussian observations based on state space models from both classical and Bayesian perspectives (with discussion), J. Royal Statistical Society B, 62, 3-56.

Durbin, 3. and Quenneville, B. (1997). Benchmarking by state space models, International Statistical Review, 65, 23^48.

Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of the United Kingdom inflation, Econometrica, 50,987-1007.

Engle, R. F. and Russell, J. R. (1998). Forecasting transaction rates: the autoregressive conditional duration model, Econometrica, 66,1127-62.

Fahrmeir, L. (1992). Posterior mode estimation by extended Kalman filtering for multi-variate dynamic generalised linear models, J. American Statistical Association, 87, 501-9.

Fahrmeir, L. and Kaufmann, H. (1991). On Kalman filtering, posterior mode estimation and Fisher scoring in dynamic exponential family regression, Metrika, 38, 37-60.

Fahrmeir, L. and Tutz, G. (1994). Multivariate Statistical Modelling Based on Generalized Linear Models. Berlin: Springer.

Fessler, J. A. (1991). Nonparametric fixed-interval smoothing with vector splines, IEEE Trans. Signal Process, 39, 852-59.

Fletcher, R. (1987). Practical Methods of Optimisation 2nd edition. New York: John Wiley. Fruhwirth-Schnatter, S. (1994). Data augmentation and dynamic linear models, J. 'Time

Series Analysis, 15,183-202. Gamerman, D. (1998). Markov chain Monte Carlo for dynamic generalised linear models,

Biometrika, 85, 215-27. Gamerman, D. (1997). Markov Chain Monte Carlo: Stochastic Simulations for Bayesian

Inference. London: Chapman and Hall. Gelfand, A. E. and Smith, A. F. M. (eds.) (1999). Bayesian Computation. Chichester: John

Wiley and Sons. Gelman, A. (1995). Inference and monitoring convergence, in Gilks et al. (1996), pp. 131-

143. Gelman, A., Carlin, J. B., Stem, H. S. and Rubin, D. B. (1995). Bayesian Data Analysis.

London: Chapman & Hall. Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integra-

tion, Econometrica, 57, 1317-39.

244 REFERENCES

Ghysels, E„ Harvey, A. C. and Renault, E. (1996). Stochastic volatility, In Rao, C. R. and Maddala, G. S. (eds.), Statistical Methods in Finance, pp. 119-91. Amsterdam: North-Holland.

Gilks, W. K., Richardson, S. and Spiegelhalter, D. J. (eds.) (1996). Markov Chain Monte Carlo in Practice. London: Chapman & Hall.

Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation, Annals of Mathematical Statistics, 31,1208-12.

Golub, G. H. and Van Loan, C. F. (1997). Matrix Computations 2nd edition. Baltimore: The Johns Hopkins University Press.

Granger, C. W. J. and Newbold, P. (1986). Forecasting Economic Time Series 2nd edition. Orlando: Academic Press.

Green, P. and Silverman, B. W. (1994). Nonparameteric Regression and Generalized Linear Models: A Roughness Penalty Approach. London: Chapman & Hall.

Hamilton, J. (1994). Time Series Analysis. Princeton: Princeton University Press. Hardle, W. (1990). Applied Nonparameteric Regression. Cambridge: Cambridge University

Press. Harrison, J. and Stevens, C. F. (1976). Bayesian forecasting (with discussion), J. Royal

Statistical Society B, 38, 205^17. Harrison, J. and West, M. (1991). Dynamic linear model diagnostics, Biometrika, 78,797-

808. Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter.

Cambridge: Cambridge University Press. Harvey, A. C. (1993). Time Series Models 2nd edition. Hemel Hempstead: Harvester

Wheatsheaf. Harvey, A. C. (1996). Intervention analysis with control groups, International Statistical

Review, 64, 313-28. Harvey, A. C. and Chung, C.-H. (2000). Estimating the underlying change in unemployment

in the UK (with discussion), J. Royal Statistical Society A, 163, 303-39. Harvey, A. C. and Durbin, J. (1986). The effects of seat belt legislation on British road

casualties: A case study in structural time series modelling, (with discussion), J. Royal Statistical Society A, 149,187-227.

Harvey, A. C. and Fernandes, C. (1989). Time series models for count data or qualitative observations, J. Business and Economic Statist., 7, 407-17.

Harvey, A. C. and Koopman, S. J. (1992). Diagnostic checking of unobserved components time series models, J. Business and Economic Statist., 10,377-89.

Harvey, A. C. and Koopman, S. J. (1997). Multivariate structural time series models, In Heij, C., Schumacher, H., Hanzon, B. and Praagman, C. (eds.), Systematic Dynamics in Economic and Financial Models, pp. 269-98. Chichester: John Wiley and Sons.

Harvey, A. C. and Koopman, S. J. (2000). Signal extraction and the formulation of un-observed components models, Econometrics Journal, 3, 84-107.

Harvey, A. C. and Peters, S. (1990). Estimation procedures for structural time series models, J. of Forecasting, 9, 89-108.

Harvey, A. C. and Phillips, G. D. A. (1979). The estimation of regression models with autoregressive-moving average disturbances, Biometrika, 66,49-58.

Harvey, A. C., Ruiz, E. and Shephard, N. (1994). Multivariate stochastic variance models, Rev. Economic Studies, 61,247-64.

REFERENCES 245

Harvey, A. C. and Shephard, N. (1990). On the probability of estimating a deterministic component in the local level model, J. Time Series Analysis, 11, 339-47.

Harvey, A. C. and Shephard, N. (1993). Structural time series models, In Maddala, G. S., Rao, C. R. and Vinod, H. D. (eds.), Handbook of Statistics, Volume 11. Amsterdam: Elsevier Science Publishers.

Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. London: Chapman & Hall.

Holt, C. C. (1957). Forecasting seasonals and trends by exponentially weighted mov-ing averages, Research memorandum, Carnegie Institute of Technology, Pittsburgh, Pennsylvania.

Hull, J. and White, A. (1987). The pricing of options on assets with stochastic volatilities, J. Finance, 42, 281-300.

Jacquier, E., Poison, N. G. and Rossi, P. E. (1994). Bayesian analysis of stochastic volatility models (with discussion), J. Business and Economic Statist., 12, 371-417.

Jazwinski, A. H. (1970). Stochastic Processes and Filtering Theory. New York: Academic Press.

Jones, R. H. (1993). Longitudinal Data with Serial Correlation: A State-Space Approach. London: Chapman & Hall.

Kalman, R. E. (1960). A new approach to linear filtering and prediction problems, J. Basic Engineering, Transactions ASMA, Series D, 82, 35—45.

Kim,C.J. and Nelson,C. R. (1999). State Space Models with Regime Switching. Cambridge, Massachusetts: MIT Press.

Kim, S., Shephard, N. and Chib, S. (1998). Stochastic volatility: likelihood inference and comparison with ARCH models, Rev. Economic Studies, 65, 361-93.

Kitagawa,G. andGersch, W. (1996). Smoothness Priors Analysis of Time Series. New York: Springer-Verlag.

Kohn, R. and Ansley, C. F. (1989). A fast algorithm for signal extraction, influence and cross-validation, Biometrika, 76,65-79.

Kohn, R., Ansley, C. F. and Wong, C.-M. (1992). Nonparametric spline regression with autoregressive moving average errors, Biometrika, 79,335—46.

Koopman, S. J. (1993). Disturbance smoother for state space models, Biometrika, 80,117-26.

Koopman, S. J. (1997). Exact initial Kalman filtering and smoothing for non-stationary time series models, J. American Statistical Association, 92, 1630-38.

Koopman, S. J. (1998). Kalman filtering and smoothing, In Armitage, P. and Colton, T. (eds.), Encyclopedia ofBiostatistics. Chichester: Wiley and Sons.

Koopman, S. J. and Durbin, J. (2000). Fast filtering and smoothing for multivariate state space models, J. Time Series Analysis, 21, 281-96.

Koopman, S. J. and Durbin, J. (2001). Filtering and smoothing of state vector for diffuse state space models, mimeo, Free University, Amsterdam, http://www.ssfpack.com/ dkbook/.

Koopman, S. J. and Harvey, A. C. (1999). Computing observation weights for signal ex-traction and filtering, mimeo, Free University, Amsterdam.

Koopman, S. J., Harvey, A. C., Doornik, J. A. and Shephard, N. (2000). Stamp 6.0: Structural Time Series Analyser, Modeller and Predictor. London: Umberlake Consultants.

http://www.ssfpack.com/

246 REFERENCES

Koopman, S. J. and Hol-Uspensky, E. (1999). The Stochastic Volatility in Mean model: Empirical evidence from international stock markets, Discussion paper, Tinbergen Institute, Amsterdam.

Koopman, S. J. and Shephard, N. (1992). Exact score for time series models in state space form, Biometrika, 79,823-26.

Koopman, S. J., Shephard, N. and Doornik, J. A. (1998). Fitting non-Gaussian state space models in econometrics: Overview, developments and software, mimeo, Nuffield College, Oxford.

Koopman, S. J., Shephard, N. and Doornik, J. A. (1999). Statistical algorithms for models in state space form using SsfPack 2.2, Econometrics Journal, 2, 113-66. http://www. ssipack.com/.

Ljung, G. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models, Biometrika, 66, 67-72.

Magnus, J. R. and Neudecker, H. (1988). Matrix Differential Calculus with Applications in Statistics and Econometrics. New York: Wiley.

Makridakis, S., Wheelwright, S. C. and Hyndman, R. J. (1998). Forecasting: Methods and Applications 3rd edition. New York: John Wiley and Sons.

McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. London: Chapman & Hall. 2nd Edition.

Mills, T. C. (1993). Time Series Techniques for Economists 2nd edition. Cambridge: Cambridge University Press.

Morf, J. F. and Kailath, T. (1975). Square root algorithms for least squares estimation, IEEE Transactions on Automatic Control, 20, 487-97.

Muth, J. F. (1960). Optimal properties of exponentially weighted forecasts, J. American Statistical Association, 55, 299-305.

Pfefferman, D. and Tiller, R. (2000). Bootstrap approximation to prediction MSE for state-space models with estimated parameters, mimeo, Department of Statistics, Hebrew University, Jerusalem.

Poirier, D. J. (1995). Intermediate Statistics and Econometrics. Cambridge: MIT. Quenneville, B. and Singh, A. C. (1997). Bayesian prediction mean squared error for state

space models with estimated parameters, J. Time Series Analysis, 21,219-36. Ripley, B. D. (1987). Stochastic Simulation. New York: Wiley. Rosenberg, B. (1973). Random coefficients models: the analysis of a cross-section of time

series by stochastically convergent parameter regression, Annals of Economic and Social Measurement, 2, 399-428.

Rydberg, T. H. and Shephard, N. (1999). BIN models for trade-by-trade data. Modelling the number of trades in a fixed interval of time, Working paper, Nuffield College, Oxford.

Sage, A. P. and Melsa, J. L. (1971). Estimation Theory with Applications to Communication and Control. New York: McGraw Hill.

Schweppe, F. (1965). Evaluation of likelihood functions for Gaussian signals, IEEE Trans-actions on Information Theory, 11, 61-70.

Shephard, N. (1993). Fitting non-linear time series models, with applications to stochastic variance models, J. Applied Econometrics, 8, SI35-52.

Shephard, N. (1994). Partial non-Gaussian state space, Biometrika, 81,115-31. Shephard, N. (1996). Statistical aspects of ARCH and stochastic volatility, In Cox, D. R.,

Hinkley, D. V. and Barndorff-Nielson, O. E. (eds.), Time Series Models in Economet-rics, Finance and Other Fields, pp. 1-67. London: Chapman & Hall.

http://www

REFERENCES 247

Shephard, N. and Pitt, M. K. (1997). Likelihood analysis of non-Gaussian measurement time series, Biometrika, 84, 653-67.

Shumway, R. H. and Staffer, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm, J. Time Series Analysis, 3, 253-64.

Shumway, R. H. and Stoffer, D. S. (2000). Time Series Analysis and Its Applications. New York: Springer-Verlag.

Silverman, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric regression curve fitting, J. Royal Statistical Society B, 47, 1-52.

Smith, J. Q. (1979). A generalization of the Bayesian steady forecasting model, J. Royal Statistical Society B, 41, 375-87.

Smith, J. Q. (1981). The multiparameter steady model, J. Royal Statistical Society B, 43, 256-60.

Snyder, R. D. and Saligari, G. R. (1996). Initialization of the Kalman filter with partially diffuse initial conditions, J. Time Series Analysis, 17,409-24.

Taylor, S. J. (1986). Modelling Financial Time Series. Chichester: John Wiley. Theil, H. and Wage, S. (1964). Some observations on adaptive forecasting, Management

Science, 10, 198-206. Wahba, G. (1978). Improper priors, spline smoothing, and the problems of guarding against

model errors in regression, J. Royal Statistical Society B, 40, 364-72. Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: SIAM. Watson, M. W. and Engle, R. E (1983). Alternative algorithms for the estimation of dynamic

factor, MIMIC and varying coefficient regression, J. Econometrics, 23, 385-400. Wecker, W. E. and Ansley, C. F. (1983). The signal extraction approach to nonlinear regres-

sion and spline smoothing, /. American Statistical Association, 78, 81-9. West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic Models 2nd edition.

New York: Springer-Verlag. West, M., Harrison, J. and Migon, H. S. (1985). Dynamic generalised models and Bayesian

forecasting (with discussion), J. American Statistical Association, 80, 73-97. Winters, P. R. (1960). Forecasting sales by exponentially weighted moving averages,

Management Science, 6, 324—42. Yee, T. W. and Wild, C. J. (1996). Vector generalized additive models, J. Royal Statistical

Society B, 58,481-93. Young, P. C. (1984). Recursive Estimation and Time Series Analysis. New York: Springer-

Verlag. Young, P. C., Lane, K., Ng, C. N. and Palmer, D. (1991). Recursive forecasting, smoothing

and seasonal adjustment of nonstationary environmental data, J. of Forecasting, 10, 57-89.

Author index

Akaike, H., 5 Anderson, B. D. O., 5, 67, 71, 128, 184 Anderson, T. W., 10, 37 Ansley, C. F„ 62, 63,71,75, 81,101,139,172 Atkinson, A. C., 81

Balke, N. S„ 13 Barndorff-Nielsen, O. E., 187 Basawa, I. V., 203 Bernardo, J. M., 155, 157,222,223 Boilerslev, T., 187 Bowman, K. O., 34 Box, G. E. P., 34, 44, 46,184 Brockwell, P. J., 5 Bryson, A. E,, 71 Burman, J. P., 53

Cargnoni, C , 222 Carlin, B. P., 159,222 Carlin, J. B., 155,157,158, 206,214,222,

223,225 Carter, C. K., 84, 159, 160,222 Chib, S., 229 Chu-Chun-Lin, S., 120 Chung, C.-H., 56,57 Cobb, G. W„ 13 Cook, R. D., 81

Davis, R. A., 5 de Freitas, J. F. G., 5 de Jong, P., 22, 71, 75, 81, 84, 101, 105, 115,

117,120, 128,139, 141, 142,153, Í59 Doornik, J. A., 31,45, 134, 135,136, 143,153,

163, 229,233 Doran, H. E„ 134 Doncet, A., 5 Duncan, D. B., 67 Durbin, J., viii, 3 ,44,51,53,56, 57, 101,103,

106, 107, 108, 109,129,155, 161, 167,189, 202, 203, 205,215, 217,219, 220,221, 222, 227,230,231

Engle, R. F.> 147, 148, 187, 188

Fahrmeir, L„ 5, 128, 193,202 Fernandes, C., 180 Fessier, J. A., 133

Fletcher, R., 143 Fruhwirth-Schnatter, S„ 84, 159, 160

Gamerman, D„ 159,160, 222, 229 Gelfand, A. E., 214 Gelman, A., 155, 157, 158, 159, 206, 214, 222,

223, 225 Gersch, W., 5 Geweke, J., 156 Ghysels, E., 185, 187 Godambe, V. P., 202,203 Golub, G. H., 126, 127 Gordon, N. J., 5 Granger, C. W. J., 5 Green, P., 61, 62,63, 81

Hamilton, J., 5, 150, 151 Hardie, W„ 173 Harrison, J., 5, 40, 42, 81, 180 Harvey, A. C., 5, 31, 35, 39,40,41,44,45,

53,56, 57,60, 81,101, 123, 124, 133, 139, 142,143,148, 153,161,163,167, 168, 173, 174,175, 180, 185,187,230, 233, 236

Hastie, T., 133 Ho, Y. C., 71 Hol-Uspensky, E„ 186 Hoit, C. C., 50 Horn, S. D., 67 Hull, J., 175 Hyndman, R. I , 169, 171

Jacquier, E., 229 Jazwinski, A. H., 5 Jenkins, G. M., 46 Jones, R. H., 5

Kailath, T., 124 Kaiman, R. E„ vii, 52 Kaufmann, H., 193 Kim, C. J., 5 Kim, S„ 229 Kitagawa, G., 5 Kohn, R„ 62,71, 75, 81, 84, 101, 139, 160,

222 Koopman, S. J., viii, 3,31,35,45, 71,72, 75, 81,

101, 103, 106,107, 108, 109,129,134, 135,

250 AUTHOR INDEX

136,140, 143,145,147,153,163,173, 174, 186, 189, 202,205,215,217, 219,220, 221,222,227,229,231, 233,238

Lane, K., 39,40 Ljung, G. M., 34

MacKinnon, M. J., 81 Magnus, J. R., 112 Makridakis, S„ 169,171 McCullagh, P., 180 Melsa, J. L., 5 Migon, H. S., 180 Mills, T. C., 5 Moore, I. B„ 5,67,71, 128, 184 Morf, J. F., 124 Muller, P., 222 Muth, J. F., 49,50

Neider, J. A., 180 Nelson, C. R., 5 Nelson, D. B., 187 Neudecker, H., 112 Newbold, P., 5 Ng, C. N., 39,40

Palmer, D., 39, 40 Penzer, J., 153 Peters, S., 148 Pfefferman, D., 152 Phillips, G. D. A., 101,139 Pitt, M. K„ 189, 215,222,229 Poirier, D. J., 160 Poison, N. G., 159,222,229

Quenneville, B., 56,152

Reinsel, G. C., 46 Renault, E., 185, 187 Ripley, B. D., 156,189 Rosenberg, B„ 100, 101,115, 117,118 Rossi, P. E„ 229 Rubin, D. B„ 155,157,158,206,214,222,

223, 225 Ruiz, E-, 175, 235, 236

Russell, J. R., 188 Rydberg, T. H„ 188

Sage, A. P., 5 Saligari, G. R., 128 Scbweppe, E, 139 Shenton, L. R., 34 Shephard, N„ 22, 31,45, 84, 134,135,136,

142,143,145,147,153,158,159,160, 163, 175, 185,187, 188,189, 201, 210, 211, 215, 222, 229,233,236,238

Shumway, R. H., 5,147 Silverman, B. W., 61, 62, 63, 81,173 Singh, A. C„ 152 Smith, A. F. M., 155, 157, 214, 222,223 Smith, 3. Q.. 180 Sayder, R. D., 129 Stem, H. S.7155,157,158,206, 214, 222,

223,225 Stevens, C. F., 40,42 Staffer, D. S., 5,147,159,222

Taylor, R. L., 203 Taylor, S. J., 185 Theil, H„ 50 Tiao, G. C., 44, 184 Tibshirani, R., 133 Tiller, R„ 152 Tutz, G„ 5,128

Van Loan, C.F., 126,127

Wage, S., 50 Wahba, G., 53, 61,63 Watson, M. W„ 147, 148 Wecker, W. E., 63, 172 Weisberg, S., 81 West, M., 5, 81, 180, 222 Wheelwright, S. C., 169,171 White, A., 175 Wild,C. J„ 133 Winters, P. R„ 50 Wong, C.-M., 62

Yee, T. W„ 133 Young, P. C., 5, 39,40, 71

Subject index

ACD model, 188 airline model, 48 Akaike information criterion, 152 antithetic variables, 205 approximating model, 191

based on first two derivatives, 193 based on the first derivative, 195 exponential family model, 195 for non-Gaussian state component, 198 for nonlinear model, 199 general error distribution, 197 mixture of normals, 197 mode estimation, 202 multiplicative model, 201 Poisson distribution, 195 stochastic volatility model, 195 t-distribttfeon, V97,199

ARIMA model, 46,112 in state space form, 47

ARMA model, 46,111,161 in state space form, 46 with missing observations, 171

augmented filtering loglikelihood evaluation, 140 regression estimation, 122

augmented Kalman filter and smoother, 115 augmented simulation smoothing, 120 augmented smoothing, 120 autoregressive conditional duration model, 188 autoregressive conditional heteroscedasticity,

187

Bayesian analysis, 155,222 computational aspects, 225 for linear Gaussian model, 155 for non-Gaussian and nonlinear model, 222 MCMC methods, 159, 228 non-informative prior, 158 posterior analysis of parameter vector, 226 posterior analysis of state vector, 155, 222

benchmarking, 54 BFGS method, 143 BIN model, 188 binary distribution, 181, 237 binomial distribution, 181 bivariate models, 161 boat race example, 237

Box-Jenkins analysis, 2 ,46,51,161 Box-Ljung statistic, 34 Brownian motion, 57

Cholesky decomposition, 14, 22, 97, 131, 133 common factors, 45 concentration of loglikelihood, 31 conditional

density estimation, 213 distribution, 11,16,65, 84 distribution function estimation, 213 mean estimation, 212 mode estimation, 202 variance estimation, 212

continuous time, 57,62 control variables, 219 cycle, 43

data exchange rate pound/dollar, 236 gas consumption, 233 internet users, 169 motorcycle acceleration, 172 Nile, 12, 18, 21, 23, 25, 27, 32, 35, 136 Oxford/Cambridge boat race, 237 road accidents, 161,167, 230

diagnostic checking, 33,152 diffuse initialisation, 27, 99, 206 diffuse Kalman filter, 27,101

augmented filtering, 115 exact initial Kalman filter, 101

diffuse loglikelihood parameter estimation, 149

diffuse loglikelihood evaluation using augmented filtering, 140 using exact initial Kalman filter, 139

diffuse simulation smoothing augmented simulation smoothing, 120 exact initial simulation smoothing, 110

diffuse smoothing augmented smoothing, 120 exact initial smoothing, 106,109

disturbance smoothing, 19, 73 augmented smoothing, 120 covariance matrices, 77 derivation of, 73 exact initial smoothing, 109

252 SUBJECT INDEX

disturbance smoothing (cont.) in matrix form, 98 variance matrix, 75

EM algorithm, 147 estimating equations, 203 EWMA, 49 exact initial filtering

convenient representation, 105 derivation of, 101 regression estimation, 122 transition to Kalman filter, 104

exact initial Kalman filter, 101 loglikelihood evaluation, 139

exact initial simulation smoothing, 110 exact initial smoothing, 106

convenient representation, 107, 109 derivation of, 106

explanatory variables, 43 exponential distribution, 188 exponential family model, 180

approximating model, 195 binary distribution, 181, 237 binomial distribution, 181 exponential distribution, 188 multinomial distribution, 182 negative binomial distribution, 182 Poisson distribution, 181,188,230

exponential smoothing, 49 exponentially weighted moving average, 49

filtering, 11, 65 financial model, 185

duration, 188 trade frequency, 188 volatility, 185,187, 236

forecast errors, 13, 33,49, 68,152 forecasting, 25, 93,214

GARCH model, 187 general autoregressive conditional

heteroscedasticity, 187 general error distribution, 184

approximating model, 197 Givens rotation, 126 goodness of fit, 152 gradient of loglikelihood, 144

heavy-tailed distribution, 183 approximating model, 195 general error distribution, 184 mixture of normals, 184 t-distribution, 183, 233

heteroscedasticity diagnostic test, 34

importance resampling, 214 importance sampling, 156, 177, 223

antithetic variables, 205

computational aspects, 204 diffuse initialisation, 206

initialisation, 27, 99 innovation model, 69 innovations, 13, 68, 96 intervention variables, 43 irregular, 9

Kalman filter, 11-13,65,67 augmented filtering, 116 Cholesky decomposition, 14,97 derivation of, 65 exact initial Kalman filter, 101 in matrix form, 97 missing observations, 23

least squares residuals, 123 leverage, 186 likelihood evaluation, 138 linearisation

based on first two derivatives, 193 based on the first derivative, 195 for noil-Gaussian state component 198 for nonlinear model, 199

local level model, 9,44,50, 57, 142,174, 180 local linear trend model, 39,50,59,100, 111,

118, 128,133 loglikelihood evaluation, 30, 138

score vector, 144 using augmented filtering, 140, 141 using exact initial Kalman filter, 139 when initial conditions are diffuse, 139,140 when initial conditions are fixed but unknown,

141 when initial conditions are known, 138

Markov chain Monte Carlo, 159,228 maximum likelihood estimation, 30, 138,142,

215 mean square error due to simulation, 151, 217 minimum mean square error, 25, 49,67, 93 missing observations, 23,59,92, 136, 169,214,

237 mixture of normals, 184

approximating model, 197 mode estimation, 202

estimating equations, 203 multinomial distribution, 182 multiplicative models, 185, 201 multivariate models, 44,167

negative binomial distribution, 182 Newton's method, 142 non-Gaussian models, 179

binary illustration, 237 exponential family model, 180 heavy-tailed distribution, 183 Poisson illustration, 230

SUBJECT INDEX 253

stochastic volatility model, 185 SV model illustration, 236 t-distribution illustration, 233

nonlinear models, 179, 184 normality diagnostic test, 33 numerical maximisation, 142

outliers, 35 Ox, 136,143, 229,239

parameter estimation, 30, 142 effect of errors, 150, 217 EM algorithm, 147 for non-Gaussian and nonlinear models, 215 large sample distribution, 150 when initial conditions are diffuse, 149

Poisson distribution, 181, 188,230 approximating model, 195

posterior analysis of state vector, 155

random walk, 9 recursive residuals, 123 regression estimation, 121

least squares residuals, 123 recursive residuals, 123

regression lemma, 37 regression model

with ARMA errors, 54, 114 with time-varying coefficients, 54

restrictions on state vector, 134

score vector, 144 seasonal, 9,40, 48,162, 231 seemingly unrelated time series equations, 44 serial correlation diagnostic test, 34 simulation, 22,156,159,228

mean square error due to, 217 variance due to, 213

simulation smoothing, 22, 83, 159 augmented simulation smoothing, 120 derivation of, 87 exact initial simulation smoothing, 110

missing observations, 93 multiple samples, 92 observation errors, 89 recursion for disturbances, 90 state vectors, 91

smoothing, 16,19, 70 spline smoothing, 61,115,133,161 square root filtering, 124 square root smoothing, 127 SsfPack, 134, 136,161, 229,238 STAMP, 45, 143,163 state estimation error, 15, 68 state smoothing, 16, 70

augmented smoothing, 120 covariance matrices, 77 derivation of, 70 exact initial smoothing, 106 fast state smoothing, 75 in matrix form, 22, 98 missing observations, 92 variance matrix, 72

state space model, I, 38, 65, 99 steady state, 13, 32, 68 stochastic volatility model, 185, 236

approximate method, 175 approximating model, 195

structural breaks, 35 structural time series model, 39, 110, 160

multivariate model, 44,167 Student's /-distribution, 183 SV model, 185,236 SVM model, 186

t-distribution, 183, 233 approximating model, 197,199

rime series, 9 trend, 9, 39

univariate treatment of multivariate series, 128

volatility models, 185,236

weight functions, 81

Time Series Analysis by State Space Methods by Durbin and Koopman

Documents

timeseries analysis

university of oxford

state space models

statespace methods

dp oxford university

nonlinear time series

oxford 0x2

series qa280