SEASONAL TIME SERIES MODELING VIA NEURAL …Chapter 1 Introduction This thesis has two objectives. First, to investigate and potentially increase the potential of neural networks with

FACULTY OF NUCLEAR SCIENCES AND PHYSICALENGINEERING

CZECH TECHNICAL UNIVERSITYPRAGUE

SEASONAL TIME SERIES MODELING VIA NEURALNETWORKS WITH SWITCHING UNITS

PHD THESIS

PRAGUE, SEPTEMBER 2009 MAREK HLAVACEK

ABSTRACT

Neural networks with switching units were originally designed for fast (on-line) data classification. This thesis introduce several innovations to the basicmodel to increase its potential in the context of seasonal time series forecasting.The inovative features include implementation of feedback and smoothing of theresponse function as suggested in [HvH05]. The remaining improvements are in-spired by ideas behind other time series models adapted to neural networks withswitching units. Performance of the proposed model as well as impact of individ-ual inovations has been examined on experimental basis. The secondary objectiveof this thesis is to provide a forecasting model for the dailly series of volumnes ofcurrency in circulation. Therefore, this particular series is exclusivelly used in theexperiments. Finally, the performance of the enhanced models is compared withtwo conventional stochastic models - ARMA and structural time seires model.Results shows that the inovated model outperforms the conventional models andhas at least same performance as the combination of both conventional models.

Acknowledgement

Firstly, I would like to thank to my advisor Frantisek Hakl for his valuable com-ments during the whole period of the preparation of this thesis.

Many thanks belong to Jan Vachulka. Jan, who was my colleague in our smallresearch team, turned with extraordinary effort the original implementation ofneural networks into a very flexible and robust software framework.

I owe much to my colleague and friend, Roman Kalous, who readily helped mewith the network architectures vizualizations. I also spent amounts of time withhim discussing the related issues and improvements; Roman provided me withcomments on time series learning procedures, model comparison, application ofgenetic algorithm, and other matters. I am really happy that I had the chance towork with him. I also believe that we will stay in touch.

Regarding the research stage, I need to thank to Marian Hlavacek and TomasSenft. The two students put extraordinary effort in development of stochasticalmodels mentioned in second chapter of this thesis.

The task of currency in circulation emerged during my employment in CzechNational Bank. It is my pleasure to mention line manager Martin Perina,liquiditymanagement expert Vladimr Pulkrab and Michael Konak, advisor to Marian andThomas. I am very grateful for their readiness and helpfulness.

I am also thankful to my current employer, the Bank for International Settle-ments in Basel. Namely, my appreciation goes to Marc Klau for the flexibility Iwas given to proceed with the thesis further and to finalize it.

Next I want thank to Ludmila Nyvltova. Ludmila is a librarian in the Instituteof Computer Science of the Academy of Sciences of the Czech Republic andshe provided me with extraordinary technical support due to which the list ofreferences is so extensive.

Finally, I would like to express my thanks to my family. Particularly, to mywife Veronika. Her support was extraordinary and essential for me. And of course,to Mariana for her patience she demonstrated in the first few months of her life.

3

Contents

1 Introduction 7

2 Theoretical Background 92.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Ordinary least squares . . . . . . . . . . . . . . . . . . . 102.1.2 Generalized least squares . . . . . . . . . . . . . . . . . . 112.1.3 Geographically weighted regression . . . . . . . . . . . . 11

2.2 Seasonal time series forecasting . . . . . . . . . . . . . . . . . . 122.2.1 Basic terms and annotation . . . . . . . . . . . . . . . . . 132.2.2 Box-Jenkins methodology . . . . . . . . . . . . . . . . . 142.2.3 Fitting ARMA model . . . . . . . . . . . . . . . . . . . . 162.2.4 Seasonality and ARMA model . . . . . . . . . . . . . . . 182.2.5 Structural time series model . . . . . . . . . . . . . . . . 19

2.3 Neural networks and time series forecasting . . . . . . . . . . . . 212.3.1 Feedforward networks . . . . . . . . . . . . . . . . . . . 222.3.2 Recurrent networks . . . . . . . . . . . . . . . . . . . . . 26

2.4 Metamodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4.1 Hybrid models . . . . . . . . . . . . . . . . . . . . . . . 312.4.2 Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . 322.4.3 Hierarchical models . . . . . . . . . . . . . . . . . . . . 36

2.5 Neural networks with switching units . . . . . . . . . . . . . . . 382.5.1 Neural networks with switching units (NNSU) . . . . . . 382.5.2 Related concepts . . . . . . . . . . . . . . . . . . . . . . 41

2.6 Currency in circulation . . . . . . . . . . . . . . . . . . . . . . . 452.6.1 Stochastic behavior of CIC . . . . . . . . . . . . . . . . . 452.6.2 Seasonal factors with impact to CIC . . . . . . . . . . . . 472.6.3 Forecasting models for CIC . . . . . . . . . . . . . . . . 48

4

CONTENTS 5

3 Time Series Modeling using NNSU 503.1 Sequence of NSU - good classifier but poor forecasting tool . . . . 503.2 Turning NNSU into efficient forecasting tool . . . . . . . . . . . 51

3.2.1 Cooperative switching . . . . . . . . . . . . . . . . . . . 513.2.2 Local feedback . . . . . . . . . . . . . . . . . . . . . . . 543.2.3 Time dissipation . . . . . . . . . . . . . . . . . . . . . . 553.2.4 Complex topology . . . . . . . . . . . . . . . . . . . . . 553.2.5 Weightening by quality . . . . . . . . . . . . . . . . . . . 56

3.3 Neuron with cooperative switching . . . . . . . . . . . . . . . . . 573.3.1 Response function . . . . . . . . . . . . . . . . . . . . . 573.3.2 Training algorithm . . . . . . . . . . . . . . . . . . . . . 60

3.4 Neural network with cooperative switching . . . . . . . . . . . . 633.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 653.4.2 Training algorithm . . . . . . . . . . . . . . . . . . . . . 68

3.5 Genetic optimization of architectures . . . . . . . . . . . . . . . . 69

4 Forecasting currency in circulation 714.1 Setup of experiments . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.1.2 Optimization by genetic algorithms . . . . . . . . . . . . 744.1.3 Feasible architectures . . . . . . . . . . . . . . . . . . . . 75

4.2 Results of experiments . . . . . . . . . . . . . . . . . . . . . . . 774.2.1 Quality criteria for selection and comparison . . . . . . . 784.2.2 Summary of results . . . . . . . . . . . . . . . . . . . . . 794.2.3 Discussion - impact study . . . . . . . . . . . . . . . . . 83

4.3 Comparison with conventional models . . . . . . . . . . . . . . . 90

5 Conclusion 93

A Detailed results of CIC forecasting experiments 95

References 116

Used abbreviations and annotation

AR Autoregresive series 15CIC Currency in circulation (also series of volumes of) 45CRU Correction unit 40CU Computational unit 38GLS Generalized least squares 11GWR Geographically weighted regression 11MA Moving averages 15NNCS Neural network with cooperative switching 63NSU Neuron with cooperative switching 57NNSU Neural network with switching units 38NSU Neuron with switching unit 38OLS Ordinary least squares 10RMSE Root mean square error 75STS Structural time series 20SU Switching unit 38

N Positive integersN0 Non-negative integersR Real numbersRn R×R · · ·R, n− times where × denotes cartesian product.(yt)

t2t=t1

Time series of quantity or random variable running form time t1 to t2

6

Chapter 1

Introduction

This thesis has two objectives. First, to investigate and potentially increase thepotential of neural networks with switching units in the context of seasonal timeseries forecasting. And second, to develop a model for the dailly series volumesof currency in circulation.

Neural networks with switching units were introduced in [BvK+95] and orig-inally designed for fast (on-line) data classification. They were successfully ap-plied to the recognition of elementary particle decays [HHK02].

The same basic model was later applied to the series of volumes of currencyin circulation (CIC) and compared with conventional stochastic models [HvH05].The performance of the network with switching units was found to be slightlybetter than of stochastic models but the difference was not significant. The miss-ing feedback and discontinuity of response function were identified as two maindrawbacks of the neural network model.

On the other hand the training of a simple network with switching units is veryfast as well as the calculation of response. therefore it enables to employ geneticalgorithms for optimization of their structure.

Enhanced version of neural network with switching units designed for timeseries forecasting is proposed in this thesis. Neurons used in the network arecalled neuron with cooperative switching (NCS) - due to the behavior of switchingunit which controls the cooperation between several clusters of the neuron.

The innovative features include implementation of feedback and smoothing ofthe response function as suggested in [HvH05]. The remaining improvements areinspired by ideas behind other time series models adapted to neural networks withswitching units.

The performance of the proposed model as well as impact of individual in-

7

8 CHAPTER 1. INTRODUCTION

novations has been examined on experimental basis. Models combining differentinnovative features were applied to the series of CIC. This series was chosen fortwo reasons. First it is in compliance with the secondary objective of this thesis,and second, the series is highly seasonal end exhibits great variety of seasonalpatterns.

The results of experiments were mutually compared to figure out what are theimpacts of individual innovations. Finally, they were compared accuracy of otherforecasting methods. First with two conventional time series models. Namely,with an ARMA model and a structural time series model. These models werechosen because they were successfully applied for the same task in the EuropeanCentral Banks.

Second, the results were compared with prediction carried by an expert inthe Czech National Bank who does not use any mathematical model but his ownknowledge and experience. It has to be mentioned that the expert’s forecast isvery tough benchmark as his experience is very strong and next he could react tounexpected events.

The rest of this thesis is structured as follows. First, the chapter 2 providesnecessary theoretical background. Second, the chapter 3 rigorously defines theenhanced architecture. Next, results of the application to forecasting of daily se-ries of currency in circulation are summarized and discussed in the chapter 4. Thechapter 5 concludes the thesis. Finally, detailed results of experiments from thechapter 4 are reported in the apppendix A.

Chapter 2

Theoretical Background

To set up a starting point the most important facts and ideas of related conceptsare summarized in this chapter. First a brief overview of linear regression is giventogether with few notes on geographically weighted regression. Next the commonstochastic methods used for time series forecasting are summarized with focuson ARMA models and their variations. The overview of recent developments ofneural network and other connectionist models for time series forecasting follows.Introduction of neural networks with switching units concludes the backgroundsummary.

2.1 Linear regression

Although the linear regression is quite simple concept it is broadly used one. Infact it is the simplicity of the model that makes it so popular. The straightforwardidea of linear regression helps to interpret the results and also to estimate themodel accuracy. Neural networks with switching units tries to benefit from thesimplicity as well and employs the model as an essential part of training algorithm.

The linear regression model has the form:

y = Xβ + ε (2.1)

where y ∈ Rm is the vector of observations of the explained variable, X ∈ Rm,n

is the matrix of explanatory variables (sometimes called plan matrix), β ∈ Rn

is parameter of the model and ε is the vector of unobservable disturbances. Thedisturbances are supposed to be random variables.

9

10 CHAPTER 2. THEORETICAL BACKGROUND

In other words the model describes a variable y by linear combination of oneor more explanatory variables. The coefficients of the linear combination are theparameters of the model. These parameters have to be estimated from availabledata. Under certain assumptions an estimate with good stochastic properties is thesolution of a linear equation as described in the following section.

2.1.1 Ordinary least squaresLet’s assume that a data set (y,X) ∈ Rm × Rm,n is given and that ε ∈ Rm isvector of random variables representing the disturbances in 2.1. Then

βOLS = arg minβ∈Rm

‖y −Xβ‖ (2.2)

is the ordinary least square (OLS) estimate of the parameter β. The solution of2.2 could be found as a solution of linear equation:

βOLS =(XTX

)−1XTy (2.3)

If the following conditions

E (εi) = 0, ∀i ∈ 1, . . . ,mE (εiεj) = 0, ∀i, j ∈ 1, . . . ,m ; i 6= j (2.4)Var (εi) = σ2, ∀i ∈ 1, . . . ,m

are met then βOLS is unbiased and most efficient linear estimate of β. If thedisturbances are normally distributed then it is the most efficient estimate of β(e.g. [JJM00]).

The last condition in 2.4 is called homoscedasticity . The absence of ho-moscedasticity is referred as heteroscedasticity . Unfortunately heteroscedasticityis much more common than homoscedasticity. If the source of heteroscedasticityis known generalized least square estimate could be used instead of the OLS.

Different methods could be used to verify the linear regression model. Twocommonly used indicators of quality of the fitted model are highlighted here -these are the coefficient of determinition R2 and it’s adjusted alternative definedas follow:

R2 = 1−

∥∥∥y −X ˆβOLS

∥∥∥2

‖y − E (y)‖2 (2.5)

adjR2 = 1− (1−R2)n− 1

n−m− 1(2.6)

2.1. LINEAR REGRESSION 11

A complete quidance for model veryfication could be found in any monographyon linear regression.

2.1.2 Generalized least squaresGeneralized least squares (GLS) is an extension of OLS where the correlationstructure of disturbances is incorporated into the model. Therefore it is capable todeal not only with heteroscedasticity but also with correlated disturbances.

Let’s assume that the correlation structure of residuals is Σ = E(εεT

). Then

the generalized least squares estimate of β is

βGLS = arg minβ∈Rn

(y −Xβ)T Σ−1(y −Xβ) (2.7)

The solution of 2.7 is

βGLS =(XT Σ−1X

)XT Σ−1y. (2.8)

As a correlation matrix Σ is positive definite and hence there exists a lower trian-

gular matrix Σ12 exists such as Σ = Σ

12

(Σ

12

)T

. It means that the GLS estimate isequivalent to the OLS estimate of β′ in a transformed problem:

Sy = SXβ′ + Sε (2.9)

where S =(Σ

12

)−1

. Therefore the theory of OLS could be easily applied to GLSas well.

Obviously the knowledge of Σ is crucial for successful application of GLS.In some cases the source of heteroscedasticity or correlations in general is knowna priory However, in the most of the cases the correlation matrix is unknown. Ifthe correlation matrix has to be estimated then one faces a risk that GLS wouldproduce well-behaving estimate of a completely misspecified model. On the otherhand it is shown that the OLS estimate is consistent even if the disturbances areheteroscedastic. Therefore it is often suggested not to use GLS unless there is asignificant evidence of heteroscedasticity or correlations.

2.1.3 Geographically weighted regressionExactly the same transformation as 2.9 in GLS is used in geographically weightedregression (GWR) which is also known as local regression. GWR is an example


of so called lazy learning techniques. These methods a priory assume that theproblem can not be solved globally.

The common behavior of these methods is to wait until a response is required.Once it receives an ”order” the model is trained to provide the best answer forgiven input using the known data patterns from the current input’s neighborhood.

Going back to linear regression the geographically weighted model is basicallya GLS model where the source of heteroscedasticity is assumed to be the structureof analyzed data.

For given input x0 the weights used in the model are derived as

w(x) = K(d(x, xo)), ∀x ∈ x1, . . . , xm (2.10)

where x1, . . . , xm are all available inputs, d is a distance function (usually metricor pseudo metric) and K is a kernel function. The kernel function K should haveits maximum at zero and decrease to zero with the increasing distance. A typicalexamples of kernel functions are

K(d) = e−d2

(2.11)

K(d) =

1− |d| if |d| < 1

0 otherwise(2.12)

Different authors are using different kernel functions. However, according to acomprehensive summary of issues related to geographically weighted regression[CA97] there is no clear evidence of an advantage of a particular kernel function.

Much more important seems to be the selection of the distance function. Itcould be of course the euclidean distance. However, also more general and evenasymmetric distances are considered. There are also applications where the dis-tance function is optimized for each input separately. However, according to theauthors current knowledge there is no universal guidance for the selection of theglobally optimal distance.

2.2 Seasonal time series forecastingAim of this section is to give a brief overview of common models that are used fortime series forecasting. The focus is put on ARMA models and state space modelsas they are used as benchmark models in chapter 4. Before moving to particularmodels the definition of forecasting problem as well as related terms is given.

2.2. SEASONAL TIME SERIES FORECASTING 13

2.2.1 Basic terms and annotationTime series is a sequence of either values of a quantity or random variableswhere each element of the sequence is uniquely associated with a distinct timeperiod. Time series where each element is a vector (of fixed dimension) are calledmultivariate. If the quantity is a scalar or single random variable then the series isunivariate. Only univariate series are considered in this thesis.

Furthermore it is assumed in this thesis that all intervals between any twosubsequent elements of a time series are of the same length (let’s call the length ofthis interval sampling frequency). If the sampling frequency and a reference pointin time is given then any time period is defined by an integer where 0 is associatedwith the reference point.

Symbol (yt)tet=ts

denotes a time series starting at time ts ∈ Z and runninguntil te ∈ Z (te ≥ ts). If the boundaries are omitted then the default values areconsidered: ts = 0 and te = ∞.

Forecasting of time series is a process during which the estimate of the futurevalue of time series is derived from current and past observations only. It is furtherassuemd that, the series is series of random variables where only one realizationis observed per time period. It means that the aim is to estimate expected value ofthe realization of a corresponding random variable.

Forecast of (yt)∞t=0 at time t0 with horizon k is

yt0+k = F (xt0 , . . . , xt0−r, yt0 , . . . , yt0−p, εt0 , . . . , εt0−q) (2.13)

where F is an arbitrary function representing the forecasting model. Furthermore,for all t, yt are observed realizations of random variables yt, εt = yt − yt are therandom disturbances and εt = yt − yt is error of the forecasting model. Finally,for all t, xt is a vector of exogenous factors (drivers).

Thereafter, both the random variables and their realizations are identically de-noted it means yt refer either to a random varialbe or to a observed value yt. Thisis to simplify the formulas as the exact meaning is always clear from the context.

Next, forecast with horizon 1 is also referred as one step ahead predicition.

White noise is an important example of time series. It is series of uncorrelated,homoscedastic random variables with zero mean. Formally, series (et)

∞t=−∞ is

called white noise if and only if for all t ∈ Z:

• E (et) = 0,


• Var (et) = σ2,

• ∀s ∈ Z, s 6= t⇒ Var (et, es) = 0

Usually, the aim of forecasting model is to produce white noise residuals as thereis no structure in it that might be used by the model to improve the forecast.

Finally, before moving to the Box-Jenkins methodology, let’s adopt the fre-quently used backshift operator

Byt = yt−1 (2.14)

which simplifies the notation.

2.2.2 Box-Jenkins methodologyOne of the most common and well developed approaches to time series forecastingis the methodology introduced by Box and Jenkins [BJ76]. It focuses on weaklystationary series (or series that might be transformed to weakly stationary series.

Definition 2.2.1. (weakly stationary series) Time series (yt)∞t=0 is weakly station-

ary if and only if it satisfy following conditions:

• ∀t ∈ N0,E (yt) = µ

• ∀t ∈ N0,Var (yt) = σ2

• ∃K ∈ R such as ∀t ∈ N0,∀h ∈ N ,Var (yt, yt + h) = ρh and ρh < K

The key stone of the methodology is Wold’s decomposition theorem (e.g.[JJM00]).

Theorem 2.2.2. (Wold) For any weakly stationary series (yt)∞t=0 with zero mean

there exist a number p ∈ Z and sequences (αi)pi=1 and (βi)

∞i=1 such as

yt =

p∑i=1

αiyt−i −∞∑i=1

βiεt−i + εt (2.15)

where (εt)∞t=0 is white noise

∑∞i=1 β

2i < ∞ and

∑pi=1 αiyt−i is an optimal linear

predictor of yt based on yt−1, . . . , yt−p.


The first sum in 2.15 is autoregressive series (AR) and the second one is seriesof moving averages (MA). The number of summands of the sums is the order ofAR or MA series,, respectively. The superposition of AR and MA series is knownas ARMA series.

Obviously, the condition on (βi)∞i=1 in 2.15 implies that limi→∞ βi = 0. Hence

weakly stationary series could be approximated by an ARMA(p,q) series whentaking first q ∈ N summands of MA component in 2.15 only.

The ARMA(p,q) process has form

yt =

p∑i=1

αiyt−i −q∑

i=1

βiεt−i + εt (2.16)

or alternatively using the backshift operator:

Φ (B) yt = θ0 + Θ (B) εt (2.17)

where Φ and Θ are polynomials with absolute terms equal to 1.Box-Jenkins methodology is a three stage process which is used to find the

best ARMA representation of a weakly stationary series. The three stages are:identification, estimation and verification, respectively.

During the first stage the order of the ARMA process is determined by examin-ing the properties of the forecasted series. Useful indicator is the autocorrelationfunction (ACF)

ACF (h) =E ((yt − µ) (yt−h − µ))

σ2(2.18)

where (yt) a weakly stationary series with mean µ and variance σ2.A bit more complicated is the definition of another indicator - partial autocor-

relation function (PACF). Partial autocorrelation of weakly stationary series (yt)again a function of distance h ∈ N between two elements of series. It is the cor-relation adjusted by the effect of elements between the compared pair. Formally,PACF (h) = φh where φh is the highest coefficient in the regression equation

yt = c+h∑

i=1

φiyt−i + εt. (2.19)

ACF and PACF are dual in the sense that ACF (h) of MA(m) is zero forh > m while the same is true for PACF (h) in the case of AR(m).

Usually, both ACF and PACF are unknown therefore they have to be substi-tuted by their estimates called sample ACF/PACF. Using the estimated correlation


functions and knowledge of typical correlation structure of AR and MA series theorders of ARMA process could be determined - or at least guessed.

Having the p and q the ARMA(p,q) could be fitted. Due to the MA compo-nent the model is not linear and therefore the model can not be used using OLSor GLS. Most of the algorithms for AMRA model fitting is based on a gradientoptimization of the likelihood function. Selected algorithms are discussed later inthe section 2.2.3.

Finally, fitted model has to be verified by checking the residuals (εt). If theseries is weak stationary and correct order of ARMA process was selected thenresiduals should be white noise. Again sample ACF and PACF might be useful tocheck if there is some correlation remains in the residuals. In such a case the orderof ARMA process should be revised and fitted again. This process is repeated untilthe residuals are not ”white enough”.

In more general case the difference operator might be applied to forecastedseries first. This is necessary when the series is non-stationary - this model iscalled integrated ARMA model (ARIMA). For more details see [BJ76]. For theremaining part of this overview the difference operator is omitted to simplify thenotation.

2.2.3 Fitting ARMA model

Fitting of an ARMA model is quite general optimization problem. Thereforethe scale of techniques is pretty large. It includes methods based on Bayesianapproach (e.g. [MBR99]) and also alternative approaches like evolutionary algo-rithms has been examined (e.g. [RSU97]). However, the focus is on the traditionalmethod here. These are non-linear least squares, likelihood maximization, linearregression, or thier combination.

The idea of least square algorithms could be demonstrated on the algorithmpresented in [BJ76]. Given that a finite sample (yy)

my=0 is available it minimizes

the sum of squared residuals (SSR)∑m

t=0 εt from 2.16. The algorithm assumesthat preliminary estimates of parameters are available. It uses them to evaluatethe initial value of the SSR and then it applies Marquardt algorithm [Mar63] tominimize it. This algorithm is an iterative procedure where each iteration requiresevaluation of the derivatives of SSR with respect to ARMA parameters.

In the case of likelihood maximization there are two problems - to evaluatethe likelihood function and to maximize it. Both phases are in detail covered forexample by [Ham94]. There the likelihood functions for simple AR and MA se-


ries are derived first and then it is shown how to use Kalman filter to evaluatelikelihood function of general ARMA series. The maximization of the likelihoodis a common optimization problem where steepest ascent methods or other algo-rithms like Newton-Raphson or its various modifications might be used. Thesealgorithms are again iterative procedures that require evaluation of derivatives ofthe likelihood function.

Two points are common for both approaches. First, the numerical algorithmsmight reach a local minimum. Secondly, if the number of parameters is large thenthey are computationally expensive due to the evaluation of derivatives. Thereforeit is important to provide them a good preliminary estimate. This is where theregression based methods are usually applied.

Three different regression procedures for ARMA estimation are compared in[HM88]. Algorithm 1 is one of them. This algorithm was originally introducedin [Spl83] and it could be also used for fitting of more general model 2.27. Thisprocedure was finally incorporated into training of the neural network model pro-posed in the chapter 3.

The two others procedures discussed in [HM88] are Hannan-Rissen algorithm(some authors use the same name for the algorithm 1) and a modification of thealgorithm 1 where the residuals from the last instead of the current iteration areused in 2.22. For more details see the cited paper.

All the three algorithms has the same principle. The estimate is retrieved intwo or more steps. First, the residuals are estimated from a simplified (AR) model.Then an iterative procedure is employed to fit the complete model. The secondphase could be simply realized by a linear regression or there might be someadjustments involved.

When the estimation is initiated by fitting of an AR model then the order ofAR series must be determined. In the 1 it is supposed that the order h is providedand authors recommended to use h = p+q. However, the order of the AR processcould be subjected to an optimization procedure (like e.g. in the case of Hannan-Rissen algorithm) with respect to the quality of resulting model.

The convergence of the algorithm 1 is not guaranteed. However, when thealgorithm converge then the residuals and forecasts are orthogonal. The algorithmtends to diverge particularly in these situations: (1) problem is over parameterized,(2) MA component is almost non-invertible, (3) there are non-stationary variationsin data, (4) the series is too short.


2.2.4 Seasonality and ARMA model

Time series is supposed to be seasonal if it exhibits periodically repeated patterns.This behavior is quite common in many areas including economics, social scienceand industry. Therefore techniques for modeling of seasonality became essentialpart of forecasting techniques.

The simplest adjustments that might help in many situations is averaging. Forexample if the series has monthly seasonality then monthly averages might beused to estimate the seasonal influence. Authors recommend such techniques par-ticularly in the situations where small sample of data is available only - see e.g.[MW03] where two advanced seasonal adjustment techniques are also presented.

In the context of ARMA series Box and Jenkins [BJ76] proposed multiplica-tive seasonal ARMA model

φp (B) ΦP (Bs) yt = θq (B) ΘQ (Bs) εt (2.23)

where the power s defines the length of the season (period) the greek letters rep-resents polynomials with the absolute terms equal to one and p, q, P,Q ∈ N areorders of the model (capital letters refer to seasonal components).

In fact 2.23 (sometimes called SARMA) is an ARMA model where some ofthe parameters are assumed to be zero and some restrictions applies to the remain-ing non-zero parameters. The motivation is to benefit from the periodical natureof the series and to reduce the number of parameters.

[WP90] highlights that the same applies to an extension of SARMA model

yt = Tt + St + It (2.24)

where T, S, I are trend, seasonal and stochastic components respectively each fol-lowing an SARMA model. The cited reference compares the component model2.24 with pure ARMA model concluding that it is more difficult to estimate thecomponent model and that better results were achieved by ARMA model for themost of the examined data.

An alternative to this purely stochastic approach is to model the seasonal ef-fects as function of external variables. In combination with ARMA model twodifferent concepts are possible.

The first one is an extension to linear regression model where the residuals are


supposed to be an ARMA series instead of a white noise. Hence the model is

yt =n∑

i=1

γi (xt)i + zt (2.25)

zt =

p∑i=1

αiyt−i +

q∑i=1

βi (εt−i)i (2.26)

where the same notation is used as in 2.16. The simplest method for estimation ofthis model is to apply OLS to 2.25 and fit ARMA process to the residuals. FinallyGLS with correlation matrix given by the estimated ARMA term is applied andnew residuals are obtained. The last two steps are iterated until the estimates arestabilized. For more details see e.g. [Pie71] or [BH83].

The second possibility is to extend ARMA model by exogenous term

yt =

p∑i=1

αiyt−i +

q∑i=1

βi (εt−i)i +n∑

i=1

γi (xt)i + εt (2.27)

where γ ∈ Rn is a vector of parameters that applies to the exogenous variables. Itwas already mentioned that this model might be estimated by the algorithm 1 orby algorithms based on Hannan-Rissanen method (e.g. [HG90]).

In both cases the exogenous variables might be of different nature. The mostcommon are indicator functions (1 if the season occur 0 otherwise), ordinal num-ber of a period within the season or trigonometric functions of the former one. Ingeneral the exogenous variables in both models 2.25 and 2.27 does not have to berelated to seasonality modelling However, such general case is out of the scope ofthis thesis.

2.2.5 Structural time series modelFormula 2.16 is intuitive representation of Box-Jenkins ARMA models. However,it is very specific and it can hardly capture the variability of time series models.An alternative to 2.16 and other model driven representations is the general statespace representation which (in multivariate case) has the form.

yt = A′xt + H′ξt + εt (2.28)ξt+1 = Fξt + ϑt (2.29)


where for given time t ∈ N0: yt ∈ Rk is the vector of observed data, xt ∈ Rn

is the vector of exogenous variables, ξt ∈ Rr is the (in general unobserved) statevector, and A′ ∈ Rk,n, H′ ∈ Rk,r and F ∈ Rn,r are matrices of parameters.Finally (εt)

∞t=0 and (ϑt)

∞t=0 are white noise series.

Equation 2.28 is known as the observation equation and 2.29 as the state equa-tion. The vector xt is exogenous in the sense that there is no relation between itand the state vector ξ.

This concept is quite popular particularly due to the Kalman Filter algorithmwhich is used for model estimation. This recurrent algorithm is optimizing theleast square error of the estimate of the state vector at time t based on data ob-served through time t. For more details see for example [Ham94]. In many casesthe Kalman filter is an interesting alternative to other estimation methods.

The state space representation is very flexible and it might be used for largescale of models. This includes also models from the family of structural timeseries models (STS) . An example of STS model is the one used in [CCMH02]for the forecasting of volume of currency in circulation. The model is defined bythe following set of formulas:

yt = µt + γt + εt (2.30)µt = βt−1 + µt−1 + ηt (2.31)βt = βt−1 + νt (2.32)

γt =k∑

i=1

zitδ

it (2.33)

δit = δi

t−1 + χit, ∀i ∈ 1, . . . , k (2.34)

where for all t ∈ N0 µt is the stochastic trend component, γt is the superpositionof k ∈ N stochastic seasonal components, βt ∈ R is the trend level. Furthermorefor all i ∈ 1, . . . , k zi

t ∈ Rgi are fixed seasonal vectors such as sum of zit over

the seasonal range for given i is equal to zero vector. Finally (εt)∞t=0, (ηt)

∞t=0 and

(νt)∞t=0 and (χi

t)∞t=0 for all i ∈ 1, . . . , k are white noise series. Moreover, it is

supposed that (ηt)∞t=0 and (νt)

∞t=0 are series of normally distributed variables.

The equation 2.30 defines the decomposition of the series into trend, sea-sonal and noise component. Both trend a random component are supposed tobe stochastic in general. The trend is further defined by the equations 2.31 and2.32 as locally linear trend. The character of seasonal components depends on thevectors z. In the already cited [CCMH02] the some of the seasons are modeledvia periodic cubic splines - for the rest of the seasons the vector z is simply vector

2.3. NEURAL NETWORKS AND TIME SERIES FORECASTING 21

if indicators that sums up to zero.Cubic splines are functions with continuous first two derivatives piecewise

composed from polynomials of degree 3. The points where one polynomial ischanging to other are called knots. The selection of number of knots and theirposition is crucial for model performance. This usually requires analysis of theseries and iterative adjustments based on visual analysis of residuals.

The details about the model fitting might be found in [CCMH02] or [Sen06]where an STS model for the series of currency in circulation in Czech Republic isderived.

It should be stated that the above brief overview of frequently used time seriesmodels is far less than complete. There exist a lot of modifications to the modelsmentioned as well as completely different models. Examples could be GARCHmodels or Markov switching models (e.g. [Ham94]).

2.3 Neural networks and time series forecastingDiscussed models are divided into three categories: feedforward networks, recur-rent networks and metamodels. Feedforward networks process each input patternseparately starting from input neurons towards output neurons. Recurrent net-works take into account the time order of data. They calculate the response us-ing forecasting errors of preceding input patterns together with the current input.Finally, the metamodels are combinations of neural network models and othermodeling concepts.

First experiments with neural network applications to time series modelingstarted in late 80’s ([LF87], [DS88]). Then universal approximation capabilities ofa feedforward neural network had been proven ([HSW89]) and more experimentsfollowed afterwards (e.g. [WHR90], [TF93]). Later, the recurrent networks wereinvolved (e.g. [CMAI94]) to achieve higher accuracy with regard to stochasticcomponents. The basic idea was that the feedback in a neural network is what amoving average component in an ARMA model is. However, recurrent networkswere introduced earlier in the context of natural language processing ([Jor86],[Elm90]). With the enormous growth of computational power the neural networksare more often used as a one of the building blocks of more complex models.For the purposes of this overview such models are called metamodels. Thesemodels combine one or more neural networks with other methods to compensatedisadvantages of particular models.


Before moving to particular models, it worth to define two basic terms - topol-ogy and architecture. The way how these terms are used in the conectionist lit-erature differs paper by paper. In this thesis, the topology refer to the structureof model only. The architecture is more general and the topology is only part ofarchitecture. In addition to the topology architecture defines other parameters anddynamics of the model.

2.3.1 Feedforward networksThe two most common feedforward neural networks are multilayer perceptron(MLP) and radial basis function network (RBF). The main difference between themodels is that MLP networks deal with hyperplanes while in the case of RBFnetwork hyperspheres are used instead. For more details about basic feedforwardarchitectures see for example [Bis95] or [Hay94]. In the context of time-seriesprediction both models are usually considered as non-linear generalizations ofautoregressive models from the Box-Jenkins methodology ([BJ76]).

The first studies investigating MLP as a time series modeling tool usuallycompare the performance of MLP with stochastic methods. The comparisons arebased on various problems However, the MLP in thee combination with backprop-agation in the basic forms are usually applied. Later, the drawbacks and pitfalls ofbackpropagation gave impulse to investigate possible improvements or to studyother training algorithms. Also several modifications of the basic architecturewere proposed in the last decades.

The output of a MLP with n inputs, one hidden layer and single output neuronhas the form:

y =m∑

i=1

vifi (wi′x+ wi,0) (2.35)

where m is number of hidden units; v1, . . . , vm ∈ R are weights of connectionsbetween hidden and output layer; fi are activation functions of a sigmoidal typeand wi ∈ Rn are weights between input layer and i-th hidden neuron.

Originally, gradient descent algorithms were used for training of the model(2.35). These algorithms are used in combination with the famous backpropaga-tion method for estimation of gradients during the optimization. The main draw-back of the gradient optimization is that it might converge to a local instead of theglobal minimum. Results are influenced by the initial values of weights and alsoby the learning rate.

Both issues are broadly studied. For example [WCL00] proposes a use of


linear regression for initialization of weights between hidden and output layer ofa MLP. The estimate is obtained using the Taylor approximation of sigmoid andGeneral Least Squares. Authors found that the method does not improve the opti-mization process when the popular steepest descent gradient optimization is usedHowever, it gives good starting point for optimization based on conjugate gradi-ents. With the second issue deals for example [LSK07]. The paper introduces agradient algorithm with optimized learning rate. The algorithm is designed foronline learning and also involves the weightening of inputs with respect to thetime - the errors of more recent observations plays more important role then theolder observations. Moreover the time weights are flexible and they are adaptedduring the optimization.

An alternative to the gradient optimization of an error function is Bayesianapproach. Algorithms based on this approach search distributions of weights notonly optimal (expected) values of weights. Typical algorithm based on this frame-work is the Extended Kalman Filter (EKF) applied to the state space representa-tion of network weights introduced in [SW89]. The advantage of EKF, from thepoint of view of time series modeling, is that it is an on-line algorithm. It meansthat weights are adjusted with respect to individual observations and therefore themodel might be easily updated whenever new observation is available. In addi-tion the EKF deals with approximation of second order derivatives while onlyfirst order derivatives are involved in the backpropagation algorithm. For moreinformation about Kalman Filters and neural networks see e.g. [Hay01].

However, algorithms based on Kalman filter have several drawbacks too. Pa-per [BT00] points that the Gausian approximation to the Bayesian solution used inEKF is not centered about the modes of weights distributions which might causeproblems in the case of highly non-linear mappings. As an alternative a sequen-tial learning algorithm based on posterior modes and means is proposed. Anotherexample of Bayesian approach is given in [ARM+03]. The bayesian algorithm iscombined with GLM for training of model:

y = α0 + α′x+m∑

i=1

vifi (wi′x+ wi,0)

where α0 ∈ R and α ∈ Rn are the parameters of linear model. The least squaresalgorithm is used to estimate initial parameters of the linear model. The com-bination of linear model and non-linear neural network helps to keep the modelsimple if the problem is linear or to reduce its complexity if non-linear componentis present in data.


Similar combination of linear and non-linear neural network model is testedin [LW01]. In this paper the activations of neurons in 2.3.1 are replaced by aBernoulli random variable with mean defined by the parameters and inputs of theneuron. In addition the weights between the hidden and output layer are linearfunction of network inputs:

y = α0 + α′x+m∑

i=1

(vi′x+ vi,0) I(fi (wi

′x+ wi,0))

where I(s) is a Bernoulli random variable with mean s. The softening of activa-tion using Bernoulli random variable is motivated by the hierarchical mixture ofexperts ([JR94]). The network is trained using the EM algorithm which is com-putationally less complex then a gradient optimization.

Apart from the training algorithms the methods of construction of neural net-work models are also studied in the context of time series forecasting. For examplethree stage procedure is proposed and tested in [LF05], a statistical framework isset up in [MTR02] and noise injection method is tested on time series data e.g. in[CGD].

The second broadly used type of feedforward neural networks are RBF net-works with the hidden layer build of radial neurons. These neurons are definedby centers and diameters or scaling factors. The activation function of a neuron isthen a function of the distance of input vector from center of the neuron. MostlyGaussian function is used as the activation function. Once an input vector is pro-cessed by hidden neurons the linear combination of responses is calculated by theoutput layer. The linear combination might be a weighted sum or weighted aver-age where the later one is thought to be more accurate for time series forecasting(as discussed e.g. in [RPB+02]). The mathematical representation of simple RBFnetwork with n hidden units with Gausian activation function has the form:

y =n∑

i=1

wiΦi (x), Φi (x) = exp

(‖x− ci‖2

σi

)(2.36)

where ci,σi and wi are center, scaling factor and weight of hidden unit i respec-tively given for any i ∈ 1 . . . n.

The aims of the training algorithms of RBF networks are to find the best cen-ters and to optimize weights of the linear combination. Unsupervised algorithmslike k-means clustering might be used for the identification of centers while theweights are mostly optimized by supervised methods. However, for given centers


the weight optimization is a linear problem - it means the positioning of centersis crucial for network performance. Therefore most of the attention is paid to thecenters identification and it is recommended to use the target data (expected out-puts) as well (e.g. [WD92]). This is for example solved by [Cio02] using twoparallel EKF estimating centers and weights respectively or in [CCM96] whereOLS algorithm is used for centers optimization.

The later cited paper [CCM96] is also interesting for another reason. It dealswith the gradient RBF network (GRBF) particularly designed for time-series fore-casting. Hidden neurons in this type of network process differences of series ratherthen series itself. This idea is driven by the fact that in the case of non-stationaryseries (which includes trends) the current value of the series might be far awayfrom the centers of hidden neurons. So if (yt)

∞t=−∞ is a time-series and

xt = (yt−1, yt−2, . . . , yt−p) (2.37)

then the response of a first order GRBF hidden unit is

yt =n∑

i=1

wiΦi (∆xt)× (yt−1 + δi) (2.38)

where δi is an additional parameter associated with i-th hidden unit. This param-eter might be considered as the one step ahead forecast of time series differencefor input vectors close to the center ci. The value of δi is set to ∆yt0 given thatci = ∆xt0 . Note that in ([CCM96]) the diameters of hidden units were assumedto be constant. The GRBF might be also of higher orders and also seasonal GRBFnetworks were studied in [WWT06].

These networks deal also with seasonal difference of inputs (∆sxt = xt −xt−s). However, hidden units are standard RBF neurons (without the term (yt−1 +δi)). It means that the model is more or less an RBF network with input prepro-cessing. The model should have better forecasting performance when applied toa seasonal time-series. The limitation mentioned by authors is that the parametersof a network are fixed once it is trained. Therefore it might not be so efficient inlong term when the series has a non-stationary variations.

Another approach to improve the approximation capabilities of RBF networksis presented in [RPB+02]. The hidden units in proposed PG-RBF network havea pseudo-Gausian (PG) activation function. The PG function have two scalingparameters instead of one like in the case of standard RBF network. Therefore theactivation functions are asymmetric in general. In addition the weights betweenhidden units and output layer are linear functions of input variables instead of


being constant. Several experiments shown that proposed modifications lead tothe reduction of the number of hidden units.

Finally, the paper introduced a sequential training process with pruning ability.When an input fulfills a novelty criteria new unit is added and simultaneously ifa unit fails in one of several tests of relevance it is removed from the network.The tests of relevance simply disqualify units with too small radius, insignificantresponse to any input and also one of two (or more) highly correlated units isremoved.

The problem of RBF function application might be the high dimension of data.It is quite common that irrelevant variables are present in real-world data and theycould not be identified a priori. This might dramatically increase the dimension ofdata and then RBF networks might be useless. To deal with this problem [HK91]suggest to use a combination of radial and sigmoidal units. The semilocal unitwith activation function

f(x) =∑

i

wi ∗ e− 12

(xi − µi

σi

)2

(2.39)

is proposed. The advantage of semilocal unit is that it has better pruning abilitythan an RBF unit while the training of network is not significantly slowed down.Experiments confirmed that the performance of semilocal units better particularlyin case the irrelevant inputs are present in data.

2.3.2 Recurrent networksIt is obvious that the time dimension is very important when working with timeseries data. It might be wrong to think about it as about a categorical or contin-uous variable. The explicit definition of the moment in time is only one part ofthe information carried by the time dimension. More important is that it allowsto develop models which implicitly deal with the data ordering using a sort ofmemory.

The memory in neural networks is implemented by recurrent (feedback) con-nections. First recurrent neural networks used for time series data processing wereintroduced by Jordan [Jor86] and Elman [Elm90]. Both are multilayer perceptronswith additional layer of context neurons and recurrent connections.

In the case of Jordan network the inputs of context layer are outputs of the net-work (fig. 2.1) while in the case of Elman network the inputs of context layer areoutputs of the hidden layer (fig. 2.2). In both models the context layer pass the in-puts in next time intervals to the hidden layer. It means that hidden neurons see the


Figure 2.1: Topology of Jordan network with single hidden layer

previous output of network or their own previous outputs. The backpropagationalgorithm might be used for training of these recurrent alternatives of MLP. Theonly difference is that the inputs must be always presented in the time sequentialorder.

Figure 2.2: Topology of Elman network with single hidden layer

Work of Elman and Jordan was primary focused on speech recognition How-ever, it was pointed out later that they are a non-linear alternative to ARMA mod-els and that they are more accurate than feedforward networks if an MA compo-


nent is present in data. A quite comprehensive overview of known neural networkmodels applied to time series analysis is given in [Dor96]. The paper shows theparallel between Jordan networks and ARMA model and between Elman net-works and state-space models. In addition it also covers multi-feedback networkswhich combines more types of feedback connections.

Theoretical properties of simple ARMA(1,1) recurrent neural network arestudied in [TLH99] where the conditions for ergodicity and stationarity are in-vestigated. The model is equivalent to a simple Jordan neural network with addi-tional shortcut connections. It is shown that under certain natural conditions theergodicity and stationarity of the network is given by the weight corresponding toautoregressive part of shortcut connection (connection between input and outputlayer).

For the experimental comparison of stochastic models, feedforward and recur-rent neural networks see e.g. [HG95], [Bal97], or [HD98]. In [HD98] the modelswere tested against artificial processes including series generated by the testedneural networks. It was shown that only autoregressive feedforward networks areable to optimally estimate its own time series. In general the performance of re-current neural networks was worse than those of feedforward network in the mostof the cases. The gradient descent learning was pointed out as the main limitationof recurrent networks and the reason for such a poor results.

Algorithms based on Kalman filter were mostly studied as an alternative togradient descent optimization. A hierarchical algorithm for identification and op-timization of neural network based on EKF is introduced in [TB92] for example.However, at the same time other authors (e.g. [Wil92]) pointed out that the EKF iscomputationally expensive. In the cited paper the relations between EKF and gra-dient descent algorithms are studied with aim to identify the advantages of EKFthat might be implemented using other training methods.

Various simplifications and modifications of training algorithms using Kalmanfiltering were suggested to overcame the problem with computational complex-ity. For example [dSN99] tests algorithm where Kalman filter is applied locallyfor each neuron regardless the information related to errors in estimates of otherweights. The simplified version of algorithm achieved comparable results as morecomplex and sophisticated Kalman filtering schemes.

Another modifications and improved training algorithm based on Kalman fil-tering are investigated in [GHDRD01] and [WdJRX05]. The first paper exam-ines the applicability of unscented Kalman filter while the second one deals withrisk-sensitive Kalman Filter. The unscented Kalman filter is a generalization ofKalman filter for non-linear systems which approximates the random variable us-


ing a discrete probability distribution - unlike the EKF where first order approx-imation of non-linear system is used instead. The difference between the EKFand risk-sensitive KF is that the covariance of state space vector is multiplied bythe term (1 + δ) in the case of the risk-sensitive variant. The risk-sensitive fac-tor δ helps to control the stability of training algorithm. It is suggested that itshould be a decreasing function of errors like the learning rate of gradient descentalgorithms.

Recurrent architectures based on MLP in the combination with a Kalman fil-tering learning algorithm became quite popular and they are probably the moststudied recurrent neural network models in the context of time series forecasting.The reasons are probably straight forward implementation of all possible feedbackconnections and promising results.

A special case of recurrent networks are networks with local feedback (e.g.[FGS92]). The local feedback is usually combined with application of finite im-pulse response (FIR) or infinite impulse response (IIR) filter to network connec-tions (e.g. [KLSK96]). More extensive comparison of local feedback MLP net-works is given in [CUPR99] where four different topologies with local activationand response feedback are compared to feedforward networks. The performedexperiments indicated that architectures with local feedback are superior to feed-forward models where FIR or IIR are only applied.

An interesting architecture with local feedback is proposed in [ZC00] wherethe concept of recurrent neural networks is combined with Fourier transformation.The Fourier recurrent networks is designed to decompose time series into differentfrequency components. Due to the link to Fourier Analysis the weights of themodel are in complex numbers therefore the paper also propose modification oftraining algorithm for complex weights.

Recurrent networks based on MLP architectures have been discussed up tothis point. The definition of recurrent alternative to RBF neural networks in thecontext of time series analysis is less natural than in the case of MLP. Accordingto author’s current knowledge, the application of recurrent RBF architectures ontime series forecasting is far less frequent and limited to the networks with localfeedback.

Zemouri and col. ([ZRZ03]) for example proposed an RBF network where thedynamics is realized via the layer of sigmoidal neurons with local feedback. Thesigmoidal neurons forms the input layer of the network and they are connectedinto a cascade. The first input neuron has linear activation function. The activation


function of other input neurons in the input layer is given by formula.

y(t) = f (wy(t− 1) + x(t)) (2.40)

where x(t) is input of neuron in time t, y(t) is output in time t, w is weight ofthe loop connection and f is the sigmoidal function. Once the output of the inputlayer is calculated it is propagated to the standard RBF network. This model wasfor example applied by Gowrishankar and Satyanarayana to the prediction of biterror rate of a wireless network[GS08].

Note that this architecture is different from architecture studied for examplein [FCMS96] as a representation of finite state automata. In this case the RBFnetwork is combined with additional layer of sigmoidal neurons too. However, thelayer of sigmoidal neurons is between hidden and output layer and the feedbackconnection propagates output of the sigmoidal units to the RBF layer - hence thefeedback is not local.

There are also papers that classify architectures with time delay as recurrentnetworks (e.g. [mC02]). However, according to the terminology adopted by thisthesis models where lagged values of processed series are used as model inputsare considered as feedforward models.

2.4 MetamodelsIf you ask for advice its usually better not to ask a single person but to composeyour own decision based on the opinion of more advisors. And this is the mainprinciple of combining models. The same idea is formulated in [Arm01] morerigorously in the context of forecasting:

”Instead of trying to choose the single best method, one framesthe problem by asking which methods would help to improve accu-racy, assuming that each has something to contribute. Many thingsaffects the forecasts and these might be captured by using alternativeapproaches. Combining can reduce errors from faulty assumptions,bias, or mistakes in data.”

The paper gives comprehensive summary of the models combining and relatedtopics and also several tips based on many applications and experiments. It is forexample recommended to combine forecasts when it is important to avoid largeerrors and to use at least five different forecasts when it is possible.

2.4. METAMODELS 31

The increase of computing capacity motivated researchers to employ singleneural networks as building blocks of more complex models too. Currently, neu-ral networks are not only used as standalone models anymore but they are alsocombined with various stochastic models or more neural networks are encapsu-lated into a single model.

Recently, term metamodel is often used for any model which consists of sev-eral submodels of same or different types with flat or more structured hierarchy.Historically, three special cases of metamodels might be recognized: hybrid, en-semble and hierarchical models. Usually, combinations of models are called hy-brid models if more different concepts is involved but the output is simply derivedas weighted average of sub-models. If all the sub-models are of the same type andweighted average or a nonlinear mapping is used to combine them then the tech-nique is referred as ensemble modeling. Finally, hierarchical models are thosemodels where submodels of the same or different type are connected into a se-quence or more complex structure.

This terminology is not very strict and there is a lot of overlaps in it and theyare also other names used for similar concepts like for example mixture of experts(which might be consider as a hierarchical model according to the specificationabove). Nevertheless, it is beyond the scope of this thesis to fill in the gaps interminology and for the purposes of this literature overview the above intuitivedescription should be sufficient.

2.4.1 Hybrid modelsA typical example of a hybrid model combining ARIMA and RBF network wasintroduced in [WC96]. Both models are trained as independent models and theiroutputs combined with respect to confidence in RBF accuracy. The certainty foreach input x is measured by recursively defined certainty factor:

CFN(x) = CFN−1(x) + (1− CFN−1(x)) ∗ φ(νN) (2.41)

where N is number of hidden units, φ is RBF function and ν1, . . . , νN is sequenceof distances of x from centers in descending order. The forecast of network wascompiled using the CF together with output of the RBF network and ARIMAmodel. Three different compilation strategies were tested concluded that follow-ing one achieved highest accuracy:

yt =

RBF (xt)+ARIMA(xt)

2if CF (xt) ≥ µ,

ARIMA(xt) if CF (xt) < µ.(2.42)


dominating the weighted averaging with weights equal to CF and 1 − CF , re-spectively.

Similar concept is used in [LYWZ06] where a MLP is combined with expo-nential smoothing. The aim of this approach is to combine linear and non-lineareffect to achieve a synergy of both approaches. Both, exponential smoothing andneural network models are fitted separately, the response of resulting model is aweighted average of individual responses. The weights are set with respect to theabsolute forecasting error using linear programming.

Another combination of linear and non-linear approach is introduced in [FK07].The paper proposes a simple hierarchical model which consists of an AR modeland a feedforward backpropagation network. The output of AR model is used asan input of the network which is a MLP with cascaded hidden layer. The restof inputs are lagged values of the modeled series. In this case the AR model isemployed as a preprocessor with aim to capture the linear structure of data.

A more complex model hybridizing neural network, ARIMA model, and ge-netically optimized fuzzy system was proposed in [VRR+08]. The fuzzy systemcomposed by 38 rules with weights optimized by the genetic algorithm is usedto identify the ARIMA model configuration. Residuals of ARIMA model areprocessed by MLP and the output of ARIMA model and MLP is finally com-bined. This model achieved significantly better performance then stochastic orneural network models when applied to Lorenz series. It forecasted the serieswith RMSE equal to 0.0027 while the single MLP and RBF networks achievedRMSE around 0.02 and 0.0114, respectively.

2.4.2 Ensembles

In all the hybrid models only two or three submodels were combined. This is notthe case of ensemble models where usually ten or more sub-models are used. Thebasic idea of ensemble modeling is to build reasonably large number of accuratebut well diversified models and to combine them efficiently.

An important impulse for other researchers was given in [KV95] where thequadratic error of the ensemble

f(x) =∑

i

wifi(x),∑

i

wi = 1, wi ≥ 0

is split into two components: average quadratic error of sub-models and a so called

2.4. METAMODELS 33

ambiguity:

(f(x)− y)2 =∑

i

wi(fi(x)− y)2 −∑

i

wi(f(x)− fi(x))2

showing that at a single data point the quadratic error of the ensemble model isguaranteed to be less than or equal to the average quadratic error of sub-models.

The important is also the interpretation of ambiguity∑

iwi(f(x)− fi(x))2 as

a measure of disagreement between models. It shows that the diversity is a crucialmatter in ensemble construction.

In general there are four basic strategies that might be used to obtain a largescale of diversified models. First, broadly studied one, is to use the same modelbut to train it with different data sets. Second possibility is to penalize the errorfunction by a diversity measure when optimizing individual models. Next degreeof freedom is the variability of a particular type of model - e.g. architecture in thecase of neural network, the selection of time series and error lags in the case ofARIMA model or selection of inputs in both cases. Finally, various optimizationalgorithms might be for training of sub-models.

The simplest technique which might be used to feed models with differentdata during learning phase is to define several training sets by excluding severalsamples from data. The excluded samples might be used for validation of models.This approach is for example used in [KV95] or [LYWH06]. It is also shownin the former one that the size of training sets has significant influence to theperformance of whole ensemble.

Two other techniques based on modification of training sets are bagging andboosting (see [Bre96] and [SFBL98] resp.). The main idea of the techniques is thesame - training set for new model is resampled from original data with respect toa probability distribution (with repetition). The difference between bagging andboosting is that the first technique deals with the uniform distribution while thelater one modifies the distribution during each iteration with respect to forecastingerrors of the last model. The motivation of boosting is to focus on the data thatwere not processed sufficiently. Therefore the higher the error is the higher is theprobability that the particular observation is used for training of the next model.

Bagging as well as boosting were proposed in the context of categorizationproblems therefore the first applications were limited to the application in thisarea. An example is [SFBL98] where the impact of bagging and boosting to vari-ance and bias reduction is studied. The paper shows that the common expectationsmight be misleading. It is shown that bagging might fail to improve performancewhen the training data are not independent - even if the average of individual


model’s error is small comparing to the variance. Next it shows that bagging andboosting might produce over fitted models.

This theoretical result is also supported by comparison published in [OM99].Authors applied boosting and bagging to 23 different classifications tasks usingdecision trees and neural networks as basic models. One of the outputs was thefact that boosting might give worse results then single model because it tendsto over-fit when noisy data are presented to the model. On the other hand theboosting dominated bagging in other cases.

Bagging might be applied to time series problems without any needs to mod-ify the basic algorithm (e.g. [GMH05]). This is not the case of boosting as themechanism of the evolution of training sample distribution has to be adjusted. Aninteresting example of a boosting algorithm suited for time-series forecasting aswell as the structured summary of other alternatives might be found in [MRH08].

The algorithm assumes that whole sample is used for training of all the sub-models. The diversity is achieved by applying different weights associated withindividual observations to the aggregated square error. The weights for next it-eration are updated with respect to linear (absolute), square, or exponential error.The experimental results shown that this algorithm improved the forecasting per-formance comparing to single recurrent network which was used as sub-model.

The next strategy of diversification is penalization of error function. Accord-ing to author’s current knowledge there is only one method described in the lit-erature - the Negative Correlation Learning (NCL) [LY99]. This technique wasdesigned for optimization of ensembles of several MLPs. The aim of NCL is toforce particular MLP to be negatively correlated with the others. It was shownthat this might be achieved by simultaneous optimization of all the MLPs using amodified error function.

For any MLP representing function Fi in a simple ensemble

F (xt) =1

M

M∑i

Fi(xt)

the error function is defined by

Ei =1

T

T−1∑t=0

(Fi(xt)− yt)2 +

1

T

T−1∑t=0

λpi(xt) (2.43)

where T is the size of training sample, λ a factor controlling the influence of

2.4. METAMODELS 35

penalization pi

pi(xt) = (Fi(xt)− F (xt))∑

j 6=i

(Fj(xt)− F (xt)) (2.44)

The advantage of NCL is that standard gradient descent algorithm in combi-nation with backpropagation might be used for training of networks. Next theoptimization and diversity is optimized in parallel which is more efficient than aselection from independent individuals - same diversity might be achieved usingmuch less individual models. From this point of view the NCL and boosting mightbe considered as active strategies while bagging is a passive one.

An interesting application of NCL might be found in [IYM03] where it buildsone part of a constructive algorithm for ensembles building. Starting with a singlenetwork with one hidden unit the algorithm decides in every step whether to addnew network or add a hidden neuron into an existing one. At the end of each stepNCL is used to train all the networks in the ensemble until convergence or growthcriteria are reached.

Unlike the boosting and NCL the remaining two diversification strategies arepassive again. They simply bet on the computational power which allows to gen-erate enough models randomly from those the accurate and diverse one might beselected. These strategies are usually combined and particularly the variability ofmodels architectures brings enormous freedom. Next paragraph gives few exam-ples of large variety of applications.

[BB08] uses ensembles of time delayed RBF networks where different ac-tivation functions are used in hidden layer. An ensemble build of several dif-ferent models including neural networks (RBF and a hybrid RBF-MLP architec-ture), nearest trajectory model, and linear models is tested in [WO04]. While[LWH06] examines the random initialization of weights in the combination withcross-validation on different training sets.

Unfortunately, searching for accurate individuals in the huge universe of allthe feasible models is becoming more difficult with the increasing complexity ofmodels. Therefore it is quite common that evolutionary algorithms are employedwhen constructing an ensemble.

But the searching for accurate networks is not the only problem when dealingwith large scale of models. If one has a lot of accurate models available thenit should be considered that the principle ”the more the better” does not usuallywork. It is essential to select only several well diversified sub-models. to do so adiversity measure must be defined first.


Various diversity measures are defined for classification tasks. These aremostly based on ratios of number of well and wrong classified samples from test-ing set (see e.g. [BHBK03]). However, such measures can not be applied whenworking with time series models. According to author’s current knowledge corre-lation of errors on training data is the only diversity measure used in this context.

Once the correlations between any two sub-models are evaluated the selectionmight be made using e.g. principal component analysis or some heuristic algo-rithm. An example of such algorithm called conditional generalized variance isproposed in [YLW08]. Here the determinant of correlation matrix is supposedto be a generalized variance of a set of sub-models. Iteratively sub-models withcontribution to the determinant bellow certain threshold are excluded from the en-semble. The problem of this method is that the threshold must be set. In generalthe problem is to choose the maximum number of members in an ensemble not tooverfit the trainig set.

Last issue related to ensemble models is to combine the outputs of individualsub-models. The very basic method is to take the average or weighted averageof the outputs where the weights are set with respect to error of sub-models (e.g.[CYA07]). However, also more sophisticated methods were examined. An ex-ample is the model employed in [SK04] where the ensemble output is weightedaverage but the weights associated with sub-models are calculated dynamicallyusing estimated accuracy of individual models.

Beyond the limitation of linear combination are models where a non-linearmodel is used to combine the outputs of sub-models - like the support vectorregression in [LYWW06] or an RBF network combining outputs of other RBFnetworks in ensemble as proposed in [YLW08]. These models outperforms thelinear ensembles in selected experiments However, the employment of such mod-els opens the question of optimal model architecture at the metamodel level.

2.4.3 Hierarchical modelsA simple hierarchical model introduced in [GL01] combines self organizing map(SOM) and Elman network. Here the SOM first identifies the patterns of timeseries in given period. Delayed outputs of SOM are processed by the networkafterwards. This combination allows to benefit from two completely differentconcepts which are suited to the particular aspects of a problem.

Different strategy is to use same models and to benefit from more complexstructure - like for example the model of chained neural networks ([DSMV01])designed for long term prediction of time series. Networks are connected into a

2.4. METAMODELS 37

chain and first network calculates one step ahead prediction, the second one twosteps ahead prediction and so on. And also ensembles with a non-linear combina-tion of sub-models might be considered as a hierarchical models from this pointof view.

A fractal neural network (e.g. [IT99]) is another bit more sophisticated exam-ple of such model. The architecture of fractal neural network might be describedas a tree with a neural network in each node. These sub-networks are specialrecurrent neural networks with non-monotone activation functions

f(x) =1− e−cx

1 + e−cx

1 + κec′(|x|−h)

1 + ec′(|x|−h). (2.45)

The motivation of authors was the parallel with biological systems hence the levelsof tree might be interpreted as sensors and association, frontal association or mo-tor cortex, respectively. The employment of non-monotone neural networks wasparticularly driven by the task - implementation of higher cognitive functions.

Similar system but this time suited for time series forecasting was proposed in[CYZ06]. The topology at metamodel level is given by a tree again but the non-monotone networks are replaced by RBF networks. In addition the tree definingthe structure of metamodel might be arbitrary while exactly four levels were ex-pected in the case of fractal neural networks.

In general the hierarchical models might be far more complex then given ex-amples. Even though only models with two levels are considered in most ap-plications a network of ensembles of regression trees is also a valid metamodelarchitecture. In any case the more complex the models are the more problem-atic is the construction of the model - to find balance between complexity andgeneralization property.

It might be said that this was one of the crucial issues related to the researchof neural networks couple of years ago. The rapid grow of computational powerallowed researchers to play with combining of models rather then constructing theoptimal one. However, the issue is arising again one or several levels above.

For few types of metamodels some heuristic algorithms for construction of op-timal hierarchical models were proposed [?]. However, like in the case of findinga good structure of neural network, evolutionary algorithms are one of the mostcommon tools broadly used to find good combination of models and their mutualinteractions.


2.5 Neural networks with switching unitsThis section introduces neural networks with switching units and summarizes themost important findings related to this concept. A brief comparison of NNSUand other similar models concludes this section. Only intuitive descriptions of thebasic terms are provided here as they are sufficient for this general overview.

2.5.1 Neural networks with switching units (NNSU)This concept introduced in [BvK+95] was designed as classifier with the possi-bility of efficient hardware implementation. The already trained network imple-mented as a circuit has to perform only basic arithmetic operations (summation,multiplication). Therefore it might be considered as an on-line classifier for ahigh frequency data stream. One can imagine various applications where suchfast classifier might be useful. The one at the beginnig of NNSU developmentwas the recognition of elementary particle decays.

The basic building block of a NNSU is the neuron with switching unit (NSU).Each NSU consists of a single switching unit (SU) and several computational units(CU). The switching unit controls the flow of data inside the NSU - it assigns eachinput pattern to one of the CUs according to certain criteria. Once input patternis assigned the response of NSU is calculated by the selected CU - the remainingCUs are not used.

More precisely switching unit defines a disjoint covering

ΩI =k⋃

i=1

ΩkI ,Ωi ∩ Ωj = ∅,∀i 6= j

of input space ΩI ⊂ Rn. The covering is found during the training phase employ-ing a cluster analysis algorithm - it means an unsupervised training process.

Using indicator functions

χi(x) =

1 if x ∈ Ωi

I

0 otherwise(2.46)

the output of NSU with k CUs for any x ∈ ΩI is

o(x) =k∑

i=1

χi(x)fi(x) (2.47)

2.5. NEURAL NETWORKS WITH SWITCHING UNITS 39

where fi, i = 1, . . . , k are activation functions of CUs.As already mentioned the aim was not to use anything else but simple arith-

metic operations during processing of data. Therefore the activation functions ofCUs are assumed to be linear functions

fk(x) =(pi

1x1, . . . , pinxn

)′ (2.48)

where pi ∈ Rn is the vector of parameters of the i− th CU.The parameters of each CU are set during the training phase via linear regres-

sion. At the beginning the training set is split into k parts according to the clusters.Next the parameters are set separately for each CU minimizing the square error‖e‖2 in the regression equation

y = Xp+ e (2.49)

where X is matrix of inputs and y is column-wise vector of desired outputs, bothrestricted to patterns from given cluster only. Note that (2.47), (2.48) and (2.49)implies that

∑ni=1 oi(x) is the linear approximation of desired output.

Architecture proposed in [BvK+95] was a sequence of several NSUs withdifferent number of clusters. The clusters are defined with respect to euclideandistance in the first neuron. For the rest the sum of inputs is used instead. Thesequence is terminated by an output neuron which only sums its inputs.

Training of the network is realized via a one pass forward algorithm. Theneurons are trained one by one using the cluster analysis and linear regressionas described above. The motivation of this model might be found in a simpleheuristic.

Lets assume that the task is to separate two classes by mapping inputs fromthe first class to value -1 and others to +1. Once the data is classified by the firstneuron, some of the input patterns are mapped to the closure of -1 and some ofthem to the closure of +1. If the task is not trivial then one might expect that restof the outputs would be around 0 which is a grey zone - where the undistinguishedinputs are mapped to. If these data are clustered into three clusters then with highprobability there will be two of them where input patterns of one kind dominatesthe other. In the last cluster (around 0) both classes would be equaly represented.

Therefore the model might fine-tune the separation in the first two clustersand re-analyze the data from the grey zone without being influenced by the rest ofdata. Repeating this several times for different numbers of clusters the separationmight be improved as parameters of different NSU are fitted to different subsetsof the input data.


In general the output might be multidimensional but the output dimensions aretreated independently therefore we might assume without any loss of generalitythat the output is one dimensional. The sequence of NSU represents a piecewiselinear function defined on the input space. The class of functions it could realizeis dense in the space of continuous functions (see [Hla01]).

The sequence of NSU both from theoretical and experimental point of view isfurther investigated in [Hla02]. Apart from the estimate of the upper boundary ofthe Vapnik-Chervonenkis dimension it figures out one drawback of the sequenceof NSU. When bias is not included in the activation function of CUs then theresulting activation function of a NSU is not an one to one mapping. Therefore itmight happen that patterns from different classes are mapped into the same regionand the next NSU in the sequence can not further improve the separation.

To overcome this limitation a correction unit (CRU) was introduced. Inputsof a correction unit are the inputs of network plus the result of classification ofthe foregoing NSU. The correction unit shifts the network inputs according to theresults of previous classification to disjoint regions.

It was shown experimentally on the problem of Higgs boson detection that thesequence of NSU with correction units is capable to learn the training pattern toany precision. Furthermore, it reduces the noise in the network output in generalwhich might be useful in evolutionary optimization .

Unfortunately, it is obvious that the concept of CRU is meaningful only in thecase of classification issues and that there is no simple generalization to regressionproblems. Therefore application of CRUs is not considered in this thesis.

Another study of NNSU might be found in [Vac06]. It investigates and com-pares various hierarchical and non-hierarchical clustering algorithms and also theinfluence of different criteria for cluster definitions (clusters according to eu-clidean distance, sum of inputs).

Since the very beginning NNSU were designed as model that might be opti-mized by genetic algorithms. Originally, only the sequence of neurons was con-sidered (e.g. [HHK02]). It was encoded by a vector of integers defining number ofclusters in particular NSU. Recently more complex encoding schema combiningProgram Symbol Trees approach [Gru94], and Read’s linear codes [Rea72] wasproposed in [KH04] and further investigated in [Kal09].

This encoding is capable to describe any acyclic topology. However, not everydirected acyclic graph (DAG) is a valid topology of a neural network - e.g. thereshould only one output neuron. Therefore, it is even more important that therepresentation provides mechanism to define recombination operators in a genericway but with output restricted to the valid topologies of considered models.


Using Read’s codes it is quite easy to identify components that might be re-placed or exchanged without violating the restrictions of topologies. And this isa great advantage of this representation when compared to the adjacent matrix orsimilar schemes.

Concerning applications there are two most important areas where NNSUwere successfully applied. First, it proved to give satisfactory results in the prob-lem of Higgs boson detection (e.g. [HJ99], [HHK02]) - NNSU were used to clas-sify data generated by the simulator of the CERN’s large hadron colider whichhad been under construction. The aim was to identify decays where a Higgsboson were present. In this case NNSU significantly dominated the statisticaltechniques.

Second, the NNSU were successfully applied to analysis of data coming fromgamma-ray Cerenkov telescope. Case study [BCG+04] compares several methodsincluding decision trees, neural networks abut also non-cennectionists models likelinear discriminant analysis for classification of simulated data. The results showsthat NNSU in many aspects dominated other methods. These promising resultsare a great motivation for further investigation and development of NNSU.

2.5.2 Related concepts

The ideas and principles behind NNSU are not unique the same or similar mecha-nisms might be found also in other concepts. Some of them are broadly used andwell established models. A brief introduction of all the models that appears to besimilar with NNSU (from the author’s point of view) is given in the rest of thissection. The aim is to figure out similarities and differences with NNSU not tocompare their performances.

Threshold autoregressive model (TAR) is a simple generalization of AR model.[Ton90] defines the models as follows:

yt = α0, j +

p∑i=1

αi,jyt−i, if yt−d ∈ Ij (2.50)

where Ij, j ∈ 1 . . . k are disjoint intervals covering R.One can say that this is a NSU where switching is driven by a single input

having the lagged values of a series as the only inputs. It is obvious that theappropriate selection of the input controlling the switching is crucial. However,


there is not any rule or algorithm to identify it. It means that it must be defined bythe modeler like the order of autoregressive process.

Next the model expects quite simple form of non-linearity of the series. Itis probably very common that the non-linear behavior is driven by more inputssimultaneously.

Network of autoregressive processing units (KNAPP) is again a model equiv-alent to a single NSU. Here the clusters are defined with respect to euclidean dis-tance. Interesting is that [LSKH96] proposed NAPU in the context of time seriesforecasting.

Going a bit more deeper NAPU is a NSU with several processing units whereK-means clustering is used for finding the centers with respect to euclidean dis-tance. The processing units tested in the paper were linear units trained by OLS.However, it is mentioned that the processing unit might be in general any autore-gressive model (e.g. MLP). A recommendation to combine linear and MLP unitsis also given saying that a MLP should be used only for those clusters where linearunits failed before.

Comparison of NAPU with AR, TAR, MLP and RBF using two artificial non-linear series shown that NAPU gives a slightly worse results then RBF networkhaving similar number of parameters However, it outperformed the other models.

The parallel of NAPU and NNSU is really obvious. However, according toauthors current knowledge, there are no later papers further investigating NAPUor derived models following the ideas presented in the [LSKH96].

Autoregressive tree (ART) presented in [MCH02] is an example of regressionor decision tree based model. It also generalize the TAR model like NAPU does.Here the clusters are defined hierarchically using a tree of different binary splitswith AR models in leaves . Such architecture allows more precise definition ofclusters, however, it is still equivalent to a single NSU with more sophisticatedswitching mechanism.

All the splits are like in the TAR model defined by criteria of type: yt−i < ciwhere the lag i and threshold ci are the only parameters defining the split. It meansthat the clusters of a tree with n inputs are always intervals in Rn.

The training of the tree is an iterative greedy process. Starting from a singlenode predefined set of splitting criteria is tested against a fitness function and thebest one is chosen. To evaluate the fitness the leave notes must be fitted for eachexamined criteria. Therefore the set of criteria must be limited to reasonable size.


For example about 8 different thresholds are considered for all the inputs in theabove cited paper.

The greedy algorithm for finding optimal clusters might be considered as anadvantage of ART against NSU. In the case of NSU the clusters are defined withrespect to structure of input data regardless the quality of the whole NSU. Onecan easily imagine a case where the ”geographically” different clusters do notmeet the break points in underlying model.

On the other hand the greedy training of ART is more time consuming. Itmeans an NNSU with several neurons might be trained in the same time as anART and the synergy of individual neurons might leverage the advantages of op-timization of splits in the ART model.

Assuming a sequence of NSU then the number of different regions in the in-put space is given by the product of number of clusters of all the neurons in thesequence. It means that the same number of clusters might be achieved with farless parameters then with ART.

Next if the tree is growing the number of observations available for fitting ofleave nodes is decreasing. This is not the case of the series of NSU. Each NSUin the series has a certain small number of clusters (usually between two and five)and so there is always enough patterns to avoid overfitting.

Finally, subsequent neurons have usually different number of clusters and al-ready transformed data are used for clustering starting from the second neuron.These two features further increase the complexity and variability of clusters inNNSU.

Note that in the case of classification problems the sequence of several NSUwith different number of clusters and switching driven by the the sum of inputshas a kind of self-correction property. This correction property is based on thefollowing heuristics. Lets assume that data are classified into two classes - callthem positive (expected output 1) and negative (expected output -1). Then neuronwith three clusters splits the input space into data classified as positive (around1), negative (around -1) or undistinguished (around 0) inputs. It means inputswhere the previous neuron could not decide what class they belonged to are treatedseparately. This effect might be further emphasized using correction units.

However, it was already mentioned that correction units could not be appliedto time series modeling (or other regression problem in general). And the sameapplies to this heuristic - when the output is not categorical then the clusters simplycan not correspond to individual categories. Hence this is might be considered asa motivation for further development of NSU.


Hierarchical mixture of experts (HME) is the last most complex model dis-cussed in this section. HME as described in [JR94] is a model with a tree structurewhere the leaves are called expert networks and inner nodes are gating networks(see figure 2.3).

Figure 2.3: Hierarchical mixture of experts with 3 gating networks

Expert networks might be considered as computational units - applying NNSUterminology. The expert networks are linear models - in the case of classificationthey have a single output non-linearity of sigmoidal type.

Gating networks calculate weights associated to particular expert networks.The weights are different for each observation However, for anyone they sumup to one. It means that HME is a convex combination of regression models,logit models for classification purposes, or autoregression models for time seriesanalysis problems (see [VS03]).

Both gating and expert networks are trained simultaneously using a supervisedtraining algorithm (e.g. [JR94] proposed an EM algorithm). It means that thetopology of model defines the number of clusters and the splits are chosen to allowthe best fit using available expert networks. Hence the clusters are optimized aswell as in the case of ART but the approach employed here need not to be restrictedto intervals and several thresholds only.

2.6. CURRENCY IN CIRCULATION 45

Even the definition of clusters in HME is more general there is not a clearadvantage of fine tuned clusters when compared to a sequence of NSU. However,the fact that any input is processed by all the expert networks and the output isderived as the combination of their responses is probably an advantage of thismodel.

In the terminology employed in HME papers it is said that gating networksare realizing soft splits - rather then assigning an input to a single cluster theydefine the level of contribution for all of them. Obviously, the parallel applica-tion of all the expert networks to each input further emphasize the effect of splitsoptimization.

Using the NSU or a network of NSU the ”soft” splits can not be easily achievedby more complex structure of the model. However, HME together with RBFnetworks encouraged author to implement similar feature into NNSU which isdiscussed in detail later in the chapter 3.

In general, HME and NNSU share a lot of ideas but they are very differentindeed. To compare them a serious analysis must be performed which is beyondthe scope of this thesis.

2.6 Currency in circulationFor the purposes of this paper the currency in circulation is defined as banknotesand coins hold outside the central bank. The volume of currency in circulationis mnitored on daily basis as one of the most important factors in the liquidityforecasting process of central bank. Unfortunately it is out of the control of centralbank hence it could not be determined exactly.

Therefore the future value has to be forecasted on daily basis using the knowl-edge from the past which is a typical problem of time series forecasting. To sim-plify the text the dailly series of volumes of currency in circulation is further alsoreferred as currency in circulation (CIC).

Figure 2.4 shows that CIC is a typical seasonal time series with significantinfluence of more seasonal factors. Next two sections explain the stochastic be-havior of CIC and summarizes possible seasonal drivers.

2.6.1 Stochastic behavior of CICThe distribution of banknotes and coins to non-banking system is mainly carriedout by commercial banks. They have to supply their branches and ATM networks


120

140

160

180

200

220

240

260

280

1997 1998 1999 2000 2001 2002 2003 2004

CIC

volu

me

(bln

CZK

)

Currency in Circulation Volume

(a) Volume of Currency in Circulation

200

205

210

215

220

225

230

235

240

245

07/02 09/02 11/02 01/03 03/03 05/03

CIC

vol

ume

(bln

CZK

)

Currency in Circulation Volume

(b) Volume of CIC- one year period(07/02-06/03)

-5

-4

-3

-2

-1

0

1

2

3

4

07/02 09/02 11/02 01/03 03/03 05/03

CIC

- da

ily c

hang

es (b

ln C

ZK)

Daily Changes of CIC Volume

(c) Daily changes ofCIC Volume

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

07/02 09/02 11/02 01/03 03/03 05/03

CIC

- da

ily c

hang

es y

ear d

iffer

ence

(bln

CZK

)

One Year Difference of Daily Changes of CIC Volume

(d) Seasonal differ-ence of CIC dailychanges

Figure 2.4: Currency in Circulation

as any client suppose that any nearby ATM is always ready and full of banknotesand no one can even imagine that it is impossible to withdraw cash in a bankbranch - unless something wrong happened.

Commercial banks are allowed to withdraw cash from their accounts keptwithin the central bank. And obviously, they aim to return the spare cash againinto the central bank as soon as possible because cash in an ATM or in a safe couldnot be further invested. In other words the banks flexibly follow clients require-ments and hence the demand for cash is mainly influenced by non-banking sectorincluding commercial subjects and households as well.

It means that the changes of the volume of currency in circulation are causedby enormous number of factors and circumstances that are obviously uncontrol-lable. Thus it is impossible to assess the exact volume of CIC and it is necessaryto estimate it without the prior knowledge of all the factors and circumstances.

To deal with the influence of the non-banking sector it is possible to assign thesuperposition of all uncontrollable factors to a stochastic and a seasonal behaviorof the CIC volume. This interpretation means the CIC volume is supposed to be arandom variable.


2.6.2 Seasonal factors with impact to CIC

The identification of all significant seasonal patterns and shocks and the choiceof the correct form for their description proved to be crucial for the models fore-casting performance. The choice of seasonal factors is a results of discussion withthe experts who are responsible for the liquidity forecasting in the Czech NationalBank.

Intra-monthly effect is particularly influenced by salaries payments. It meansthe demand for cash is higher around the paydays and then decrease till the salariesare payed again. The difficulty related to the monthly cycles is that the length ofmonths differs during the year. In addition the same month in two subsequentyears could have different length as well due to restriction to working days only.

Day of Week The effect of week day is similar to the intra-monthly effect.Again the demand oscillates during the week and reaches the maximum beforeweekend as ATM network have to withstand all the shopping activities.

Floating Holiday - Easter Easter is probably the most tricky season. Eventhe Easter Monday is always on Monday its date varies year by year. Hence itinterfere with the intra-monthly seasonality and it seems that the influence of theEaster Holiday depends on its position in a month.

Fixed Holiday Other national holidays are unlike the Easter fixed to a particulardate, hence their positions in the month are given. On the other hand the positionin the week is different year by year.

Number of non-working days A national holiday on Friday or Monday mightinfluence the demand for CIC more then when it is in the ”middle” of a week.Eerybody knows that every ”long weekend” is a great opportunity that has to betaken. Therefore the number of non-working days following the particular day isalso relevant candidate for explanatory variable.

Christmas and New Year The most significant season is around Christmas andNew Year when the shopping activities raise dramatically.


Shocks Apart from the seasonal effects listed above there are two significantshocks that have to be considered. First is the Y2K effect and second is a bankfailure on the June 16th 2000 preceded by significant growth of cash withdrawals.

The seasonal factors enlisted above are essential inputs to any forecastingmodel. There exist many ways how to use them in a model. Hower the mostcommon is to use indicator functions or superpositions of trigonometric functions.For the inputs of neural network models these two approaches were combined asfurther explained in 4.1.1.

2.6.3 Forecasting models for CICSurprisingly, it is not far in the history when the most common models used forCIC forecasting were experts who simply knew. They did the job for a long timeand hence they learned how it works. Experts can also use alternative indicatorslike number of armored cars queuing in front of the central bank early in themorning.

The fact is that good experts significantly outperformed conventional time se-ries models. However, the problem is that when they are on holidays someone elsehas to do the job and it usualy results in inconsistent forecasts. Therefore centralbanks tends to replace or at least support the experts decission by mathematicalmodels.

Two models are compared in [CCMH02] where also a comprehensive sum-mary of all the pitfalls related to the CIC forecasting is given. The models dis-cussed in this paper are ARMA model and STS model (see section 2.2.5). Thesemodels were applied to the data of the Eurozone and results proved that STSmodel outperforms ARMA model. However, the accuracy was further increasedby combination of the models indicating that there is still some information left inthe residuals of both models.

Acording to the authors current knowledge the STS model introduced in thegiven reference is the most accurate model used in the context of CIC forecasting.Therefore it is also used together with the ARMA model as a benchmark model.

Finally, it should be mentioned that the CIC series in Czech is more volatileand less stable than the series of the whole eurozone which is given by the devel-oping character of the Czech economy in the 90’s. Therefore lower accuracy offorecasts is expected.


Algorithm 1: Regression algorithm for ARMA model fittingInput: (yt)

mt=0, (xt)

mt=0⊂ Rn, p, q ∈ N , p+ q > 0

Output: α ∈ Rp, β ∈ Rq, γ ∈ Rn, (εt)mt=0

Set h = p+ qFit

yt =

[h]∑i=1

aiyt−i +n∑

i=1

γ0i (xt)i + ε0t (2.20)

using linear regression to find initial estimate of residuals ε0.Set it = 1while ‖wit − wit−1‖ > c do

Fit

yt =

p∑i=1

αiti yt−i +

q∑i=1

βiti

(εit−1t−i

)i+

n∑i=1

γiti (xt)i + εitt (2.21)

using linear regression.Set

εitt = yt −p∑

i=1

αiti yt−i −

q∑i=1

βiti

(εitt−i

)i−

n∑i=1

γiti (xt)i (2.22)

wit = [α, β, γ]it

Set it = it+ 1endSet [α, β, γ, ε] = [wit, εit]

Chapter 3

Time Series Modeling using NNSU

Several inovations of neural networks with switching units are proposed in thischapter. The inovations are designed to improve forecasting capabilities of NNSUwith focus on seasonal time series. The basic ideas behind the propsed improve-ments are adopted from other time series models and adapted to the concept ofNNSU. The selection of particular features and their specific implementation drawfrom the results of the NNSU application to time series data.

3.1 Sequence of NSU - good classifier but poor fore-casting tool

Sequences of NSU in its basic form (as described in the section 2.5.1) has beenapplied to the series of CIC (see [HKv05]). Results demonstrate that the model iscomparable with conventional models on one side but they also uncovered the lim-itations of sequence of NSU on the other. The results suggested that the missingfeedback and trivial topology are the most limiting constraints.

The importance of feedback is pretty much clear and well known fact. Butwhy is the model quite good in classification tasks but not in the time series fore-casting? The reason is that the heuristic behind switching explained in 2.5.1 doesnot work for time series data.

In both cases the output of a neuron with switching unit is multidimensional.Sum of the output across all dimension is the approximation of desired output ofthe network. It means that it approximates the number associated with the classwhere the very input belongs to, or the value of time series at given time.

50

3.2. TURNING NNSU INTO EFFICIENT FORECASTING TOOL 51

It makes perfect sense for the subsequent neuron to take this value and clus-ter the data with respect to it. Different clusters would contain data assigned bythe previous neuron to different classes or those that the previous neuron couldnot distinguish. But the actual value of time series can not be used to select theappropriate model which the series is subjected to at the given time.

Therefore euclidean distance in the multidimensional input space has to beused instead of the sum of inputs in all hidden neurons not only in the input one.This strategy was tested on several classification tasks in [Vac06] - the results ofnetworks with euclidean distance were much less accurate than those where theclustering was subjected to the sum of inputs.

Next section describes few modifications of the basic concept of NNSU. Theyshould overcome the discussed limitations of the sequence of NSU and furtherforecasting performance of neural networks with switching units.

3.2 Turning NNSU into efficient forecasting tool

Modifications proposed in this section are at both neuron and network level. Neu-ron with switching unit is extended by three new features: cooperative switching,local feedback and time dissipation. Concerning the network level the general ar-chitecture is described and concept of weightening by quality is introduced. Theimpact of these improvements is examined in the chapter 4.

3.2.1 Cooperative switching

Neuron with switching unit splits the input data into several clusters and thenprocess the data from each cluster separately using linear transformation. It meansthat the activation function is piecewise linear and typically discontinuous at theborders of the clusters.

This makes the response of NSU very sensitive to disturbances. Obviously,such behavior have negative impact to the forecasting performance as randomnoise is a native component of any stochastic time series. In addition, variablelength of seasons or missing values might also interfere with the discontinuity ofresponse of network and influence the model performance.

The influence of variable length of a season could be demonstrated on a dailyseries restricted to working days only which increases in the first half of eachmonth while it is decreasing in the second one. Let’s neuron have two clusters

52 CHAPTER 3. TIME SERIES MODELING USING NNSU

where the first one is associated with 11 working days and the other with remain-ing data. Then the model will be more accurate in months with 22 working daysthan in the months with 20 or 21 working days.

In addition the misclassified points may have negative impact during the train-ing process too. They are the potential leverage points in linear regression.

To reduce the sensibility of the neuron the response has to be made continuous.A possible solution is to calculate the response around borders as the superposi-tion of responses of all the neighboring clusters. Naturally, impact of individualclusters should reflect the distance of the processed input from the borders. Thiscould be realized by weights determined by geographical position of inputs.

Similarly, weights could reduce impact of the potential leverage points duringtraining. The farer the input is from the center of the cluster the lower weight ithas. This is similar to geographically weighted regression. The only differenceis that in GWR the centers are floating and for each input the regression is fittedagain (see section 2.1.3).

Putting all the ideas above together, both problems could be solved by a singlesystem of weights. The same weights could be used in linear regression and alsofor calculation of the response. Only one of many possible systems of weightsis considered in this thesis. This system of weights is briefly described in thefollowing paragraphs, and exact definition follows in the section 3.3.1.

First of all, each computational unit assigns a weight to every input. Second,the weights are positive and for given input they sum up to one. It means that theresponse is a convex combination of responses calculated by all computationalunits. For given computational unit and given input the weight is determined bythe distance from the center of the associated cluster.

It is equal one until it is close to a border then it linearly decrease until itreaches zero. Inputs at the border of two clusters have weight equal to 0.5. Thearea around the border where the weight is decreasing is called smoothing bandand its width determines the level of smoothing. If the width of the smoothingband is equal to zero then smoothing is deactivated and the response is equal toresponse of the basic NSU. The figure 3.1 shows weights for two neighboringclusters in the case of univariate input.

In general, weights reflect the geometry of clusters. They are calculated withrespect to the norm of their projection on normal of the closest border of thecluster. The figure 3.2 shows the distribution of weights for two dimension alinput and three clusters. For more details see the equations (3.7) - (3.9) in thesection 3.3.1.


0 1 2 3 4

univariate input of neuron

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

weig

ht

Weights of first cluster

Weights of second cluster

Border between clusters

Figure 3.1: Weights of two neighboring clusters for univariate input

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x 2

+ +

+

c1 c2

c3

Figure 3.2: Weights of three neighboring clusters for two dimensional input

Figure shows distribution of weights between clusters with centers c1, c2 and c3.The weight of given cluster is determined by the intensity of associated color (red,green and blue respectively).


Of course, as already mentioned, the system of weights could be defined inmany other ways. For example, the weights could be symmetrical and ignore theshape of clusters or they could decrease exponentially rather than linearly. How-ever, minor effect of different decrease function is expected concerning the factthat there is not any dominant kernel functions in GWR as well ([?]). Next, theasymmetrically defined distance reflecting the shape of clusters is chosen becauseit provides natural generalization of the original NSU. But, study of different sys-tems of weights is out of the scope of this thesis, and it is possible that other choicecould further increase performance of the model.

3.2.2 Local feedback

The next weak point of a network with switching units used in [HvH05] is thatit has no feedback. Therefore it fails when applied for example to a Box-Jenkinsmodel with MA component even it is a simple MA(1) process.

The chapter 2 is presenting some examples of models with feedback. In gen-eral, there are two possible implementations of feedback in connectionist models:global and local. Models with global feedback are simply models with cyclictopology where outputs of descendants are used as inputs of predecessors. Localfeedback is implemented on neuron level it means that cycles of length one areonly allowed. In fact, with local feedback the topology could be considered asacyclic as feedback connections could be hidden ”inside” the neurons.

Taking this consequence into account, implementation of the local feedbackis only suggested to preserve the acyclic character NNSU. This option also cor-responds with the fact that each neuron in NNSU is trained separately to best fitthe desired output. Hence local implementation of feedback has no impact to thefeedforward character of training.

Next, neuron with switching unit is in fact a piecewise ARMAX model with-out the MA component(see 2.27) because the inputs are exogenous variables andlagged values of the modeled series. Therefore, feedback provides the missingMA component only which is very straightforward extension of the basic model.

Finally, the parallel with ARMAX models offers the guidelines for training ofthe model with local feedback. The proposed training algorithm is based on thecombination of two methods - linear regression applied recurrently and Marquardtalgorithm. For more details on training algorithm see the section 3.3.2.


3.2.3 Time dissipation

Real-world time series have a tendency to change their behavior in time. Seriesof CIC is a typical example of such series. There are two general options how todeal with such behavior. First, the model provides a mechanism to describe thefluctuations of its parameters, or second the model simply gives more importanceto recent observations than to older ones.

The first possibility is used in the structural time series models, for example(see section 2.2.5). But there is no simple way how to implement a similar mech-anism in the context of NNSU.

Contrary, it is very easy to adopt the second approach which is commonly usedin regression models. This could be realized by weightening in linear regressionwhere the weights are function of time. The weight of the most recent observationis equal to one and it is decreasing with time going to the past. Like in the GWRdifferent functions could determine the actual value of the weight. Here the mostcommon exponential decrease is considered.

The only problem left is how to determine how steep the exponential decreaseshould be - it means the speed of time dissipation. There are two factors in con-tradiction. On one side, if it is to slow then the accuracy could be influenced bytoo old observations. On the other side, if it is to fast then the there could be onlylittle information left that could explain behavior of series in rare seasons (e.g.Christmas).

Neural networks with switching units can benefit from their connectionistcharacter and independent training of individual neurons, and use different speedof dissipation for every neuron. In this sense the output of network is a consensusof models with different length of memory.

3.2.4 Complex topology

Section 3.1 explains why the sequence of NSUs is not an optimal architecture fortime series forecasting. Fortunately, the restriction to the sequence is not enforcedby the properties of NSU. Actually, the sequences were used only because of theirsimple representation in the context of genetic algorithms.

The little performance of sequence of NSU inspired the development of morecomplex topology. Once [KH04] introduced new encoding of neural networks, thestiff topological restrictions of genetic algorithms disappeared. So called archi-tecture with variables selection has been proposed, given that a genetic algorithmcapable to deal with arbitrary acyclic topology exists.


Architecture with variable selection is based on two level topology. An arbi-trary directed acyclic graph (DAG) with single input and output node defines thetop level. The nodes in this topology are blocks that contain any neural networkwith switching units, in general.

Apart from the connections determined by the top level topology each block isalso connected with the input of the network. But only some of the variables fromthe input are selected before the inputs of network are passed into the submodelinside the block. The selection of variables is fixed for given block but differentblcoks could use different variables in general.

The two level topology has one advantage in the context of genetic algorithms.It allows to define a recombination operator which keeps the contents of blocksuntouched during the optimization. The idea behind this construction is to providehigher control over the topologies created by genetic algorithms.

Even though the contents of a block could be of any complexity it is assumedthat it is not to complex as the topology should be subjected to the optimizationby genetic algorithm. The sequence of NSU is a natural option in the context ofclassification tasks. An example of successful application of these architectureson classification of elementary particle decays is given in [Kal09]. But, mainly,the cited reference provides very detailed explanation of the application of geneticalgorithm to optimization of the topology of neural networks.

Obviously, the contents of a block could be also a single neuron. This shouldbe the primary choice for tasks where no other option based on a priory knowledgeor experience is available. And this is exactly the case of time series forecasting- sequence of NSU seems not to be a good choice and there is no evidence for adominance of any other architecture.

3.2.5 Weightening by quality

The last innovation proposed in this thesis is inspired by the hybrid model intro-duced in [WC96]. This model combines ARMA and RBF network with respectto confidence in RBF accuracy. The confidence is measured for each input indi-vidually and is determined by the RBF network parameters (see 2.4.1 for moredetailed description).

A similar concept can be used in NNSU because each neuron with switchingunit can estimate the quality of its output for given input. Although, it appliesto the NSU in its basic form as well the following (a bit heuristic) argumentationmakes more sense when the cooperative switching is considered.

3.3. NEURON WITH COOPERATIVE SWITCHING 57

Weights applied in the linear regression reduce the impact of the observationsaround the border in the cooperative switching regime. Therefore, one can expectthat a computational unit process inputs close to the center with higher accuracythan those around the borders. In other words, that the geographical positiondetermines the quality of the output.

Weights used by different clusters already reflect the position of every inputand hence they could be used to estimate the quality of corresponding outputs.Following the underlying idea the quality should be highest for inputs aroundcenters of clusters and lowest for inputs at their borders. This hold for examplefor the sum of squares of weights which is one of the proposed measures of quality.

The second one is based on the quality of linear regression measured by ad-justed R2. In this case the quality is given by weighted average of adjustedR2 across the computational units. The weights used in the average are againthe weights used in the calculation of the response in the cooperative switchingregime.

This concludes the list of proposed innovations. The ideas and concepts presentedabove are further formalized in the next sections where the formal description ofenhanced neuron and network is given.

3.3 Neuron with cooperative switchingThis section rigorously defines an enhanced version of neuron with switching unit.It is referred as neuron with cooperative switching (NCS) to distinguish it from theneuron with switching unit in its basic form. But it implements all the innovationsdescribed in the previous section 3.2 not only the cooperative switching.

NCS is defined by response function and by training algorithm which sets theparameters of activation function according to training data. Both the responsefunction and training algorithm are described in next two sections.

3.3.1 Response functionThe response function of NCS is mapping input from Rn to output in Rn+3. Theextra three dimensions of output are for the bias in linear transformation and fortwo estimates of quality as described later. Inputs for given time t are composedfrom four different components, in general

• outputs of parent neurons: ot ∈ (−1, 1)u,


• exogenous variables: xt ∈ (−1, 1)v,

• lagged values of forecasted series:(yt−k1 , . . . , yt−kp

) ∈ Rp,

• lagged values of prediction error(εt−l1 , . . . , εt−lq

) ∈ Rq,

where the dimension n of input

z =((ot)1, . . . , (ot)u, (xt)1, . . . , (xt)v, yt−k1 , . . . , yt−kp , εt−l1 , . . . , εt−lq

)(3.1)

equals to the sum of dimensions of individual components n = u + v + p + qand indices k1, . . . , kp, l1, . . . , ylq are parameters of the response function definedlater. It worth to be highlighted that the inputs ot and xt are supposed to be fromunit cube - it means normalized inputs are expected.

The exact form of the response function of given neuron is determined byseveral parameters that are fixed and not subjected to optimization by the trainingalgorithm. These are:

• Number of clusters M

• Lags of forecasted series k1, . . . , kp ∈ N

• Lags of errors l1, . . . , lq ∈ N

• smoothing band width θ (see 3.9)

Optimized parameters are associated with computational units only. Eachcomputational unit has following parameters:

• center of cluster: c ∈ (−1, 1)u+v

• transformation vector: α ∈ Ru+v+1

• coefficients of series lag operator:Φ (yt) = φ1yt−k1 + φ2yt−k2 + . . .+ φpyt−kp

• coefficients of errors lag operator:Ψ (yt) = ψ1yt−l1 + ψ2yt−l2 + . . .+ ψpyt−lq


These parameters form a diagonal matrix Pr:

Ar = diag(αr1, . . . , α

ru+v+1)

Φr = diag(φr1, . . . , φ

rp) (3.2)

Ψr = diag(ψr1, . . . , ψ

rq)

Pr = diag (Ar,Φr,Ψr)

where the upper index r ∈ 1, . . . ,M is used to distinguish parameters relatedto different computational units.

Finally, training algorithm assigns value of adjusted R2 to each computationalunit which is further referred as (R2adj)u.

For the inputs and parameters described above the response of NCS is

F (zt) = [f(zt), g(xt, ot), h(xt, ot)] (3.3)

where

f (zt) =M∑

r=1

wr(xt, ot)Przt (3.4)

g(xt, ot) =M∑

r=1

wr(xt, ot)2 (3.5)

h(xt, ot) =M∑

r=1

wr(xt, ot) ∗ (R2adj)r (3.6)

The weights wr(xt, ot) are derived from distance of the inputs from bordersbetween clusters and control the influence of particular computational unit. Theclusters are defined with respect to the components xt and ot of the input zt. Tosimplify the notation, at stays for vector ((ot)1, . . . , (ot)u, (xt)1, . . . , (xt)v) there-after.

The weight for cluster r and input at is calculated in several steps. First, therelative distance of the input at from border between clusters r and s

dr (at, s) =2 〈cr − at, cr − cs〉

‖cr − cs‖2 − 1 (3.7)

is calculated. Then the maximal relative distance

dr (at) = maxs∈1,...,M,s6=r

dr (xt, s) (3.8)


is selected. Note that the relative distance dr (at) is apparently not a metric - itdepends on the direction as well and it has negative value for any at from clusterr and dr (cr) = −1.

Next the non-normalized weight wr(at) is set with respect to dr (at) accordingto

0 dr (at) > θ

wr (at) =θ − dr (at)

2θ|dr (at)| ≤ θ (3.9)

1 dr (at) < −θwhere θ is a fixed parameter of neuron and defines the relative width of the bandaround cluster borders. Parameter θ is referred as smoothing band width.

Finally the weight wr (at) is normalized

wr (at) =wr (at)∑M

v=1 wv (at)

(3.10)

to make sure that weights of all computational units for given input sum up to one.With the weights wr(at) defined the the description of response function is

complete. However, it worth to give a meaning to some of its components, todemonstrate the general idea behind the model. First of all, there are compu-tational units and one switching unit. The role of computational units is prettymuch clear. They estimate the prediction using ARMA model. Their parame-ters are fitted during the training process. The switching unit assigns weightsw1(at), . . . , w

M(at) to individual computational units for any feasible input at.Finally, the outputs of computational units are combined with respect to the

weights. The component reposnsible for this operation can be called smootheras it makes the response continuous. In parallel, the weights could be used forestimation of quality of the response, hence the last component is a referred asquality controller (QC). Figure 3.3 schematically presents the mutual connectionsof these components.

3.3.2 Training algorithmTraining of NCS have two stages. The clusters are found during the first one, andcomputational units associated with them are fitted in the second.

Great variety of clustering algorithms could be employed in the first stage.The influence of clustering algorithms has been extensively studied in [Vac06].


INPUTS(x,o)

SU CU CU CU

SMOOTHER

y

y’

e

QC

Q

Figure 3.3: Schema of neuron with cooperative switching

Here, y′ stays for the first component of response defined by (3.4) which is in factprediction of y; e = y − y′ is the error of prediction; Q represents the remainingtwo components of response defined by (3.5) and (3.6) which are estimates of thequality of output. The rectangular, executive blocks are described in the text(page60);


Following the suggestion of this study a combination of two methods is used in theNCS. They are k-mean algorithm, and a simplified version of principal componentalgorithm. The later one is used for initialization of the first one. For more detailssee the cited reference.

The output of the clustering algorithms are centers of clusters c1, . . . , cM . Assoon as the centers of clusters are known the weights wr (at) might be calculatedusing the formulas (3.7) - (3.10). The calculation of the weights concludes thefirst stage.

Training of computational units is similar to fitting of an ARMA model be-cause each computational unit is, in fact, an ARMA model. The only differenceis that the mutual interaction of computational units has to be considered.

Section 2.2.3 summarizes common methods used for fitting ARMA models.As the problem is not linear many of these algorithms are based on gradient op-timization of the error or likelihood function. But these methods are expensivein terms of computational time. This limits their applicability in the context ofgenetic optimization where many networks have to be trained.

Hence, the regression procedures are more appropriate taking the genetic op-timization into account. From theoretical point of view, these algorithms are lessaccurate and their convergence is not guaranteed. But they are much more fasterthan the gradient algorithms. Therefore, the proposed solution is to use modi-fied version of algorithm 1 for fitting of computational units during the geneticoptimization.

For selected networks a gradient algorithm could be applied subsequently tofurther increase the accuracy. Generally, any non-linear least square algorithmcould be directly applied to minimize the square error of prediction and optimizethe parameters of the response function. Therefore the initial regression procedurehas to be explained only.

This procedure is pretty much the same as the 1. There are only two differ-ences. First, in each iteration all the computational units have to fitted, and theresiduals for next iteration are derived subsequently from the response of neuronas defined in (3.4).

Second, the weighted regression is applied using the weights wr (at) insteadof OLS. In addition if the time dissipation is activated the weights wr (at) areadjusted with respect to time. This is done by multiplicative factor e−λt where λdetermines the speed of time dissipation.

The procedure is described in the algorithm 2 using the notation from the pre-vious section. It is an iterative process where the convergence criteria for maximalchange of parameters (µ), maximal change of error (η), and maximum number of

3.4. NEURAL NETWORK WITH COOPERATIVE SWITCHING 63

iterations (K) are with respect to the computational force on disposal. In somecases this algorithm does not converge. So if the maximum number of iterationsis reached without convergence criteria to be met then estimation of parametersfrom first iteration is used - it means that the feedback is ignored in this case.

Algorithm 2: Regression algorithm for fitting of computational unitsInput: (yt)

mt=0, (at)

mt=0⊂ Ru+v, lags k1, . . . , kp and l1, . . . , lq (p, q ∈ N ),

weights wr(at)(∀r ∈M )Output: parameters pr = diag(P r) =∈ Rn for all r ∈M , residuals (εt)

mt=0

Set z0t = ((at)1, . . . , (at)u+v, yt−l1 , . . . , yt−lp , 0, . . . , 0) for t ∈ 0, . . . ,m

for r = 1 to M doFit yt = z0

t pr0 + εt using weighted linear regression with weights wr(at)

endSet ε0t = yt − f(zt) for all t ∈ 0, . . . ,m.Set i = 1while i ≤ K and

∑Mr=1

∥∥pri − pr

i−1

∥∥ > µ and∥∥(εit)

mt=0 −

(εi−1t

)m

t=0

∥∥ > η doSet zi

t = ((at)1, . . . , (at)u+v, yt−l1 , . . . , yt−lp , εi−1t−k1

, . . . , εi−1t−kq

) fort ∈ 0, . . . ,mfor r = 1 to M do

Fit yt = zitp

ri + εt using weighted linear regression with weights

wr(at)endSet εit = yt − f(zt) for all t ∈ 0, . . . ,m.Set i = i+ 1

endif i = K and

∑Mr=1

∥∥pri − pr

i−1

∥∥ > µ and∥∥(εit)

mt=0 −

(εi−1t

)m

t=0

∥∥ > η thenSet pr = pr

0 Set εt = ε0tendelse

Set pr = pri−1 Set εt = εi−1

t

end

3.4 Neural network with cooperative switchingNeuron with cooperative switchingwas defined in the previous section. A neuralnetwork based on this type of neuron is proposed here. The network is referredthereafter as neural network witch cooperative switching (NNCS).


Let’s have a look at the motivation for building of networks of NCS first,before describing the architecture and dynamics. In general, a network can haveparallel and serial connections. If one of them are obviously meaningless theyshould not be included in the model not to complicate it needlessly. But this is notthe case of NCS.

Concerning serially connected NCS they are, in fact, decomposing the seriesstep by step. If the first neuron can capture the most significant components thenthe next one has better chance to find less obvious patterns. This is analogy tothe commonly used approach when the long-term trend is estimated first, next theseasonal components and finally a stochastic model is fitted with respect to theresiduals.

In addition different neurons can use different inputs and hence their clustersare also different which further increase the potential of the whole model. This iseven more clear for parallel neurons. Figure 3.4 presents results of an applicationof two parallel computational units to simple time series.

The series is deterministic and is composed of two seasonal components whereeach of them has different frequency. Inputs of neurons defines position of par-ticular time within the season (e.g. if season has period 4 then the inputs are1,2,3,4,1,2... subsequently). The two neurons use one of these two inputs each.The results show that the two parallel neurons are able to deal with the series quitewell.

0 2 4 6 8 10 12

Period

-4

-3

-2

-1

0

1

2

3

Value

Season 1 - est.

Season 2 - est.

Series

Series -est.

Residuals

Figure 3.4: Approximation a simple series with two parallel neuron with cooper-ative switching

Of course, the ideas expressed in the above paragraphs do not prove anything


- but they are quite illustrative and they demonstrate the capabilities of the modelfrom bird’s eye view. From the theoretical point of view, the sequence of NSU- which a special case of the architecture proposed in the next section - is densein continuous functions (see [Hla01]). Putting all together it gives at least a goodmotivation to ”give it a try”.

3.4.1 Architecture

A brief description of the proposed architecture was already given in the section3.2.4. This section presents the ideas introduced there in a structured way benefit-ing from the rigorous description of NCS. Respecting the terminology introducedat the beginning of the section 2.3 the topology is defined first and description ofdynamics follows.

The topology has two levels. The top level topology is defined by an arbitrarydirected acyclic graph (DAG) with single input and single output node, and inputnode connected to all nodes except the output one.

The first restriction is implied by the genetic algorithms. Considering the factthat both the input and output neuron could be consistently ignored the restrictioncould be eliminated easily. Here it is assumed that the input node are inputs of thenetwork - it means exogenous variables in the context of time series forecasting.The limitation to a single output neuron is then in compliance with the demandedbehavior of neural network - only one response is expected when single input isprovided. Therefore, the restriction to single input and single output is .

The second one is enforced to make sure that nodes have connection to in-puts of networks. This comes from the idea of step by step filtering of particularpatterns sketched in the motivation paragraphs above. The missing connectionbetween input and output node has no impact to generality of the topology and isdiscussed later.

The nodes of the top level topology are referred as blocks. There are no restric-tions to the contents of block from topological point of view. But it is expectedthat the structure of blocks is a priory logical and simple because not to reduce thecapabilities of genetic algorithms as the inner structure of blocks is not subjectedto optimization.

The inner topology of blocks is the second, bottom layer of the topology. Cur-rently, there is no evidence for dominance of any topology in relation to timeseries tasks as the sequences of NSU turns to be inefficient at this area. Therefore,blocks with single NCS are considered only.


There are free additional components in the block: variable selector, qualitycontroller and normalizer. These realize more or less interface and supports thefunctionality of the NCS. The diagram of the block structure is given in the fig-ure 3.5. Here the description of topology ends and next paragraphs describe thedynamics starting with specification of the three enlisted components of the block.

Variable selector works solely with the inputs from the input block - it meansexogenous variables. It selects a fixed subset of exogenous variables and presentsthem to the normalizer untouched. The remaining exogenous variables are blocked.

Quality controller is, contrary, dealing with the outputs of other parent blocksbut not with the input one. Taking into account that output of a block is equalto response of a NCS (3.3) it also includes the estimates of quality defined byformulas (3.5) and (3.6).

For given time the quality controller takes estimated qualities of all the an-cestors and normalize them to sum up to 1 separately. Then it multiplies theremaining component of their outputs (defined by(3.4) with them. It presents tothe normalizer both the multiplied and original value of the response.

Normalizer only takes all the inputs and normalize them using a multiplicativefactors determined during training algorithm for each input dimension separately.The normalized inputs are passed to the NCS.

Finally, the component (3.4) of response is summed across all dimensionshence the output of the block is prediction of the series plus the two quality mea-sures (3.5) and (3.6). Note that lagged observations of the series as well as errorsare presented to NCS as they are.

The only exception, is the last output block where both the exogenous vari-ables and lagged values are not considered. The reason for this construction israther technical then methodological. It has been already mentioned that conver-gence of the training algorithm of NCSis not guaranteed if the lagged errors arepresented to the model. Moreover, it might happen that for a certain combinationof inputs, the regression could be very sensitive due to the limitations of employedimplementations of linear regression.

Therefore, it worths to keep the last neuron as simple as possible, because ifthis neuron is not trained properly then the whole network is odd. If this happensinside the network then the last neuron cuts of the appropriate inputs accordingly,


VARIABLESELECTOR

QUALITYCONTROLER

NCS

NORMALIZER

EXOGENOUSVARIABLES

PARENTBLOCK

PARENTBLOCK

DELAYEDOBSERVATIONS

SUM

y’ Q

Figure 3.5: Diagram of block with single neuron with cooperative switching

Here, y′ stays for the first component of response defined by (3.4) which is in factprediction of y; Q represents the remaining two components of response definedby (3.5) and (3.6) which are estimates of the quality of output. The meaning ofrectangular, executive blocks is described in the text.


and it provide meaningful response. Literally, this turns the output node into therole of a judge who decides who it should trust.

The only thing left unexplained (concerning the architecture discussed in thissection) is the interaction of blocks. Actually, it is very intuitive and naturallyimplied by the acyclic topology. When an input is presented to the network it isprocessed by the blocks with no other parents but input block, first. Their outputsare transformed to the subsequent blocks until it is processed by all the parents ofthe output block which calculates the response of the whole network afterwards.

Architecture of this type is referred as architecture with variables selection inthe context of neural networks with switching units. Its application is definitelynot limited to time series tasks. Experiments with this type of architecture appliedto classification of elementary particle decays could be found in [Kal09].

3.4.2 Training algorithm

Training of NNCS is given by the character of training algorithm of NCS. Eachneuron is trained to best fit the forecasted series based on the presented trainingdata and its inputs. It means that NCScan be trained whenever response of itsparents is calculated for the training data.

Hence the training of NNCS is feedforward algorithm. The training data flowsthrough the network in the same way as if the response is calculated. But, obvi-ously, each neuron is trained right before it process the data.

Finally, the training of normalizer inside the blocks should be defined to makethe description of training algorithm complete. This component is responsible fornormalization of inputs of NCS. It should compress the incoming data into theunit cube. Therefore, its parameters are simply set with respect to maximal valueof presented data for each dimension, separately.

For given dimension the multiplicative normalization factor is inverse value ofthe maximum presented increased by a safety margin. Natural choice of the safetymargin is is the half of smoothing band width. This makes the system of weightsapplied in the subsequent NCS consistent.

At this point the definition of neural network with cooperative switching iscomplete. Chapter 4 presents results of several experiments with this type of net-work with aim to show how important particular features are. The construction ofnetworks involved in the experiments is driven by a genetic algorithm. Therefore,the basic concept of genetic optimization of neural network topologies is brieflyexplained in the next section.

3.5. GENETIC OPTIMIZATION OF ARCHITECTURES 69

3.5 Genetic optimization of architecturesThe purpose of this section is definitely not to provide detailed information aboutthe genetic algorithms. The aim is to explain the basic terms to simplify the ori-entation in the next chapter. The exact and comprehensive description of geneticalgorithms applied in the experiments presented later could be found in [Kal09].The study of genetic algorithms is not a subject of this thesis and they were usedexactly in the form as presented in the cited reference.

A genetic algorithm tutorial [Whi] says:

”Genetic Algorithms are a family of computational models in-spired by evolution. These algorithms encode a potential solution toa specific problem on a simple chromosome-like data structure andapply recombination operators to these structures so as to preservecritical information.”.

This general description of genetic algorithms indicates that great deal of theiractual implementations exists. Considering neural networks they could be appliedboth for training of individual network and for optimization of architectures. Onlythe later application is considered in the context of NNSU and NNCS.

In general, the idea of genetic algorithms is to find good individuals by evo-lution of initial population of individuals which is usually randomly generated.Here the term individual stays for any type of object subjected to the optimiza-tion, the evolution is an iterative process and population is the set of individualsin one iteration of the evolution referred as generation.

Each iteration of a genetic algorithm has two stages. First good individualsare selected and second new individuals are created through recombination op-erators. The quality of individuals is measured by a fitness function. The tworecombination operators are cross-over and mutation. While cross-over operateson two (or more) individuals, mutation deals with one. Definition of fitness func-tion and recombination operators determine the actual implementation of geneticalgorithm.

Recombination operators works with description of individuals encoded bygenomes. Finding efficient encoding is the crucial task when genetic algorithm isdesigned. On one site it has to be flexible not to limit the feasible individuals, andit has to be simple on the other to make the recombination possible.

Concerning the current implementation of genetic algorithm for optimiza-tion of NNSU architectures the individuals are neural networks. The represen-tation of the topologies combines two methods: Program Symbol Trees (PST) ap-

http://samizdat.mines.edu/ga_tutorial/


proach introduced by Frederic Gruau[Gru94], and Read’s linear codes (Codes) byRonald C. Read [Rea72]. The resulting representation is then called Instruction-Parameter Code.

Roughly speaking, applied genomes are encoding the top level topology plusthey carry the description of individual blocks. The definition of top level topologyis affected by both cross-over and mutation whereas the description of individualblocks could only be changed through mutation. Exact definitions of genomes andrecombination operators are beyond the scope of this thesis and they are explainedin detail in the already cited reference [Kal09].

This brief introduction of genetic algorithms in the context of neural networkswith switching units and their enhanced alternative proposed here concludes thischapter. The next chapter presents results of application of neural networks withcooperative switching to forecasting of a typical seasonal time series which is adaily series of currency in circulation.

Chapter 4

Forecasting currency in circulation

Several improvements of neural networks with switching units in the context oftime series forecasting have been presented in the previous chapter. The improve-ments have been implemented as an extension to the existing NNSU engine. Withthis system several experiments had been conducted to solve two issues.

First, to find the best setup that would produce optimal forecasting model forthe daily series of volumes of currency in circulation. And second to examine theimpacts of particular modifications proposed in the chapter 3.

Summary of the results and the discussion of benefits and drawbacks of differ-ent setups is given in this chapter. The experiments proved that the application ofneural networks with switching units to time series forecasting is relevant whenthe proposed modifications are implemented. The comparison with conventionalmodels show that NNSU outperforms the individual models and also their combi-nations.

4.1 Setup of experiments

Each experiment is a single run of a genetic algorithm where the optimized indi-viduals are neural networks. The aim of the experiments is to provide the mostaccurate forecasts of the daily series of volumes of currency in circulation (CICseries). Typical example of an output is given in figure 4.1 that shows the series,one step ahead prediction and their difference (residuals) evaluated over the periodfrom July 2003 until June 2004.

71

72 CHAPTER 4. FORECASTING CURRENCY IN CIRCULATION

June-03 September-03 December-03 February-04 May-04230

240

250

260

270

Volu

me (

bilio

ns C

ZK

)

Actual value

Forcasted value

June-03 September-03 December-03 February-04 May-04

-1.0

0.0

1.0

Residuals

Figure 4.1: Forecasting volume of currency in circulation

All experiments deal with the same data and they use the same engine forgenetic optimization. The experiments differs in the feasible architectures of net-works as explained in the section 4.1.3.

4.1.1 Data

The CIC series in the period from 1 January 1996 until 30 June 2004 is examinedin the experiments. This period is split into three distinct intervals: training period(1 January 1996 until 60 June 2002), evaluation period (1 July 2002 until 30 June2003) and testing period (1 July 2003 until 30 June 2004). The data related tothese periods are referred as training, evaluation and testing data respectively.Abbreviations TR,EV and TE refer to the time interval, observed values of seriesor exogenous data depending on the context.

Figure 4.1 as well as 2.4 shows that there is a longterm trend present in theseries. Therefore, the first difference of the series (see figure 2.4 (c)) is consid-ered instead of the original series. This is also in compliance with the findingspresented in [Hla04] and [Sen06].

4.1. SETUP OF EXPERIMENTS 73

Lagged values of the CIC series are important inputs of all the neurons. Lagsup to the one year delay are used due to the long memory of the series. Howeverto limit the total number of lags involved only the most recent one (1-15) plusthat referring to the monthly (20-23), bi-monthly (41-45), quarterly(60-65) andannual (250-255) delays1 are considered. The same selection of delays applies tothe error lags as well.

Exogenous factors are the last but not least portion of inputs. The exogenousfactors are exclusively used to describe the seasons that might influence the series.They are involved in the models in two different forms. The first type of seasonalfactor is a superposition of trigonometric functions and the second is a laggedpolynomial of indicator function.

A seasonal factor expressed as a superposition of trigonometric functions hasthe form:

dt,i =

p∑j=1

(ai,j sin

(2jπmi,t

Mi,t

)+ bi,j cos

(2jπmi,t

Mi,t

)), (4.1)

where dt,i is the value of the ith factor at time t,Mi,t is the length of the current cy-cle (for example the length of a month) and mi,t is the position in the current cycleat time t (the current day in the month). Finally, the positive number p sets up thenumber of different frequencies forming the factor. Naturally, the more frequen-cies is considered the better the approximation is. However, its recommended tokeep the number of frequencies p bellow nine.

The second group of seasonal factors is mostly applied to isolated events likenational holidays or shocks and is of the form:

dt,i = Γi (B) B−Fiτi (t) , (4.2)

where Γi is a polynomial in B and τi is the seasonal indicator function (τi (t) = 1if ith season occurs at time t and τi (t) = 0 otherwise). Finally the Fi is a positivepower of B. The combination of the polynomial Γi and B−Fi guarantees thatparticular seasons might influence future and also past observations.

Finally, all the factors described in the section 2.6.2 were included in the listof explanatory variables. The only season modeled in the form 4.1 are monthlycycles. The rest of them is represented by lags of indicator functions includingthe day of the week where the indicator function is related to the Friday. The briefoverview of all factors together with numbers of lags is summarized in the table4.1.

1Observations are limited to working days only, hence the number of observations per year isapproximately 252


seasonal factors, order of Γ power F number ofshocks frequenciesintra-monthly effect — — 8day of week 3 0 —Easter 10 5 —fixed holiday 10 5 —non-working days 10 5 —Christmas 15 5 —New Year 5 5 —bank failure 15 5 —Y2K 10 5 —

Table 4.1: Seasonal factors and shocks

4.1.2 Optimization by genetic algorithmsThe genetic algorithms optimizes architectures of networks which includes:

• Topology of network

• Number of computing units for each neuron

• Selection of exogenous variables for each input neuron

• Selection of error and series lags for each neuron

• Other characteristics of neurons like power of time dissipation or width ofsmoothing band (if applicable)

The remaining parameters of neurons are optimized using the training dataset. Optimization of neurons with local feedback is restricted to the iterative re-gression only. Gradient optimization is not activated due to enormous demand forcomputation time. For more details on training algorithm see chapter 3.

Due to technical limitations of the system, the genome codes first three itemson the above list only. The remaining characteristics of neurons are not includedin the genome. Therefore, when a network is carried from one population tothe next these characteristics are lost and randomly generated again. The sameapplies to recombined networks - parameters that are not included in the genomeare always generated randomly. It means that there is extremely high probability

4.1. SETUP OF EXPERIMENTS 75

of mutation which is almost equal to 1. However the results prove that it worthsto use the genetic algorithms rather than a completely random search even theirfunctionality is limited.

The fitness function is defined as the inverse of the root mean square error(RMSE) of one step ahead prediction over the evaluation period. Formally, let’sassume that NN (P ) is a neural network trained using the data from time periodP . Next let yNN(P )

t is the one step ahead prediction of time series (yt)t= calculatedby the trained neural network NN (P ) at time t. Then

CEV (NN) =

√√√√∑

t∈EV

(yt − y

NN(TR)t

)2

|EV | (4.3)

is the RMSE of one step ahead prediction over the evaluation period and

fitness (NN) =1

CEV (NN)(4.4)

is the value fitness function.Note that the testing data are not used during the optimization - they are re-

served for testing of the quality of resulting architectures.

4.1.3 Feasible architectures

The common properties of performed experiments were discussed until this point.In the following paragraphs the differences mostly given by the architectures in-volved are highlighted and explained taking the use of the table 4.2 as an outline.

First of all the size of the experiment is determined by the size of populationand count of generations. In the most cases populations of 50 individuals and 50generations have been used. Only in the most general cases where all the extrafeatures have been activated the total number of networks has been approximatelydoubled by evolving population of 70 networks in 70 generations to make surethat the results are not negatively influenced by the increase in variability.

In both cases recombination and mutation gave arise to 25 networks in eachgeneration. The remaining were rolled from the previous one. As it was alreadymentioned in 4.1.2 only the topology and counts of computing units remain un-touched or are affected by the recombination or mutation the rest of the parameterswas always generated randomly.


Table 4.2: Setup of experimentsexperiment A B C D E F G H I Jnumber of generations 50 50 50 50 50 50 70 50 70 50population size 50 50 50 50 50 50 70 50 70 50min blocks 2 2 2 2 2 2 2 2 2 2max blocks 15 15 15 10 20 15 15 15 15 20min NCSU per block 1 1 1 2 1 1 1 1 1 1max NCSU per block 1 1 1 4 1 1 1 1 1 1min CU per NCSU 2 2 2 2 2 2 2 1 2 2max CU per NCSU 5 5 5 5 5 5 5 1 5 5cooperative switching x x x x x x x x xlocal feedback x x x x x x x x xquality weights x x x x x x x xtime dissipation x x xone layer only x x x

Next parameters of the experiment determine the minimum and maximumnumber of blocks, neurons in blocks and computing units in neurons. While thelimits for number of neurons per block and number of computing units per neuronare valid for all the networks in particular experiment. The limit for number ofblocks applies to the randomly created networks only - it means to the initialpopulation only. In the later iterated populations the recombination of topologiesmay lead to networks with higher and also lower number of blocks.

Most of the experiments deals with 2-15 blocks. However there are two ex-ceptions. First, the experiment D which is the only experiment that allows blockswith more than a single neuron (2-4). And second, two of the experiments wherethe topology is restricted to single layer network (E and J). In the first case thelower number of blocks leverage the size of the blocks. While in the second onethe bigger number of blocks is allowed to balance the topological constraint. Tosee if the size is important for single layer networks the number of neurons is keptin range 2 to 15 for the experiment I which also deals with this kind of networks.

Actually more than a single experiment with non-singular blocks have beenperformed. However all these experiments produced similar (not very optimistic)results. Therefore the experiment D is the only one of them included in this sum-mary.

Number of computing units per neuron is fixed to the range 2-5. There is again

4.2. RESULTS OF EXPERIMENTS 77

one exception - the experiment H - which demonstrates the capability of networkswhere each neuron is simply an ARMA model.

Another degree of freedom of the experiments is given by the activation anddeactivation of the features introduced in chapter 3 - it means: cooperative switch-ing (3.2.1), local feedback (3.2.2), quality weights (3.2.5) and time dissipation(3.2.3). When activated the local feedback deals with the lags summarized inthe description of inputs above (4.1.1). In the experiments where smoothing wasconsidered the relative width of the smoothing band was fixed to 0.2. The time dis-sipation was employed in three experiments and the power was drawn randomlyfor every neuron separately from the range 0-0.003 uniformly.

Finally, as already mentioned, the topology in experiments E,I an J was re-stricted to a single layer. Hence it is possible to check if more complex topologiesincrease the forecasting capabilities. Obviously, the number of experiments thatmight be combined from the listed features is endless. From this point of viewit might be misleading to make some conclusions based on results of 10 exper-iments. However the experiments presented there are only a selection of all theresults obtained during the research of neural networks with switching units andcarefully selected to demonstrate the most important findings. Moreover the aimis not to answer every relevant question in detail but to figure out which of theimplemented improvements significantly improved forecasting accuracy. For thispurpose the presented results should be sufficient.

4.2 Results of experiments

Hundreds of networks were conducted, trained and evaluated during each exper-iment and it is impossible to study them all one by one. This section presentsthe results of experiments A-J in an aggregated form. The overview is focusedon the accuracy of the best networks because it is the most important quantitativemeasure of the performance achieved by individual experiments.

Definition of best networks is formalized in the section 4.2.1 together with thecriteria of quality that are employed in the comparison of selected networks. Thesummary of results follows in 4.2.2 and more detailed results are further availablein the appendix A


4.2.1 Quality criteria for selection and comparison

The best networks for a given experiment are selected according to the fitnessfunction 4.4. Let’s remind that the fitness is the inverse value of CEV whichdenotes the RMSE of one step ahead prediction over the evaluation period asdefined by equation 4.3. Keeping this in mind and following the common practicein time series forecasting the CEV is reported hereby instead of the value of fitnessfunction.

It means that the target of genetic algorithms which is to maximize the fitnessfunction could be reworded in terms of CEV . Obviously, maximization of fitnessis then equal to minimization of the CEV .

As the fitness function is already used in the optimization process and forselection of good networks produced by particular experiment it can not be usedfor crosswise comparison of the experiments. For these purposes the quality ofnetworks is measured by the RMSE of one step ahead prediction over the testingperiod.

Therefore all the networks are also applied to testing data. But first, they aretrained again using the data from training and evaluation period together. This isnecessary because the behavior of the CIC series might not be consistent over thewhole period and so the recent observations are particularly important. Hence thecriteria is defined as

CTE (NN) =

√√√√√√

∑t∈TE

(yt − y

NN(TRS

EV )t

)2

|TE|

(4.5)

using the same formalism as in formula 4.3.With quality criteria CEV and CTE the defined it could be formalized what

best networks are. For a given experiment CEV -top n or CTE-top n networks arethe n networks with the lowest value of criteria CEV or CTErespectively that werecreated during the experiment. When a single network is considered CEV -best orCTE-best is used instead of CEV -top 1 or CTE-top 1.

Note that both quality criteria CEV and CTE have the same meaning. Theonly differences are data used for training of network and period which RMSE isevaluated over. Of course, it is possible to define more complex criteria of qualitythat might be also used in genetic algorithms as fitness function alone or in thecombination with CEV .


An important example of alternative criteria of neural network quality is

COF (NN) =

√√√√(P

t∈TR

“yt−y

NN(TR)t

”2

|TR|

)

CEV (NN)(4.6)

which is referred in this thesis as coefficient of overfitting. This criteria comparesthe RMSE over evaluation period with RMSE over training period. Although thecriteria CEV automatically ignores networks that perform well on training databut give poor results on evaluation data, it might be helpful to explicitly penalizemodels for labile performance.The criteria COF is not included in the summary ofresults but average values are reported in the appendix A.

Because RMSE controls the average size of errors only the overview of resultsalso contains the 95% and 99% percentiles of absolute values of residuals of onestep ahead prediction over the testing period. It means that CTE and reportedpercentiles are statistics derived from the same residuals. However the estimatesof the percentiles are given by the worse 14 and 4 errors respectively and hencethey are considered as supportive indicators only.

The percentiles might decide in two particular situations. First, too big valueof percentile might disqualify a model. And second, big difference in percentilescould give a preference to one of the models where CTE are very close.

Finally, size of the networks is also considered in the comparison of differentarchitectures. The size could be measured by number of neurons or number ofparameters of the very network.

4.2.2 Summary of resultsThe purpose of genetic algorithms involved in the experiments is to find the bestnetwork in the terms of forecasting accuracy. Therefore results of the best networkcreated during an experiment might be considered as results of the experiment atall. However it could be misleading to compare different architectures based onoutputs of a single network. Therefore CEV -top 10 networks is selected from eachexperiment to make the fundamentals for comparison more robust.

Table 4.3 (a) presents the results of experiments A-J sorted by the averagevalue of CTE calculated from the CEV -top 10 networks. To simplify the text, theaverages calculated from results of the top 10 networks are thereafter referred asaverage values and the results of the best network as minimum values whenever itis clear from the context.


Table 4.3: Forecasting accuracy of selected networks(a) CEV -top 10 networks

experiment CEV -top 10, averages CEV bestCEV 95% pct. 99% pct. CEV 95% pct. 99% pct.

G 0.434 0.846 1.197 0.419 0.806 1.046A 0.437 0.825 1.174 0.427 0.770 1.087E 0.441 0.837 1.101 0.428 0.806 0.977J 0.443 0.839 1.157 0.434 0.783 1.064B 0.446 0.898 1.168 0.427 0.830 1.070I 0.456 0.879 1.193 0.445 0.835 1.082H 0.461 0.888 1.228 0.450 0.836 1.050C 0.461 0.891 1.232 0.445 0.835 1.079F 0.465 0.899 1.262 0.451 0.836 1.141D 0.471 0.917 1.329 0.449 0.868 1.169

(b) CTE-top 10 networks

experiment CTE-top 10, averages CTE bestCEV 95% pct. 99% pct. CEV 95% pct. 99% pct.

G 0.416 0.807 1.157 0.411 0.746 1.051A 0.417 0.809 1.078 0.414 0.764 1.009B 0.421 0.819 1.123 0.419 0.776 1.070J 0.424 0.834 1.088 0.418 0.801 0.911E 0.425 0.808 1.056 0.422 0.793 0.920I 0.426 0.828 1.070 0.419 0.791 0.960C 0.428 0.817 1.067 0.424 0.775 0.983F 0.437 0.846 1.116 0.427 0.796 1.074D 0.437 0.851 1.125 0.428 0.783 1.014H 0.445 0.853 1.179 0.439 0.826 1.023


The experiment G gives the best results both in terms of average and minimumvalues of CEV . It is followed by A which gives comparable results to G whenlooking at the averages of CTE , however, the best network from G is slightlybetter than the best network from A. On the other hand model A gives little bitbetter results when looking at percentiles. On the other side of the spectrum is theexperiment D - it is worse than G about 8.5% in average and about 7.2% in theminimum value of CEV .

Somewhere in the middle of the scale is the experiment I. The rest of the exper-iments could be split into two groups. Results better than those of the experimentI were produced by E,J and B. Accuracy of H,C and F was somewhere in themiddle between I and D.

The results presented in the table 4.3 (a) are definitely the most important.However they are heavily influenced by the selection of the best networks drivenby theCEV criteria. It also worths to take results ofCTE-top networks into accountto see where the limits of architectures involved in the experiments are. Theseresults are reported in table 4.3 (b) - again sorted by average value of CTE .

Results ofCTE-top 10 networks further confirm the dominance of experimentsG and A. The smallest difference (around 3.5%) between CTE of CEV -top 10and CTE-top 10 networks was noticed by experiments E and H. Contrary, thebiggest difference around 7% applies to experiments C,D and I. This proportionalinconsistence moves H to the last position of the table 4.3 (b) and mix the mutualorder of remaining experiments as well.

Apparently, forecasting accuracy as presented in the tables 4.3 (a) and (b) maybe further improved. First the ensemble of networks can be used instead of indi-vidual networks. And second, the networks could be trained more precisely usingthe gradient optimization when fitting parameters of neurons with local feedback.

Results of ensembles of CEV -top 10 networks are available in the appendixA. The prediction of ensemble is simply constructed as the average of individualpredictions. The improvement is in the range from 1.3% (exp. H) to 4% (exp. D)when comparing CTE of an ensemble with the average value of CTE of the samenetworks. In general CTE of ensembles are usually only a bit better than CTE

of the best networks and in some cases (B, D, F and H) they are even worse. Inaddition best networks also produce better results than ensembles when lookingat percentiles.

Concerning the application of gradient optimization the improvement for themost interesting experiments A,E and G is pretty stable, but unfortunately verylow. The improvement of accuracy is around 0.2% which is not very promising.Therefore the impact of gradient optimization was not tested for the rest of the


experiments. Hence all the results reported in this thesis are obtained by networkswhere neurons with local feedback are trained by iterative regression only.

Next paragraphs go a bit more into the detail and refer to the results presentedin the appendix A. The appendix contains reports with fixed layout. The contentsof reports is exactly described in the appendix. In brief, the reports shows

• evolution of quality criteria CEV , CTE and COF ,

• distribution of CTE for all networks,

• residuals of one step ahead prediction by the CEV best network togetherwith the histogram and correlogram

• scheme of the CEV best network

Looking at the evolution of quality criteria during the optimization (figure (b)of the reports) the experiments D an I are particularly interesting. In the caseof D the progression of genetic algorithm is very chaotic and next the level ofoverfitting is pretty high (see figure A.1).

In the experiment I the coefficient of overfitting is steeply increasing from themiddle of the evolution while CEV is decreasing and CTE is more-less constantbut still slowly decreasing (A.17). It indicates that better results are obtained byoverfitted networks which is in the contradiction to common experience.

Moving to the figures (c) of reports the experiment D is again special. Thefigure shows the distribution of CEV of all the networks involved in the experi-ment. For the experiment D the distribution has two peaks (see figure A.7). Thissuggests that there are two classes of networks involved with different forecastingcapability. Detailed study of topologies shown that all better networks have sin-gle layer only while worse results are produced by networks with more complextopologies.

Remaining graphs in the reports refer to residuals of one step ahead predictionproduced by CEV best network. Particularly important are correlograms in thefigures (f) and (g). Residuals of good forecasting model are uncorrelated andhence they should not display any significant values. From this point of view thenetworks from experiments B,C,D,F and I give a bit worse outputs than the others.

The other two graphs - plot of residuals and histogram of residuals (figures(d) and ?? respectively) - can hardly be used for a comparison. There is onlyone interesting message in the plot of residuals. They show that there are threeparticular days where the networks from all experiments are consistently wrong.These are 2 January 2004, 18 March 2004 and 8 June 2004.


The poor result in January indicates that additional effort is needed to capturethe seasonal effect of New Year. Unfortunately, no explanation was found for thecoincidence on other days. There is nothing special with these dates and they arenot related to any particular season.

Finally, the schemes of topologies have an informative character only and theyare included in the appendix to demonstrate the variability of topologies involvedin the experiments. For the comparison the size of networks is more transparentindicator. The size could be measured by the average number of neurons or by thenumber of their parameters. These quantities are reported in the tables 4.4 (a) and(b). The averages are calculated for several sets of networks to see the effect ofgenetic evolution.

The average size of networks is increasing during the evolution and the sizeof CEV -top 10 networks is above the average almost in all the experiments. Theonly exceptions are the experiments E and D. The experiment D has completelyopposite trend - the average size is decreasing during the evolution and the averagesize of CEV -top 10 networks is bellow the average. In the case of E the averagesize is almost constant.

4.2.3 Discussion - impact studySeveral features typical for neural networks with switching units have been de-scribed in the chapter 2. These were examined by numerous applications to clas-sification tasks but never tested on time series data. In addition, chapter 3 proposedcouple of modifications to the original model.

Previous section summarized results of ten different experiments designed tosee what are the impacts of the original as well as the newly proposed featureswhen they are used for time series forecasting.

In brief the examined features are

• switching units,

• sequences of neurons as building blocks,

• cooperative switching,

• local feedback,

• complex topology.

• weightening by quality,


Table 4.4: Average size of networks(a) Average number of neurons in selected subsets of networks

experiment average number of neuronsinitial first 25 all CEV -top 10

population generations networks networksA2 9 12 17 22B 9 16 21 27C 10 14 15 15D 20 12 12 9E 12 12 13 12F 9 11 13 16G 10 14 20 24H 9 19 22 25I 10 14 18 23J 13 13 15 18

(b) Average number of parameters in selected subsets of networks

experiment average number of parametersinitial first 25 all CEV -top 10

population populations networks networksA2 1678 2207 3174 3955B 1702 2868 3571 4761C 1747 2521 2529 2325D 5525 2998 2962 2041E 2234 2420 2515 2430F 1480 1598 1885 2195G 1836 2608 3308 3782H 622 1301 1417 1412I 1773 2808 3809 4778J 2326 2291 2756 3259


• time dissipation.

The exact meaning of these entries was already explained in the previous chap-ters. Their meaning is also quickly reminded later in this section where they arestudied one by one. However two general outcomes of the experiments are dis-cussed first.

Fixed topology is immune to structural changes

A bit strange, at least at first glance, is that networks achieve significantly higheraccuracy in the testing period than in the evaluation period. Actually there are twogood reasons for this.

First, it has to be considered that all the networks are trained twice - once usingthe training data only and once using the training plus evaluation data. Theaccuracy in evaluation period is then calculated after the first training while there-trained networks determine the accuracy in the testing period. It means that theforecasts in testing period are produced by networks trained on longer interval.

Second reason, which further emphasizes the first one, is that the CIC series isnot stable due to the developing character of the banking system - for example in-crease of cash withdrawal fees would motivate people to withdraw higher amountsbut less frequently. As time is running systemic changes are less frequent and theseries is more predictable.

In any case it is clear, that any application of model in a production environ-ment requires regular updates of the model to reflect recent changes. Thereforethe fact that quality of given network is preserved when it is trained again usingextended data set and applied to more recent period is very important. It wouldbe much more difficult to optimize the topology whenever the model has to beupdated than to train a single neural network.

Genetic algorithms succeded in optimization

Section 4.1.2 explains the setup of genetic algorithms in the executed experiments.It highlights the limited coding capacity of the genome given by the current imple-mentation of genetic algorithms for NNSU optimization. This limitation results incompletely random selection of error lags and powers of time dissipation when-ever a new network is created and even in the case it is used ”without any change”in the next population. This affects all the experiments but F where neither localfeedback nor time dissipation are considered.


Despite these limitations the influence of the evolution is evidently positive.Excluding the experiment D the average value of CEV is decreasing (equivalently,fitness is increasing) from the first to the last population. Of course, there aresome fluctuations but the trend is visible. These results indicate that selection ofright inputs from all the available exogenous variables and topology are of greatimportance.

Concerning topologies one outcome of experiment D is valuable even the re-sults of this particular experiment are very poor. It is the only experiment wherethe average size of networks is decreasing during the evolution (see tables 4.4 (a)and (b)). This proves that recombination of networks could influence the size inboth directions.

The remaining paragraphs discuss the impact of the features enlisted at thebeginning of this section. The most important experiments used as benchmarksare the experiment A and G. The reason is that G is the most complex and A issomewhere in the ”center”. For complete list of features included in a particularexperiment see table 4.2.

Switching units

The presence of switching units is obviously the essential characteristic of neuralnetworks with switching units and hopefully it is not necessary to explain theidea of switching at this place again. A neuron without switching unit is equalto an ARMA model with exogenous inputs. Hence a network of such neurons issimply a combination of several ARMA models. This architecture is used in theexperiment H.

The absence of switching units is the only difference between the experimentsH and G. Hence the importance of switching units is evident as the performanceof CTE best network is the best for experiment G and worst for experiment H.This result was expected as many seasonal patterns are non-linear and thereforethey can not be approximated by linear functions with sufficient precision.

Sequences of neurons as building blocks

Sequence of neurons is a serially connected group of neurons - each neuron hasexactly one predecessor and one successor except the first and last neuron inthe sequence. Originally, it was the only considered architecture of neural net-works with switching units. Sequences of neurons proved to be very powerful inmany separation tasks. Therefore they were used as building blocks of neural net-


works once the genetic algorithms were generalized to general acyclic topologies[Kal09].

The experiment D was designed to test if the strategy could overcome thelimitations of sequence of neurons in the context of time series discussed in thesection 3.1. The results show that this approach is not the right way.

Networks created during the experiment D are very often heavily overfitted.During the optimization the topologies are reduced to single level topology asgenetic algorithm works to minimize the effect of overfitting. Finally, networksconsisting of a single block are not rare and have one of the best results comparedto other networks in the experiment D.

Cooperative switching

The response of neuron with switching unit is equal to response of a single com-putational unit selected for processing of the very input. Therefore the response ispiecewise linear function of inputs which is discontinuous at the borders betweenclusters. Neuron with cooperative switching takes into account outputs of all com-putational units and the response is their weighted superposition. The weightsmakes the response continuous and during the training they also give higher im-portance to the inputs that are closer to the centers of clusters. For more detailssee section 3.2.1.

The cooperative switching was deactivated in the experiment C, otherwise,the experiment is identical to the experiment A. Both CTE and CEV top networksfrom experiment A are better than those from C. The difference is proportionallyhigher in the case of CEV top networks. It means that quality of prediction in theevaluation period is less related to accuracy in testing period for networks fromexperiment C.

Detail study of few selected networks indicates that the inconsistent behavioris probably consequence of two training cycles. Network trained on training datawhich is quite accurate on evaluation data could be ”damaged” during the secondtraining when data from training and evaluation period are used together. Theadditional data has impact to position of clusters and hence also to the position ofsensitive areas where the response is discontinuous. The same applies also in theopposite direction.

This means that cooperative switching increase the accuracy and also makesthe networks less sensible to training data which is of great importance for appli-cation in production environment.


Local feedback

In ARMA model the feedback is represented by the MA component. The depen-dence of current values on disturbances from the past is quite common thereforefeedback is very important feature of time series models in general. In neural net-works with switching units it is implemented locally - it means for each neuron,separately. For more details see section 3.2.2.

Experiment F was executed to test the impact of local feedback. Networkswithout local feedback were only involved in this experiment. Otherwise the setupis the same as the setup of A again. Obviously, the results of experiment F are oneof the worse which suggests the importance of local feedback.

Finallu, the feedback was added to some of the networks from the experimentF manully. This significantly improved the forecasting accuracy which furtherconfirms the positive impact of local feedback.

Complex topology

Training of NNSU is a feedforward algorithm. Once a neuron is trained it cannotadjust its weights in the context of other neurons. The only correction could berealized by subsequent neurons. However the question is if they have capability todo so. If not then a single layer of neurons should be as efficient as any complextopology. If this is true then the concept of NNSU is reduced to a ensembles ofvery basic regression trees consisting of single layer.

The experiment E, I, and J restrict the topology of networks to single layer.From these experiments the accuracy of networks from E and J is comparable.The experiment I gives a bit worse results which is due to the activation of timedissipation as discussed later. Therefore the experiments E and J are the onlycandidates for the study of the impact of topological restrictions. However theclosest benchmark for the experiment J is G which also has quality weighteningactivated which can influence the comparison. So only the pair of experiments Eand A is considered.

The accuracy of CEV top 10 networks is better for the experiment A butthe difference is about 1% only when looking at the average accuracy, and evensmaller when looking at the best networks. More significant is the improvementin the CTE top 10 networks.

Another difference is in the progress of the genetic optimization (see A.1 (b)and A.9 (b)). While both CEV and CEV criteria are decreasing during the evolu-tion in the experiment A, the criteria CTE fluctuates even the criteria CEV has a


decreasing tendency. Which menas that the genetic optimizatino has only minoreffect in experiment E.

In addition, the activation of time dissipation could further increase the per-formance of more complex networks as demonstrated in the experiment G.

All around, the findings suggests that the use of more complex topologies isworthwhile. The potential is probably in the higher flexibility of the architecturescombined with the genetic optimization. Further improvement is expected whenselection of lags would be subjected to the optimization or more complex fitnessfunction is used.

Weightening by quality

Each neuron with cooperative switching can estimate the quality of its predictionat given time. The quality is determined by two factors: geographical position ofthe input and weighted average of linear regression quality across the computa-tional units measured by adjusted coefficent of determination adjR2 (see (2.6)).Clearly, the quality of prediction of the very neuron varies in the time. Thereforeany neuron with two or more inputs can use the estimated quality of prediction ofits parents to weight their outputs. This concept is explained in the section 3.2.5

Weightening by quality is suspended in the experiments B and J. Howeveronly B can be used for comparison as there is no appropriate benchmark for J.The benchmark for B is again the experiment A.

The results do not show any significant difference in the forecasting accuracymeasured by CTE . The best networks from experiment are only slightly betterthen networks from experiment A. Hence the weightening by quality haven’t im-proved the forecasting accuracy a lot.

However the quality weightening has positive impact to the size of networks.The average number of parameters (and neurons) is about 20% higher for CEV -top 10 networks from the experiment B even the size of networks in the initialpopulation is the same for both experiments.

Time dissipation

When time dissipation is activated then training data are weighted with respectto time - the weight of the most recent observation is equal to one and it is ex-ponentially decreasing with time going into the past. The speed of dissipationvaries neuron by neuron to deal with the changing behavior of time series and rareseasons in parallel. For more details see section 3.2.3.


Time dissipation was activated in the three experiments: G, H and I. However,the results of experiment H are particularly influenced by the complete absence ofswitching units. Concerning the time dissipation one can only say that it haven’tbalanced the impact of missing switching units. Therefore, meaningful is onlycomparison of the experiments G and I with A and E respectively.

Although the difference is definitely not alarming, the networks from the ex-periment G are bit better than those from the experiment A. Contrary, in the caseof experiments I and E the results are better for experiment E where time dissipa-tion is deactivated.

In addition, the average size of CEV top networks in the experiment I is twiceas much as in the experiment E despite the networks in the initial population of theexperiment I were smaller than those in the experiment E. Moreover the need foradditional neurons is also demonstrated by the evolution of overfitting coefficient.The average value of the coefficient of overfitting is steeply increasing togetherwith the size of networks from the middle of the evolution.

Taking into account the similar size of networks in A and G there is an indica-tion that the activation of time dissipation negatively interfere with the topologicalrestriction employed in the experiment I.

On the other hand activation of time dissipation positively influenced the re-sults of complex topologies in the experiment G. Unfortunately, the differenceis not to significant. However, it is possible that further improvement would beachieved if time dissipation is subjected to genetic optimization. But to test thishypothesis the functionality of genetic algorithms has to be extended first.

4.3 Comparison with conventional models

This section concludes the outcomes of experimental study of forecasting perfor-mance of the proposed model applied to the series of CIC. Based on the outcomesof previous section the CEV -top 10 networks are chosen and compare with con-ventional time series forecasting models, namely an ARMA model and an STSmodel.

Networks from the experiment G are the most complex one where all the inno-vations suggested in the chapter 3 were employed. These networks also achievedthe best accuracy from all the carried experiments. The selection of CEV -top net-works rather then CTE-top networks is to make sure that the testing data were notused in any form for the construction or selection of the resulting networks.

4.3. COMPARISON WITH CONVENTIONAL MODELS 91

Results of conventional models presented in this section are obtained by mod-els developed in [Hla04] and [Sen06] 2.

Accuracy of models is compared with respect to the root mean square error ofone step ahead prediction over testing period, here referred as CTE . The overviewof forecasting accuracy of all the models involved in the comparison plus theaccuracy of the experts from CNB (thereafter referred as expert) is in the table4.5. The performance of the expert is used as the benchmark in this comparison.

Table 4.5: Comparison of forecasting performance of different models applied toseries of volumes of currency in circulation

model RMSE improvementARMA 0.463 -6.55%STS 0.452 -4.23%STS-ARMA 0.419 3.47%NNCS average of top 10 0.434 -0.01%NNCS best 0.419 3.56%NNCS ensemble of top 10 0.418 3.68%Expert 0.434 —

The improvement of models is measured in the relation to the accuracy of expert.

It is obvious that accuracy of ARMA and STS models is worse then the ac-curacy of expert. On the other hand the when the models are combined they candeliver significantly better prediction outperforming the accuracy of expert. Thisindicates that each of these models can extract different type of information fromthe series.

Particularly, from this point of view, the results of neural networks witch co-operative switching are very promising. The performance of the best network iscomparable to the performance of the combination of both conventional models.

In addition, it has to be highlighted that the comparison with expert is notcompletely fair. The reason is that all the models were trained only once at thebeginning of training period. contrary, the expert could take recent trends into ac-count to adjust the prediction. It means that higher accuracy of models is expectedif they are trained on regular basis.

2Cited publications are Bachelor and Master thesis respectively. In both cases author of thisthesis closely collaborated with the authors and their supervisor as a consultant.


Particularly, the accuracy of neural networks is significantly decreasing intime. Looking at the trimesters of the training period then the RMSE of the pre-diction of the best network raise from 0.336 in the first one to 0.484 in the lastone. Contrary, the performance of the expert does not exhibit any significant fluc-tuations.

In total, based on the available results, it was shown that neural networks areable to deliver prediction with at least same performance as combination of twoconventional models.

Chapter 5

Conclusion

This thesis examined potential of the concept of neural networks switching unitsin the context of time series forecasting. Several inovations to the basic modelof network with switching units have been suggested in the chapter 3. The pro-posed improvements are inspired by ideas behind other stochastic or connectionisttime series models and adapted to the specific properties of neural networks withswitching units.

Impact of the inovations has been tested by comprehensive experiments em-ploying the genetic algorithms. Even though the results of the complex study canhardly be translated into generally valid theorems they provide sufficient back-ground for more detailed study of the proposed architecture. Definitelly, the re-sults indicate which of the proposed inovations are fruitfull and where is the po-tential for future development.

In general, the implemented inovations are based on the basic algorithms,models or criteria related to the corresponding concept. Therefore, further im-provemnt could be achieved by application of the most recent knowledge and finetuning of the individual features. But the attention should be paid to the areaswhere only little contribution was achieved.

Particularly, the impact of more complex topologies in combination with otherfeatures should be examined in detail. However, to do so the actual implementa-tion of neural netwoks with switching units has to be dramatically extended. First,the genetic algorithm needs to be extened to handle the specific features of newarchitectures. And second, a tool for deeper analysis of fitted networks is neces-sary.

To demonstrate the capabilities of the enhanced model - called after one of thenew features - neural network with cooperative switching, the model was applied

93

94 CHAPTER 5. CONCLUSION

to the daily series of volumes of currency in circulation. The performance wascompared with two conventional time series models: ARMA and STS.

These two stochastic models were successfully applied in the context of cur-rency in circulation forecasting in the European Central Bank already. Based onthe experience of the ECB these models were also fitted for the volumes of CzechKrown. Results suggest that neural networks with cooperative switching outper-forms both models when these are applied separately, and is comparable to theaccuracy of their combination. In any case they provide more accurate predictionthen forecasts carried by experts in the Czech National Bank. Moreover furtherimprovement could be achieved by regular updates of the network to reflect thechanging behavior of the forecasted series.

Next advantage of the neural networks with cooperative switching is that theirconstruction is fully automated. This is probably not relevant in the context ofliquidity management in the central bank, but it is very helpful when hundredssimilar series have to be forecasted. An example of such task is controlling ofcash demand in ATMs of a bank which is very close to the modelling of currencyin circulation.

Appendix A

Detailed results of CIC forecastingexperiments

This appendix is a set of reports with fixed layout that present the results of theexperiments from the chapter 4 in a strucuterd and consistent way. Detailed ex-planation of setup of particular experiments is given in the sectin 4.1.

Each report is related to a single experiment and contains

• table (a) summarizing the results in the terms of RMSE and percentilesof distribution of absolute error of one step ahead prediction over testingperiod,

• two graphs (b) and (c) with aggregated results based on all networks createdduring particular experiment,

• four graphs (d) - (g) with results related to the CEV -best network (definitionof CEV -best network is given in section 4.2.1),

• schema of topology of CEV - best network.

95

96APPENDIX A. DETAILED RESULTS OF CIC FORECASTING EXPERIMENTS

CEV 95% pct. 99% pctbest (CTE) 0.414 0.764 1.009best (CEV ) 0.427 0.770 1.087average of top 10 (CTE) 0.417 0.809 1.078average of top 10 (CEV ) 0.437 0.825 1.174ensemble of top 10 (CEV ) 0.425 0.803 1.012

(a) Summary of results

3 9 15 21 27 33 39 45 51

generation

1.28

1.29

1.3

1.31

1.32

1.33

1.34

1.35

RMSE(EV)/RMSE(TR)

0.42

0.44

0.46

0.48

0.5

0.52

0.54

RMSE

overfitting [rhs]

RMSE(TE)

RMSE(EV)

(b) Evolution of average values of qualitycriteria CEV , CTE and coefficient of over-fitting (right hand side axis)

0.416 0.428 0.44 0.452 0.464 0.476 0.488 0.5 0.634

RMSE

0

10

20

30

40

(c) Histogram of CEV in the whole popula-tion

07/03 09/03 11/03 01/04 03/04 04/04 06/04-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

error (

billion

CZ

K)

(d) Residuals of one step ahead predictionof the CEV -best network

-2.88 -2.13 -1.39 -0.64 0.11 0.86 1.60 2.350.00

0.10

0.20

0.30

0.40

0.50

(e) Histogram of standardized residuals ofthe one step ahead prediction of the CEV -best network and N(0,1) distribution

10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

(f) Autocorrelation function of the one stepahead prediction residuals of CEV -best net-work

10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

(g) Partial autocorrelation function of theone step ahead prediction residuals of CEV -best network

Figure A.1: Results of experiment A

97

IN

1

2

34

5

6

7 8

9

10

11

121314

1516

17

18

19

20

21

22

OU

T

Figure A.2: Topology of the CEV -best network in the experiment A




3 9 15 21 27 33 39 45 51

generation

1.29

1.3

1.31

1.32

1.33

1.34

1.35

RMSE(EV)/RMSE(TR)

0.44

0.46

0.48

0.5

0.52

0.54

RMSE

overfitting [rhs]

RMSE(TE)

RMSE(EV)


0.422 0.434 0.446 0.458 0.47 0.482 0.494 0.526

RMSE

0

10

20

30

40


07/03 09/03 11/03 01/04 03/04 04/04 06/04-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

error (

billion

CZ

K)


-2.57 -1.85 -1.13 -0.41 0.32 1.04 1.76 2.480.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

(e) Histogram of standardized residuals -CEV -best

10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08


Figure A.3: Results of experiment B

99

IN

1

2

3

4

5

6

7

8

9 1011

12

13

14

151617

18

19

2021

22

2324 25

26

27

OU

T

Figure A.4: Topology of the CEV -best network in the experiment B




3 9 15 21 27 33 39 45 51

generation

1.24

1.26

1.28

1.3

1.32

1.34

1.36

1.38

1.4

RMSE(EV)/RMSE(TR)

0.44

0.46

0.48

0.5

0.52

0.54

0.56

RMSE

overfitting [rhs]

RMSE(TE)

RMSE(EV)


0.428 0.44 0.452 0.464 0.476 0.488 0.5 0.518

RMSE

0

10

20

30

40


07/03 09/03 11/03 01/04 03/04 04/04 06/04-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

error (

billion

CZ

K)


-2.31 -1.65 -0.98 -0.32 0.35 1.01 1.68 2.340.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

(e) Histogram of standardized residuals -CEV -best

10 20 30 40 50 60 70 80 90-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08


Figure A.5: Results of experiment C

101

IN

12

34

56

7

8

9

10

1112

13

14

15

OU

T

Figure A.6: Topology of the CEV -best network in the experiment C


RMSE 95% pct. 99% pctbest (CTE) 0.428 0.783 1.014best (CEV ) 0.449 0.868 1.169average of top 10 (CTE) 0.437 0.851 1.125average of top 10 (CEV ) 0.471 0.917 1.329ensemble of top 10 (CEV ) 0.452 0.846 1.190


3 9 15 21 27 33 39 45 51

generation

0

2

4

6

8

10

RMSE(EV)/RMSE(TR)

0

1

2

3

4

RMSE

overfitting [rhs]

RMSE(TE)

RMSE(EV)

(b) Evolution of average values of qual-ity criteria CEV , CTE and coefficient ofoverfitting (right hand side axis)

0.44 0.464 0.488 0.512 0.536 0.56 0.584 0.608

RMSE

0

5

10

15

20

(c) Histogram of CEV in the whole pop-ulation

07/03 09/03 11/03 01/04 03/04 04/04 06/04-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

error (

billion

CZ

K)

(d) Residuals of one step ahead predic-tion of the CEV -best network

-2.81 -2.11 -1.40 -0.70 0.01 0.71 1.42 2.120.00

0.10

0.20

0.30

0.40

0.50

0.60

(e) Histogram of standardized residuals- CEV -best

10 20 30 40 50 60 70 80 90-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

(f) Autocorrelation function of the onestep ahead prediction residuals of CEV -best network

10 20 30 40 50 60 70 80 90-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

(g) Partial autocorrelation function ofthe one step ahead prediction residuals ofCEV -best network

Figure A.7: Results of experiment D

103

IN

12

34

56

78

OU

T

Figure A.8: Topology of the CEV -best network in the experiment D




3 9 15 21 27 33 39 45 51

generation

1.28

1.29

1.3

1.31

1.32

RMSE(EV)/RMSE(TR)

0.44

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

RMSE

overfitting [rhs]

RMSE(TE)

RMSE(EV)


0.426 0.438 0.45 0.462 0.474 0.486 0.498 0.512 0.616

RMSE

0

10

20

30

40

50


07/03 09/03 11/03 01/04 03/04 04/04 06/04-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

error (

billion

CZ

K)


-2.52 -1.80 -1.07 -0.34 0.38 1.11 1.84 2.560.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06


Figure A.9: Results of experiment E

105

IN

12

34

56

78

910

OU

T

Figure A.10: Topology of the CEV -best network in the experiment E




3 9 15 21 27 33 39 45 51

generation

1.18

1.2

1.22

1.24

1.26

1.28

1.3

1.32

1.34

RMSE(EV)/RMSE(TR)

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

0.54

RMSE

overfitting [rhs]

RMSE(TE)

RMSE(EV)


0.434 0.448 0.46 0.472 0.484 0.496 0.508 0.52 0.564

RSME

0

10

20

30

40

50


07/03 09/03 11/03 01/04 03/04 04/04 06/04-2.00

-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

error (

billion

CZ

K)


-3.44 -2.68 -1.92 -1.16 -0.40 0.36 1.12 1.880.00

0.10

0.20

0.30

0.40

0.50

0.60


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12


Figure A.11: Results of experiment F

107

IN

1

2

3

45

6

78

910

1112

1314

15

16

17

18

19

OU

T

Figure A.12: Topology of the CEV -best network in the experiment F




4 12 20 28 36 44 52 60

generation

1.29

1.3

1.31

1.32

1.33

1.34

1.35

1.36

1.37

1.38

RMSE(EV)/RMSE(TR)

0.44

0.46

0.48

0.5

0.52

0.54

RMSE

overfitting [rhs]

RMSE(TE)

RMSE(EV)


0.414 0.426 0.438 0.45 0.462 0.474 0.486 0.498 0.51

RMSE

0

10

20

30

40


07/03 09/03 11/03 01/04 03/04 04/04 06/04-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

error (

billion

CZ

K)


-2.35 -1.66 -0.96 -0.27 0.42 1.12 1.81 2.500.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08


Figure A.13: Results of experiment G

109

IN

1

23

45

67

8

910

11

1213

1415

1617

18

1920

2122

23

24 25

2627

OU

T

Figure A.14: Topology of the CEV -best network in the experiment G




3 9 15 21 27 33 39 45 51

generation

1.055

1.06

1.065

1.07

1.075

1.08

RMSE(EV)/RMSE(TR)

0.46

0.47

0.48

0.49

0.5

0.51

RMSE

overfitting [rhs]

RMSE(TE)

RMSE(EV)


0.448 0.46 0.472 0.484 0.496 0.508 0.55

RMSE

0

10

20

30

40

50


07/03 09/03 11/03 01/04 03/04 04/04 06/04-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

error (

billion

CZ

K)


-2.22 -1.52 -0.83 -0.13 0.57 1.27 1.96 2.660.00

0.10

0.20

0.30

0.40

0.50

0.60


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06


Figure A.15: Results of experiment H

111IN

1

2

3

4

5

6

7

8

910

1112

13

14

15

16

17

18

1920

21

22

23

24

OU

T

Figure A.16: Topology of the CEV -best network in the experiment H




4 12 20 28 36 44 52 60

generation

1.2

1.22

1.24

1.26

1.28

1.3

1.32

RMSE(EV)/RMSE(TR)

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

RMSE

overfitting [rhs]

RMSE(TE)

RMSE(EV)


0.426 0.438 0.45 0.462 0.474 0.486 0.498

RMSE

0

10

20

30

40

50


07/03 09/03 11/03 01/04 03/04 04/04 06/04-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

error (

billion

CZ

K)


-2.83 -2.10 -1.36 -0.62 0.12 0.85 1.59 2.330.00

0.10

0.20

0.30

0.40

0.50

0.60


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08


Figure A.17: Results of experiment I

113

IN

12

34

56

78

910

1112

1314

1516

OU

T

Figure A.18: Topology of the CEV -best network in the experiment I




3 9 15 21 27 33 39 45 51

generation

1.2

1.22

1.24

1.26

1.28

1.3

1.32

RMSE(EV)/RMSE(TR)

0.44

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

RMSE

overfitting [rhs]

RMSE(TE)

RMSE(EV)


0.422 0.44 0.458 0.476 0.494 0.536

RMSE

0

10

20

30

40


07/03 09/03 11/03 01/04 03/04 04/04 06/04-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

error (

billion

CZ

K)


-2.30 -1.63 -0.95 -0.27 0.41 1.09 1.77 2.440.00

0.10

0.20

0.30

0.40

0.50

0.60


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08


10 20 30 40 50 60 70 80 90-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08


Figure A.19: Results of experiment J

115

IN

12

34

56

78

910

1112

1314

1516

17

OU

T

Figure A.20: Topology of the CEV -best network in the experiment J

Bibliography

[AAES10] R. Andrawis, A. F. Atiya, and H. El-Shishiny. Forecast combina-tion model using computational intelligence/linear models for thenn5 time series forecasting competition. International Journal ofForecasting - conditionally accepted, expected 2010.

[Arm01] J. S. Armstrong. Combining forecasts. In J. S. Armstrong, editor,Priciples of Forecasting: A Handbook for Researchers and Practi-tioners. Kluwer Academic Publishers, 2001.

[ARM+03] F. Acernese, R. D. Rosa, L. Milano, F. Barone, A. Eleuteri, andR. Tagliaferri. A hierarchical bayesian learning framework for au-toregressive neural network modeling of time series. In Image andSignal Processing and Analysis, volume 2, pages 897–902, 2003.

[Bal97] S. Balkin. Using recurrent neural networks for time series forecast-ing. Technical Report 97-11, Pensylvania State University, 1997.

[BB08] A. Bouchachia and S. Bouchachia. Ensemble learning for time se-ries prediction. 2008.

[BCG+04] R. K. Bock, A. Chilingarian, M. Gaug, F. Hakl, T. Hengstebeck,M. Jirina, J. K. E. Kotrc, P. Savicky, S. Towers, A. Vaiciulis, andW. Wittek. Methods for multidimensional event classification: acase study using images from a cherenkov gamma-ray telescope. InNuclear Instruments and Methods in Physics Research Section A:Accelerators, Spectrometers, Detectors and Associated Equipment,volume 516, pages 511–528, 2004.

[BH83] W. R. Bell and S. C. Hillmer. Modeling time series with cal-endar variation. Journal of the American Statistical Association,(78):526–534, 1983.

116

BIBLIOGRAPHY 117

[BHBK03] R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. Anew ensemble diversity measure applied to thinning ensembles. In4th International Workshop on Multiple Classifier Systems, pages306–316, 2003.

[Bis95] C. Bishop. Neural Networks for Pattern Recognition. Oxford Uni-versity Press, 1995.

[BJ76] G. Box and G. Jenkins. Time Series Analysis: Forecasting andControl. Holden-Day, 1976.

[Bre96] L. Breiman. Bagging predictors. In Machine Learning, pages 123–140, 1996.

[BT00] T. Briegel and V. Tresp. Dynamic neural regression models. Discus-sion paper 181, Ludwig Maxmilians Universitat Munchen, 2000.

[BvK+95] P. Bitzan, J. Smejkalova, M. Kucera, M. Parzek, and M. Matyas.Theory and technical implementation of a neural network withswitching units. Technical report, CERN Geneva, 1995.

[CA97] S. S. Chris Atkeson, Andrew Moore. Locally weighted learning. AIReview, 11:11–73, April 1997.

[CCM96] E. Chng, S. Chen, and B. Mulgrew. Gradient radial basis functionnetworks for nonlinear and non-stationary time series prediction.IEEE transactions on neural networks, 7(1):190–194, 1996.

[CCMH02] A. Cabrero, G. Camba-Mendez, and A. Hirsch. Modelling thedailly series of banknotes in circulation in the context of the liq-uidity managemnet operations of the ecb. ECB Working Paper,2002.

[CGD] S. Canu, Y. Grandvalet, and X. Ding. One step ahead forecastingusing multilayered perceptron.

[Cio02] I. B. Ciocoiu. Rbf networks training using a dual extended kalmanfilte. Neurocomputing, 48:609–622, 2002.

[CMAI94] J. T. Connor, R. D. Martin, L. E. Atlas, and M. Ieee. Recurrent neu-ral networks and robust time series prediction. IEEE Transactionson Neural Networks, 5:240–254, 1994.

118 BIBLIOGRAPHY

[CUPR99] P. Campolucci, A. Uncini, F. Piazza, and B. D. Rao. On-line learn-ing algorithms for locally recurrent neural networks. IEEE Trans-actions on Neural Networks, 10:253–271, 1999.

[CYA07] Y. Chen, B. Yang, and A. Abraham. Flexible neural trees ensemblefor stock index modeling, 2007.

[CYZ06] Y. Chen, B. Yang, and J. Zhou. Automatic design of hierarchicalrbf networks for system identification. In P. S. et al., editor, PRI-CAI 2006: Trends in Artificial Intelligence, volume 4099/2006 ofLecture Notes in Computer Science/Computational Science, pages1191–1195. Springer-Verlag Berlin Heidelberg, 2006.

[Dor96] G. Dorffner. Neural networks for time series processing. NeuralNetwork World, (6):447–468, 1996.

[DS88] S. Dutta and S. Shekhar. Bond rating: A non-conservative appli-cation of neural networks. In IEEE International Conference onNeural Networks, pages 443–450. IEEE Press, 1988.

[DSMV01] M. Duhoux, J. A. K. Suykens, B. D. Moor, and J. Vandewalle. Im-proved long-term temperature prediction by chaining of neural net-works. International Journal of Neural Systems, 11(1):1–10, 2001.

[dSN99] J. A. da Silva and R. Neto. Preliminary testing and analysis of anadaptive neural netwrok training kalman filtering. In Proceedings ofthe IV Brazilian Conference on Neural Networks, pages 247–251,Sao Jose dos Campos, Brazil, July 1999.

[Elm90] J. L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.

[FCMS96] P. Frasconi, M. Cori, M. Maggini, and G. Soda. Representationof finite state automata in recurrent radial basis function networks.Machine Learning, 23:5–32, 1996.

[FGS92] P. Frasconi, M. Gori, and G. Soda. Local feedback multilayerednetworks. Neural Computation, 4:120–130, 1992.

BIBLIOGRAPHY 119

[FK07] U. B. Filik and M. Kurban. A new approach for the short-term loadforecasting with autoregressive and artifcial neural network mod-els. International Journal of Computtational Intelligence Research,3(1):66–71, 2007.

[GHDRD01] P. Gil, J. Henriques, H. Duarte-Ramos, and A. Dourado. State-spaceneural networks and the unsented kalman filter in on-line nonlin-ear system identification. In Proceedings of IASTED InternationalConference on Intelligent Systems and Control (ISC 2001), 2001.

[GL01] C. L. Giles and S. Lawrence. Noisy time series prediction using arecurrent neural network and grammatical inference. In MachineLearning, pages 161–183, 2001.

[GMH05] Z. Gao, F. Ming, and Z. Hongling. Bagging neural networks forpredicting water consumption, 2005.

[Gru94] F. Gruau. Neural Network Synthesis using Cellular Encoding andthe Genetic Algorithm. PhD thesis, Ecole Normale Superieure deLyon, 1994.

[GS08] Gowrishankar and P. Satyanarayana. Recurrent neural networkbased bit error rate prediction for 802.11 wireless local area net-work. International Journal of Computer Sciences and EngineeringSystems, 2(3), 2008.

[Ham94] J. D. Hamilton. Time Series Analysis. Princeton university Press,1994.

[Hay94] S. Haykin. Neural Networks: A Comprehensive Foundation.Macmillan Publishing, New York, 1994.

[Hay01] S. Haykin, editor. Kalman Filtering and Neural Networks. Adaptiveand Learning Systems for Signal Processing, Communication andControl. John Wiley & Sons, Inc., 2001.

[HD98] M. Hallas and G. Dorffner. A comparative study in feedforward andrecurrent neural networks in time series prediction using gradientdescent learning. Technical report, University of Vienna, 1998.

120 BIBLIOGRAPHY

[HG90] D. Huang and L. Guo. Estimation of nonstationary armax mod-els based on the hannan-rissanen method. Annals of Statistics,18(4):1729–1756, 1990.

[HG95] B. G. Horne and C. L. Giles. An experimental comparison of re-current neural networks. Neural Information processing Systems,7:697, 1995.

[HHK02] F. Hakl, M. Hlavacek, and R. Kalous. Application of neural net-works optimized by genetic algorithms to higgs boson search. InP. S. et al., editor, Computational Science ICCS 2002, volume2331/2002 of Lecture Notes in Computer Science, pages 554–563.Springer-Verlag Berlin Heidelberg, 2002.

[HJ99] F. Hakl and M. Jirina. Using gmdh neural net and neural netwith switching units to find rare particles. In Proceedings of In-ternational Conference on Artificial Neural Nets and Genetic Algo-rithms, Slovenia, 1999. Springer-Verlag Wien.

[HK91] E. Hartman and J. D. Keeler. Predictiong the future: Advantages ofsemilocal units. Neral Computation, 3:566–578, 1991.

[HKv05] M. Hlavacek, M. Konak, and J. Cada. The application of feedfor-ward structured neural networks to the modelling of daily series ofcurrency in circulation. In Working Paper Series, number 11/2005.Czech National Bank, Prague, 2005.

[Hla01] M. Hlavacek. Application of neural networks with switching unitson higgs boson search. Technical report, Czech Technical Univer-sity, Faculty of Nuclear Science and Physical Engineering, Prague,2001.

[Hla02] M. Hlavacek. Navrh a analyza neuronove sıte s prepınacımi jed-notkami vhodne pro studium procesu rozpadu elementarnıch castics vyuzitım monosti geneticke optimalizace. Master’s thesis, CzechTechnical University, Faculty of Nuclear Science and Physical En-gineering, 2002.

[Hla04] M. Hlavacek. Modelovanı objemu obeziva v Cr, 2004. BachelorThesis.

BIBLIOGRAPHY 121

[HM88] E. Hannan and A. McDougall. Regression precedures ofr armaestimation. Journal of the American Statistical Association,83(402):490–498, 1988.

[HSW89] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforwardnetworks are universal approximators. Neural Networks, (2):359–366, 1989.

[HvH05] M. Hlavacek, J. Cada, and F. Hakl. The application of structuredfeedforward neural networks to the modelling of the daily seriesof currency in circulation. In L. Wang, K. Chen, and Y. Ong, ed-itors, Computational Science ICCS 2005, volume 3610/2005 ofLecture Notes in Computer Science, pages 1234–1246. Springer-Verlag Berlin Heidelberg, 2005.

[IT99] T. Ieshima and A. Tokosumi. How could neural networks repre-sent higher cognitive functions?: A computational model based ona fractal neural network. In The Second International Conferenceon Cognitive Science and The 16th Annual Meeting of the JapaneseCognitive Science Society Joint Conference (ICCS/JCSS99), 1999.

[IYM03] M. M. Islam, X. Yao, and K. Murase. A constructive algorithm fortraining cooperative neural network ensembles. IEEE Transactionson Neural Networks, 14:820–834, 2003.

[JJM00] G. Judge, J.D.Miller, and R. Mittelhammer. Econometric Founda-tions. Cambridge University Press, New York, 2000.

[Jor86] M. I. Jordan. Serial order; a parallel distributed processing ap-proach. Technical Report 86 - 104, 1986.

[JR94] M. Jordan and R.Jacobs. Hierarchical mixtures of experts and theem algorithm. Neural Computation, 6:181–184, 1994.

[Kal09] R. Kalous. Evolutinary Optimization of Neural Networks Archi-tectures Using Genetic Algorithms. PhD thesis, Czech TechnicalUniversity, Faculty of Nuclear Science and Physical Engineering,2009.

122 BIBLIOGRAPHY

[KH04] R. Kalous and F. Hakl. Evolutionary operators on DAG representa-tions. In Proceedings of the Internatioanl Conference on Comput-ing, Communicatoin and Control Technologies: CCCT 04, pages14–17. Austin, Texas, USA, August 2004.

[KLSK96] T. Koskela, M. Lehtokangas, J. Saarinen, and K. Kaski. Time se-ries prediction with multilayer perceptron, fir and elman neural net-works. In In Proceedings of the World Congress on Neural Net-works, pages 491–496, 1996.

[KV95] A. Krogh and J. Vedelsby. Neural network ensembles, cross val-idation, and active learning. In Advances in Neural InformationProcessing Systems, pages 231–238. MIT Press, 1995.

[LF87] A. Lapedes and R. Farber. Nonlinear signal processing usingneural networks. In Prediction and System Modeling, Tech. Rep.LAUR872662, Los Alamos National Lab., NM, 1987.

[LF05] K.-P. Liao and R. Fildes. The accuracy of a procedural approach tospecifying feedforward neural networks for forecasting. Comput.Oper. Res., 32(8):2151–2169, 2005.

[LSK07] Y. Lean, W. Shouyang, and L. K. Keung. An online learning algo-rithm with adaptive forgetting factors for feedforward neural net-works in financial time series forecasting. Nonlinear dynamics andsystems theory, 7(1):116–131, 2007.

[LSKH96] M. Lehtokangas, J. Saarinen, K. Kaski, and P. Huuhtanen. A net-work of autoregressive processing units for time series modeling.Appl. Math. Comput., 75(2-3):151–165, 1996.

[LW01] T. L. Lai and S. P.-S. Wong. Stochastic neural networks with appli-cations to nonlinear time series. Journal of the American StatisticalAssociation, 96, 2001.

[LWH06] K. Lai, L. Wang, and W. Huang. Neural-network-based metamodel-ing for financial time series forecasting. In JCIS-2006 Proceedings,Advances in Intelligent System Research. Atlantis Press, 2006.

[LY99] Y. Liu and X. Yao. Ensemble learning via negative correlation.Neural Networks, 12:1399–1404, 1999.

BIBLIOGRAPHY 123

[LYWH06] K. K. Lai, L. Yu, S. Wang, and W. Huang. Hybridizing exponen-tial smoothing and neural network for financial time series predic-tion. In ICCS 2006 Part IV, LNCS 3994, pages 493–500. Springer-Verlang Berlin, 2006.

[LYWW06] K. K. Lai, L. Yu, S. Wang, and H. Wei. A novel nonlinear neuralnetwork ensemble model for financial time series forecasting. InLecture Notes in Computer Science 3991, pages 790–793, 2006.

[LYWZ06] K. K. Lai, L. Yu, S. Wang, and C. Zhou. Neural-network-basedmetamodeling for financial time series forecasting. In In Proceed-ings of JCIS-2006: Advances in Intelligent Systems Research, 2006.

[Mar63] D. W. Marquardt. An algorithm for least-squares estimation ofnonlinear parameters. SIAM Journal on Applied Mathematics,11(2):431–441, 1963.

[MBR99] A. Monfort, M. Billio, and C. Robert. Bayesian estimation ofswitching arma models. Journal of Econometrics, 1999.

[mC02] H. ming Cheung. A new recurrent radial basis function network. InIn Proceedings of the Proceedings of the 9th International Confer-ence on Neural Information Processing, ICONIP, volume 2, pages1032–1036, 2002.

[MCH02] C. Meek, D. M. Chickering, and D. Heckerman. Autoregressivetree models for time-series analysis. In Proceedings of the SecondInternational SIAM Conference on Data Mining, 2002.

[MRH08] A. Mohammad, B. Romuald, and C. Hubert. A new boosting al-gorithm for improved time-series forecasting with recurrent neuralnetworks. Inf. Fusion, 9(1):41–55, 2008.

[MTR02] M. C. Medeiros, T. Tersvirta, and G. Rech. Building neural net-work models for time series: A statistical approach. Working PaperSeries in Economics and Finance 508, Stockholm School of Eco-nomics, Sep 2002.

[MW03] D. M. Miller and D. Williams. Shrinkage estimators of time seriesseasonal factors and their effect on forecasting accuracy. Interna-tional Journal of Forecasting, 19(4):669–684, 2003.

124 BIBLIOGRAPHY

[OM99] D. Opitz and R. Maclin. Popular ensemble methods: an empiri-cal study. Journal of Artificial Intelligence Research, 11:169–198,1999.

[Pie71] D. A. Pierce. Least squares estimation in the regression model withautoregressive-moving average errors. Biometrika, 58(2):299–312,1971.

[Rea72] R. C. Read. The coding of various kinds of unlabeled trees. In R. C.Read, editor, Graph theory and computing, page 153182. AcademicPress, 1972.

[RPB+02] I. Rojas, H. Pomares, J. L. Bernier, J. Ortega, B. Pino, F. J. Pelayo,and A. Prieto. Time series analysis using normalized pg-rbf net-work with regression weights. Neurocomputing, 42:267–285, 2002.

[RSU97] S. Rolf, J. Sprave, and W. Urfer. Model identification and parameterestimation of arma models by means of evolutionary algorithms. InIn Proceedings of the IEEE/IAFE, pages 237–243, 1997.

[Sen06] T. Senft. Predikce poptavky po obezivu v ekonomice z hlediskacentralnı banky. Master’s thesis, Charles University, Faculty ofMathematics and Physics, 2006.

[SFBL98] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting themargin: a new explanation for the effectiveness of voting methods.The Annals of Statistics, 26:322–330, 1998.

[SK04] Z.-Q. Shen and F.-S. Kong. Dynamically weighted ensemble neu-ral networks for regression problems. In Proceedings of 2004 In-ternational Conference on Machine Learning and Cybernetics, vol-ume 6, pages 3492–3496, 2004.

[Spl83] H. Spliid. A fast estimation method for the vector autoregressivemoving average model with exogenous variables. Journal of theAmerican Statistical Association, 78(384):843–849, 1983.

[SW89] S. Singhal and L. Wu. Training multilayer perceptron with the ex-trended kalman algorithm. In D. Touretzky, editor, Advances inNeural Information Processing Systems, pages 133–140, 1989.

BIBLIOGRAPHY 125

[TB92] T. Theodosopoulos and M. S. Branicky. A hierarchical algorithmfor neural training and control. Technical report, MassachusettsInstitute of Technology, 1992.

[TF93] Z. Tang and P. A. Fishwick. Feed-forward neural nets as models fortime series forecasting. ORSA Journal of Computing, 5:374–385,1993.

[TLH99] A. Trapletti, F. Leisch, and K. Hornik. On the ergodicity and sta-tionarity of the arma(1,1) recurrent neural network process. Tech-nical Report 37, Vienna University of Economics and Business Ad-ministration, Wien, 1999.

[Ton90] H. Tong. Nonlinear Time Series: A dynamical System Approach.Oxford University Press, Oxford, U.K., 1990.

[Vac06] J. Vachulka. Monitorovanı prubehu ucenı neuronovych sıtı sprepınacımi jednotkami. Master’s thesis, Czech Technical Univer-sity, Faculty of Nuclear Science and Physical Engineering, 2006.

[VRR+08] Valenzuela, I. Rojas, F. Rojas, H. Pomares, L. J. Herrera,A. Guillen, L. Marquez, and M. Pasadas. Hybridization of intelli-gent techniques and arima models for time series prediction. FuzzySets and Systems, 159(7):821–845, 2008.

[VS03] C. Vidal and A. Surez. Hierarchical mixtures of autoregressivemodels for time-series modeling. In Artificial Neural Networksand Neural Information Processing ICANN/ICONIP 2003, Lec-ture Notes in Computer Science, pages 597–604. Springer Berlin /Heidelberg, 2003.

[WC96] D. K. Wedding and K. J. Cios. Time series forecasting by com-bining rbf networks, certainity factors, and the box-jenkins model,1996.

[WCL00] C.-C. C. Wong, M.-C. Chan, and C.-C. Lam. Financial time se-ries forecasting by neural network using conjugate gradient learn-ing algorithm and multiple linear regression weight initialization.In Computing in Economics and Finance, number 61. Society forComputational Economics, 2000.

126 BIBLIOGRAPHY

[WD92] D. Wettschereck and T. Dietterich. Improving the performanceof radial basis function networks by learning centre locations. InJ. Moody, S. Hanson, and R. Lippmann, editors, Advances in Neu-ral Information Processing Systems, volume 4, pages 1133–1140.Morgan Kaufmann, San Francisco, 1992.

[WdJRX05] Y. Wen, J. de Jesus Rubio, and L. Xiaoou. Recurrent neural net-works training with stable risk-sensitive kalman filter algorithm.In Proceedings od International Joint Conference on Neural Net-works, Montreal, Canada, 2005.

[Whi] D. Whitley. A genetic algorithm tutorial.

[WHR90] A. Weigend, B. Huberman, and D. Rumelhart. Predicting the fu-ture: A connectionist approach. Internaternational Journal of Neu-ral Systems, 1:193–209, 1990.

[Wil92] R. J. Williams. Training recurrent networks using the ex-tended kalman filter. In International Joint Conferenceon Neural Networks, pages 241–246. Available: cite-seer.nj.nec.com/williams92training.html, 1992.

[WO04] J. Wichard and M. Ogorzalek. Time series prediction with ensemblemodels. pages 1625–1629, Budapest, 2004.

[WP90] W.R.Bell and M. Pugh. Alternative approaches to the analysis ofthe time series components. Technical Report CENSUS/SRD/RR-90/01, Bureau of the census, Statistical Research Division, 1990.

[WWT06] H. Wang, J. Wang, and W. Tian. A seasonal grbf network for non-stationary time series prediction. Measurement Science and Tech-nology, 17:28062810, 2006.

[YLW08] L. Yu, K. K. Lai, and S. Wang. Multistage rbf neural network en-samble learning for exchange rates forecasting. Neurocomputing,71, 2008.

[ZC00] Y.-Q. Zhang and L.-W. Chan. Forenet : Fourier recurrent networksfor time series prediction. In Proceedings of Internation Conferenceon Neural Information Processing, ICONIP 2000, Korea, 2000.

BIBLIOGRAPHY 127

[ZRZ03] R. Zemouri, D. Racoceanu, and N. Zerhouni. Recurrent redial basisfunction network for time-series prediction. Engineering Applica-tions of Artificial Inteligence, (16):453–463, 2003.

SEASONAL TIME SERIES MODELING VIA NEURAL …Chapter 1 Introduction This thesis has two objectives. First, to investigate and potentially increase the potential of neural networks with

Documents